# Using APIs to Collect Data from New York Times

In this notebook we will be working with New York Times Article Search API.

The NYT Article Search API allows you to search more than 2.8 million New York Times articles from 1981 to today, retrieving headlines, abstracts, lead paragraphs and links to associated multimedia.

The API supports the following type of searching: 

    - Standard keyword searching
    - Date range: all articles from X date to Y date
    - Field search: search within any number of given fields, e.g., title:obama 
    - Ordering by relevance, newest and oldest

The API will not return full text of articles. But it will return a number of helpful metadata such as subject terms, abstract, and date, as well as URLs, which one could conceivably use to scrape the full text of articles.

For more details about this API, please go to https://developer.nytimes.com/docs/articlesearch-product/1/overview.

### Get API key

In [None]:
# Note: this is my API key and is only being used for example purposes. Please don't spam NYT using my API key.
# You could sign up for your own API key following the instruction bellow
nyt_api ='gTWqbhG6yHPgNrnDgtlf9lEVwvuqlWv2'

- Go to the New York Times (NYT) Developer website and create your account (https://developer.nytimes.com/accounts/create)
- Look for the verification email in your inbox (spam folder) and click the link in that email
- Sign into your NYT Developer account
- Select Apps from the user drop-down
- Click + New App to create a new app
- Enter a name and description for the app in the New App dialog, enable Article Search API
- Click Create
- Select Apps from th user drop-down.
- Click the app in the list.
- View the API key on the App Details tab.
- Confirm that the status of the API key is active and copy the API key to a .txt file.


### Install and import necessary libraries

The first thing we need to do is install the 'requests' and 'pandas' library. Requests is a Python module that you can use to send all kinds of HTTP request, including the GET method. We could covert the results from JSON to a csv file (similar to .xlsx) using Pandas package 

In [None]:
import sys
!{sys.executable} -m pip install requests
!{sys.executable} -m pip install pandas

If you don't have pip installed, try the following code

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} requests
!conda install --yes --prefix {sys.prefix} pandas

Now let's import the libraries we will be using in this workshop. We don't need to install json because it is already installed with base python.

In [None]:
import requests
import json
import pandas

### Search articles about COVID

Remeber parameters in the request? We need to specify the value of NYT Article Search parameters like this:

In [None]:
# Search articles about COVID
r1 = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=COVID&api-key=gTWqbhG6yHPgNrnDgtlf9lEVwvuqlWv2")
article1 = r1.json()

The q (for query) parameter searches the article's body, headline and byline for a particular term. In this case, we are looking for the search term ‘COVID’. 

The search function returns a dictionary of the first 10 results. To get the next 10, we have to use the page parameter. page = 2 returns the second 10 results, page = 3 the third 10 and so on.

In [None]:
article1

In [None]:
# we can also use the format() method to handle the complex url string more efficiently
q = "COVID"
url = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q={}&api-key={}".format(q, nyt_api)
r1 = requests.get(url)
article1 = r1.json()

In [None]:
article1

If you run the code, you'll see that the returned dictionary (JSON) is pretty messy. The code in the cell below could help us understand the structure of the result.

In [None]:
article_link = []
print('\n--1--\n')
    
for element in article1:
    print(element)

print('\n--2--\n')
    
for element in article1['response']:
    print(element)

print('\n--3--\n')
    
for element in article1['response']['docs'][0]:
    print(element)

print('\n--4--\n')

print(article1['response']['docs'][0]['web_url'])
print(article1['response']['docs'][0]['snippet'])

print('\n--5--\n')

for element in article1['response']['docs']:
    article_link.append(element['web_url'])
    print(element['web_url'])

### Search articles about COVID that were published between Januray 2020 and March 2020

In [None]:
# q and nyt stay the same
# add two more parameters: begin_date and end_date
begin_date = 20200101
end_date = 20200331
url2 = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q={}&begin_date={}&end_date={}&api-key={}".format(
    q, begin_date, end_date, nyt_api)
r2 = requests.get(url2)
article2 = r2.json()

In [None]:
for element in article2['response']['docs']:
    print(element['pub_date'])

The begin_date and end_date parameters (in YYYYMMDD format) limit the date range of the search.

There are many other parameters and filters we can use to specify our serach. See the API document: https://developer.nytimes.com/docs/articlesearch-product/1/overview

### Search articles about COVID that were published between Januray 2020 and March 2020 on the New York Times and Associated Press only

The fq (filter query) parameter has a sub-filed source, which could restrict our search

The API document has an exmaple about source:
- fq=source:("The New York Times")

But what we want is slightly different from the example, we want articles from The New York Times or Associated Press. How to do that? The API document also explains the filter query field. We find that
- 'source' allows single token
- 'source.contains' allows multiple tokens

So source.contains is what we want. We could change the example above a little bit like this bellow:

In [None]:
fq = 'source.contains:("The New York Times", "AP")'

In [None]:
url3 = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q={}&begin_date={}&end_date={}&fq={}&api-key={}".format(
    q, begin_date, end_date, fq, nyt_api)
r3 = requests.get(url3)
article3 = r3.json()

In [None]:
for element in article3['response']['docs']:
    print(element['source'])

### Exercise 1: expand url3 and find articles on page 90, print out the headline of articles on that page

### Exercise 2: search articles about COVID that were published between Januray 2020 and March 2020 from the U.S. news desk, print out the news desk of articles on the first page

### Exercise 3: make a request to get articles about COVID and China (hint: use glocations filter query field), print out the word_count of articles on the first page

In [None]:
page = 90
url4 = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q={}&begin_date={}&end_date={}&fq={}&page={}&api-key={}".format(
    q, begin_date, end_date, fq, page, nyt_api)
r4 = requests.get(url4)
article4 = r4.json()

In [None]:
for element in article4['response']['docs']:
    print(element['headline'])

In [None]:
# Example from the API document: 
# fq=news_desk:("Sports" "Foreign")
fq2 = 'news_desk:("U.S.")'
url5 = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q={}&begin_date={}&end_date={}&fq={}&&api-key={}".format(
    q, begin_date, end_date, fq2, nyt_api)
r5 = requests.get(url5)
article5 = r5.json()

In [None]:
for element in article5['response']['docs']:
    print(element['news_desk'])

In [None]:
fq2 = 'glocations:("China")'
url6 = "http://api.nytimes.com/svc/search/v2/articlesearch.json?q={}&fq={}&api-key={}".format(q, fq2, nyt_api)
r6 = requests.get(url6)
article6 = r6.json()

In [None]:
for element in article6['response']['docs']:
    print(element['word_count'])

## 2. Twitter API

In [None]:
consumer_key= 'Y5KnOYsf0fYnqP5A1J7bYMKfX'
consumer_secret= '79tDInG1YOljdU3nKb5Ptmpf87TbzW7z2TOVzQva26K6AgavNS'
access_key= '1127753376379412480-FwFnjFDzfGJg1mOvvuXlpUxgmHkPeM'
access_secret= '0gS6xHHBoQqhoOd2ysSxudXShIOtsjcJOrDmJnynbyO4c'

In [None]:
import tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth, wait_on_rate_limit=True)