# Tutorial: Using the New York Times API

Original code: [Nick Diakopoulos](http://www.nickdiakopoulos.com/), slightly tweaked. 

APIs, or "Application Programming Interfaces" are useful because they can allow you to access data or services on other servers on the web. The goal of this tutorial is to give you some experience collecting data from APIs - one way to think of an API is as a URL that returns data when loaded. We'll look specifically at the New York Times APIs. Whenever working with APIs you'll need to get cozy with the API documentation as that will dictate what you can and can't do with the API. 

Here is [the documentation](http://developer.nytimes.com/) on the NYT APIs.

And here's [the documentation](http://developer.nytimes.com/community_api_v3.json#/README) on the NYT Community API which we'll use to collect a day's worth of comments. 

Something to be aware of with APIs is that they are usually rate limited, and you may need to sign up for an authorization key to use them. Before continuing, sign up for an NYT API key and copy the string key into the variable below `api_key`.

In [None]:
import pandas as pd
import requests, json 
import math

# Copy your api_key here as a string
api_key = 'API_KEY'
url     = 'http://api.nytimes.com/svc/community/v3/user-content/by-date.json'

date_of_interest = "2018-02-14"

api_response = requests.get(url, params={"api-key": api_key, "date": date_of_interest})
api_response.url


If you paste that URL into a browser you will see all of the sweet JSON data that the API has sent back to fulfill your request. 

The next step is to figure out how to parse the response. There is a variable `results` and beneath that a variable `comments` which has a list of JSON objects, one for each comment. 

Let's the get the API response as a JSON object. 

In [None]:
api_response = requests.get(url, params={"api-key": api_key, "date": date_of_interest}).json()

Then we can isolate the comments list and parse it into a pandas dataframe. 

In [None]:
json.dumps(api_response["results"])



In [None]:
comments = pd.read_json(json.dumps(api_response["results"]["comments"]))

We don't need all those columns so let's drop some of them. 

In [None]:
comments.columns

In [None]:
comments.drop(labels = ["commentSequence", "commentTitle", "lft", "rgt", "status", "statusID", "userTitle", "userURL"], axis=1, inplace=True)

In [None]:
comments.shape

Neat, we've collected 25 comments from the date we specified in the URL. BUT there were many more comments made that day and we we want to loop through and collect them all. 

Just how many comments were there? We can look at the `totalCommentsFound` field in the response object to find out. 

In [None]:
api_response["results"]["totalCommentsFound"]

In [None]:
nIterationsNeeded = int(math.ceil((api_response["results"]["totalCommentsFound"]) / 25.0))
print ("We need to collect", nIterationsNeeded ,"times, since we only get 25 comments at a time.")

Based on the Community API [readme](http://developer.nytimes.com/community_api_v3.json#/README) there is another URL parameter called `offset` which allows us to grab blocks of 25 comments from a different starting offset. We can increment this parameter and repeatedly call the API with different offset values in order to collect all the comments. 

In [None]:
# We'll need this library to add sleeping functionality and slow our script down
from time import sleep

# Create an empty dataframe with the columns we want (as gathered above)
all_comments = pd.DataFrame(columns = comments.columns)

# Iterate from zero up to the number of iterations needed
for i in range(0, nIterationsNeeded):
    print (i)
    # set the offset by multiplying by 25
    offset = i * 25
    # call the api with the offset parameter
    api_response = requests.get(url, params={"api-key": api_key, "date": date_of_interest, "offset": offset})
    #print requests.get(url, params={"api-key": api_key, "date": "2016-12-15", "offset": offset}).url
    if api_response.status_code != 200:
        sleep(1)
        api_response = requests.get(url, params={"api-key": api_key, "date": date_of_interest, "offset": offset})
    
    api_response = api_response.json()
    comments_batch = pd.read_json(json.dumps(api_response["results"]["comments"]))
    comments_batch.drop(labels = ["commentSequence", "commentTitle", "lft", "rgt", "status", "statusID", "userTitle", "userURL"], axis=1, inplace=True)
    
    # Append these comments
    all_comments = all_comments.append(comments_batch)
    # Because we just appended a bunch of rows we need to reset the index
    all_comments.reset_index()
    print (" Collected", all_comments.shape[0], "comments.")
    # Sleep for a bit in between each call (it's courteous not to request data to an api too frequently and some APIs dictate this through rate limiting)
    sleep(.1) # half second

In [None]:
all_comments.shape

In [None]:
# Let's save it
all_comments.to_csv("NYT-Comments" + date_of_interest + ".csv", index=False, encoding='utf-8')

Which users were most active commenting on that day?

In [None]:
grouped = all_comments.groupby("userID")
grouped.size()

Oddly it has incorrectly parsed the userID as a floating point number, but we know better that it's an integer and we can set it explicitely. 

In [None]:
all_comments.userID = all_comments.userID.astype(int)
grouped = all_comments.groupby("userID")
group_sizes_df = grouped.size().reset_index()
group_sizes_df.columns = ['userID', "group_size"]
group_sizes_df[group_sizes_df.group_size > 5]

And then we can use the top userID to see what kind of comments that person wrote. 

In [None]:
pd.set_option('display.max_colwidth', -1)
all_comments[all_comments.userID == 44268159].commentBody

We could also aggregate and then rank people by another column, like the average or total `recommendation_count` of their comments. This will give a sense of the users that were overall most recommended. 

In [None]:
all_comments.columns

Now let's see who got the most recommednation for their comments

In [None]:
import numpy as np

subset = all_comments[['userID', 'recommendationCount']]
grouped = subset.groupby("userID")
print (grouped.agg(np.sum).sort_values(by="recommendationCount", ascending=False))



In [None]:
all_comments[all_comments.userID == 44268159].recommendationCount