# Problem Set 2
In this homework you'll collect some article data from the [NYT API](http://developer.nytimes.com/) and then enrich those articles with their share counts using the [Facebook Graph API](https://developers.facebook.com/docs/graph-api). 

Now would be a good time to sign up with both of those APIs to get registration keys. 

Please review Tutorial #2 before undertaking this problem set. 

In [2]:
# Here are some imports that you'll need
import pandas as pd
import requests, json 
import math
from time import sleep

### 1. Collect NYT Article Data
Using the NYT [Article Search API](http://developer.nytimes.com/article_search_v2.json) collect all NYT articles that mention "Donald Trump" from Novemer 7, 2016 to November 9, 2016 (that's from one day before to one day after election day). Output the complete dataframe of 675 articles to a .csv file called `trump_articles.csv`.

*(HINT: you will need to adapt the code from the tutorial - it won't be exactly the same and you should be sure to consult the documentation carefully - but it basically follows the same steps.)*

In [3]:
api_key = 'ac20af44c81748e490d35c1648766eae'
url = 'https://api.nytimes.com/svc/search/v2/articlesearch.json'

api_response = requests.get(url, params={"api-key": api_key, "q": "Donald Trump", "begin_date": "20161107", "end_date": "20161109"}).json()
#To extract the number of hits for the loop
articles = pd.read_json(json.dumps(api_response["response"]["docs"])) 
sleep(1)

nIterationsNeeded = int(math.ceil((api_response["response"]["meta"]["hits"]) / 10.0)) #Calculating number of iterations
nIterationsNeeded 

#Create an empty dataframe with the columns
all_articles = pd.DataFrame(columns = articles.columns)
offset=0

#Iterate from zero up to the number of iterations needed
for i in range(0, nIterationsNeeded):
    #Call the api with the offset parameter page
    api_response = requests.get(url, params={"api-key": api_key, "q": "Donald Trump", "begin_date": "20161107", "end_date": "20161109", "page": offset})
    if api_response.status_code != 200:
        sleep(1)
        api_response = requests.get(url, params={"api-key": api_key, "q": "Donald Trump", "begin_date": "20161107", "end_date": "20161109", "page": offset})
    
    api_response = api_response.json()
    articles_batch = pd.read_json(json.dumps(api_response["response"]["docs"]))
    
#Append these articles
    all_articles = all_articles.append(articles_batch)
    
#Reset the index after appending every batch
    all_articles.reset_index()
    print " Collected", all_articles.shape[0]
    sleep(.1)
#Increment the offset parameter
    offset = i + 1


 Collected 10
 Collected 20


KeyboardInterrupt: 

In [4]:
#Output the file in csv format
all_articles.to_csv("TrumpArticles.csv", index=False, encoding='utf-8')

### 2. What were the 5 sections in which these articles were most frequently published?

In [10]:
#Grouping the dataframe by section name
articles_grouped=all_articles.groupby("section_name")
#Obtain the frequency for every group
temp=articles_grouped.size()
#Printing the 5 sections in which these articles were most frequently published
temp.sort_values(ascending=False)[0:5]


section_name
U.S.            310
Business Day    128
World           123
Opinion          32
Arts             15
dtype: int64

### 3. What was the average word count of articles before election day compared to after election day?
*Hint*: Consider using the [to_datetime](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) pandas function to be able to filter dates.

In [62]:
from datetime import datetime
#Convert string to timestamp for comparison
all_articles["pub_date"] =  pd.to_datetime(all_articles['pub_date'])

election_day="2016-11-08"

#Obtain a boolean mask to get the word count for relevant records
mask_bool=(all_articles["pub_date"] <= election_day)
#Calculate the average of the word count
avg_wc_before=all_articles.loc[mask_bool].word_count.mean()
print "Average word count before election day:",avg_wc_before

#Obtain a boolean mask to get the word count for relevant records
mask_bool=(all_articles["pub_date"] > election_day)
#Calculate the average of the word count
avg_wc_after=all_articles.loc[mask_bool].word_count.mean()
print "Average word count before election day:",avg_wc_after



Average word count before election day: 795.030534351
Average word count before election day: 626.883333333


### 4. Among all the articles collected what were the titles and urls for the top five based on the number of Facebook shares?
To get the number of facebook shares you can use the following API URL from facebook: "http://graph.facebook.com/v2.8/" and pass it a parameter called `id` that is set to the URL of the article you want share information about. You ALSO need to pass a parameter called `access_token` that is an access token that you generate using the instructions [here](https://developers.facebook.com/docs/graph-api/overview) under the section "Generate a Basic User Access Token". NOTE: this token will expire after some time (a few hours) and you may need to generate another one in order to continue after a break. 

*Hint* In Tutorial 1 there's an example using the `apply()` function, but you'll need to tweak things after looking at the [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html). 

**Be sure to include a bit of sleep time in between API calls**

In [214]:
# Add your token below to test how the API should work with a test URL
url = 'https://graph.facebook.com/v2.8/'
token = 'EAACEdEose0cBADLUJBezMnqcurUt3vFZBjLCTnyNCZBSJlaWLXpCMulG7H2yj7xRSPmOnHMoSP7IATF5Y0zyzpKGTZCu5zdkH2KKqdZBNH3qstZC9RxB70l8B5lzK9bwpnxjqBTQaRhZCWRdS0MbYZA3YNNfgR6F6e4BcN80elFZCVyTbqGJ06XEcxN9hpDlKS0ZD' # <--- Enter your token here

#Add columns in the data set for the share count and title of the article
all_articles['share_count']=0
all_articles['title']=0

#Create an empty data set for the final result
final_data=pd.DataFrame(columns = all_articles.columns)

#Define function to obtain the share count and title of the article
def getSC(loc_all_articles):
    x=0
#Global variable declaration to access the variable outside the function
    global final_data
    
#Check whether the website exists
    http_res = requests.get(loc_all_articles['web_url'])
    if http_res.status_code != 200:
            x=-1
            y="Website doesn't exist"
    else:
#Obtain the share count and title for the article 
        api_resp = requests.get(url, params={"id": loc_all_articles['web_url'], "access_token": token})
        api_resp_json=api_resp.json()
        x=api_resp_json["share"]["share_count"]
        data = json.dumps(api_resp_json)
#Check for the key title in the json dump
        if 'title' not in data:
            y="No title in given data"
        else:
            y=api_resp_json["og_object"]["title"]
        if api_resp.status_code != 200:
            sleep(3)
            api_resp = requests.get(url, params={"id": loc_all_articles['web_url'], "access_token": token})
            api_resp_json=api_resp.json()
            print loc_all_articles['web_url']
            x=api_resp_json["share"]["share_count"]
            data = json.dumps(api_resp_json)
            if 'title' not in data:
                y="No title in given data"
            else:
                y=api_resp_json["og_object"]["title"]

#Populate variables to copy to final dataframe
    loc_all_articles["share_count"]=x
    loc_all_articles["title"]=y

#Append each row to form the entire final dataframe
    final_data=final_data.append(loc_all_articles)
    print loc_all_articles['web_url'],", SC:",x,", Title:",y
    sleep (0.1)
    return

#Function call to obtain share count and title for the article
all_articles.apply(getSC, axis=1)


http://www.nytimes.com/slideshow/2016/11/09/us/politics/donald-trumps-career-path.html , SC: 33 , Title: Donald Trump’s Career Path
http://www.nytimes.com/2016/11/07/magazine/donald-and-the-dead.html , SC: 1 , Title: Donald and the Dead
http://www.nytimes.com/2016/11/10/opinion/president-donald-trump.html , SC: 4 , Title: President Donald Trump
http://www.nytimes.com/video/us/politics/100000004755887/donald-trump-votes.html , SC: 0 , Title: No title in given data
http://www.nytimes.com/2016/11/09/us/politics/donald-trump-voting.html , SC: 1 , Title: Donald Trump, Amid Cheers and Jeers, Casts His Vote
http://www.nytimes.com/2016/11/07/opinion/campaign-stops/what-ivanka-trump-cant-sell.html , SC: 37 , Title: What Ivanka Trump Can’t Sell
http://www.nytimes.com/2016/11/10/us/politics/trump-speech-transcript.html , SC: 51 , Title: Transcript: Donald Trump’s Victory Speech
http://www.nytimes.com/video/us/100000004759928/student-protests-break-out-nationwide.html , SC: 22 , Title: No title in

0    None
1    None
2    None
3    None
4    None
5    None
6    None
7    None
8    None
9    None
0    None
1    None
2    None
3    None
4    None
5    None
6    None
7    None
8    None
9    None
0    None
1    None
2    None
3    None
4    None
5    None
6    None
7    None
8    None
9    None
     ... 
5    None
6    None
7    None
8    None
9    None
0    None
1    None
2    None
3    None
4    None
5    None
6    None
7    None
8    None
9    None
0    None
1    None
2    None
3    None
4    None
5    None
6    None
7    None
8    None
9    None
0    None
1    None
2    None
3    None
4    None
dtype: object

In [230]:
#Sort and print the first 5 rows to obtain the top 5 articles based on Facebook shares
sorted_df=final_data.sort_values(by=['share_count'], ascending=[False])[0:5]

sorted_df[["title","web_url","share_count"]]

Unnamed: 0,title,web_url,share_count
1,The Future of the Democratic Party,http://www.nytimes.com/roomfordebate/2016/11/0...,1410.0
5,An Emotional Election Night in America,http://www.nytimes.com/slideshow/2016/11/10/us...,458.0
4,"In Sight, Yet Elusive",http://www.nytimes.com/2016/11/09/us/politics/...,164.0
1,6 Books to Help Understand Trump’s Win,http://www.nytimes.com/2016/11/10/books/6-book...,158.0
7,http://www.nytimes.com/video/us/politics/10000...,http://www.nytimes.com/video/us/politics/10000...,109.0
