## Project to automate api access 
+ I added my api key as config.py file 

In [137]:
import pandas as pd
import config
import math
from newsapi import NewsApiClient
import newspaper
import requests
from newspaper import fulltext

# Hit Api with credentials
newsapi = NewsApiClient(api_key=config.api_key)


# Grab all sources
+ read through available sources list and make df storing domain and source name
+ I hit a ton of sources below, we can clearly narrow it down 
+ I do this so I can join these things together as string in next block to insert into our query

In [138]:
sources = newsapi.get_sources()
new_orgs = sources["sources"]
my_sources = {}
for i, x in enumerate(new_orgs):
    my_sources[i] = (x['id'])
domains = sources["sources"]
my_domains = {}
for i, x in enumerate(domains):
    my_domains[i] = (x['url'])
sources = pd.Series(my_sources).to_frame("sources")
domains = pd.Series(my_domains).to_frame("domains")
query_keys_df = domains.join(sources)
print(query_keys_df)

                                        domains                      sources
0                        https://abcnews.go.com                     abc-news
1                    http://www.abc.net.au/news                  abc-news-au
2                    https://www.aftenposten.no                  aftenposten
3                      http://www.aljazeera.com           al-jazeera-english
4                            http://www.ansa.it                         ansa
5                         http://www.argaam.com                       argaam
6                        http://arstechnica.com                 ars-technica
7                        https://arynews.tv/ud/                     ary-news
8                           https://apnews.com/             associated-press
9                            http://www.afr.com  australian-financial-review
10                        https://www.axios.com                        axios
11                    http://www.bbc.co.uk/news                     bbc-news

## Choosing data sources
+ Lets attempt to grab some sources from different geographic locations as well as different idological perspectives

+ categorizing news sources
    + Traditional TV MSM
        +  http://us.cnn.com   
        +  http://www.cnbc.com 
        +  http://www.foxnews.com  
        +  http://www.msnbc.com  
        +  https://abcnews.go.com  
        +  http://www.nbcnews.com  
    + Traditional publications 
        +  http://www.nytimes.com  
        +  https://www.washingtonpost.com 
        
    + Internet Sources
        +  http://www.huffingtonpost.com 
        +  https://www.politico.com
        +  http://www.breitbart.com 
        +  https://news.google.com 
        +  https://www.buzzfeed.com 
        +  https://news.vice.com  
    + Financial publications
        +  http://www.economist.com
        +  http://www.bloomberg.com 
        +  http://www.businessinsider.com 
        +  http://www.wsj.com
        +  http://fortune.com  
        
    + News aggregators
        +  https://apnews.com/ 
        +  http://www.reuters.com 
    + foreign reporting
         + http://www.aljazeera.com  
         + http://www.bbc.co.uk/news   
         + https://www.jpost.com/  
         + http://timesofindia.indiatimes.com 
         + https://russian.rt.com 
         + https://www.theguardian.com/uk 
         + http://www.independent.co.uk  
         + http://www.telegraph.co.uk  


 



In [139]:

## Literally picking data sources from df i printed above 
a= query_keys_df.iloc[[0,3,8,11,16,17,18,20,22,23,24,39,41,44,62,82,83,93,98,99,111,114,117,119,121,124,127,128,132],[1]]
list_sources =a["sources"].tolist()

## build out string for query request 
myString = ",".join(list_sources)
myString

'abc-news,al-jazeera-english,associated-press,bbc-news,bloomberg,breitbart-news,business-insider,buzzfeed,cbs-news,cnbc,cnn,four-four-two,fox-sports,google-news-ar,infobae,nbc-news,news24,polygon,rt,rte,the-hill,the-irish-times,the-new-york-times,the-sport-bible,the-times-of-india,the-washington-post,usa-today,vice-news,xinhua-net'

+ You can see the string which we will insert into sources above

## Now lets begin process of automating query calls
+ After the first call we can see our total results for the day, which will allow us to make subsequent calls.
+ first lets build function to clean query returns

In [140]:
def clean_query(query):
    for x in query['articles']:
        try:
            x["source"] = x["source"]["name"]
        except:
            pass
        try:
            x['publishedAt'] = str.split(x['publishedAt'], "T")[0]
        except:
            pass
        try:
            del x['urlToImage']
        except KeyError:
            pass
    my_df = pd.DataFrame(query["articles"])
    return my_df

## Function to hit the api
+ Originally I had a loop here.  instead I figured I would just build a function that takes start data, end data, query term(candidate)
    + The original code kept giving me a query limit reached result, so I decided to change up strategy and search one day at a time
    + after we hit the papers at the start for past 30 days, we will only need 1 day at a time going forward.
    + I built in some print statements for error handeling, which you will see below in the block after this, the behavior gets strange at a point

In [146]:
import sys
import time
candidates_list=[]

# Make first call
def hit_api(start,end,q,myString):
    ## catch bug with formatted strings for dates
    if end < 10:
        start_str = "0"+ str(start)
        end_str = "0"+ str(end)
    elif end==10:
        start_str = "0"+ str(start)
        end_str = str(end)
    else :
        start_str = str(start)
        end_str = str(end)
    print(start_str)
    print(end_str)    
    all_articles = newsapi.get_everything(q=q,
                                          sources=myString,
                                          domains='https://apnews.com/,http://www.nytimes.com',
                                          language='en',
                                          from_param='2019-09-{}'.format(start_str),
                                          to='2019-09-{}'.format(end_str),
                                          sort_by='relevancy',
                                          page_size=100,
                                          page=1)
    total_pages = math.ceil(all_articles["totalResults"]/100)
    print("query will return: "+ str(all_articles["totalResults"]))
    all_articles = clean_query(all_articles)
    candidates_list.append(all_articles)
    return(candidates_list)



### original code
#     start=1
#     end=5
#     for page in range(2,total_pages+1):
#         start_str = "0"+ str(start)
#         end_str = "0"+ str(end)
#         all_articles = newsapi.get_everything(q='Bernie Sanders',
#                                           sources=myString,
#                                           domains='https://apnews.com/,http://www.nytimes.com',
#                                           language='en',
#                                           from_param='2019-{}-09'.format(start_str),
#                                           to='2019-{}-09'.format(end_str),
#                                           sort_by='relevancy',
#                                           page_size=100,
#                                           page=page)
#         print(page)
#         ran_query = clean_query(all_articles)
#         #ran_query_df = pd.DataFrame(ran_query['articles'])
#         bernie_sanders_list.append(ran_query)


## Built out a loop 
+ simple, look at first day of september to last day incrementing start and end by 1 each time
+ YOu can see from the prints that something seems to be going wrong when we hit day 10, I dont get why.  We are getting data for that day

In [149]:
for x in range(1,31):
    df=hit_api(x,x+1,'Bernie|Sanders',myString)
Bernie_df = pd.concat(df)
Bernie_df.reset_index(drop=True)

01
02
query will return: 31
02
03
query will return: 48
03
04
query will return: 76
04
05
query will return: 94
05
06
query will return: 84
06
07
query will return: 56
07
08
query will return: 47
08
09
query will return: 60
09
10
query will return: 81
10
11
query will return: 116
11
12
query will return: 151
12
13
query will return: 268
13
14
query will return: 213
14
15
query will return: 64
15
16
query will return: 83
16
17
query will return: 106
17
18
query will return: 127
18
19
query will return: 152
19
20
query will return: 136
20
21
query will return: 86
21
22
query will return: 68
22
23
query will return: 76
23
24
query will return: 103
24
25
query will return: 95
25
26
query will return: 65
26
27
query will return: 57
27
28
query will return: 36
28
29
query will return: 29
29
30
query will return: 43
30
31
query will return: 24


Unnamed: 0,author,content,description,publishedAt,source,title,url
0,Reid J. Epstein and Maggie Astor,I think to elect anyone else is a foolish vent...,"Bernie Sanders, Joe Biden and Elizabeth Warren...",2019-09-02,The New York Times,2020 Democrats Fan Out Across Iowa and New Ham...,https://www.nytimes.com/2019/09/02/us/politics...
1,Mark Leibovich,"Yet here was Mr. Biden again, at 76, trudging ...","On certain days, Biden 2020 can feel more like...",2019-09-02,The New York Times,Does Joe Biden Want to Be Doing This?,https://www.nytimes.com/2019/09/02/us/politics...
2,Times Of India,"Copyright © 2019 Bennett, Coleman &amp; Co. Lt...",Bernie Sanders rebukes India over Kashmir move...,2019-09-01,The Times of India,Bernie Sanders rebukes India over Kashmir move...,https://timesofindia.indiatimes.com/world/us/b...
3,Dean Obeidallah,"Dean Obeidallah, a former attorney, is the hos...",Dean Obeidallah writes that 2020 presidential ...,2019-09-01,CNN,The star of the annual Muslim convention was a...,https://www.cnn.com/2019/09/01/opinions/bernie...
4,Mark Schmitt,"In a recent MSNBC series, American Swamp, for ...",“Drain the swamp” suggests that all political ...,2019-09-02,The New York Times,Why Has Trump’s Exceptional Corruption Gone Un...,https://www.nytimes.com/2019/09/02/opinion/tru...
5,Times Of India,"Copyright © 2019 Bennett, Coleman &amp; Co. Lt...",US Senator and Democratic presidential contend...,2019-09-01,The Times of India,US Senator Bernie Sanders says 'deeply concern...,https://timesofindia.indiatimes.com/world/us/u...
6,"John Binder, John Binder",A report by the Wall Street Journal reveals ho...,"Union bosses, closely tied to the Democrat Par...",2019-09-02,Breitbart News,Union Bosses Fear Workers Will Stick with Trum...,https://www.breitbart.com/politics/2019/09/02/...
7,Kathleen Ronayne,"SACRAMENTO, Calif. (AP) Thousands of miles fro...",Big tech or big labor? 2020 Democrats line up ...,2019-09-01,Associated Press,Big tech or big labor? 2020 Democrats line up ...,https://www.apnews.com/84ee5960e7cc436597b77e1...
8,"Kathleen Ronayne and Steve Peoples, AP","SACRAMENTO, Calif. Thousands of miles from the...",Most major Democratic White House hopeful have...,2019-09-01,The Washington Post,Big tech or big labor? 2020 Democrats line up ...,https://www.washingtonpost.com/business/techno...
9,"Hannah Bleau, Hannah Bleau",Warren used both of her verified Twitter accou...,Sen. Elizabeth Warren (D-MA) – one of the most...,2019-09-02,Breitbart News,Elizabeth Warren Goes All Out for Labor Day: '...,https://www.breitbart.com/politics/2019/09/02/...


## Data sanity check

In [155]:
Bernie_df.shape
print("we should expect: {} articles".format(Bernie_df.shape[0]))
unique_array = Bernie_df.url.unique()
print("we have {} unique links".format(unique_array.shape[0]))

we should expect: 2203 articles
we have 1168 unique links


### We have 1032 rows that share some sort of identical value

In [163]:
duplicateRowsDF = Bernie_df[Bernie_df.duplicated("url")]
duplicateRowsDF.shape

(1035, 7)

## Display duplicates

In [167]:
Bernie_df.sort_values(by=['url'])


Unnamed: 0,author,content,description,publishedAt,source,title,url
69,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-24,Google News,"AM QUICKIE: September 24th, 2019 w/ Lucie Stei...",http://feedproxy.google.com/~r/MajorityReport/...
68,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-24,Google News,"AM QUICKIE: September 24th, 2019 w/ Lucie Stei...",http://feedproxy.google.com/~r/MajorityReport/...
82,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-11,Google News,"AM QUICKIE: September 11th, 2019 w/ Lucie Stei...",http://feedproxy.google.com/~r/MajorityReport/...
87,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-18,Google News,"AM QUICKIE: September 18th, 2019 w/ Lucie Stei...",http://feedproxy.google.com/~r/MajorityReport/...
50,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-20,Google News,"AM QUICKIE: September 20th, 2019 w/ Sam Seder ...",http://feedproxy.google.com/~r/MajorityReport/...
77,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-20,Google News,"AM QUICKIE: September 20th, 2019 w/ Sam Seder ...",http://feedproxy.google.com/~r/MajorityReport/...
95,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-17,Google News,"AM QUICKIE: September 17th, 2019 w/ Lucie Stei...",http://feedproxy.google.com/~r/MajorityReport/...
76,,,Welcome to Majority.FM 's AM QUICKIE! Brought ...,2019-09-17,Google News,"AM QUICKIE: September 17th, 2019 w/ Lucie Stei...",http://feedproxy.google.com/~r/MajorityReport/...
28,"Tom Woods, Tom Woods",,Bernie Sanders is proposing a nationwide progr...,2019-09-24,Google News,Ep. 1498 Against Bernie's National Rent Control,http://feedproxy.google.com/~r/TheTomWoodsShow...
58,"Tom Woods, Tom Woods",,Bernie Sanders is proposing a nationwide progr...,2019-09-24,Google News,Ep. 1498 Against Bernie's National Rent Control,http://feedproxy.google.com/~r/TheTomWoodsShow...


### Hitting newspaper 3k Api with links
+ We can deal with duplicates and streamlining the above later
+ to demonstrate a working product I feed our url into newspaper 3k 
+ Id say it takes 5-10 seconds per article to fetch complete text

In [168]:
list_full_text=[]
for link in Bernie_df['url'][0:10]:
    html = requests.get(link).text
    text = fulltext(html)
    list_full_text.append(text)

In [169]:
list_full_text[6]

'Union bosses, closely tied to the Democrat Party, say American union workers sticking with President Trump in 2020 and his economic nationalist agenda is “a serious problem” for them.\n\nA report by the Wall Street Journal reveals how union bosses and Democrats are looking to peel off Trump’s support from American union workers who back his agenda, where most recently he has demanded multinational corporations move their production in China to the U.S.\n\nThe Journal reports:\n\n“It’s a serious problem for us,” said Alan Netland, president of the North East Area Labor Council in Duluth, Minn., which represents 40,000 union members. “People may say, ‘I voted Republican and the world didn’t fall in, so maybe I better keep doing that.’” [Emphasis added] Union officials, along with Democratic presidential candidates, are now trying to highlight what they see as a yawning gap between the president’s pro-worker rhetoric and his policies. [Emphasis added] … Democratic candidates have put app

## Conclusion
+ we need to functionize streamline and clean up query calls.
+ sorry my python is rusty.