# Scraping News Articles

### https://towardsdatascience.com/scraping-news-and-articles-from-public-apis-with-python-be84521d85b9

Summary:

1. **New York Times**: Only provides meta information about the article, not the article itself.
2. 

In [1]:
import requests
import os
from pprint import pprint
import pandas as pd

# New York Times

In [7]:
with open('../../../../api_keys/nytimes/api_key.txt') as f:
    apikey = f.readline()

The simplest query we can do with NY Times API is look up for current top stories. The snippet below is very straightforward. We run a GET request against `topstories/v2` endpoint supplying `section` name and our API key. Section in this case is science, but NY Times provides a lot of other options here, e.g. fashion, health, sports or theater. Full list can be found in the link below. This specific request would produce response that would look something like this:

In [87]:
# Top Stories: https://developer.nytimes.com/docs/top-stories-product/1/overview
section = "science"
query_url = f"https://api.nytimes.com/svc/topstories/v2/{section}.json?api-key={apikey}"

r = requests.get(query_url)
# pprint(r.json())

In [90]:
r.json().keys()

dict_keys(['status', 'copyright', 'section', 'last_updated', 'num_results', 'results'])

In [94]:
type(r.json()['results'])

list

In [95]:
data = pd.json_normalize(r.json()['results'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   section              25 non-null     object
 1   subsection           25 non-null     object
 2   title                25 non-null     object
 3   abstract             25 non-null     object
 4   url                  25 non-null     object
 5   uri                  25 non-null     object
 6   byline               25 non-null     object
 7   item_type            25 non-null     object
 8   updated_date         25 non-null     object
 9   created_date         25 non-null     object
 10  published_date       25 non-null     object
 11  material_type_facet  25 non-null     object
 12  kicker               25 non-null     object
 13  des_facet            25 non-null     object
 14  org_facet            25 non-null     object
 15  per_facet            25 non-null     object
 16  geo_facet 

In [97]:
data.head()

Unnamed: 0,section,subsection,title,abstract,url,uri,byline,item_type,updated_date,created_date,published_date,material_type_facet,kicker,des_facet,org_facet,per_facet,geo_facet,multimedia,short_url
0,world,,Covid-19 Live Updates: 3 States See Cases Link...,A judge blocks a couple’s plans to hold a 175-...,https://www.nytimes.com/2020/08/22/world/covid...,nyt://article/3015e8d5-b401-5109-bd6d-3ba42011...,,Article,2020-08-22T10:47:12-04:00,2020-08-22T08:01:11-04:00,2020-08-22T08:01:11-04:00,,,[Coronavirus (2019-nCoV)],[],[],[],[{'url': 'https://static01.nyt.com/images/2020...,https://nyti.ms/2Qfhh8Z
1,science,,Why Some Tropical Fish Are Gettin’ Squiggly Wi...,"On occasion, different species of anglerfish p...",https://www.nytimes.com/2020/08/22/science/ang...,nyt://article/5937698d-a328-5ff6-b73f-8d2f8f0f...,By Sabrina Imbler,Article,2020-08-22T05:00:13-04:00,2020-08-22T05:00:13-04:00,2020-08-22T05:00:13-04:00,,,"[Reproduction (Biological), Coral, Reefs, Fish...",[],[],[],[{'url': 'https://static01.nyt.com/images/2020...,https://nyti.ms/3aLQtGQ
2,health,,Cartilage Is Grown in the Arthritic Joints of ...,Researchers discovered a way to awaken dormant...,https://www.nytimes.com/2020/08/22/health/arth...,nyt://article/be31e3aa-d2c1-5a6c-abfa-7f587922...,By Gina Kolata,Article,2020-08-22T05:00:11-04:00,2020-08-22T05:00:11-04:00,2020-08-22T05:00:11-04:00,,,"[Knees, Bones, Research, Mice, Stem Cells, Art...",[Nature Medicine (Journal)],[],[],[{'url': 'https://static01.nyt.com/images/2020...,https://nyti.ms/2Qh2hHK
3,health,,Why Antibody Tests Won’t Help You Much,Most antibody tests are useful only for large ...,https://www.nytimes.com/2020/08/21/health/coro...,nyt://article/52a01565-8d75-5473-a681-48026a0d...,By Donald G. McNeil Jr.,Article,2020-08-21T22:55:28-04:00,2020-08-21T18:34:31-04:00,2020-08-21T18:34:31-04:00,,,"[Coronavirus (2019-nCoV), Antibodies, Tests (M...","[Infectious Diseases Society of America, Cente...","[Osterholm, Michael T]",[],[{'url': 'https://static01.nyt.com/images/2020...,https://nyti.ms/2QaUJ9s
4,health,,What to Know About Stuttering,The speech disorder can play havoc with sociab...,https://www.nytimes.com/2020/08/21/health/stut...,nyt://article/f5babc18-419d-50e1-abee-75299973...,By Benedict Carey,Article,2020-08-22T09:12:16-04:00,2020-08-21T17:03:59-04:00,2020-08-21T17:03:59-04:00,,,"[Stuttering, Anxiety and Stress, Presidential ...",[],"[Harrington, Brayden]",[],[{'url': 'https://static01.nyt.com/images/2020...,https://nyti.ms/2YlflQU


Next and probably the most useful endpoint when you are trying to get some specific set of data is the article search endpoint.

This endpoint features lots of filtering options. The only mandatory field is `q (query)`, which is the search term. Beyond that you can mix and match filter query, date range (`begin_date`, `end_date`), page number, sort order and facet fields. The filter query (`fq`) is interesting one, as it allows use of Lucene query syntax, which can be used to create complex filters with logical operators (`AND`, `OR`), negations or wildcards. Nice tutorial can be found in the link below.


In [98]:
# Article Search: https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>

query = "politics"
begin_date = "20200701"  # YYYYMMDD
filter_query = "\"body:(\"Trump\") AND glocations:(\"WASHINGTON\")\""  # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0"  # <0-100>
sort = "relevance"  # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
            f"q={query}" \
            f"&api-key={apikey}" \
            f"&begin_date={begin_date}" \
            f"&fq={filter_query}" \
            f"&page={page}" \
            f"&sort={sort}"

r = requests.get(query_url)
# pprint(r.json())

In [99]:
r.json().keys()

dict_keys(['status', 'copyright', 'response'])

In [100]:
data = pd.json_normalize(r.json()['response']['docs'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 28 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   abstract                 10 non-null     object
 1   web_url                  10 non-null     object
 2   snippet                  10 non-null     object
 3   lead_paragraph           10 non-null     object
 4   print_section            4 non-null      object
 5   print_page               4 non-null      object
 6   source                   10 non-null     object
 7   multimedia               10 non-null     object
 8   keywords                 10 non-null     object
 9   pub_date                 10 non-null     object
 10  document_type            10 non-null     object
 11  news_desk                10 non-null     object
 12  section_name             10 non-null     object
 13  type_of_material         10 non-null     object
 14  _id                      10 non-null     obje

In [101]:
data.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,keywords,pub_date,document_type,news_desk,section_name,type_of_material,_id,word_count,uri,headline.main,headline.kicker,headline.content_kicker,headline.print_headline,headline.name,headline.seo,headline.sub,byline.original,byline.person,byline.organization,subsection_name
0,"A backstage fixture at Democratic conventions,...",https://www.nytimes.com/2020/08/20/us/christin...,"A backstage fixture at Democratic conventions,...","Christine Jahnke, a communications coach who p...",B,9.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'persons', 'value': 'Jahnke, Christi...",2020-08-20T22:12:35+0000,article,Obits,U.S.,Obituary (Obit),nyt://article/6cad0b2d-6b45-5a73-aac9-5ef08962...,1199,nyt://article/6cad0b2d-6b45-5a73-aac9-5ef08962...,"Christine Jahnke, Speech Coach for Women in Po...",,,"Christine Jahnke, Speech Coach For Women in P...",,,,By Katharine Q. Seelye,"[{'firstname': 'Katharine', 'middlename': 'Q.'...",,
1,Jon Meacham’s remarks at this week’s Democrati...,https://www.nytimes.com/2020/08/21/books/jon-m...,Jon Meacham’s remarks at this week’s Democrati...,"Last month, the historian and biographer Jon M...",,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'subject', 'value': 'Presidential El...",2020-08-21T23:24:28+0000,article,Books,Books,News,nyt://article/8766ce1d-24e0-50ca-b68e-4c249313...,1079,nyt://article/8766ce1d-24e0-50ca-b68e-4c249313...,A Presidential Historian Makes a Rare Appearan...,,,,,,,By Alexandra Alter,"[{'firstname': 'Alexandra', 'middlename': None...",,
2,She grew up around Berkeley activists but came...,https://www.nytimes.com/2020/08/12/us/politics...,She grew up around Berkeley activists but came...,Kamala Harris’s first act as a political candi...,A,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'persons', 'value': 'Harris, Kamala ...",2020-08-12T07:00:07+0000,article,Politics,U.S.,News,nyt://article/3c71944f-266a-57fa-8ddf-a44e9ff3...,1765,nyt://article/3c71944f-266a-57fa-8ddf-a44e9ff3...,"Kamala Harris, a Political Fighter Shaped by L...",,,,,,,By Matt Flegenheimer and Lisa Lerer,"[{'firstname': 'Matt', 'middlename': None, 'la...",,Politics
3,Party strategists pay a lot of attention to re...,https://www.nytimes.com/2020/08/12/opinion/cen...,Party strategists pay a lot of attention to re...,Some of the most important developments in pol...,,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'subject', 'value': 'State Legislatu...",2020-08-12T09:00:18+0000,article,OpEd,Opinion,Op-Ed,nyt://article/74074d0f-51df-568d-962a-d63621ee...,2296,nyt://article/74074d0f-51df-568d-962a-d63621ee...,The Politics We Don’t See Matter as Much as Th...,,,,,,,By Thomas B. Edsall,"[{'firstname': 'Thomas', 'middlename': None, '...",,
4,Ms. Louis-Dreyfus was the final television sta...,https://www.nytimes.com/2020/08/20/us/politics...,Ms. Louis-Dreyfus was the final television sta...,Olivia Pope and Selina Meyer are used to findi...,A,15.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'subject', 'value': 'Democratic Nati...",2020-08-20T21:27:42+0000,article,Politics,U.S.,News,nyt://article/ef970cda-17cb-52e9-93e8-28898c30...,1439,nyt://article/ef970cda-17cb-52e9-93e8-28898c30...,Julia Louis-Dreyfus Caps a Week of Starring Ro...,,,Four Women Used to Life Onscreen Guided Viewer...,,,,By Sydney Ember and Lisa Lerer,"[{'firstname': 'Sydney', 'middlename': None, '...",,Politics


Last endpoint for NY Times that I will show here is their Archive API which returns list of articles for given month going back all the way to 1851! This can be very useful if you need bulk data and don’t really need to search for specific terms.

In [58]:
# Archive Search
# https://developer.nytimes.com/docs/archive-product/1/overview

year = "2020"  # <1851 - 2020>
month = "7"  # <1 - 12>
query_url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={apikey}"

r = requests.get(query_url)
# pprint(r.json())

In [59]:
print(r.json().keys())
print(r.json()['response'].keys())
print(type(r.json()['response']['docs']), len(r.json()['response']['docs']))

dict_keys(['copyright', 'response'])
dict_keys(['meta', 'docs'])
<class 'list'> 6553


In [60]:
data = pd.json_normalize(r.json()['response']['docs'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6553 entries, 0 to 6552
Data columns (total 29 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   abstract                 6553 non-null   object
 1   web_url                  6553 non-null   object
 2   snippet                  6553 non-null   object
 3   lead_paragraph           6553 non-null   object
 4   print_section            4078 non-null   object
 5   print_page               4078 non-null   object
 6   source                   6553 non-null   object
 7   multimedia               6553 non-null   object
 8   keywords                 6553 non-null   object
 9   pub_date                 6553 non-null   object
 10  document_type            6553 non-null   object
 11  news_desk                6553 non-null   object
 12  section_name             6553 non-null   object
 13  subsection_name          2646 non-null   object
 14  type_of_material         6445 non-null  

In [77]:
data.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,keywords,pub_date,document_type,news_desk,section_name,subsection_name,type_of_material,_id,word_count,uri,headline.main,headline.kicker,headline.content_kicker,headline.print_headline,headline.name,headline.seo,headline.sub,byline.original,byline.person,byline.organization,slideshow_credits
0,A small-time businessman became a key middlema...,https://www.nytimes.com/2020/07/01/world/asia/...,A small-time businessman became a key middlema...,"KABUL, Afghanistan — He was a lowly drug smugg...",A,18,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'glocations', 'value': 'Russia', 'ra...",2020-07-01T23:16:05+0000,article,Foreign,World,Asia Pacific,News,nyt://article/00e3dfa6-82b3-5825-bf8a-c144ac1a...,1219,nyt://article/00e3dfa6-82b3-5825-bf8a-c144ac1a...,Afghan Contractor Handed Out Russian Cash to K...,,,"Afghan Contractor Gave Out Russian Cash, Offic...",,,,"By Mujib Mashal, Eric Schmitt, Najim Rahim and...","[{'firstname': 'Mujib', 'middlename': None, 'l...",,
1,The so-called Capitol Hill Organized Protest a...,https://www.nytimes.com/2020/07/01/us/seattle-...,The so-called Capitol Hill Organized Protest a...,"SEATTLE — For weeks, officials in Seattle have...",A,20,The New York Times,[],"[{'name': 'subject', 'value': 'George Floyd Pr...",2020-07-01T14:06:12+0000,article,National,U.S.,,News,nyt://article/01ea3048-bbe7-51ab-a366-09770601...,1443,nyt://article/01ea3048-bbe7-51ab-a366-09770601...,Police Clear Seattle’s Protest ‘Autonomous Zone’,,,"Blaming Gun Violence, Seattle Officials Clear ...",,,,By Rachel Abrams,"[{'firstname': 'Rachel', 'middlename': None, '...",,
2,Surging outbreaks in the U.S. Embassy and the ...,https://www.nytimes.com/2020/07/01/us/politics...,Surging outbreaks in the U.S. Embassy and the ...,WASHINGTON — Inside the sprawling American Emb...,A,1,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'subject', 'value': 'Coronavirus (20...",2020-07-01T21:21:16+0000,article,Washington,U.S.,Politics,News,nyt://article/02eea957-203b-5824-a81d-a3c8afcf...,1675,nyt://article/02eea957-203b-5824-a81d-a3c8afcf...,Late Action on Virus Prompts Fears Over Safety...,,,Embassy Crisis in Riyadh Shows Perils of Diplo...,,,,By Mark Mazzetti and Edward Wong,"[{'firstname': 'Mark', 'middlename': None, 'la...",,
3,Journalists have been wary of Alden Global Cap...,https://www.nytimes.com/2020/07/02/business/me...,Journalists have been wary of Alden Global Cap...,Alden Global Capital seemed in position this w...,B,1,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'subject', 'value': 'Newspapers', 'r...",2020-07-03T02:13:01+0000,article,Business,Business Day,Media,News,nyt://article/038264a1-04c7-5562-ac8a-1d0d912c...,1003,nyt://article/038264a1-04c7-5562-ac8a-1d0d912c...,Hedge Fund’s Run at Tribune Publishing Ends Wi...,,,Hedge Fund Delays Effort To Buy Tribune,,,,By Marc Tracy,"[{'firstname': 'Marc', 'middlename': None, 'la...",,
4,"Adam Hollingsworth, known in Chicago as the Dr...",https://www.nytimes.com/2020/07/01/style/dread...,"Adam Hollingsworth, known in Chicago as the Dr...","In late May, as protests against police brutal...",ST,2,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","[{'name': 'persons', 'value': 'Hollingsworth, ...",2020-07-01T15:39:53+0000,article,Styles,Style,,News,nyt://article/04c67aa2-2796-56d1-b342-ab16769f...,779,nyt://article/04c67aa2-2796-56d1-b342-ab16769f...,‘You Can’t Just Get Up and Steal a Police Horse’,,,‘You Can’t Just Get Up and Steal a Police Horse’,,,,By Ximena Larkin,"[{'firstname': 'Ximena', 'middlename': None, '...",,


In [82]:
print(data.iloc[0]['snippet'])

A small-time businessman became a key middleman for bounties on coalition troops in Afghanistan, U.S. intelligence reports say. Friends saw him grow rich, but didn’t know how.


In [84]:
data.iloc[0]['abstract']

'A small-time businessman became a key middleman for bounties on coalition troops in Afghanistan, U.S. intelligence reports say. Friends saw him grow rich, but didn’t know how.'

In [85]:
data.iloc[0]['lead_paragraph']

'KABUL, Afghanistan — He was a lowly drug smuggler, neighbors and relatives say, then ventured into contracting, seeking a slice of the billions of dollars the U.S.-led coalition was funneling into construction projects in Afghanistan.'

In [86]:
data.iloc[0]['headline.main']

'Afghan Contractor Handed Out Russian Cash to Kill Americans, Officials Say'

# Guardian

Next up is another great source of news and articles — The Guardian. Same as with NY Times, we first need to sign up for an API key. You can do so [here](https://bonobo.capi.gutools.co.uk/register/developer) and you will receive your key in an email. With that out of the way, we can navigate to [API documentation](https://open-platform.theguardian.com/documentation/) and start querying the API.

In [102]:
with open('../../../../api_keys/guardian/api_key.txt') as f:
    apikey = f.readline()

Let’s start simply by querying content sections of The Guardian. These sections group content into topics, which can be useful if you are looking for specific type of content, e.g. science or technology. If we omit the query ( q) parameter, we will instead receive full list of sections, which is about 75 records.

In [105]:
# https://open-platform.theguardian.com/documentation/section
query = "science"
query_url = f"https://content.guardianapis.com/sections?" \
            f"api-key={apikey}" \

r = requests.get(query_url)
# pprint(r.json())

In [111]:
print(r.json().keys())
print(r.json()['response'].keys())

dict_keys(['response'])
dict_keys(['status', 'userTier', 'total', 'results'])


In [114]:
data = pd.json_normalize(r.json()['response']['results'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75 entries, 0 to 74
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  75 non-null     object
 1   webTitle            75 non-null     object
 2   webUrl              75 non-null     object
 3   apiUrl              75 non-null     object
 4   editions            75 non-null     object
 5   activeSponsorships  2 non-null      object
dtypes: object(6)
memory usage: 3.6+ KB


In [115]:
data.head()

Unnamed: 0,id,webTitle,webUrl,apiUrl,editions,activeSponsorships
0,about,About,https://www.theguardian.com/about,https://content.guardianapis.com/about,"[{'id': 'about', 'webTitle': 'About', 'webUrl'...",
1,animals-farmed,Animals farmed,https://www.theguardian.com/animals-farmed,https://content.guardianapis.com/animals-farmed,"[{'id': 'animals-farmed', 'webTitle': 'Animals...",
2,artanddesign,Art and design,https://www.theguardian.com/artanddesign,https://content.guardianapis.com/artanddesign,"[{'id': 'artanddesign', 'webTitle': 'Art and d...",
3,australia-news,Australia news,https://www.theguardian.com/australia-news,https://content.guardianapis.com/australia-news,"[{'id': 'australia-news', 'webTitle': 'Austral...",
4,better-business,Better Business,https://www.theguardian.com/better-business,https://content.guardianapis.com/better-business,"[{'id': 'better-business', 'webTitle': 'Better...",


Moving on to something little more interesting — searching by tags. This query looks quite similar to the previous one and also returns similar kinds of data. Tags also group content into categories, but there are a lot more tags (around 50000) than sections. Each of these tags have structure like for example `world/extreme-weather`. These are very useful when doing search for actual articles, which is what we will do next.

In [117]:
# https://open-platform.theguardian.com/documentation/tag
query = "weather"
section = "news"
page = "1"
query_url = f"http://content.guardianapis.com/tags?" \
            f"api-key={apikey}" \
            f"&q={query}" \
            f"&page={page}"

r = requests.get(query_url)
#pprint(r.json())

In [119]:
data = pd.json_normalize(r.json()['response']['results'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           10 non-null     object
 1   type         10 non-null     object
 2   sectionId    9 non-null      object
 3   sectionName  9 non-null      object
 4   webTitle     10 non-null     object
 5   webUrl       10 non-null     object
 6   apiUrl       10 non-null     object
 7   description  1 non-null      object
dtypes: object(8)
memory usage: 768.0+ bytes


In [120]:
data.head()

Unnamed: 0,id,type,sectionId,sectionName,webTitle,webUrl,apiUrl,description
0,theguardian/mainsection/weather2,newspaper-book-section,weather,Weather,Weather,https://www.theguardian.com/theguardian/mainse...,https://content.guardianapis.com/theguardian/m...,
1,weather/weather,keyword,weather,Weather,Weather,https://www.theguardian.com/weather/weather,https://content.guardianapis.com/weather/weather,
2,australia-news/australia-weather,keyword,australia-news,Australia news,Australia weather,https://www.theguardian.com/australia-news/aus...,https://content.guardianapis.com/australia-new...,
3,world/extreme-weather,keyword,world,World news,Extreme weather,https://www.theguardian.com/world/extreme-weather,https://content.guardianapis.com/world/extreme...,
4,us-news/us-weather,keyword,us-news,US news,US weather,https://www.theguardian.com/us-news/us-weather,https://content.guardianapis.com/us-news/us-we...,


The one thing you really came here for is article search and for that we will use https://open-platform.theguardian.com/documentation/search.

The reason I first showed you section and tag search is that those can be used in the article search. Above you can see that we used `section` and `tag` parameters to narrow down our search, which values can be found using previously shown queries. Apart from these parameters, we also included the obvious `q` parameter for our search query, but also starting date using `from-date` as well as `show-fields` parameter, which allows us to request extra fields related to the content - in this case those would be headline, byline, rating and shortened URL. There's bunch more of those with full list available [here](https://open-platform.theguardian.com/documentation/search).

In [121]:
query = "(hurricane OR storm)"
query_fields = "body"
section = "news"  # https://open-platform.theguardian.com/documentation/section
tag = "world/extreme-weather"  # https://open-platform.theguardian.com/documentation/tag
from_date = "2019-01-01"
query_url = f"https://content.guardianapis.com/search?" \
            f"api-key={apikey}" \
            f"&q={query}" \
            f"&query-fields={query_fields}" \
            f"&section={section}" \
            f"&tag={tag}" \
            f"&from-date={from_date}" \
            f"&show-fields=headline,byline,starRating,shortUrl"

r = requests.get(query_url)
#pprint(r.json())

In [122]:
r.json().keys()

dict_keys(['response'])

In [123]:
data = pd.json_normalize(r.json()['response']['results'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  7 non-null      object
 1   type                7 non-null      object
 2   sectionId           7 non-null      object
 3   sectionName         7 non-null      object
 4   webPublicationDate  7 non-null      object
 5   webTitle            7 non-null      object
 6   webUrl              7 non-null      object
 7   apiUrl              7 non-null      object
 8   isHosted            7 non-null      bool  
 9   pillarId            7 non-null      object
 10  pillarName          7 non-null      object
 11  fields.headline     7 non-null      object
 12  fields.byline       7 non-null      object
 13  fields.shortUrl     7 non-null      object
dtypes: bool(1), object(13)
memory usage: 863.0+ bytes


In [124]:
data.head()

Unnamed: 0,id,type,sectionId,sectionName,webPublicationDate,webTitle,webUrl,apiUrl,isHosted,pillarId,pillarName,fields.headline,fields.byline,fields.shortUrl
0,news/2019/dec/19/weatherwatch-storms-hit-franc...,article,news,News,2019-12-19T11:33:52Z,Weatherwatch: storms hit France and Iceland as...,https://www.theguardian.com/news/2019/dec/19/w...,https://content.guardianapis.com/news/2019/dec...,False,pillar/news,News,Weatherwatch: storms hit France and Iceland as...,Daniel Gardner (MetDesk),https://gu.com/p/dv4dq
1,news/2020/jan/31/weatherwatch-how-repeated-flo...,article,news,News,2020-01-31T21:30:00Z,Weatherwatch: how repeated flooding can shift ...,https://www.theguardian.com/news/2020/jan/31/w...,https://content.guardianapis.com/news/2020/jan...,False,pillar/news,News,Weatherwatch: how repeated flooding can shift ...,David Hambling,https://gu.com/p/d755m
2,news/2019/sep/18/weatherwatch-do-30-year-mortg...,article,news,News,2019-09-18T20:30:09Z,Weatherwatch: do 30-year mortgages make sense ...,https://www.theguardian.com/news/2019/sep/18/w...,https://content.guardianapis.com/news/2019/sep...,False,pillar/news,News,Weatherwatch: do 30-year mortgages make sense ...,Paul Brown,https://gu.com/p/cakmm
3,news/2019/jul/17/weatherwatch-venice-supercell...,article,news,News,2019-07-17T20:30:05Z,World weatherwatch: Venice supercell storm end...,https://www.theguardian.com/news/2019/jul/17/w...,https://content.guardianapis.com/news/2019/jul...,False,pillar/news,News,World weatherwatch: Venice supercell storm end...,Alessio Martini (MetDesk),https://gu.com/p/bq6ad
4,news/2019/oct/25/weatherwatch-volunteers-world...,article,news,News,2019-10-25T20:30:29Z,Weatherwatch: volunteers worldwide aided rescu...,https://www.theguardian.com/news/2019/oct/25/w...,https://content.guardianapis.com/news/2019/oct...,False,pillar/news,News,Weatherwatch: volunteers worldwide aided rescu...,Kate Ravilious,https://gu.com/p/chhdj


In [126]:
data.iloc[2]['fields.headline']

'Weatherwatch: do 30-year mortgages make sense as sea levels rise faster annually?'

# Currents 

Finding popular and good quality news API is quite difficult as most classic newspapers don’t have free public API. There are however, sources of aggregate news data which can be used to get articles and news from newspapers like for example Financial Times and Bloomberg which only provide paid API services or like CNN doesn’t expose any API at all.

One of these aggregators is called [Currents API](https://currentsapi.services/en). It aggregates data from thousands of sources, 18 languages and over 70 countries and it’s also free.

It’s similar to the APIs shown before. We again need to first get API key. To do so, you need to register at https://currentsapi.services/en/register. After that, go to your profile at https://currentsapi.services/en/profile and retrieve your API token.

With key (token) ready we can request some data. There’s really just one interesting endpoint and that’s https://api.currentsapi.services/v1/search:

In [128]:
with open('../../../../api_keys/currents_news_articles/api_key.txt') as f:
    apikey = f.readline()

In [144]:
# https://currentsapi.services/en/docs/search
category = "business"
# language = languages['English']  # Mapping from Language to Code, e.g.: "English": "en"
# country = regions["Canada"]  # Mapping from Country to Code, e.g.: "Canada": "CA",
language = "en",
country = "'CA",
keywords = "bitcoin"
t = "1"  # 1 for news, 2 for article and 3 for discussion content
domain = "financialpost.com"  # website primary domain name (without www or blog prefix)
start_date = "2020-06-01T14:30"  # YYYY-MM-DDTHH:MM:SS+00:00
query_url = f"https://api.currentsapi.services/v1/search?" \
            f"apiKey={apikey}" \
            f"&language={language}" \
            f"&category={category}" \
            f"&country={country}" \
            f"&type={t}" \
            f"&domain={domain}" \
            f"&keywords={keywords}" \
            f"&start_date={start_date}"

r = requests.get(query_url)
pprint(r.json())

{'news': [], 'status': 'ok'}


In [132]:
r.json().keys()

dict_keys(['status', 'news'])

In [133]:
data = pd.json_normalize(r.json()['news'])
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Empty DataFrame

In [147]:
# language = languages['English']
language = 'en'
query_url = f"https://api.currentsapi.services/v1/latest-news?" \
            f"apiKey={apikey}" \
            f"&language={language}"

r = requests.get(query_url)
# pprint(r.json())

In [148]:
r.json().keys()

dict_keys(['status', 'news', 'page'])

In [149]:
data = pd.json_normalize(r.json()['news'])
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           30 non-null     object
 1   title        30 non-null     object
 2   description  30 non-null     object
 3   url          30 non-null     object
 4   author       30 non-null     object
 5   image        30 non-null     object
 6   language     30 non-null     object
 7   category     30 non-null     object
 8   published    30 non-null     object
dtypes: object(9)
memory usage: 2.2+ KB


In [150]:
data.head()

Unnamed: 0,id,title,description,url,author,image,language,category,published
0,06d68120-cc63-4ca0-84be-62a104409781,Robert Trump mourner allegedly punches restaur...,An unidentified mourner from Robert Trump's fu...,https://nypost.com/2020/08/22/robert-trump-mou...,@nypost,https://nypost.com/wp-content/uploads/sites/2/...,en,[general],2020-08-22 14:52:36 +0000
1,66e6ae89-0090-4061-b04b-2b9d5b9d85be,How to Watch DC Fandome Online,DC fans will have access to over eight hours o...,https://variety.com/2020/film/news/dc-fandome-...,@Variety,https://pmcvariety.files.wordpress.com/2019/04...,en,"[entertainment, celebrity, television, music]",2020-08-22 15:00:56 +0000
2,85e4dcfc-ad28-4899-9f0b-a2eab0b354a0,Oregon UPS driver arrested in string of inters...,Kenneth Ayers is believed to have been involve...,https://news.yahoo.com/oregon-ups-driver-arres...,yahoo,,en,[general],2020-08-22 02:36:00 +0000
3,6985a85c-7072-4130-872f-b1da024d14d6,Biden asks Americans to judge Trump 'by the fa...,Joe Biden is focusing on the devastating numbe...,https://news.yahoo.com/biden-asks-americans-ju...,yahoo,,en,[general],2020-08-21 03:27:01 +0000
4,96d996dd-f76a-4640-bcf6-b88230a2dac6,Now Is the Time to Force Hezbollah out of Lebanon,The horrific August 4 blast in Beirut has expo...,https://news.yahoo.com/now-time-force-hezbolla...,yahoo,,en,[general],2020-08-22 10:30:07 +0000


In [152]:
data.groupby("author")['id'].count()

author
@BBCNews                                         5
@Variety                                         1
@nypost                                          1
DAN SEWELL and JULIE CARR SMYTH                  1
EMILY WAGSTER PETTUS                             1
Emily Peck, Felix Salmon, and Anna Szymanski     1
JENNIFER PELTZ                                   1
bbc                                              6
timesunion                                       1
yahoo                                           12
Name: id, dtype: int64