# NYT API
To use the New York Times API, register for an [API key here](https://developer.nytimes.com/).

For more detailed explanation on using the API, see this resource:
https://nicksubic.medium.com/a-guide-to-querying-the-new-york-times-api-with-python-b621556236f8


In [1]:
import requests
from time import sleep

## creating API calls with keys

In [2]:
# key
key = 'HHhPAjnfEF2EmLtf9PdGsLEQZizR3iQu'

# query
query = 'migrant'

# base URL
url = f'https://api.nytimes.com/svc/search/v2/articlesearch.json?&q={query}&api-key={key}'

In [3]:
response = requests.get(url)

In [4]:
response

<Response [200]>

In [5]:
type(response)

requests.models.Response

In [6]:
dir(response)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__firstlineno__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__static_attributes__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

## parsing our data

We want to turn our response object into a `dict` data structure, which we can more easily explore using Python. 

In [7]:
parsed = response.json()

In [8]:
parsed.keys()

dict_keys(['status', 'copyright', 'response'])

Use methods like `keys()` to see what is contained within the `parsed` object.

In [9]:
parsed['response'].keys()

dict_keys(['docs', 'meta'])

Use brackets to go deeper into the `dict` structure. First, check out the `meta` key, then the `docs` key.

In [10]:
type(parsed['response']['meta'])

dict

In [11]:
parsed['response']['meta']

{'hits': 31684, 'offset': 0, 'time': 51}

Inside `docs`, we have a list. 

In [12]:
type(parsed['response']['docs'])

list

The data in `docs` is absolutely massive, so we will just print out the first item in that object. 

In [13]:
parsed['response']['docs'][0]

{'abstract': 'The migrant crisis has emerged as a dominant theme in House races across New York State and even in the presidential race.',
 'web_url': 'https://www.nytimes.com/2024/10/29/nyregion/migrants-new-york-elections.html',
 'snippet': 'The migrant crisis has emerged as a dominant theme in House races across New York State and even in the presidential race.',
 'lead_paragraph': 'Good morning. It’s Tuesday. Today we’ll look at how the migrant crisis in New York City became a campaign issue that candidates from both sides of the aisle in the state are using to attract voters.',
 'source': 'The New York Times',
 'multimedia': [{'rank': 0,
   'subtype': 'xlarge',
   'caption': None,
   'credit': None,
   'type': 'image',
   'url': 'images/2024/10/29/multimedia/29nytoday1-gfwc/29nytoday1-gfwc-articleLarge.jpg',
   'height': 400,
   'width': 600,
   'legacy': {'xlarge': 'images/2024/10/29/multimedia/29nytoday1-gfwc/29nytoday1-gfwc-articleLarge.jpg',
    'xlargewidth': 600,
    'xlarge

The very first item in `docs` appears to be information about a single article.

In [14]:
parsed['response']['docs'][0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

Let's save the information in `docs` to a list. That way, we will have an easier time going through the data, and won't have to re-type out all of the keys that our data is nested within.

In [15]:
articles = parsed['response']['docs']

We have a list of articles. Let's look at the first one.

In [16]:
type(articles)

list

In [17]:
articles[0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

## paginating through results

We have 10 articles, because the NYTimes API only allows us to get 10 results at a time. We can paginate through the results by adding an `&page=` parameter, combined with a `sleep()` function, which allows us to insert pauses in our request (as to not overload the NYTimes servers). 

In [18]:
len(articles)

10

We will write a loop that does the following:
- create a url with variables for pagination, query, and our key
- make an API call, save our response to `response`
- parse our data from the `response`
- access the response data we want in `docs`
- save it to a list

In [19]:
results = []
query = 'migrant'
for i in range(0, 5):  
    url = f'https://api.nytimes.com/svc/search/v2/articlesearch.json?q={query}&page={i}&api-key={key}'
    response = requests.get(url)
    parsed = response.json()
    articles = parsed['response']['docs']
    results.append(articles)
    sleep(6) # sleep at least 6 seconds not to overload the servers

KeyError: 'response'

In the end, we have a list of results, 5 lists to be exact (each list represents one page of articles). Each result (or page) contains information about 10 articles that matched our search.

In [25]:
type(results)

list

In [26]:
len(results)

5

In [27]:
type(results[0])

list

In [28]:
len(results[0])

10

## pulling articles out of our results
Let's try to look at just the first article on the first page. To do that, we need to know what kind of data we are dealing with. Use `type()` and list and dict indexing methods to figure out how to get just the abstract from the first article.

In [24]:
type(results)

list

In [25]:
type(results[0])

list

In [26]:
# we have a dict within a list within a list
type(results[0][0])

dict

In [27]:
results[0][0].keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'subsection_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [28]:
results[0][0]['abstract']

'The decision comes as political pressure mounts to cut down on programs that allow migrants to stay in the United States temporarily, even without a visa or green card.'

We can simplify our life by creating a variable

In [29]:
article = results[0][0]

### individual challenge: exploring keys
Explore some of the other keys, in this `article` object. Some of them contain more nesting dicts. See if you can access the data inside of them.

In [30]:
article.keys()

dict_keys(['abstract', 'web_url', 'snippet', 'lead_paragraph', 'print_section', 'print_page', 'source', 'multimedia', 'headline', 'keywords', 'pub_date', 'document_type', 'news_desk', 'section_name', 'subsection_name', 'byline', 'type_of_material', '_id', 'word_count', 'uri'])

In [31]:
article['byline']

{'original': 'By Hamed Aleaziz',
 'person': [{'firstname': 'Hamed',
   'middlename': None,
   'lastname': 'Aleaziz',
   'qualifier': None,
   'title': None,
   'role': 'reported',
   'organization': '',
   'rank': 1}],
 'organization': None}

In [32]:
type(article['byline'])

dict

In [33]:
article['byline'].keys()

dict_keys(['original', 'person', 'organization'])

In [36]:
article['byline']['person'][0]['firstname']

'Hamed'

In [37]:
article['byline']['person'][0]['lastname']

'Aleaziz'

## saving our data to lists
Now let's say we want just a list of the abstracts. We could write a loop (actually, a loop within a loop!) to pull out the abstract information for each article on each page.

In [38]:
# we need to do nested loops

for result in results: # loops through list of pages
    for article in result: # loops through list of articles 
        print(article['abstract']) # grabs abstract of article

The decision comes as political pressure mounts to cut down on programs that allow migrants to stay in the United States temporarily, even without a visa or green card.
The country’s defense ministry said the military officers who opened fire might have mistaken the migrants for cartel members.
Italy, an aging country, badly needs migrant labor and immigration, but the government has admitted that the pathways for legal entry are rife with abuse.
At least nine people are known to have died in a sinking off the Canary Islands and 48 more are missing, the latest disaster on the perilous Atlantic crossing from West Africa.
It’s been a year since Mayor Eric Adams made his ominous prediction. City officials should treat the arrival of migrants as an opportunity rather than a catastrophe, advocates say.
Reports and interviews shed new light on the holding center, where migrants’ calls with lawyers are monitored and some say they’ve been forced to wear blackout goggles.
A group of New York’s 

### group challenge: lists more lists!
Go through the article data, and grab the following information:
- abstract
- publication date 
- one other detail (check the list of `keys()`) of your choosing

Save all of that information to lists. 

In [105]:
abstracts = []
dates = []
keywords = []
for result in results:
    for item in result:
        abstracts.append(item['abstract'])
        dates.append(item['pub_date'])
        keywords.append(item['keywords'])

In [106]:
len(abstracts)

50

In [103]:
abstracts[:10]

['The decision comes as political pressure mounts to cut down on programs that allow migrants to stay in the United States temporarily, even without a visa or green card.',
 'The country’s defense ministry said the military officers who opened fire might have mistaken the migrants for cartel members.',
 'Italy, an aging country, badly needs migrant labor and immigration, but the government has admitted that the pathways for legal entry are rife with abuse.',
 'At least nine people are known to have died in a sinking off the Canary Islands and 48 more are missing, the latest disaster on the perilous Atlantic crossing from West Africa.',
 'It’s been a year since Mayor Eric Adams made his ominous prediction. City officials should treat the arrival of migrants as an opportunity rather than a catastrophe, advocates say.',
 'Reports and interviews shed new light on the holding center, where migrants’ calls with lawyers are monitored and some say they’ve been forced to wear blackout goggles.'

### Bonus: going deep into keywords

In [146]:
for result in results:
    for item in result:
        print(item['keywords'][0]['value'])

Biden, Joseph R Jr
Chiapas (Mexico)
Italy
Drownings
Illegal Immigration
Illegal Immigration
Illegal Immigration
Illegal Immigration
Democracy (Theory and Philosophy)
Presidential Election of 2024
Illegal Immigration
Presidential Election of 2024
Evacuations and Evacuees
Trump, Donald J
Harris, Kamala D
Presidential Election of 2024
Right-Wing Extremism and Alt-Right
Presidential Election of 2024
Presidential Election of 2024
Polls and Public Opinion
Politics and Government
Presidential Election of 2024
Springfield (Ohio)
Illegal Immigration
Harris, Kamala D
Harris, Kamala D
Presidential Election of 2024
Faye, Bassirou Diomaye
Harris, Kamala D
Meloni, Giorgia (1977- )
Presidential Election of 2024
Tren de Aragua (Gang)
New York City
Presidential Election of 2024
internal-storyline-no
Homeless Persons
Drownings
Haitian-Americans
Maritime Accidents and Safety
United States Politics and Government
Presidential Election of 2024
Mayorkas, Alejandro
Presidential Election of 2024
audio-neutral

In [147]:
abstracts = []
dates = []
keywords = []
for result in results:
    for item in result:
        abstracts.append(item['abstract'])
        dates.append(item['pub_date'])
        keywords.append(item['keywords'][0]['value'])

### individual challenge: to dataframe!
make the lists into a dataframe

In [148]:
import pandas as pd
df = pd.DataFrame({
    'date': dates,
    'abstract': abstracts,
    'keywords': keywords
})

In [149]:
df.head()

Unnamed: 0,date,abstract,keywords
0,2024-10-04T18:58:04+0000,The decision comes as political pressure mount...,"Biden, Joseph R Jr"
1,2024-10-03T02:49:49+0000,The country’s defense ministry said the milita...,Chiapas (Mexico)
2,2024-09-28T04:01:18+0000,"Italy, an aging country, badly needs migrant l...",Italy
3,2024-09-30T18:52:57+0000,At least nine people are known to have died in...,Drownings
4,2024-09-20T07:00:08+0000,It’s been a year since Mayor Eric Adams made h...,Illegal Immigration


In [150]:
df.to_csv('nyt_data.csv')

## text analysis with spaCy

In [110]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [111]:
docs = list(nlp.pipe(abstracts))

In [112]:
for doc in docs:
    print(doc)

The decision comes as political pressure mounts to cut down on programs that allow migrants to stay in the United States temporarily, even without a visa or green card.
The country’s defense ministry said the military officers who opened fire might have mistaken the migrants for cartel members.
Italy, an aging country, badly needs migrant labor and immigration, but the government has admitted that the pathways for legal entry are rife with abuse.
At least nine people are known to have died in a sinking off the Canary Islands and 48 more are missing, the latest disaster on the perilous Atlantic crossing from West Africa.
It’s been a year since Mayor Eric Adams made his ominous prediction. City officials should treat the arrival of migrants as an opportunity rather than a catastrophe, advocates say.
Reports and interviews shed new light on the holding center, where migrants’ calls with lawyers are monitored and some say they’ve been forced to wear blackout goggles.
A group of New York’s 

## tokens

In [113]:
for token in docs[0]:
    print(token) 

The
decision
comes
as
political
pressure
mounts
to
cut
down
on
programs
that
allow
migrants
to
stay
in
the
United
States
temporarily
,
even
without
a
visa
or
green
card
.


In [114]:
type(token)

spacy.tokens.token.Token

In [115]:
# see all the methods for Token objects!
dir(token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang

In [116]:
for token in docs[0]:
    print(token, token.pos_, token.dep_)    

The DET det
decision NOUN nsubj
comes VERB ROOT
as ADP prep
political ADJ amod
pressure NOUN compound
mounts NOUN pobj
to PART aux
cut VERB relcl
down ADP prt
on ADP prep
programs NOUN pobj
that PRON nsubj
allow VERB relcl
migrants NOUN nsubj
to PART aux
stay VERB ccomp
in ADP prep
the DET det
United PROPN compound
States PROPN pobj
temporarily ADV advmod
, PUNCT punct
even ADV advmod
without ADP prep
a DET det
visa NOUN pobj
or CCONJ cc
green ADJ amod
card NOUN conj
. PUNCT punct


## NER

In [117]:
# see docs (hah!) on Doc object on this page: https://spacy.io/api/doc

type(doc)

spacy.tokens.doc.Doc

In [118]:
dir(doc)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_context',
 '_get_array_attrs',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment'

In [119]:
for doc in docs:
    for ent in doc.ents:
        print(ent.text, ent.label_)

the United States GPE
Italy GPE
At least nine CARDINAL
the Canary Islands LOC
48 CARDINAL
Atlantic LOC
West Africa GPE
Eric Adams PERSON
New York GPE
millions CARDINAL
Donald J. Trump PERSON
Michael P. Nash PERSON
A.I. ORG
millions CARDINAL
hundreds of thousands CARDINAL
the United States GPE
Donald J. Trump PERSON
millions CARDINAL
two CARDINAL
Democratic NORP
Kamala Harris’s PERSON
Tim Walz PERSON
Minnesota GPE
Harris PERSON
JD Vance PERSON
Ohio GPE
Donald J. Trump’s PERSON
Germany GPE
Europe LOC
Eric Adams PERSON
Kamala Harris PERSON
Friday DATE
Donald J. Trump PERSON
Donald J. Trump PERSON
Kamala Harris PERSON
The Freedom Party ORG
first ORDINAL
Donald Trump PERSON
Kamala Harris PERSON
Michigan GPE
2016 DATE
Haitian NORP
Trump ORG
Monday DATE
Biden PERSON
Friday DATE
Arizona GPE
Republican NORP
Haitian NORP
first ORDINAL
Western NORP
Senegal GPE
Bassirou Diomaye Faye PERSON
the United Nations ORG
Arizona GPE
Kamala Harris PERSON
first ORDINAL
Democratic NORP
Musk PERSON
Giorgia Mel

## word frequencies

In [120]:
from collections import Counter

words = []
for doc in docs:
    for token in doc:
        if not token.is_stop:
            if not token.is_punct:
                words.append(token.text)

word_freq = Counter(words)
common_words = word_freq.most_common(20)
print(common_words)

[('Trump', 12), ('President', 11), ('migrants', 9), ('Donald', 9), ('J.', 8), ('Harris', 8), ('border', 8), ('immigration', 7), ('country', 6), ('said', 6), ('president', 6), ('immigrants', 6), ('people', 5), ('false', 5), ('claims', 5), ('Kamala', 5), ('United', 4), ('undocumented', 4), ('asylum', 4), ('pets', 4)]


## dependency 

In [121]:
from spacy import displacy
displacy.render(docs[0], style="dep", jupyter=True)

## BONUS: searching text by grammatical dependancy 
See this tutorial for more info: https://applied-language-technology.mooc.fi/html/notebooks/part_iii/03_pattern_matching.html

In [122]:
# first, join the docs into one string to process with matcher
# then run nlp() again

abstracts_string = ' '.join(abstracts)
doc = nlp(abstracts_string)

In [123]:
doc[:50]

The decision comes as political pressure mounts to cut down on programs that allow migrants to stay in the United States temporarily, even without a visa or green card. The country’s defense ministry said the military officers who opened fire might have mistaken the migrants for cartel

In [124]:
from spacy.matcher import Matcher

# Create a Matcher and provide model vocabulary; assign result under the variable 'matcher'
matcher = Matcher(nlp.vocab)

# Call the variable to examine the object
matcher

<spacy.matcher.matcher.Matcher at 0x320449cf0>

In [125]:
# Define a list with nested dictionaries that contains the pattern to be matched
pronoun_verb = [{'POS': 'PRON'}, {'POS': 'VERB'}]

In [126]:
# Add the pattern to the matcher under the name 'pronoun+verb'
matcher.add("pronoun+verb", patterns=[pronoun_verb])

In [127]:
# Apply the Matcher to the Doc object under 'doc'; provide the argument
# 'as_spans' and set its value to True to get Spans as output\

matches = matcher(doc)

# Call the variable to examine the output
# result

In [128]:
matches

[(12298179334642351811, 12, 14),
 (12298179334642351811, 40, 42),
 (12298179334642351811, 168, 170),
 (12298179334642351811, 265, 267),
 (12298179334642351811, 318, 320),
 (12298179334642351811, 482, 484),
 (12298179334642351811, 518, 520),
 (12298179334642351811, 586, 588),
 (12298179334642351811, 588, 590),
 (12298179334642351811, 598, 600),
 (12298179334642351811, 638, 640),
 (12298179334642351811, 822, 824),
 (12298179334642351811, 916, 918),
 (12298179334642351811, 957, 959),
 (12298179334642351811, 995, 997),
 (12298179334642351811, 1072, 1074),
 (12298179334642351811, 1092, 1094),
 (12298179334642351811, 1129, 1131),
 (12298179334642351811, 1202, 1204),
 (12298179334642351811, 1300, 1302)]

In [129]:
for match_id, start, end in matches:
    string_id = doc.vocab.strings[match_id]  # Look up string ID
    span = doc[start:end]
    print(string_id, span.text)

pronoun+verb that allow
pronoun+verb who opened
pronoun+verb some say
pronoun+verb that allow
pronoun+verb who echoed
pronoun+verb she visited
pronoun+verb he conceded
pronoun+verb that helped
pronoun+verb him win
pronoun+verb there was
pronoun+verb which appears
pronoun+verb She used
pronoun+verb them did
pronoun+verb We cover
pronoun+verb they thought
pronoun+verb that sank
pronoun+verb that come
pronoun+verb him prevail
pronoun+verb he declared
pronoun+verb he described
