# Section 10. Python Web APIs: Accessing NYT Data

#### Instructor: Pierre Biscaye 

The objective of this notebook is to introduce you to some basic steps for extracting data from the web with APIs using Python, with the New York Times API as a case study. The content of this notebook is taken from UC Berkeley D-Lab's Python Web APIs [course](https://github.com/dlab-berkeley/Python-Web-APIs).

### Learning Objectives
1. The New York Times API
2. Top stories API
3. Most Viewed and Most Shared APIs
4. Article Search API
5. Examples of Data Analysis

In [None]:
# Import required libraries
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from datetime import datetime

# 1. The New York Times API

We are going to use the NYT API to demonstrate how Web APIs can be used to access useful information in an easy way. Before proceeding with this lesson, you should have already set up an API key following the instructions in Web APIs overview slides. Copy that API key now.

## Handling API Keys

API keys are sensitive data! You **do not** want to accidentally share them publicly.

The following cell will:

1. first try to obtain previously saved credentials by loading with `configparser`;
2. if not found, use `getpass` to request the credentials from the user (which works in notebooks as an input prompt);
3. then save those user-inputted credentials using configparser to `~/.notebook-api-keys` which is outside of the directory for this notebook so it doesn't accidentally get uploaded publicly.

Run the following cell and add the API Key you just created when prompted.

In [None]:
import configparser
import os
from getpass import getpass

def get_api_key(api_name):
    config_file_path = os.path.expanduser("~/.notebook-api-keys")
    config = configparser.ConfigParser(interpolation=None)  # Disable interpolation to avoid issues with special characters
    
    # Try reading the existing config file
    if os.path.exists(config_file_path):
        config.read(config_file_path)
    
    # Check if API key is present
    if config.has_option("API_KEYS", api_name):
        # Ask if the user wants to update the key
        update_key = input(f"An API key for {api_name} already exists. Do you want to update it? (y/n): ").lower()
        if update_key == 'n':
            return config.get("API_KEYS", api_name)
    
    # If no key exists or user opts to update, prompt for the new key
    api_key = getpass(f"Enter your {api_name} API key: ")

    # Save the API key in the config file
    if not config.has_section("API_KEYS"):
        config.add_section("API_KEYS")
    config.set("API_KEYS", api_name, api_key)
    
    with open(config_file_path, "w") as f:
        config.write(f)
    
    return api_key

# Example usage to retrieve the NYT API key
api_key = get_api_key("NYT")

print("NYT API key retrieved successfully.")


**Tip**: Another way to keep your credentials secure and provide convenient access is through the [JupyterLab Credential Store
](https://towardsdatascience.com/the-jupyterlab-credential-store-9cc3a0b9356). If you are using JupyterLab, this is a great general solution for handling API keys!

## Using `pynytimes`

To access the NYTimes' databases, we'll be using a third-party library called [pynytimes](https://github.com/michadenheijer/pynytimes). This package provides an easy to use tool for accessing the wealth of data hosted by the Times.

To install the library, follow the instructions taken from their [Github repo](https://github.com/michadenheijer/pynytimes).

There are multiple options to install `pynytimes`, but the easiest is by just installing it using `pip` in the Jupyter notebook itself, using a magic command:

In [None]:
%pip install pynytimes

You can also install it via the command line or Anaconda Navigator - whichever you're more comfortable with.

Once the package installed, let's go ahead import the library and initialize a connection to their servers using our api keys.

In [None]:
# Import the NYTAPI object which we'll use to access the API
from pynytimes import NYTAPI

In [None]:
# Intialize the NYT API class into an object using your API key
nyt = NYTAPI(api_key, parse_dates=True)

We are now ready to make some API calls!

## Making API Calls

Now that we've established a connection to New York Times' rich database, let's go over what kind of data and privileges we have access to.
 
 ### APIs

[Here is the collection of the APIs the NYT gives us:](https://developer.nytimes.com/apis)

- [Top stories](https://developer.nytimes.com/docs/top-stories-product/1/overview): Returns an array of articles currently on the specified section 
- [Most viewed/shared articles](https://developer.nytimes.com/docs/most-popular-product/1/overview): Provides services for getting the most popular articles on NYTimes.com based on emails, shares, or views.
- [Article search](https://developer.nytimes.com/docs/articlesearch-product/1/overview): Look up articles by keyword. You can refine your search using filters and facets.
- [Books](https://developer.nytimes.com/docs/books-product/1/overview): Provides information about book reviews and The New York Times Best Sellers lists.
- [Movie reviews](https://developer.nytimes.com/docs/movie-reviews-api/1/overview): Search movie reviews by keyword and opening date and filter by Critics' Picks.
- [Times Wire](https://developer.nytimes.com/docs/timeswire-product/1/overview): Get links and metadata for Times' articles as soon as they are published on NYTimes.com. The Times Newswire API provides an up-to-the-minute stream of published articles.
- [Tag query (TimesTags)](https://developer.nytimes.com/docs/timestags-product/1/overview): Provide a string of characters and the service returns a ranked list of suggested terms.
- [Archive metadata](https://developer.nytimes.com/docs/archive-product/1/overview): Returns an array of NYT articles for a given month, going back to 1851.

We will look at a few of these today.

# 2. Top Stories API

Let's look at the top stories of the day. All we have to do is call a single method on the `nyt` object:

In [None]:
# Get all the top stories from the home page
top_stories = nyt.top_stories()

print(f"top_stories is a list of length {len(top_stories)}")

The `top_stories` method has a single parameter called `section` that defaults to "home".

If we are interested in a specific section, we can pass in one of the following tags into the `section` parameter:
```arts```, ```automobiles```, ```books```, ```business```, ```fashion```, ```food```, ```health```, ```home```, ```insider```, ```magazine```, ```movies```, ```national```, ```nyregion```, ```obituaries```, ```opinion```, ```politics```, ```realestate```, ```science```, ```sports```, ```sundayreview```, ```technology```, ```theater```, ```tmagazine```, ```travel```, ```upshot```, and ```world```.


In [None]:
# Preview the results
top_stories[:2]

This is pretty typical output for data pulled from an API. We are looking at a list of nested JSON dictionaries.

When working with a new API, a good way to establish an understanding of the data is to inspect a single object in the collection. Let's grab the first story in the array and inspect its attributes and data:

In [None]:
top_story = top_stories[0]
top_story

We are provided a diverse collection of data for the article ranging from the expected (title, author (byline), date, section), to keywords, and to NLP-derived information such as named entities. Notice that the full article itself is not included - the API does not provide that to us.

## Organizing the API Results into a `pandas` DataFrame

In order to conduct subsequent data analysis, we need to convert the list of JSON data to a `pandas` DataFrame. `pandas` allows us to simply pass in the JSON list and produce a clean table in one line of code. 

First, let's see what happens when we pass in `top_stories` to `pd.json_normalize`:

In [None]:
# Convert to DataFrmae
df = pd.json_normalize(top_stories)
# View the first 5 rows
df.head()

In [None]:
# Inspect the metadata
df.info()

For the most part, `pandas` does a good job of producing a table where:

- The columns correspond with the JSON dictionary keys from our API call.
- The number of rows matches the number of articles.
- Each cell holds the corresponding value found under that article's dictionary key.

What can we do with this? As an example, let's pull the information on individual 'entities' reported in the top stories by the API, and plot the frequency with which different individuals appear.

In [None]:
from collections import Counter

# Flatten the list of lists into a single list
all_words = [word for sublist in df['per_facet'] for word in sublist]

# Count occurrences of each unique string
word_counts = Counter(all_words)

# Get the 10 most common strings
top_10_words = word_counts.most_common(10)

# Convert to DataFrame for easy plotting
top_10_df = pd.DataFrame(top_10_words, columns=["word", "count"])

# Truncate long strings for readability on plot
top_10_df["short_word"] = top_10_df["word"].apply(lambda x: x[:10] + "..." if len(x) > 10 else x)

In [None]:
import seaborn as sns

plt.figure(figsize=(10, 5))
sns.barplot(data=top_10_df, x="short_word", y="count", hue="word", palette="viridis", legend=False)
plt.xlabel("Word")
plt.ylabel("Count")
plt.title("Top 10 Most Common People Named in Top Stories")
plt.xticks(rotation=45)
plt.show()

# 3. Most Viewed and Most Shared APIs

Retrieving the most viewed and shared articles is also quite simple. The `days` parameter returns the most popular articles based on the last $N$ days. Keep in mind, however, that `days` can only take on one of three values: 1, 7, or 30.

In [None]:
# Retrieve the most viewed articles for today.
# The days parameter defaults to 1
most_viewed_today = nyt.most_viewed()
print(f"Title: {most_viewed_today[0]['title']}")
print(f"Section: {most_viewed_today[0]['section']}")
most_viewed_today[0]

How many stories are provided to us via this function call?

In [None]:
len(most_viewed_today)

For this piece of data, we can consult a guide or what's known as a schema to understand the information at our finger tips.

The [Most Viewed Schema](https://developer.nytimes.com/docs/most-popular-product/1/types/ViewedArticle) can answer any questions we may have about the data provided by this API:

| Attribute      | Data Type | Definition      |
| ----------- | ----------- | ----------- |
| url      | string       | Article's URL.       |
| adx_keywords   | string        | Semicolon separated list of keywords.        |
| column   | string        | Deprecated. Set to null.        |
| section   | string        | Article's section (e.g. Sports).        |
| byline   | string        | Article's byline (e.g. By Thomas L. Friedman).        |
| type   | string        | Asset type (e.g. Article, Interactive, ...).        |
| title   | string        | Article's headline (e.g. When the Cellos Play, the Cows Come Home).        |
| abstract   | string        | Brief summary of the article.|
| published_date   | string        | When the article was published on the web (e.g. 2021-04-19).        |
| source   | string        | Publisher (e.g. New York Times).        |
| id   | integer        | Asset ID number (e.g. 100000007772696).        |
| asset_id   | integer        | Asset ID number (e.g. 100000007772696).        |
| des_facet   | array        | Array of description facets (e.g. Quarantine (Life and Culture)).        |
| org_facet   | array        | Array of organization facets (e.g. Sullivan Street Bakery).        |
| per_facet   | array        | Array of person facets (e.g. Bittman, Mark).        |
| geo_facet   | array        | Array of geographic facets (e.g. Canada).        |
| media   | array        | Array of images.        |
| media.type   | string        | Asset type (e.g. image).        |
| media.subtype   | string        | Asset subtype (e.g. photo).        |
| media.caption   | string        | Media caption        |
| media.copyright   | string        | Media credit        |
| media.approved_for_syndication   | boolean        | Whether media is approved for syndication.        |
| media.media-metadata   | array        | Media metadata (url, width, height, ...).        |
| media.media-metadata.url   | string        | Image's URL.        |
| media.media-metadata.format   | string        | Image's crop name     |
| media.media-metadata.height   | integer        | Image's height |
| media.media-metadata.width   | integer        | Image's width      |

To pull most popular articles for the past week and month, we pass the numbers 7 or 30 into `days`

In [None]:
most_viewed_week = nyt.most_viewed(days=7)
len(most_viewed_week)

Note that the API only results in 20 articles, and this is not a parameter we can modify.

What is the most viewed article of the last week?

In [None]:
most_viewed_week[0]['title']

What individuals occurred most commonly in the most-viewed articles?

In [None]:
df2 = pd.json_normalize(most_viewed_week)

# Flatten the list of lists into a single list
all_words = [word for sublist in df2['per_facet'] for word in sublist]

# Count occurrences of each unique string
word_counts = Counter(all_words)

# Get the 10 most common strings
top_10_words = word_counts.most_common(10)

# Convert to DataFrame for easy plotting
top_10_df = pd.DataFrame(top_10_words, columns=["word", "count"])

# Truncate long strings for readability on plot
top_10_df["short_word"] = top_10_df["word"].apply(lambda x: x[:10] + "..." if len(x) > 10 else x)

# Plot it
plt.figure(figsize=(10, 5))
sns.barplot(data=top_10_df, x="short_word", y="count", hue="word", palette="viridis", legend=False)
plt.xlabel("Word")
plt.ylabel("Count")
plt.title("Top 10 Most Common People Named in Most Viewed Stories")
plt.xticks(rotation=45)
plt.show()

Now let's look at the most *shared* stories. Here we can search by sharing methods.

In [None]:
# Get most shared stories
email = nyt.most_shared(days=30, method = 'email')
facebook = nyt.most_shared(days=30, method = 'facebook')
len(facebook)

In [None]:
# Get unique identifier for each story
email_ids = [story["uri"] for story in email]
facebook_ids = [story["uri"] for story in facebook]

In [None]:
# Calculate the intersection of unique IDs
len(set(email_ids).intersection(set(facebook_ids)))

**Question**: How do we interpret the result of the last line of code?

# 4. Article Search API

The previous results are interesting but likely seem a bit restricted. Let's take it up a notch and use the search API to retrieve a set of articles about a particular topic in a chosen period of time.

We'll use the `article_search` function. Two relevant parameters include:

- `query`: The search query
- `results`: Number of articles returned. The default is 10.

Let's try pulling the 20 most recent articles about France:

In [None]:
articles = nyt.article_search(query="France", results=20)

Let's look at the main headlines of these articles:

In [None]:
headlines = [article['headline']['main'] for article in articles]
headlines

Some of these results don't seem relevant to the country of France. 

Let's take a peek at the first article provided to see if we can figure out why. We're going to remove the `multimedia` key in order to make it more easy to view:

In [None]:
del articles[0]['multimedia']
articles[0]

**Question**: Any guesses about why this article showed up in the search?

Notice that not all article data comes in the same format. Data from the search API is presented differently from that of the Most Viewed and Top Stories APIs.

There are schemas for the above data. 

- [Article Schema](https://developer.nytimes.com/docs/articlesearch-product/1/types/Article)
- [Byline](https://developer.nytimes.com/docs/articlesearch-product/1/types/Byline)
- [Headline](https://developer.nytimes.com/docs/articlesearch-product/1/types/Headline)
- [Keyword](https://developer.nytimes.com/docs/articlesearch-product/1/types/Keyword)
- [Multimedia](https://developer.nytimes.com/docs/articlesearch-product/1/types/Multimedia)
- [Person](https://developer.nytimes.com/docs/articlesearch-product/1/types/Person)

Let's search for some articles again, but within a specific time period. 

For example, how would we retrieve all the articles about the 2024 Paris Summer Olympics published between the opening and closing ceremonies (+ 1 day)?

We need to pass a dictionary to the `dates` argument which contains keys named "begin" and "end". Those two keys point to `datetime` objects that we'll use as time markers. We're also going to use the `options` argument to filter and sort our results. We'll restrict ourselves to 100 articles for tractability.

In [None]:
# Set up start and end date objects
begin = datetime(2024, 7, 25) # July 25, 2024
end = datetime(2024, 8, 12) # August 12, 2024

# Create a dictionary containing the datetime objects
date_dict = {"begin": begin, "end": end}

# Create options dictionary
options_dict = {
    # Sort from earliest to latest
    "sort": "oldest",
    # Return only articles from New York Times (filters out other sources such as AP and Reuters)
    "sources": ["New York Times"],
    # Return only news, analyses, and articles
    "type_of_material": ["News Analysis", "News", "Article"]
}

articles = nyt.article_search(
    query="Paris Olympics",
    results=100,
    dates=date_dict,
    options=options_dict)

In [None]:
# Grab first article and drop the multimedia key to reduce clutter
article = articles[0]
del article["multimedia"]

# Check out results
article

You can see how this could be a more powerful tool for data search.

# 5. Data Analysis

Now, we'll perform some analysis on a database of articles published about the 2024 United States presidential election.

We will work with a previously queried set of articles because making the API call in class will take too much time. The code used to query the articles we'll analyze can be found in the following cell. Feel free to adapt it for future queries. Keep in mind your API call and rate limits.

## Query Using the Article Search API

In [None]:
# Change this variable if you'd like to run the query yourself; note it can take a long time to run
run_query = False

# Only run this code if you're able to wait for the query to finish
if run_query:
    # Create datetime objects
    begin = datetime(2024, 9, 7) # September 7, 2020
    end = datetime(2024, 11, 7) # November 7, 2020
    date_dict = {"begin": begin, "end": end}

    options_dict = {
        "sort": "oldest",
        "sources": ["New York Times",],
        "type_of_material": ["News Analysis", "News", "Article", "Editorial"]
    }

    # To get the dataset we use, set n_results to 2000
    n_results = 2000
    # n_results = 10

    # Perform article search query
    articles = nyt.article_search(
         query="presidential election",
         results=n_results,
         dates=date_dict,
         options=options_dict)

    # Create DataFrame 
    df = pd.json_normalize(articles)
    
    # Ensure 'lead_paragraph' column has no NaN 
    df['lead_paragraph'] = df['lead_paragraph'].fillna('')
    
    # Save DataFrame
    df.to_csv("Data/election2024_articles.csv")

Let's load in the previously saved data:

In [None]:
df = pd.read_csv("Data/election2024_articles.csv")
df.head()

In [None]:
# Inspect metadata
df.info()

## Perform Sentiment Analysis

Sentiment analysis is a common task when working with text data. Let's track the sentiment of articles about the election over the two month time period. We'll use the `vadersentiment` package to evaluate the sentiment of each article.

According to the [VADER Github Repo](https://github.com/cjhutto/vaderSentiment), "VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is *specifically attuned to sentiments expressed in social media*."

We'll start by installing the `vadersentiment` library.

In [None]:
# Install the vadersentiment library
%pip install vadersentiment

In [None]:
# Import the SentimentIntensityAnalyzer object
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
# Initialize analyzer object
analyzer = SentimentIntensityAnalyzer()
# Calculate the polarity scores of the lead paragraph 
df["sentiment"] = df["lead_paragraph"].apply(lambda x: analyzer.polarity_scores(x) if isinstance(x, str) else np.nan)

In [None]:
# Inspect the sentiment column
df.sentiment.head()

In [None]:
# View single row
df.sentiment.iloc[0]

The `compound` score is computed by summing the valence scores of each word in the lexicon, adjusted according to the rules, and then normalized to be between -1 (most negative) and +1 (most positive). This is the most useful metric if you want a single unidimensional measure of sentiment for a given sentence. We can think of this score as a normalized, weighted composite score. It is also useful for researchers who would like to set standardized thresholds for classifying sentences as either positive, neutral, or negative. 

Typical threshold values are:

1. **Positive Sentiment**: compound score $\geq 0.05$
 
2. **Neutral  Sentiment**: $-0.05 <$ compound score $< 0.05$
 
3. **Negative Sentiment**: compound score $\leq -0.05$

In [None]:
# Re-assign sentiment as the compound score
df["sentiment"] = df["sentiment"].apply(lambda x: x["compound"] if isinstance(x, dict) else np.nan)

Let's get a sense of the distribution of scores by calculating some summary statistics and plotting a histogram:

In [None]:
# Summary statistics
df.sentiment.describe()

In [None]:
bins = np.linspace(-1, 1, 17)
df.sentiment.hist(bins=bins, figsize= (9, 7))
plt.xlabel("Sentiment Score")
plt.ylabel("Frequency")
plt.xlim([-1.0, 1.0])

Finally, using the VADER thresholds for positive, neutral, and negative, we can see how many articles qualify for each of those labels:

In [None]:
# Proportion of positive, negative, and neutral texts
def bin_func(x):
    if x > 0.05:
        return "positive"
    elif x < -.05:
        return "negative"
    else:
        return "neutral"
# Calculate counts
df.sentiment.apply(bin_func).value_counts()

## Sentiment Over the Course of the Campaign

Let's examine how the compound score evolved over the course of the campaign. Do you have expectations on how this quantity might behave as the election nears? 

First, let's create a new `pandas` series which tracks the sentiment over time:

In [None]:
# change pub_date to DatetimeIndex format
df["pub_date"] = pd.to_datetime(df["pub_date"])

In [None]:
# Create a time series with publication date as the index and sentiment score as the value
sentiment_ts = pd.Series(index= df.pub_date.tolist(),
                         data = df.sentiment.tolist())

Next, we'll calculate daily and weekly averages:

In [None]:
# Resample the data with daily averages and weekly averages
daily = sentiment_ts.resample("d").mean()
weekly = sentiment_ts.resample("W").mean()

Let's plot the results. Do you notice any patterns?

In [None]:
# Daily average sentiment of articles.
daily.plot(figsize = (11, 7))
plt.xlabel("Dates")
plt.ylabel("Sentiment Score");

In [None]:
# Weekly average sentiment of articles.
weekly.plot(figsize = (11, 7))
plt.xlabel("Dates")
plt.ylabel("Sentiment Score");

## Handling Nested Arrays of Keywords

The NY Times has done us a favor in providing named entities in the article API results, thus relieving us of having to do the tagging ourselves. However, the data structure that it comes in can be tricky to handle. Here, we provide a short tutorial showing one way to cleanly extract keyword data.

In [None]:
# Refer to a sample article's set of keywords
df.keywords.iloc[1]

We see a number of things here:
- Each article's keywords are laid out in a list of dictionaries.
- A dictionary tell us the name, value, rank, and major of the keyword.
- 'Name' gives the category of keyword, with give possibilities: `subject`, `persons`, `glocations`, `organizations`, and `creative_works`.
- 'Value' gives the actual keywork or phrase.
- 'Rank' indicates the relative importance of the keywork. The ordering of the list corresponds to the ranking.
- 'Major' indicates whether the keyword is a primary focus or a secondary reference.
- All articles do not all have the same number of rankings.

Let's write a function to extract keyword data based on the ranking. This function will be applied over the pandas series of keyword data.

In [None]:
import ast

# Convert the string representation of the list into actual lists of dictionaries
df['keywords'] = df['keywords'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)

In [None]:
df["keywords"].head()

In [None]:
def rank_extractor(data, rank):
    """Extracts keyword data based on the 'rank' field."""
    if isinstance(data, list):
        for keyword in data:
            if isinstance(keyword, dict) and keyword.get("rank") == rank:
                return {"name": keyword.get("name"), "value": keyword.get("value")}
    return None

In [None]:
# Extract the first, second, and third keywords
rank1 = df.keywords.apply(lambda x: rank_extractor(x, 1))
rank2 = df.keywords.apply(lambda x: rank_extractor(x, 2))
rank3 = df.keywords.apply(lambda x: rank_extractor(x, 3))

In [None]:
# View results
rank1.head()

Let's convert these dictionaries into `pandas` Series:

In [None]:
rank1 = rank1.apply(pd.Series)
rank2 = rank2.apply(pd.Series)
rank3 = rank3.apply(pd.Series)
rank1.head()

Voila! A nice clean format. Now can we conduct some light analysis:

In [None]:
# Most frequent type of keyword in ranking #1
rank1.name.value_counts()

In [None]:
# The most common keywords in ranking #1:
rank1.value.value_counts()

## Key Points

* APIs allow structured web interactions, often using URLs to query databases and retrieve data.
* API keys authenticate users, enabling access to APIs while monitoring and limiting the number of requests.
* The NYT API allows users to do things like retrieve top stories, find most shared stories, and search for stories.
* Text data acquired through APIs can be analyzed using natural language processing tools such as sentiment analysis.
