<a href="https://colab.research.google.com/github/ipeirotis/dealing_with_data/blob/master/02-WebAPIs/A2-NewsAPI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Interacting with APIs: Retrieving News Articles with NewsAPI

In the world of analytics, accessing and working with external data sources is a fundamental skill. Application Programming Interfaces (APIs) provide a structured way for different software systems to communicate and share information. In a previous session, we introduced the core concepts of APIs and how to make basic requests.

In this notebook, we will dive deeper into API interaction by using the **NewsAPI**. We will learn how to:

*   Make API calls with specific parameters to filter data, focusing on retrieving relevant business news.
*   Process the API's response, which is typically in JSON format.
*   Extract relevant information from the response for analytical purposes.
*   Structure the retrieved data into a pandas DataFrame for further analysis, such as analyzing headline sentiment or identifying key topics in business news.

By the end of this notebook, you will be able to retrieve news articles based on various criteria and prepare the data for analytical tasks relevant to business insights.

## Understanding NewsAPI

NewsAPI is a simple, easy-to-use API that returns JSON metadata for headlines and articles from news sources and blogs all over the web. It's a great resource for:

*   Gathering news data for analysis.
*   Building news aggregation tools.
*   Researching trends in news coverage.

To use NewsAPI, you will need an API key. If you haven't already, please sign up for a free developer account on their website to get your key.

## API Keys and Security Considerations

An API key is a secret token that authenticates your requests to the API. **It is crucial to keep your API key confidential.**

**For this educational exercise, we will include the API key directly in the code.** However, in any real-world application or production environment, you should **never** hardcode your API key directly in your script or notebook, especially if it might be shared publicly. Secure methods like environment variables or configuration files should be used instead.

In [None]:
# Adding the libraries that we will use in this notebook
import requests
import pandas as pd
import re
from bs4 import BeautifulSoup

In [None]:
# IMPORTANT: Sign up and get your own key for NewsAPI
# When the whole class uses a single key, we often run out of quota

news_api_key = 'KEY ON SLACK'

## Understanding API Rate Limits

Many APIs, including NewsAPI, have rate limits to prevent abuse and ensure fair usage. This means there's a limit to how many requests you can make within a certain timeframe (e.g., per minute or per day).

If you exceed the rate limit, your requests might be blocked, and you'll receive an error. It's important to be mindful of these limits, especially when testing or running your code multiple times. The free NewsAPI plan has specific limits on the number of requests and results per request.

# Fetching Data from NewsAPI

In [None]:
endpoint = 'https://newsapi.org/v2/top-headlines'
parameters = {
    'country' : 'us',
    'category' : 'business',
    'apiKey' : news_api_key,
    'pageSize' : 100
}
resp = requests.get(endpoint, params=parameters)

### Basic Error Handling

When making API requests, it's good practice to check if the request was successful before trying to process the data. The `requests` library provides ways to check the status code of the response. A status code of 200 generally indicates a successful request.

We can also check the 'status' key in the NewsAPI JSON response itself, which should be 'ok' for a successful call.

In [None]:
if resp.status_code == 200:
    data = resp.json()
    if data['status'] == 'ok':
        print("API request successful!")
        # You can now proceed with processing the data
    else:
        print(f"API returned an error status: {data['status']}")
        # Print more details if available in the response
else:
    print(f"API request failed with status code: {resp.status_code}")
    print(f"Response text: {resp.text}") # Helpful for debugging

In [None]:
data = resp.json()

# Exploring the API Response

In [None]:
data.keys()

## Exploring the JSON Response Structure

When we make a successful API request to NewsAPI, the response data is returned in JSON format. This JSON object is essentially a dictionary in Python. Let's examine the top-level keys we see:

*   `status`: Indicates if the request was successful (e.g., 'ok').
*   `totalResults`: The total number of results available based on our query parameters.
*   `articles`: This is the most important key for us. Its value is a **list** where each element in the list is a dictionary representing a single news article.

Understanding this structure is crucial for navigating and extracting the information we need from the API response.

In [None]:
data['status']

In [None]:
data['totalResults']

In [None]:
# This is  a list of the returned articles
data['articles']

In [None]:
# Do not freak out. It is just a big list.
# This is the first article in the list
data['articles'][0]

## Exercise: Adding a Keyword Search

The `top-headlines` endpoint of the NewsAPI allows you to search for specific keywords within the headlines and descriptions of articles. This is very useful for narrowing down your search to topics of interest within a category like 'business'.

Examine the documentation of the [`top-headlines` endpoint](https://newsapi.org/docs/endpoints/top-headlines) specifically looking for parameters related to keywords.

Your task is to:

1.  Create a Python function that takes a keyword as input.
2.  Inside the function, construct the API request parameters, including the provided keyword.
3.  Make the API call using the `requests` library.
4.  Process the JSON response.
5.  Return the list of articles (the value associated with the 'articles' key) from the response.

This function should make it easy to retrieve business headlines related to a specific term, like 'inflation' or 'technology'.

In [None]:
# Your code here

## Analyzing the returned JSON objects that represent the individual articles

In [None]:
# This is the first article in the list
data['articles'][0]

In [None]:
# Let's see the keys for an individual article
data['articles'][0].keys()

### Structure of an Individual Article

As we can see, each element in the `data['articles']` list is a Python dictionary. This dictionary contains various pieces of information about the news article, such as:

  *   `source`: Information about the news source (another nested dictionary).
  *   `author`: The author of the article.
  *   `title`: The headline of the article.
  *   `description`: A brief summary of the article.
  *   `url`: The URL to the full article.
  *   `urlToImage`: A URL to an image related to the article.
  *   `publishedAt`: The publication date and time.
  *   `content`: A truncated version of the article's content (often just the beginning).

We will now work with these individual article dictionaries to extract and process the information we need.

In [None]:
data['articles'][0]['description']

In [None]:
data['articles'][0]['content']

# Processing Article Data

### Iterate over all articles and do various things

## Using `pd.json_normalize`

The API response is in JSON format, and the articles are a list of dictionaries. `pd.json_normalize` is a convenient function from the pandas library that can take a list of dictionaries (like our `data['articles']`) and automatically create a DataFrame.

One of the key features of `pd.json_normalize` is its ability to **flatten** nested structures within the dictionaries. For instance, if an article had a nested dictionary for 'source' information, `pd.json_normalize` would create separate columns in the DataFrame for the keys within the 'source' dictionary (e.g., 'source.id', 'source.name').

This makes it easy to quickly get our article data into a tabular format for initial exploration.

In [None]:
# That is a convenient way to create a datafame
# from a **list** of **JSON/dictionaries**. A bit awkward
# though if we want to work with each individual article

df = pd.json_normalize( data['articles'] )
df

In [None]:
# Print the titles of the news articles
for article in data['articles']:
    print(article['title'])
    print('-------')

### Exercise: Include Article URLs

Modify the code above to print also the URL of each article.

In [None]:
# YOUR CODE HERE

### Exercise: Include URL and Title Length

Modify the code above to print also the URL of each article, and print the length of the title.

## Structuring Data for a DataFrame

While `pd.json_normalize` is quick, sometimes we want to select specific fields or perform transformations before creating the DataFrame. A common approach is to iterate through the API response, create a list of dictionaries where each dictionary represents a row in our desired DataFrame, and then convert this list into a pandas DataFrame.

### Example

Create a dataframe using the code above to print also the URL of each article, and print the length of the title. Create a histogram of the title lengths.

In [None]:
results = [] # We store here our results

# We go through all the articles
for article in data['articles']:
  # Remember that "article" is the loop variable

  # We create our own dictionaries, with the attributes that we need
  entry = {
      "title": article['title'],
      "url": article['url'],
      "title_length": len(article['title']) # Length of the title in characters
  }
  results.append(entry)

df = pd.DataFrame(results)
df

In [None]:
# Create a histogram of title lengths
df['title_length'].hist(figsize=(6,2), bins=20)

In [None]:
# Create one big string with all the titles
titles = '\n'.join(df['title'])
print(titles)

### Exercise: Get the full text of the news

The 'content' field provided by the NewsAPI often contains only a summary or the beginning of the article. To get the full text of the news story, we need to visit the article's URL and extract the content from the webpage itself. This process is often called web scraping.

We have a helper function `retrieve_text_from_url` below that uses the `requests` and `BeautifulSoup` libraries to attempt to retrieve and clean the text content from a given URL. Note that extracting clean text from various website structures can be challenging, so this function might not work perfectly for every URL.

Your task is to:

1.  Iterate through the list of articles (`data['articles']`).
2.  For each article, use the `retrieve_text_from_url` function with the article's 'url'.
3.  Store the retrieved full text in a new key (e.g., 'url_content') within the dictionary for each article.
4.  Create a new pandas DataFrame (`df_content`) from the updated list of dictionaries, including the original article details and the new 'url_content'.

In [None]:
# Helper function: This functions gets as input a URL
# and returns back the text that appears in that web page
def retrieve_text_from_url(url):
    """Remove html tags from a string"""
    try:
      resp = requests.get(url)
      soup = BeautifulSoup(resp.text, "html.parser")
      return soup.get_text()
    except:
      return ""

In [None]:
results = [] # We store here our results

# We go through all the articles
for article in data['articles']:
  c = article.get('content', '') # Use .get() for safety in case 'content' is missing

  # START CODE
  url = article['url']
  url_content = ... # YOUR CODE HERE
  # END CODE
  entry = {
      "title": article['title'],
      "url": article['url'],
      "description": article.get('description', ''),
      "content": c,
      "url_content": ... # YOUR CODE HERE
  }
  results.append(entry)

df_content = pd.DataFrame(results)

#### Solution

In [None]:
results = [] # We store here our results

# We go through all the articles
for article in data['articles']:
  c = article.get('content', '') # Use .get() for safety in case 'content' is missing

  # START CODE
  url = article['url']
  url_content = retrieve_text_from_url(url)
  # END CODE
  entry = {
      "title": article['title'],
      "url": article['url'],
      "description": article.get('description', ''), # Use .get()
      "content": c,
      "url_content": url_content
  }
  results.append(entry)

df_content = pd.DataFrame(results)
display(df_content)