In [66]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# <center>Determining the Correlation between Regime Changes and Media Coverage by Analyzing New York Times Headlines</center>
<center>Alex Matsoukas and Lila Smith</center>

## Introduction

The purpose of our project is to examine the New York Times coverage of various world conflicts in which the US had been involved with, either militarily or diplomatically. The inspiration for this focus comes from the role of the New York Times in the Iraq War in 2003, which includes publicizing faulty information and swaying public perception in favor of US intervention.
For this project, we focused on the following three countries and time periods
- Chile (01/1960 to 01/1990)
- Libya (01/2008 to 01/2013)
- Bolivia (01/2018 to 01/2021)

for the purpose of answering "How does the quantity and quality of the New York Times media coverage vary around regime changes around the world?"

More specifically, this question can be broken down into three narrower fields of focus:
- How frequently does the New York Times mention a country per month in the time period surrounding regime changes?
- What words are most used in headlines about these countries during this time period?
- How do the headlines' sentiment correlate with specific regimes?

Our approach to providing an answer to these questions can be broken down into several stages including data collection, sentiment analysis, and effective communication of our findings. We utilized the New York Times Article Search API to obtain the data (in the form of article headlines), and Google Cloud's Natural Language API to analyze the text of the headlines. More detail on how exactly this was done will be provided in the Methodology section below. We also used Python's data visualization libraries such as Matplotlib, Numpy, and Wordcloud to create informative and effective graphics displaying the data and subsequent analysis.

## Methodology

### Collecting and Storing Data

The first step that we took towards answering our question was to obtain data from the [New York Times (NYT) Article Search API](https://developer.nytimes.com/docs/articlesearch-product/1/overview). The format of the requests is such that the base URI (`https://api.nytimes.com/svc/search/v2/articlesearch.json?q={query}&fq={filter}&api-key={yourkey}`) can be edited and altered to fit the desired search. We used this format as follows:

`https://api.nytimes.com/svc/search/v2/articlesearch.json?q={search_term}&fq=source:(\"The New York Times\")&begin_date={begin_date}&end_date={end_date}&"page={page}&api-key={api_key}`,

using the target country in place of `{search_term}` and the start and end dates in format YYYYMMDD in place of `{begin_date}` and `{end_date}`.

From the response provided by the API we focused on the hits, which is the number of articles in which the search term is mentioned, and the headlines, on which we conducted sentiment anaylsis. Because the time span of our focus on each country spans at least 3 years, we grouped the hits and headlines by month when saving data.

An example of what the API returns for the country of Libya between the dates of January 1st and 31st, 20013 is shown in the code block below.

In [64]:
import os
import pyjq
from obtaining import request_articles, get_hits

# Path to API Key not Stored in Repo
PATH_ALEX = "/home/softdes/Desktop/nytimes-api-key"

with open(os.path.abspath(PATH_ALEX), "r") as f:
    api_key = f.readline().strip()
    
country_name = "Libya"
start_date = "20130101" # January 1st, 2013
end_date = "20130131" # January 31st, 2013

API_response = request_articles(country_name, start_date, end_date, api_key)
number_of_hits = get_hits(API_response)

headlines = pyjq.all('.response .docs[] .headline .main', API_response.json())

print(f"Between {start_date} and {end_date}, the term '{country_name}' appeared in {number_of_hits} articles. Ten headlines from this time period are:\n")
print(headlines)

Between 20130101 and 20130131, the term 'Libya' appeared in 84 articles. Ten headlines from this time period are:

['Italy Closes Consulate in Benghazi After New Attack', 'Lone Suspect Held in Benghazi Attack Is Freed in Tunisia', 'Clinton Testifies About Benghazi', 'Site of Kidnapping in Algeria', "Italy's Oil Leader Is Pursuing Its East African Bet", 'Mali’s Culture War: The Fate of the Timbuktu Manuscripts', 'French Troops Fight Alongside Mali Army Against Islamist Occupiers', 'The Early Word: On the Front', 'Why We Must Help Save Mali', 'A Lifelong Passion Is Now Put to Practice in The Hague']


The data collected from the NYT API is stored in a csv file, where each country gets its own data file. We utilized the Python Pandas library to organize the information into dataframes and save the dataframes to csv files. An example of what our stored data looks like is show below:

In [65]:
import pandas as pd
filepath = f"CountryData/{country_name}_data.csv"
dataframe = pd.read_csv(filepath)
dataframe.head()

Unnamed: 0,Country Name,MM-YYYY,Number of Hits,Sentiment Score (-1 to 1),Magnitude,Month's Headlines
0,Libya,01-2008,11,0.0,0.0,"['Rehabilitating Libya', 'A New France in the ..."
1,Libya,02-2008,11,-0.3,0.3,['Senior Qaeda Commander Is Killed by U.S. Mis...
2,Libya,03-2008,9,-0.1,0.1,['Papers Detail Complaints of Links to Treasur...
3,Libya,04-2008,15,0.0,0.0,['Libya Seeks Exemption for Its Debt to Victim...
4,Libya,05-2008,12,0.0,0.0,['Pakistani Nuclear Scientist Denies Selling S...


More details about the `Sentiment Score (-1 to 1)` and `Magnitude` columns will be provided below.

### Analyzing Headlines

After obtaining the raw data from NYT, we conducted sentiment analysis on the headlines using the [Google Cloud Natural Language API](https://cloud.google.com/natural-language). The sentiment analysis is done by posting plaintext headlines to the API, structured in the following way:

`body = {"document": {"type": "PLAIN_TEXT", "language": "en-us", "content": text}, "encodingType": "UTF32"}`,

where `"PLAIN_TEXT"` is replaced by the headlines that we want to analyze. Before sending the monthly headlines to the API, the information was processed to remove the quotation marks and commas around/between the headlines so that purely headline text was analyzed.

An example of an analysis done by the Google Cloud API is shown below for the headlines for Libya in January 2008.

In [71]:
from processing import headline_list_to_string, request_sentiment
plaintext = headline_list_to_string(country_name, "200801")
analysis = request_sentiment(plaintext)
analysis.json()

{'documentSentiment': {'magnitude': 0, 'score': 0},
 'language': 'en-US',
 'sentences': [{'text': {'content': 'Rehabilitating Libya A New France in the New Middle East: Forget Glory Waving Goodbye to Hegemony Demand More From Libya After Veto, House Passes a Revised Military Policy Measure 3 Convicted Who Led Charity Tied to Militants Fiat King Holds Court in Rome Red Bulls Coach Heads South to Find Players Sarkozy Says Press Is Free to Ignore His Personal Life Unwilling New Frontier for Migrants: 3 Greek Isles Nuclear Scientists',
    'beginOffset': 0},
   'sentiment': {'magnitude': 0, 'score': 0}}]}

Based on this analysis, the both the overall sentiment and the overall magnitude for this month are 0. The sentiment score ranges from -1 to 1, where -1 indicates that the text contained a very negative connotation and 1 indicates that the text contained a very positive connotation. In this case, a sentiment score of 0 implies that the tone of the headlines was, on average, neutral. The magnitude score indicates the strength of emotion conveyed in the text. The magnitude ranges from 0 and up, with a larger score indicating stronger emotionality (either positive or negative emotion). We used the Pandas library again to read the existing csv data files and update the dataframes for each country.

### Answering the Questions

The data that we've collected on NYT media coverage for Chile, Libya, and Bolivia takes the form of hits, headlines, and sentiment analysis per month around the time period of a regime change and/or US military involvement in these countries. The number of hits can be used to gauge media attention, under the assumption that more hits means more media coverage by the NYT. We decided to visualize this data using a scatterplot of hits vs. time in order to get a sense of when each country was mentioned the most; our hypothesis that there should be spikes in hits around notable events can be verified with this plot. Because the data is already stored in an organized fashion, we used the Matplotlib library to create the scatterplot of the `Number of Hits` column vs. the `MM-YYYY` column from the csv file for each country.

To determine the kind of language that was most frequently used by the NYT to describe each country, we used Python's Wordcloud library to create a qualitative display of most common words. The more prominent words in the wordcloud should be a good indicator of how the NYT is describing the country around and during military/politcal conflict. We have the capability to create wordclouds for single, isolated months, which allows us to visualize headlines directly corresponding to large hit values.

In order to get a more quantitative understanding of the headline language, our third visualization of the data is a bubblechart that plots the sentiment score vs. time, with the size of the bubble corresponding to the number of hits. Once again, this data is plotted using Matplotlib's scatter function to plot the `Sentiment Score (-1 to 1)` column against the `MM-YYYY` column of each csv, controlling the size with the `Number of Hits`. This provides a comprehensive visual of both media attention and sentiment, allowing us to see big-picture trends that the wordcloud or the scatterplot can't do individually.

## Results

## Conclusion