<a href="https://colab.research.google.com/github/jcvincentliu/Diving-into-NYT-Articles-in-Four-Years/blob/main/project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **MACS 31300: Final Project - Public Policies and Public Opinions Behind New York Times Articles in Four Years**
####  --- Uncovering the myths through the use of natural language processing, clustering, and deep learning

In this project, I dug into how the magnitude of crime-related news among public in the last 50 years by exploring the [New York Times Archives data](https://developer.nytimes.com/). As we can tell from the official crime data, on a nationwide level, crime levels rose in the 70s and 80s, peaked in the early 90s and have been dropping since mid-90s. This is especially the pattern for violent crimes. Public policies, however, did not quite fit into this pattern that we saw important crime prevention policies becoming the law all the time and that with lower crime levels, the trend was fighting harder on crimes. One explanation in the literature is changes in crime policies are in responce with the public opinion and attitudes on crimes in lieu of what is really going on. The approach highlights the political influence of public opinion while giving attention to how the electrocal system works in the country. I am intersted in knowing about the validity of the latter explantion. 

The project was made a possibility with a mix of methods, including natural language processing and clustering. I started with cleaning the data, applying necessary transformations, moving to exploratory analysis of the most salient tuples in each decade, and ending with using machine learning to fina out the best model.   

### **Project Description**

---



The whole project will be divided into five phases. In phases one and two, I scrape the data from the New York Times archive database and perform data wrangling including cleaning the text data, tokenizing them, and generating concatenated tables. I will be giving my attentions to NYT articles in four specific years, 1971, 1989, 2010, and 2020, each of which is represented with an influential criminal justice policy/legislation. In phrase three, I did explorative data analysis (EDA) by looking at most frequent words and bigrams and creating visualizations. In phrase four, I use clustering techniques to divide tokens into groups and visualizing the grouping. Lastly, I will project the groups using deep learning methods to create interactive visualizations. 

Starting with loading `request`, `nltk`, `tensorflow`, `sklearn`, and other libraries: 

In [None]:
import nltk
nltk.download('stopwords')
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.preprocessing import LabelEncoder 
from nltk.corpus import stopwords
nltk.download("stopwords") #provides list of english stopwords
stop = stopwords.words('english')  

import requests
import json
import seaborn as sns
import re
import os
import matplotlib.pyplot as plt
%matplotlib inline

plt.style.use('ggplot')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
os.getcwd() # get current directory, shall be deleted later   see https://note.nkmk.me/en/python-os-getcwd-chdir/


'/'

In [None]:
os.chdir("/..") # delete later

#### Part I: Gathering Data Through Web Scraping (Preliminary Data Collection)
---

The codes are adapted from a [Hackaton competition on Kaggle](https://www.kaggle.com/code/tumanovalexander/nyt-dataset-hackaton/notebook). Since the original scope of my analysis was all articles published by New York Times between 1960-2020, I scraped each year's data and concatenated them into one large data frame.   

In [None]:
df_archieve = {}

for year in range(1960, 2021):
    archieve = []
    for month in range(1, 13):
        articles_request = requests.get(f'https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key=8Aekv3zl5FZVAoZwAGfwAn4BO1ruYxh2')
        if articles_request.ok:
            articles_content = json.loads(articles_request.text)
            number_of_news_items = len(articles_content['response']['docs'])

            for i in range(number_of_news_items):
                if 'abstract' in articles_content['response']['docs'][i].keys():
                    fragment_abstract = articles_content['response']['docs'][i]['abstract']
                    fragment_headline = articles_content['response']['docs'][i]['headline']['main']
                    if fragment_abstract or fragment_headline:
                        fragment = fragment_abstract + ' ' + fragment_headline
                    archieve.append(fragment)
                    
    df_archieve = pd.DataFrame(data=archieve, index=np.full_like(range(len(archieve)), year)).reset_index()
    df_archieve.rename(columns={'index': 'year', 0: 'sentence'}, inplace=True)
    df_archieve.to_csv(f'datasets\df_{year}.csv')

In [None]:
year_archieves = pd.DataFrame()

for year in range(1960, 2021):
    df = pd.read_csv(f'datasets\df_{year}.csv')
    year_archieves = pd.concat([year_archieves, df])

I saved the data into my local environment.



In [None]:
year_archieves.to_csv('Total Archives.csv')

#### Part II: Cleaning the Text Data
---

Loading the data of years 1971, 2989, 2010, and 2020:

In [None]:
from google.colab import files
uploaded = files.upload()

Saving df_1971.csv to df_1971.csv
Saving df_1989.csv to df_1989.csv
Saving df_2010.csv to df_2010.csv
Saving df_2020.csv to df_2020.csv


In [None]:
NYT_1971 = pd.read_csv("df_1971.csv") # Beginning of war on drug: https://www.britannica.com/topic/war-on-drugs

NYT_1989 = pd.read_csv("df_1989.csv") # influential Connor vs Graham

NYT_2010 = pd.read_csv("df_2010.csv") # the peak of incarceration counts in the nation, see the graph on https://www.prisonpolicy.org/profiles/US.html#time

NYT_2020 = pd.read_csv("df_2020.csv") # Killing of George Floyd

I created a function that does five things: stripping each sentence into strings, converting words into lower case, removing stopwords, punctuations, and numbers, processing missing data, and reformatting `year` as a factor/categorical variable. I applied the function to four datasets.

In [None]:
def clean_data(NYT_df):
  NYT_df.drop(columns=['Unnamed: 0'], inplace=True)
  NYT_df['sentence processed'] = NYT_df["sentence"].str.lower()
  NYT_df['sentence processed'] = NYT_df['sentence processed'].apply(lambda x: re.sub(r'\d+', '', x)) # remove all the numbers
  NYT_df['sentence processed'] = NYT_df['sentence processed'].apply(lambda x: re.compile(r'<[^>]+>').sub(" ", x))
  NYT_df['sentence processed'] = NYT_df['sentence processed'].apply(lambda x: re.sub(r'[^\w\s]',' ',x))
  NYT_df['sentence processed'] = NYT_df['sentence processed'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)])) # remove stop words
  NYT_df['sentence processed'] = NYT_df["sentence processed"].fillna("fillna")
  NYT_df["year"] = NYT_df["year"].astype("category")
  return NYT_df

In [None]:
NYT_1971 = clean_data(NYT_1971)
NYT_1989 = clean_data(NYT_1989)
NYT_2010 = clean_data(NYT_2010)
NYT_2020 = clean_data(NYT_2020)


By inspecting the first five rows of the dataset `NYT_1971`, I was able to grasp a sense of what my data looks like. This is important for future analysis. I also realized that all variables with the exception of `year` are in the pandas object type.   

In [None]:
NYT_1971.head()

Unnamed: 0,year,sentence,sentence processed,sentence tokenized,sentence bigrammed
0,1971,A Moursund Jr lr on T Winner's Dec 22 (32:4) l...,moursund jr lr winner dec lr disputes winner s...,"[moursund, jr, lr, winner, dec, lr, disputes, ...","[(moursund, jr), (jr, lr), (lr, winner), (winn..."
1,1971,Amer Gas Assn repts gas utility and pipeline i...,amer gas assn repts gas utility pipeline indus...,"[amer, gas, assn, repts, gas, utility, pipelin...","[(amer, gas), (gas, assn), (assn, repts), (rep..."
2,1971,Dr Spock lr disputes Prof Bickel Dec 21 Op-Ed ...,dr spock lr disputes prof bickel dec op ed art...,"[dr, spock, lr, disputes, prof, bickel, dec, o...","[(dr, spock), (spock, lr), (lr, disputes), (di..."
3,1971,Northfield Savings Bank begins operation under...,northfield savings bank begins operation new n...,"[northfield, savings, bank, begins, operation,...","[(northfield, savings), (savings, bank), (bank..."
4,1971,"NYS Health Dept repts 638,077 children, 26.4% ...",nys health dept repts children population aged...,"[nys, health, dept, repts, children, populatio...","[(nys, health), (health, dept), (dept, repts),..."


In [None]:
NYT_1971.dtypes

year                  category
sentence                object
sentence processed      object
sentence tokenized      object
sentence bigrammed      object
dtype: object

For this project, I am concerned with both individual words and bigrams (two words as a group). 

- Token/Unigram: 

Single words may have particular meanings in English. To understand this, thinking about the word "criminal", which itself tells us a lot of information. With this word, we are able to picture an image of someone committing an unlawful act, A single word could have sentiment and be used to shed light on public opinion. 

In the first step, I tokenized each article in each dataset and put the list of tokens in a new column named `sentence tokenized`. 

In [None]:
NYT_1971["sentence tokenized"] = NYT_1971["sentence processed"].str.split(' ')
NYT_1989["sentence tokenized"] = NYT_1989["sentence processed"].str.split(' ')
NYT_2010["sentence tokenized"] = NYT_2010["sentence processed"].str.split(' ')
NYT_2020["sentence tokenized"] = NYT_2020["sentence processed"].str.split(' ')

- Bigram

Words come in pairs usually reveal more information than single words. Still using the example of "criminal", phrases "criminal justice" and "criminal sentencing" go to two distinct directions, although they both share the word "criminal". The former could mean a field of study or a principle, whereas, the latter is a judicial and legislative term. Hereby, diving words into bigrams will give us more information about the data.

In the following two chunks, I created a function that will rearrange a list of single tokens into a list of bigrams and apply it to our datasets. The new column is called `sentence tokenized`.

In [None]:
def ngrams(tokens, n=2, sep=' '):
    return [sep.join(ngram) for ngram in zip(*[tokens[i:] for i in range(n)])]

In [None]:
NYT_1971['sentence bigrammed'] = NYT_1971['sentence tokenized'].apply(ngrams)
NYT_1989['sentence bigrammed'] = NYT_1989['sentence tokenized'].apply(ngrams)
NYT_2010['sentence bigrammed'] = NYT_2010['sentence tokenized'].apply(ngrams) # remove all the numbers
NYT_2020['sentence bigrammed'] = NYT_2020['sentence tokenized'].apply(ngrams)

Now, let's take another look at our `NYT_1971` data. 

In [None]:
NYT_1971.head(5)

Unnamed: 0,year,sentence,sentence processed,sentence tokenized,sentence bigrammed
0,1971,A Moursund Jr lr on T Winner's Dec 22 (32:4) l...,moursund jr lr winner dec lr disputes winner s...,"[moursund, jr, lr, winner, dec, lr, disputes, ...","[moursund jr, jr lr, lr winner, winner dec, de..."
1,1971,Amer Gas Assn repts gas utility and pipeline i...,amer gas assn repts gas utility pipeline indus...,"[amer, gas, assn, repts, gas, utility, pipelin...","[amer gas, gas assn, assn repts, repts gas, ga..."
2,1971,Dr Spock lr disputes Prof Bickel Dec 21 Op-Ed ...,dr spock lr disputes prof bickel dec op ed art...,"[dr, spock, lr, disputes, prof, bickel, dec, o...","[dr spock, spock lr, lr disputes, disputes pro..."
3,1971,Northfield Savings Bank begins operation under...,northfield savings bank begins operation new n...,"[northfield, savings, bank, begins, operation,...","[northfield savings, savings bank, bank begins..."
4,1971,"NYS Health Dept repts 638,077 children, 26.4% ...",nys health dept repts children population aged...,"[nys, health, dept, repts, children, populatio...","[nys health, health dept, dept repts, repts ch..."


One of my goal in this section is to map the most popular words/grams in NYT articles and draw time series graphs to display the usage of words of NYT writers over time. To do that, I started with compiling tokens and bigrams into one single list. I then rearranged them into a pandas dataframe.

This is done below:

In [None]:
def get_word_lists_nyt(NYT_df):
  wordlist_nyt = []
  num_article_yr = NYT_df.shape[0]
  for num in range(num_article_yr):
    words = NYT_df["sentence tokenized"][num]
    for word in words:
      wordlist_nyt.append(word)
  return wordlist_nyt

def get_bigram_lists_nyt(NYT_df):
  wordlist_nyt = []
  num_article_yr = NYT_df.shape[0]
  for num in range(num_article_yr):
    words = NYT_df["sentence bigrammed"][num]
    for word in words:
      wordlist_nyt.append(word)
  return wordlist_nyt

In [None]:
wordlist_nyt_1971 = get_word_lists_nyt(NYT_1971)
wordlist_nyt_1989 = get_word_lists_nyt(NYT_1989)
wordlist_nyt_2010 = get_word_lists_nyt(NYT_2010)
wordlist_nyt_2020 = get_word_lists_nyt(NYT_2020)


In [None]:
bigramlist_nyt_1971 = get_bigram_lists_nyt(NYT_1971)
bigramlist_nyt_1989 = get_bigram_lists_nyt(NYT_1989)
bigramlist_nyt_2010 = get_bigram_lists_nyt(NYT_2010)
bigramlist_nyt_2020 = get_bigram_lists_nyt(NYT_2020)

In [None]:
NYT_token_dict = {"Year": ["1971", "1989", "2010", "2020"], 
                  "Tokens": [wordlist_nyt_1971, wordlist_nyt_1989, wordlist_nyt_2010, wordlist_nyt_2020], 
                  "Num of Tokens": [len(wordlist_nyt_1971), len(wordlist_nyt_1989), len(wordlist_nyt_2010), len(wordlist_nyt_2020)], 
                  "Bigrams": [bigramlist_nyt_1971, bigramlist_nyt_1989, bigramlist_nyt_2010, bigramlist_nyt_2020], 
                  "Num of Bigrams": [len(bigramlist_nyt_1971), len(bigramlist_nyt_1989), len(bigramlist_nyt_2010), len(bigramlist_nyt_2020)]}
NYT_tokens = pd.DataFrame(data= NYT_token_dict)
NYT_tokens["Year"] = NYT_tokens["Year"].astype("category")


In [None]:
NYT_tokens.dtypes

Year              category
Tokens              object
Num of Tokens        int64
Bigrams             object
Num of Bigrams       int64
dtype: object

In [None]:
NYT_tokens

Unnamed: 0,Year,Tokens,Num of Tokens,Bigrams,Num of Bigrams
0,1971,"[moursund, jr, lr, winner, dec, lr, disputes, ...",2718752,"[moursund jr, jr lr, lr winner, winner dec, de...",2625362
1,1989,"[lead, editor, bah, humbug, lead, chile, milit...",2961626,"[lead editor, editor bah, bah humbug, lead chi...",2857095
2,2010,"[age, digital, imagery, one, man, still, earns...",2028580,"[age digital, digital imagery, imagery one, on...",1916132
3,2020,"[president, terrible, toddler, anyone, else, w...",279127,"[president terrible, terrible toddler, toddler...",264602


### Part III: Exploring Data with Explorative Data Analysis (EDA)

In [80]:
from collections import Counter
count_word_nyt1971 = Counter(wordlist_nyt_1971)
count_word_nyt1989 = Counter(wordlist_nyt_1989)
count_word_nyt2010 = Counter(wordlist_nyt_2010)
count_word_nyt2020 = Counter(wordlist_nyt_2020)

count_bigram_nyt1971 = Counter(bigramlist_nyt_1971)
count_bigram_nyt1989 = Counter(bigramlist_nyt_1989)
count_bigram_nyt2010 = Counter(bigramlist_nyt_2010)
count_bigram_nyt2020 = Counter(bigramlist_nyt_2020)

In [82]:
count_bigram_nyt1989.most_common(20)

[('net inc', 23013),
 ('share earns', 22369),
 ('lead company', 17277),
 ('company reports', 17251),
 ('reports earnings', 17248),
 ('earnings qtr', 15915),
 ('inc share', 13443),
 ('sales net', 11182),
 ('new york', 10782),
 ('revenue net', 8080),
 ('net loss', 6885),
 ('inc b', 6862),
 ('qtr march', 6502),
 ('qtr dec', 6380),
 ('qtr sept', 5996),
 ('qtr june', 5983),
 ('b share', 5665),
 ('otc qtr', 5548),
 ('rev net', 5272),
 ('shares outst', 5019)]

#### Part IV: Data Modeling - Clustering and Grouping

(comment) unsupervised classification of the understanding of the group we have
eg look at word embedding, how they cluster over time --> dive into themes by year: start with looking at word clouds --> know what should I looking for?

In [None]:
data = year_archieves.sample(n = 10000)
X = data["sentence"]
y = data["year"]


In [None]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
data["sentence"]


21111     Aon, close to settlement with attorneys genera...
105288      J. Paul Austin, the retired chairman of the ...
12834     GENEVA, March 1&#8212;Stheduled airlines flyin...
92958     MILTENBERG-Herman. 97 years young, our dearly ...
101651    TheaterWorks in Hartford is now staging “God o...
                                ...                        
62116     A revamping is expected to take time to lift t...
56413     PARIS, May 18 (Reuters) -- Following is the tr...
27402     WESTBURY, L. I., April 10&#8212; A last&#8208;...
50614     Hundreds of thousands of Rwandan refugees trud...
41460     WASHINGTON, June 19 The anti-discrimination la...
Name: sentence, Length: 10000, dtype: object