# COGS 108 - EDA Checkpoint

# Names

- James Larsen
- Alejandro Servin
- Lily Steiner
- Mayra Trejo
- Lucy Lennemann

<a id='research_question'></a>
# Research Question

How has the sentiment of the language surrounding Deafness used by popular online news sources (ABC, New York Times, USA Today, The Guardian, Associated Press) changed since the 80s?

# Setup

In [1]:
#import necessary packages, some will be used during analysis
import sys
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import json
import unicodedata
import nltk
from textblob import TextBlob, Word
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('omw-1.4')
from datetime import date
from collections import defaultdict
import math
import gensim
from gensim import corpora
import string
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /home/jamie/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jamie/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [2]:
# Import Datasets
# Import ABC Dataset
with open('dataset/abc_data.json') as abc_ds:
    abc_data=json.load(abc_ds)
    
# Import Alternative Press Dataset
with open('dataset/ap_data.json') as ap_ds:
    ap_data=json.load(ap_ds)

# Import The Guardian Dataset
with open('dataset/guard_data.json') as guard_ds:
    guard_data=json.load(guard_ds)
    
# Import New York Times Dataset
with open('dataset/nyt_data.json') as nyt_ds:
    nyt_data=json.load(nyt_ds)

# Import USA Today Dataset
with open('dataset/usa_data.json') as usa_ds:
    usa_data=json.load(usa_ds)

In [3]:
# Convert datasets to dataforms
abc_df = pd.read_json('dataset/abc_data.json')
ap_df = pd.read_json('dataset/ap_data.json') 
guard_df = pd.read_json('dataset/guard_data.json')
nyt_df = pd.read_json('dataset/nyt_data.json') 
usa_df = pd.read_json('dataset/usa_data.json') 

In [4]:
# Set row and column display
pd.options.display.max_rows=6
pd.options.display.max_columns=5

#Used to look for text errors reverted for cleaning
#pd.options.display.max_colwidth=None 

pd.options.display.max_colwidth=40

In [5]:
#Space for textblob coode

# Data Cleaning

Describe your data cleaning steps here.

1. We are reordering the columns of all the dataframes so that they match.
2. We are converting the date strings into pd.datetime format
3. We are removing all articles before 1980-01-01
4. We are removing unicode artifacts from the text using unicodedata.normalize
5. We are removing any extraneous articles
6. We are removing any extraneous pieces of article text

### ABC Dataset

In [6]:
#visualize dataframe
abc_df                                      

Unnamed: 0,url,headline,source,date,text
0,https://abcnews.go.com/US/wireStory/...,Prosecutor: Alex Murdaugh now faces ...,ABC News,2022-01-21 18:17:00,"COLUMBIA, S.C. -- A once-prominent S..."
1,https://abcnews.go.com/US/undefeated...,Undefeated: Deaf football team bring...,ABC News,2021-11-20 12:59:00,"Once considered underdogs, the footb..."
2,https://abcnews.go.com/US/referee-ac...,Referee accused of discriminating ag...,ABC News,2021-12-30 02:53:00,The American Civil Liberties Union i...
...,...,...,...,...,...
158,https://abcnews.go.com/US/dunwoody-d...,Dunwoody Day Care Trial: Widow 'Didn...,ABC News,2012-02-24 16:02:00,"ATLANTA, Feb. 24, 2012 — -- A witnes..."
159,https://abcnews.go.com/US/story?id=9...,Police Investigate Deaf Student Homi...,ABC News,2006-01-07 15:05:00,"Feb. 4, 2001 -- A student found dead..."
160,https://abcnews.go.com/US/story?id=9...,Rush Limbaugh Suffers Hearing Loss -...,ABC News,2006-01-07 15:26:00,"Oct. 8, 2001 -- Rush Limbaugh, who's..."


In [7]:
# Reorganize columns
abc_df = abc_df[['headline','date','source','url','text']]

# Convert 'date' to datetime format and only visualize date
abc_df['date'] = pd.to_datetime(abc_df['date'], errors='coerce')

# Remove articles before 1980-01-01
abc_df = abc_df[~(abc_df['date']<='1980-01-01')]

abc_df['source'] = 'abc'

In [8]:
#look for null values
abc_df.isnull().sum()

headline    0
date        0
source      0
url         0
text        0
dtype: int64

In [9]:
#Comb for unique values in the 'headline' column
abc_df['headline'].unique()

array(['Prosecutor: Alex Murdaugh now faces 71 charges; $8.5M stolen ...',
       'Undefeated: Deaf football team brings triumph and pride to ...',
       'Referee accused of discriminating against deaf wrestler in state ...',
       'Preserving Black American Sign Language in the Deaf community ...',
       'Baby born deaf has touching reaction to hearing music for 1st time ...',
       "Deaf Costco worker with mumbling manager won't get award - ABC ...",
       'Today in History - ABC News',
       "Scenes from Week 1 of Ghislaine Maxwell's sex-abuse trial - ABC ...",
       'Police officer dies from COVID-19 just 3 months after retirement ...',
       'Man released from prison after 48 years in court compromise - ABC ...',
       'Liberty Univ associate professor charged with sexual battery - ABC ...',
       'Report Warns of Terror Unpreparedness - ABC News',
       "Epstein's former house manager testifies, calls Ghislaine Maxwell ...",
       "Nobel doctor calls sexual violence i

In [11]:
#Clean text
abc_df['text'] = abc_df['text'].apply(lambda t: unicodedata.normalize('NFKD', t))

### Associated Press Dataset

In [12]:
#visualize dataframe
ap_df 

Unnamed: 0,url,headline,date,source,text
0,https://apnews.com/article/lifestyle...,2 hurt when part of student center c...,2022-01-13 12:53:56+00:00,AP News,"TALLADEGA, Ala. (AP) — Two workers w..."
1,https://apnews.com/article/georgia-u...,Georgia hospital agrees to measures ...,2022-01-09 15:58:08+00:00,AP News,"CALHOUN, Ga. (AP) — Federal authorit..."
2,https://apnews.com/article/sports-ba...,Wednesday's Scores | AP News,2022-01-27 05:20:44+00:00,AP News,"BOYS PREP BASKETBALL=Ash Fork 50, Gr..."
...,...,...,...,...,...
374,https://apnews.com/article/7bd22eafb...,Blind Lobby for Bill to Ban Seating ...,1990-02-07 07:41:00+00:00,AP News,\t WASHINGTON (AP) _ Scores of bli...
375,https://apnews.com/article/048cc469c...,Judaism in Silence: New Sign Languag...,1985-05-07 18:06:00+00:00,AP News,"\t NEWARK, N.J. (AP) _ Naomi Mille..."
376,https://apnews.com/article/03d6ff92f...,No One Took Girl's Threats Seriously...,1985-12-04 19:16:00+00:00,AP News,"\t SPANAWAY, Wash. (AP) _ Danny Ga..."


In [13]:
# Reorganize columns
ap_df = ap_df[['headline','date','source','url','text']]

# Convert 'date' to datetime format and only visualize date
ap_df['date'] = pd.to_datetime(ap_df['date'])

#Remove articles before 1980-01-01
ap_df = ap_df[~(ap_df['date']<='1980-01-01')]

ap_df['source'] = 'ap'

In [14]:
# Look for null values
ap_df.isnull().sum()

headline    0
date        0
source      0
url         0
text        0
dtype: int64

In [15]:
#Comb for unique values in the 'headline' column
ap_df['headline'].unique()

array(['2 hurt when part of student center collapses at deaf school | AP News',
       'Georgia hospital agrees to measures to help deaf patients | AP News',
       "Wednesday's Scores | AP News",
       'New NYC mayor says kids safe in school despite virus surge | AP ...',
       "Tuesday's Scores | AP News", "Monday's Scores | AP News",
       "Saturday's Scores | AP News", "Thursday's Scores | AP News",
       "Friday's Scores | AP News",
       'Mickey Guyton, Jhené Aiko, Mary Mary to sing at Super Bowl | AP ...',
       'Omicron surge is undermining care for other health problems | AP ...',
       'West Virginia lawmakers introduce 15-week abortion ban | AP News',
       'Ómicron trastoca el regreso a las escuelas en EEUU | AP News',
       'Prosecutor: Alex Murdaugh now faces 71 charges; $8.5M stolen | AP ...',
       'What to watch out for when Oscar noms are announced Tuesday ...',
       'New this week: Mary J. Blige, Jennifer Lopez and Puppy Bowl | AP ...',
       '2022 SAG n

In [17]:
#Remove articles that report sports scores
ap_df = ap_df[ap_df['headline'].str.contains("Monday's Scores|Tuesday's Scores|Wednesday's Scores|Thursday's Scores|Friday's Scores|Saturday's Scores|Sunday's Scores")==False]

#Clean text
ap_df['text'] = ap_df['text'].apply(lambda t: unicodedata.normalize('NFKD', t))

### The Guardian Dataset

In [18]:
#visualize dataframe
guard_df

Unnamed: 0,url,date,source,headline,text
0,https://www.theguardian.com/society/...,2022-01-27 19:51:16+00:00,The Guardian,British Sign Language to become reco...,British Sign Language (BSL) is on co...
1,https://www.theguardian.com/society/...,2021-12-09 18:10:02+00:00,The Guardian,Scottish health board apologises ove...,A Scottish health board has apologis...
2,https://www.theguardian.com/tv-and-r...,2022-01-10 17:36:39+00:00,The Guardian,Strictly: sign language interpreter ...,She was the first deaf contestant an...
...,...,...,...,...,...
6838,https://www.theguardian.com/theguard...,1954-11-12 15:10:17+00:00,The Guardian,Pensioners demand £2 10s a week - fr...,About four thousand old age pensione...
6839,https://www.theguardian.com/theobser...,1932-05-22 13:38:00+00:00,The Guardian,First woman to fly the Atlantic,"Miss Amelia Earhart, the American fl..."
6840,https://www.theguardian.com/world/18...,1865-02-07 02:35:09+00:00,The Guardian,Beethoven conducts Fidelio,Extracts from Louis Spohr's autobiog...


In [19]:
# Reorganize columns
guard_df = guard_df[['headline','date','source','url','text']]

# Convert 'date' to datetime format and only visualize date
guard_df['date'] = pd.to_datetime(guard_df['date'])

#Remove articles before 1980-01-01
guard_df = guard_df[~(guard_df['date']<='1980-01-01')]

guard_df['source'] = 'guard'

In [20]:
# Look for null values
guard_df.isnull().sum()

headline    0
date        0
source      0
url         0
text        0
dtype: int64

In [21]:
#Comb for unique values in the 'headline' column
guard_df['headline'].unique()

array(['British Sign Language to become recognised language in the UK ',
       'Scottish health board apologises over late diagnosis of deaf children',
       'Strictly: sign language interpreter to be projected on to big screens at live shows',
       ..., 'Yours sincerely, FG Pink', 'Revenge of the old guard',
       "Magus's maggot"], dtype=object)

In [23]:
#Clean text
guard_df['text'] = guard_df['text'].apply(lambda t: unicodedata.normalize('NFKD', t))

### New York Times Dataset

In [24]:
#visualize dataframe
nyt_df                                    

Unnamed: 0,url,headline,date,source,text
0,https://www.nytimes.com/2021/11/19/p...,How the Beatles Broke Up and the Dea...,2021-11-19 10:30:02+00:00,The New York Times,"This weekend, listen to a collection..."
1,https://www.nytimes.com/2021/10/10/o...,Don’t Fear a Deafer Planet,2021-10-10 15:00:07+00:00,The New York Times,"In Deaf culture, we have a rich stor..."
2,https://www.nytimes.com/2021/10/01/s...,"R. Allen Gardner, 91, Dies; Taught S...",2021-10-01 16:37:51+00:00,The New York Times,Washoe was 10 months old when her fo...
...,...,...,...,...,...
744,https://www.nytimes.com/1985/07/29/o...,Where Chimpanzees Use Sign Language,1985-07-29 05:00:00+00:00,The New York Times,Credit...The New York Times Archives...
745,https://www.nytimes.com/1984/02/06/o...,FIRE ALARMS FOR THE DEAF,1984-02-06 05:00:00+00:00,The New York Times,Credit...The New York Times Archives...
746,https://www.nytimes.com/1982/07/22/o...,DEAF AND SAFE DRIVERS,1982-07-22 05:00:00+00:00,The New York Times,Credit...The New York Times Archives...


In [25]:
# Reorganize columns
nyt_df = nyt_df[['headline','date','source','url','text']]

# Convert 'date' to datetime format and only visualize date
nyt_df['date'] = pd.to_datetime(nyt_df['date'])

#Remove articles before 1980-01-01
nyt_df = nyt_df[~(nyt_df['date']<='1980-01-01')]

nyt_df['source'] = 'nyt'

# Visualize 'text' to search for errors
#nyt_df['text']

In [26]:
#Look for null values
nyt_df.isnull().sum()

headline    0
date        0
source      0
url         0
text        0
dtype: int64

In [27]:
#Comb for unique values in the 'headline' column
nyt_df['headline'].unique()

array(['How the Beatles Broke Up and the Deaf Football Team Taking California by Storm: The Week in Narrated Articles',
       'Don’t Fear a Deafer Planet',
       'R. Allen Gardner, 91, Dies; Taught Sign Language to a Chimp Named Washoe',
       'Barbara Kannapell, Activist Who Empowered Deaf People, Dies at 83',
       'Lesson of the Day: ‘Black, Deaf and Extremely Online’',
       'Black, Deaf and Extremely Online',
       'I Think Beethoven Encoded His Deafness in His Music',
       'Mothering While Deaf in a Newly Quiet World',
       'A Deaf-Blind Dishwasher Achieves His Childhood Dream: Movie Actor',
       'The Queer, Half-Deaf Actor Redefining the Idea of a Leading Man',
       'Giannis Antetokounmpo Is Called Amazing. Now in Sign Language, Too.',
       'University Denounced for Showing Sign Language for ‘Jewish’ as a Hooked Nose',
       'Harlan Lane, Vigorous Advocate for Deaf Culture, Dies at 82',
       'At Banks and Fund Firms, Access Is Too Often Denied, Blind and Deaf 

### USA Today Dataset

In [29]:
#create dataframe using dataset

#visualize dataframe
usa_df 

Unnamed: 0,url,headline,source,date,text
0,https://www.usatoday.com/story/enter...,"Deaf Detroit rapper to join Eminem, ...",USA Today,"Published: 8:16 p.m. ET Feb. 4, 2022...",Sean Forbes had expected to experien...
1,https://www.usatoday.com/story/news/...,"Black, deaf, proud: Gallaudet Univer...",USA Today,"Published: 5:01 a.m. ET Feb. 3, 2022...",WASHINGTON – It’s a chilly December ...
2,https://www.usatoday.com/story/money...,Reviewed launches new vertical dedic...,USA Today,"Published: 8:58 a.m. ET Jan. 10, 202...",— Recommendations are independently ...
...,...,...,...,...,...
1323,https://www.usatoday.com/story/popca...,Early Buzz: Post-Halloween treats,USA Today,"Published: 10:49 a.m. ET Nov. 1, 201...","Happy November, Pop Readers! I can't..."
1324,https://www.usatoday.com/story/enter...,Singer who lost hearing while attend...,USA Today,"Published: 12:26 p.m. ET June 7, 201...",Mandy Harvey's voice was heard aroun...
1325,https://www.usatoday.com/story/news/...,Daughters of man killed by police sp...,USA Today,"Published: 6:27 p.m. ET Sept. 29, 20...",After hearing that Louisville Metro ...


In [30]:
# Reorganize columns
usa_df = usa_df[['headline','date','source','url','text']]

# Convert 'date' to datetime format and only visualize date
pd.options.mode.chained_assignment = None

usa_df['date'] = usa_df['date'].str.extract(r'Published:? (.*?)(?:Updated:?.*)?$')
usa_df['date'] = usa_df['date'].str.replace('ET', '')
usa_df['date'] = pd.to_datetime(usa_df['date'])

pd.options.mode.chained_assignment = 'warn'

# Remove articles before 1980-01-01
usa_df = usa_df[~(usa_df['date']<='1980-01-01')]

usa_df['source'] = 'usa'

#Find data types
usa_df.dtypes

headline            object
date        datetime64[ns]
source              object
url                 object
text                object
dtype: object

In [31]:
#look for null values
print(usa_df.isnull().sum())
usa_df.dropna(inplace=True)

headline    0
date        4
source      0
url         0
text        0
dtype: int64


In [32]:
#Comb for unique values in the 'headline' column
usa_df['headline'].unique()

array(['Deaf Detroit rapper to join Eminem, Dre, Snoop at Super Bowl halftime',
       'Black, deaf, proud: Gallaudet University embraces all of its diversity',
       'Reviewed launches new vertical dedicated to accessibility', ...,
       'Early Buzz: Post-Halloween treats',
       "Singer who lost hearing while attending CSU wows America's Got ...",
       'Daughters of man killed by police speak out'], dtype=object)

In [34]:
#Clean Text
usa_df['text'] = usa_df['text'].apply(lambda t: unicodedata.normalize('NFKD', t))

In [35]:
# List of dataframes for function iteration
df_list = [abc_df, ap_df, guard_df, nyt_df, usa_df]

In [36]:
combined_df = pd.concat(df_list).reset_index(drop=True)
combined_df['year'] = combined_df['date'].apply(lambda d: d.year)
combined_df

Unnamed: 0,headline,date,...,text,year
0,Prosecutor: Alex Murdaugh now faces ...,2022-01-21 18:17:00,...,"COLUMBIA, S.C. -- A once-prominent S...",2022
1,Undefeated: Deaf football team bring...,2021-11-20 12:59:00,...,"Once considered underdogs, the footb...",2021
2,Referee accused of discriminating ag...,2021-12-30 02:53:00,...,The American Civil Liberties Union i...,2021
...,...,...,...,...,...
9137,Early Buzz: Post-Halloween treats,2012-11-01 10:49:00,...,"Happy November, Pop Readers! I can't...",2012
9138,Singer who lost hearing while attend...,2017-06-07 12:26:00,...,Mandy Harvey's voice was heard aroun...,2017
9139,Daughters of man killed by police sp...,2016-09-29 18:27:00,...,After hearing that Louisville Metro ...,2016


# Data Analysis & Results (EDA)

Carry out EDA on your dataset(s); Describe in this section

### Sentiment Analysis 

In this next section, we are creating new dataframes which will have the analysis results in addition to defining a few helper functions for our analysis. 

In [37]:
#find sentiment for a given piece of text
def get_sentiment(text):
    blob = TextBlob(text)
    polarity, subjectivity = blob.sentiment
    return polarity, subjectivity

In [38]:
#cleans text and returns textblob object for keyword analysis 
def cleaned_blob(text):    
    #removes all quotations, periods, commas, and hyphens
    text = text.replace('‘', '')
    text = text.replace('’', '')
    text = text.replace('“', '')
    text = text.replace('”', '')
    text = text.replace('.', ' ')
    text = text.replace(',', ' ')
    text = text.replace('–', ' ')   
    text = text.replace('-', ' ')
    #removes stopwords 
    words_list = (x for x in TextBlob(text).words if x not in stopwords.words('English'))
    #removes numbers, not relevant for keyword analysis
    words_list = (x for x in words_list if x.isalpha())
    #lemmatizes
    words_list = (Word(word).lemmatize() for word in words_list)
    # joins all words into one string
    cleaned = ' '.join(words_list)
    b = TextBlob(cleaned) 
    #remove leading/trailing whitespace and makes all lowercase
    b = b.strip()
    b = b.lower()
    return b

Now we will apply this function to each news dataframe and add two columns with the objectivity score and subjectivity score (both ranging from -1 to 1) 

In [39]:
news_data_sent = [abc_df, ap_df, guard_df, nyt_df, usa_df]
for df in news_data_sent:
    df[['polarity', 'subjectivity']]=df.apply(lambda x: get_sentiment(x['text']),axis=1,
                             result_type='expand')

In [41]:
#test to see if properly configured
abc_df

Unnamed: 0,headline,date,...,polarity,subjectivity
0,Prosecutor: Alex Murdaugh now faces ...,2022-01-21 18:17:00,...,0.066518,0.394252
1,Undefeated: Deaf football team bring...,2021-11-20 12:59:00,...,0.117981,0.488425
2,Referee accused of discriminating ag...,2021-12-30 02:53:00,...,0.054921,0.421270
...,...,...,...,...,...
158,Dunwoody Day Care Trial: Widow 'Didn...,2012-02-24 16:02:00,...,0.000680,0.437609
159,Police Investigate Deaf Student Homi...,2006-01-07 15:05:00,...,0.074857,0.390068
160,Rush Limbaugh Suffers Hearing Loss -...,2006-01-07 15:26:00,...,0.066071,0.438921


In [42]:
# List of dataframes for function iteration
df_list = [abc_df, ap_df, guard_df, nyt_df, usa_df]

In [45]:
combined_df = pd.concat(df_list).reset_index(drop=True)
combined_df['year'] = combined_df['date'].apply(lambda d: d.year)
combined_df

Unnamed: 0,headline,date,...,subjectivity,year
0,Prosecutor: Alex Murdaugh now faces ...,2022-01-21 18:17:00,...,0.394252,2022
1,Undefeated: Deaf football team bring...,2021-11-20 12:59:00,...,0.488425,2021
2,Referee accused of discriminating ag...,2021-12-30 02:53:00,...,0.421270,2021
...,...,...,...,...,...
9137,Early Buzz: Post-Halloween treats,2012-11-01 10:49:00,...,0.467978,2012
9138,Singer who lost hearing while attend...,2017-06-07 12:26:00,...,0.405035,2017
9139,Daughters of man killed by police sp...,2016-09-29 18:27:00,...,0.397457,2016


### Data Analysis 