# Data Preparation

## PART 1: Selecting Data
* Select news orgs + classify them
* Read the news covering the Atlanta Shootings and select a few pieces
* Put them into a dictionary database

## Articles and Thoughts:
FOX: https://www.foxnews.com/opinion/tucker-carlson-atlanta-shooter-media-political-agenda
* mentioned how the the shooter was struggling with mental health
* denied that the shooting was racially motivated
* brought up the BML movement and made comparisons

NYT: https://www.nytimes.com/2021/03/19/opinion/atlanta-shooting-massage-sex-work.html
* "the victims lived at the nexus of race, gender and class"
* talked about how race, gender, and class intersected at this crime

CNN: https://www.cnn.com/2021/03/20/opinions/atlanta-spa-shootings-a-hate-crime-hong/index.html
* brought up the question: "how is this not a hate crime"
* called attempts to bring up anything else as a "distraction"
* heavily called on the police's statements

AC: https://www.theamericanconservative.com/dreher/racializing-atlanta-massage-parlor-killings/
* lots of text exerpts
* called the media "radicalized"
* says that racism is bad but this is not racist
* exerpts has been cut out for time

Notes for Cleaning:
* all are capturing the \n and other \ functions
* CNN in particular has weird ",&#39" things

In [2]:
# load files and read into data
keys = ['nyt', 'fox', 'cnn', 'ac']
data = {}
for k in keys:
    with open("articles/opinions/" + k + ".txt", "rb") as file:
        # print(type(file.read()))
        contents = file.read().decode("utf-8") 
        data[k] = []
        data[k].append(contents)
        file.close()

In [3]:
# Print data
print(data)



In [4]:
data.keys()

dict_keys(['nyt', 'fox', 'cnn', 'ac'])

In [5]:
data['nyt']

['Among the first things I did upon learning about the shootings at three massage parlors in the Atlanta area was to check in with a former massage parlor worker I met in 2019. At the time, I was reporting an article about a prostitution raid at a Florida massage parlor.\n\nUnable to work during the pandemic, she was home alone when we spoke; the news from Atlanta hadn’t reached her yet. “Too frightening,” she said, when I sent her an article about what had happened. Robert Aaron Long, 21, who has been charged with the murder of eight people in Atlanta and nearby Acworth, six of them Asian women, had been arrested on his way to Florida — where she was — and where he planned on killing \\more, according to what he told the police. She worried for her colleagues. “Do you think someone will kill them? Am I in danger too?”\n\nI didn’t know how to respond, in part because I knew so little about those killed in Georgia: Hyun Jung Grant, 51; Suncha Kim, 69; Soon Chung Park, 74; Yong Ae Yu, 63

## PART 2: Cleaning Data
* lower case
* remove punctuation
* remove numbers
* remove common text
* tokenize
* remove stop words


In [6]:
# Put into a dataframe
import pandas as pd
pd.set_option('max_colwidth',150)
data_df = pd.DataFrame.from_dict(data, orient="index")
data_df.columns = ['article']
print(data_df.index)
data_df

Index(['nyt', 'fox', 'cnn', 'ac'], dtype='object')


Unnamed: 0,article
nyt,Among the first things I did upon learning about the shootings at three massage parlors in the Atlanta area was to check in with a former massage ...
fox,"On the afternoon of March 16, police say, a 21-year-old man called Robert Aaron Long walked into a massage parlor outside Atlanta and shot five pe..."
cnn,"At an office where I worked some years ago, a male co-worker would repeatedly stop by my desk and interrupt my work to talk about sex. One of my c..."
ac,It is striking to see how quickly our media has racialized the narrative of the horrific murders at the Georgia massage parlors. From what we know...


In [7]:
# First Round of Text Cleaning
import re
import string

def clean_round1(text):
    text = text.lower()
    text = re.sub('b\'', '', text)
    text = re.sub('\\n', '', text) 
    # text = re.sub('nnn', ' n', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # remove punctuation
    text = re.sub('“', '', text)
    text = re.sub('”', '', text)
    text = re.sub('’', '', text)
    text = re.sub('\w*\d\w*', '', text) # remove numbers
    text = re.sub('—', ' ', text)
    

    print(text)
    return text

# round1 = lambda x: clean_round1(x)

In [8]:
# Apply cleaning to the rounds
for i in range(len(keys)):
    text = str(data[keys[i]])
    data_df.iloc[i, 0] = clean_round1(text)

among the first things i did upon learning about the shootings at three massage parlors in the atlanta area was to check in with a former massage parlor worker i met in  at the time i was reporting an article about a prostitution raid at a florida massage parlornnunable to work during the pandemic she was home alone when we spoke the news from atlanta hadnt reached her yet too frightening she said when i sent her an article about what had happened robert aaron long  who has been charged with the murder of eight people in atlanta and nearby acworth six of them asian women had been arrested on his way to florida   where she was   and where he planned on killing more according to what he told the police she worried for her colleagues do you think someone will kill them am i in danger toonni didnt know how to respond in part because i knew so little about those killed in georgia hyun jung grant  suncha kim  soon chung park  yong ae yu  daoyou feng  xiaojie tan  paul andre michels  and dela

In [9]:
# Create Corpus
data_df

Unnamed: 0,article
nyt,among the first things i did upon learning about the shootings at three massage parlors in the atlanta area was to check in with a former massage ...
fox,on the afternoon of march police say a man called robert aaron long walked into a massage parlor outside atlanta and shot five people long then ...
cnn,at an office where i worked some years ago a male coworker would repeatedly stop by my desk and interrupt my work to talk about sex one of my coll...
ac,it is striking to see how quickly our media has racialized the narrative of the horrific murders at the georgia massage parlors from what we know ...


In [10]:
# Create Document Term Matrix
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(stop_words='english')
data_vect = vectorizer.fit_transform(data_df.article)
data_matrix = pd.DataFrame(data_vect.toarray(), columns=vectorizer.get_feature_names())
data_matrix.index = data_df.index
data_matrix

Unnamed: 0,aapi,aaron,aberrant,abolishing,absolutely,absurd,abundance,academics,accidentnnnew,according,...,yon,yong,york,young,younndid,youre,yu,yun,yung,zero
nyt,0,1,1,0,0,0,0,0,0,4,...,1,1,2,0,0,0,1,1,1,0
fox,0,1,0,1,1,1,0,0,1,4,...,0,0,3,1,0,2,0,0,0,1
cnn,1,1,0,0,1,0,1,0,0,1,...,0,0,1,0,1,0,0,0,0,0
ac,0,0,0,0,1,0,0,1,0,0,...,0,0,3,3,0,0,0,0,0,0


## PART 3: Pickle Data
We pickle the data so we can read it in later notebooks.
* corpus - 
* document term matrix - rows for each document, columns for each word. each cell indicates how many times the word is in each document

In [13]:
# Pickle Corpus Data
data_df.to_pickle('corpus.pkl')

In [14]:
# Pickle DTM 
data_matrix.to_pickle('dtm.pkl')

In [None]:
# Pickle Cleaned ata
data_clkean