# This notebook was used to preliminarily process and clean the csv data files I created from raw Lexis Nexis xml files. Some additional cleaning may occur while doing data analysis, as additional issue or problems with the raw data arise.
***

In [2]:
%run Functions.ipynb

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Commjhub/jupyterhub/comm318_fall2019/jdlish/nltk_data
[nltk_data]     ...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Commjhub/jupyterhub/comm318_fall2019/jdlish/nltk_data
[nltk_data]     ...
[nltk_data]   Unzipping corpora/stopwords.zip.


### CNN Data Preliminary Processing and Cleaning

In [3]:
#Create a CNN dataframe
data_cnn = pd.read_csv('Data Files/CNN and Fox Csv Files/cnn_data.csv', sep='|', header=None,error_bad_lines=False, warn_bad_lines=False)


In [4]:
data_cnn.columns = ["full_date","body_text","word_count"]

In [5]:
data_cnn["News Outlet"]="CNN"

In [6]:
#Create a column tokenizing the text of each news transcript. This will save time later in each analysis step.
strip_chars=';:."--,!$%^&*@#'
data_cnn["Tokenized Text"]=data_cnn["body_text"].apply(tokenize,args=(True,strip_chars))

In [8]:
#Make dates into strings to remove problems with floats vs. strings in the same column
data_cnn['full_date']=data_cnn['full_date'].astype(str)

### Fox Data Preliminary Processing and Cleaning

In [9]:
#Create a Fox dataframe
data_fox = pd.read_csv('Data Files/CNN and Fox Csv Files/fox_data.csv', sep='|', header=None,error_bad_lines=False, warn_bad_lines=False)

In [10]:
data_fox.columns = ["full_date","body_text","word_count"]

In [11]:
data_fox["News Outlet"]="Fox"

In [12]:
#Create a column tokenizing the text of each news transcript. This will save time later in each analysis step.
strip_chars=';:."--,!$%^&*@#'
data_fox["Tokenized Text"]=data_fox["body_text"].apply(tokenize,args=(True,strip_chars))

In [13]:
#Make dates into strings to remove problems with floats vs. strings in the same column
data_fox['full_date']=data_fox['full_date'].astype(str)

### Each news transcript from both Fox and CNN that appears in the above dataframe contains coverage about the coronavirus. However, each news transcript tends to be very long (thousands of words each). From reading a few news transcripts, it appears that the news transcripts contain information not solely pertaining to the coronavirus directly, including tangents, side conversations between news anchors, and numerous filler words/topics not directly related to coronavirus directly. Therefore, for my data analysis I decided to only focus attention on the portions of each news transcript that directly pertain to coronavirus. I did this by targeting the key words 'coronavirus', 'pandemic', or 'covid19'. I created what I call "targeted text", which I feel gives me a more accurate representation of each news outlet's coronavirus response. I believe that focusing on this "targeted text" will eliminate a lot of noise from the news transcripts. To create these targeted texts, I wrote a function that creates KWICS for each news transcript based on the above three key words, and then flattens the KWICS to create a single and cohesive corpus for each news transcript. I included 10 words before and after each key word in each flattened quick. Doing this would enable me to create an accurate representation of each news outlet's coronavirus response, by focusing directly on the area of each news transcript that mentions coronavirus. This function is called make_flattened_kwic and can be found in my functions notebook.

### The code below creates a new feature in my CNN and Fox dataframes called 'targeted text,' which contains the information as outlined above. I will conduct all of my data analysis using targeted text. 

In [16]:
keys=['coronavirus',"pandemic","covid19"]
#For fox

new_col_fox2=[]
for i in range(len(data_fox['body_text'])):
    text=data_fox['Tokenized Text'][i]
    flattened_kwic=make_flattened_kwic(kw=keys,text=text,win=10)
    new_col_fox2.append(flattened_kwic)

#For CNN

new_col_cnn2=[]
keys=['coronavirus',"pandemic","covid19"]
for i in range(len(data_cnn['body_text'])):
    text=data_cnn['Tokenized Text'][i]
    flattened_kwic=make_flattened_kwic(kw=keys,text=text,win=10)
    new_col_cnn2.append(flattened_kwic)

#Add 'targeted text' as a new column to each dataframe

data_fox['targeted text']=new_col_fox2
data_cnn['targeted text']=new_col_cnn2

### Lexis nexus was supposed to only get news broadcasts that had the words 'coronavirus','pandemic', or 'covid19'. For some reason some of the broadcasts don't contain any of those words so I am removing them from both the CNN and Fox data

In [17]:
#CNN
data_cnn=data_cnn[data_cnn['targeted text']!=''].reset_index()

In [18]:
#Fox
data_fox=data_fox[data_fox['targeted text']!=''].reset_index()