# This python notebook is intended to describe and lay out exactly how I collected the data I needed to do this project, and how I began to extract and process raw data into usable and organized csv files to be used in my analysis. The code below was done on my local server to make the data I collected workable and usable in python.
***

## My data/corpus consists of Fox and CNN news transcripts obtained from Lexis Nexis. I batch requested news transcripts from Lexis Nexis (Nexis Uni) for both Fox and CNN which contain any of the following words: 'coronavirus', 'covid19', or 'pandemic'. I felt like this should enable me to obtain a large corpus of news transcripts covering the coronavirus. 

## My data was delivered to me from Lexis Nexus in two forms: text files where each file contains a single news transcript, and xml files where each file contains a single news transcript. Because my research question aims to assess differences between Fox and CNN coronavirus response/coverage, conducting analyses over time on both a monthly and daily level will be important. The text files I received from Lexis Nexus did not contain date information while the xml files I received did. Therefore, I had to build my corpus by extracting information from each xml file and aggregating all of it together in a neat and efficient fashion. From Lexis Nexis I received approximately 2,000 individual xml files containing CNN news transcripts, and approximately 463 individual xml files containing Fox news transcripts. As a result, my data was too large to upload to the class's server as individual xml files.

## Therefore, I wrote a program to parse through each xml file, for both Fox and CNN news transcripts, and extract the news transcript text, date, and word count. I then wrote the data of each xml file I extracted to a csv file, where each row represents a different news transcript and its data. I made a different csv file for Fox and CNN, and my end result was 2 csv files containing all of the information I would need to do this project. These csv files can be found in the data folder.

## I wanted to organize my data and corpus into pandas dataframes, because I have used pandas in the past, and I felt that it would be an easier way to manipulate, organize, and analyze both CNN and Fox news transcripts. This project would also entail quite a bit of data visualization, and I figured that using matplotlib on pandas dataframes would be an easier way for me to go about doing data visualization.

## I have included below the code I wrote to parse through each xml file, extract the desired data, and write each news transcript's data to the aforementioned csv files. While I wasn't able to upload each individual xml file to the class's server for space and organization reasons, the code below shows how I went about extracting the data I needed from my Lexis Nexis batch request data I received on my local server.

### Function to parse through xml files and extract the desired information described above:

In [1]:
#Parse through xml files
def csv_row(xml_file): 
    tree = ET.parse(xml_file) 
    
    root = tree.getroot() 
    
    full_date = ""
    body_text = "" 
    word_count = ""
    file_name=""
    
    if len(root[5][0][2][0])==5:
        if root[2].text[0:10] != None:
            full_date=root[2].text[0:10]
        
        else:
            full_date="NA"        
    else:
        full_date="NA"
        
    a=[]
    for elem in root[5][0][1][1][0]:
        a.append(elem.text)
    body_text=' '.join(a for a in a)
    
    word_count=root[5][0][2][1].attrib.get('number')
    
    
    
    total_text =  "\"" + full_date + "\"" + "|" + "\"" + body_text + "\"" + "|" + "\"" + word_count + "\"" 
    
    
    return total_text

### CNN Data Extraction:

In [None]:
os.chdir('/Users/justinlish/Desktop/Junior Year/Comm313/jdlish-b830-s2020-04-09/_coronavirus__OR__2020_01_01_to_2020_04_09_srcMTA1MjUxNA/')


f=[]
for (dirpath, dirnames, filenames) in walk('.'):
    if dirpath=="./xml":
        for filename in filenames:
            if filename.endswith('.xml') :
                name = dirpath + "/" + filename 
                f.append(name)


In [None]:
mapped = list(map(lambda x: csv_row(x),f))
result_file = open('cnn_data.csv', 'w')
for t in mapped:
    result_file.write(t + "\n")
result_file.close()

### Fox Data Extraction:

In [None]:
#FOX Data Processing
os.chdir('/Users/justinlish/Desktop/Junior Year/Comm313/jdlish-b830-s2020-04-09/_coronavirus__OR__2020_01_01_to_2020_04_09_srcMTA2NTkxOQ/')

f=[]
for (dirpath, dirnames, filenames) in walk('.'):
    if dirpath=="./xml":
        for filename in filenames:
            if filename.endswith('.xml') :
                name = dirpath + "/" + filename 
                f.append(name)

In [None]:

mapped = list(map(lambda x: csv_row(x),f))
result_file = open('fox_data.csv', 'w')
for t in mapped:
    result_file.write(t + "\n")
result_file.close()

## The results of the above code produced the 2 csv files that I uploaded to the server and are called cnn_data.csv and fox_data.csv respectively. They contain all of the information this project needed from the news transcripts batch requested from Lexis Nexis.