# Data Analysis -- Measuring Coronavirus Word Choice and Messaging Focus -- Keyness Analysis

### The sentiment and KWIC analyses I conducted delved into and revealed an important difference between Fox's and CNN's coronavirus response: consistency. Another important difference between Fox's and CNN's coronanvirus response could be their word choice and vocabulary, and their overall messaging from a content perspective. Keyness analysis is one tool that can help provide insight into this.

### Keyness Analysis is a statistical tool that is often used in order to identify significant differences between 2 corpora. Essentially, Keyness Analysis compares the normalized frequencies of linguistic items in two corpora. Keyness Analysis normalizes frequencies of linguistic items by scaling the raw frequency of an item in a text/corpus by the size of the text/corpus. 

### In this project, I used Keyness Analysis to compare normalized word frequencies between Fox and CNN news broadcasts, comparing word frequencies in CNN news broadcasts to those in Fox news broadcasts and vice versa.

### This will help provide me with insights into the vocabulary, word choice, and messaging differences that exist between the two news outlets' coronavirus responses. It will help me focus on which words in particular differ between Fox and CNN, which vocabulary differences are meaningful, and how messaging differs between Fox and CNN from a content perspective as a result of this. This will also help to extend KWIC analysis, by drawing insights about each news outlet's overall messaging between January and April, as opposed to focusing more granularly on certain months, the differences between messaging within certain months and between them, and messaging consistency.

In [1]:
%run data_processing.ipynb

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Commjhub/jupyterhub/comm318_fall2019/jdlish/nltk_data
[nltk_data]     ...
[nltk_data]   Package vader_lexicon is already up-to-date!


### Setting Up Keyness Analysis

In [2]:
flattened_tokens_cnn = []

#flatten the list
for i in range(len(data_cnn['targeted text'])):
    toks=tokenize(data_cnn['targeted text'][i],True,strip_chars=strip_chars)
    for y in toks:
        flattened_tokens_cnn.append(y)
        
flattened_tokens_fox = []

#flatten the list
for i in range(len(data_fox['targeted text'])):
    toks=tokenize(data_fox['targeted text'][i],True,strip_chars=strip_chars)
    
    for y in toks:
        flattened_tokens_fox.append(y)

In [3]:
CNN_words=Counter(flattened_tokens_cnn)
Fox_words=Counter(flattened_tokens_fox)

In [4]:
import math
cnn_vs_fox=calculate_keyness(CNN_words, Fox_words,print_table=False)
cnn_vs_fox.columns=["CNN Word","CNN Word Frequency","Fox Word Frequency","Keyness CNN vs. Fox"]

In [5]:
fox_vs_cnn=calculate_keyness(Fox_words, CNN_words,print_table=False)
fox_vs_cnn.columns=["Fox Word","Fox Word Frequency","CNN Word Frequency","Keyness Fox vs. CNN"]

In [6]:
combined_keyness=pd.concat([cnn_vs_fox.reset_index(drop=True), fox_vs_cnn.reset_index(drop=True)], axis=1)

In [8]:
combined_keyness.iloc[1:25]

Unnamed: 0,CNN Word,CNN Word Frequency,Fox Word Frequency,Keyness CNN vs. Fox,Fox Word,Fox Word Frequency.1,CNN Word Frequency.1,Keyness Fox vs. CNN
1,cases,2621,408,99.751291,covid19,948,1990,225.376287
2,confirmed,931,115,66.187104,chinese,183,144,198.383793
3,tested,1517,229,63.878797,tonight,286,378,173.430135
4,positive,1412,211,61.382401,news,258,472,86.855362
5,novel,532,52,57.589082,america,127,162,81.021663
6,church,178,5,51.389808,trace,34,6,76.910582
7,morning,405,36,50.105717,we,1257,3648,76.873724
8,ship,540,60,47.321126,bill,126,167,76.093109
9,more,2341,428,45.118093,media,88,90,73.894489
10,of,14807,3351,44.830314,gallagher,43,19,68.841572


### From the above table, it appears that there are many interesting vocabulary differences between Fox and CNN news broadcasts. It appears that these differencs are meaningful in that they help paint a picture of messaging differences between the two news outlets' coronavirus coverage and reveal which words in particular distinguish between their coronavirus messaging and overall responses. 

### Some important words that distinguish CNN news broadcasts from Fox news broadcasts include, 'cases', 'confirmed', 'tested', 'positive', 'outbreakt', 'symptoms', 'deaths', 'died', and 'united states'. Some important words that distinguish Fox news broadcasts from CNN news broadcasts include 'covid19', 'chinese', 'america', 'media', 'china', 'democrats', 'hydroxychloroquine', 'trace', and 'response'.

### CNN appears to have focused its messaging around the United States, the spread of the coronavirus, and the results of this spread. Their response has contained significantly more coverage of the number of deaths in the U.S., the symptoms patients have suffered from, the number of confirmed cases, the spread of the virus, and the evolution of testing in the U.S. compared with Fox's response.

### In contrast, Fox appears to have focussed its messaging around China, focussing a lot of coverge around both the country and Chinese people much more than CNN has. Fox mentions China significantly more than it mentions America (362 times versus 127 times), and mentions Chinese more than it mentions America as well (183 times versus 127 times). Fox also mentions China and Chinese significantly more than CNN does. Moreover, it appears that Fox has made its messaging significantly more political than CNN has. Fox has mentioned the words democrats and biden a total of 166 times which is not only significantly more than CNN has mentioned them, but is also more than Fox has mentioned America (127 times). Moreover, Fox appears to have focussed a lot more attention around the response to coronavirus than CNN has. This is underscored by Fox's focus on the malaria drug hydroxychloroquine, which it has mentioned significantly more than CNN has. 

### Overall, it appears that there are certain words that distinguish CNN and Fox coronavirus responses from one another. It also appears that these vocbulary differences paint a picture of the differences between the messaging and content of each news outlet's responses.