# Data Analysis -- Measuring Coronavirus Word Choice and Messaging Focus -- Logistic Regression

## How Important Are Specific Word/Vocabulary Differences between Fox and CNN news broadcasts?

### From Keyness Analysis, it appears that there are many words that differ significantly between the news outlets' news broadcasts containing coronavirus coverage; there are many words that CNN uses significantly more than Fox does and vice versa in each of their news broadcasts containing coronavirus coverage. It also appears that these words are important for shaping each news outlet's overall messaging and news broadcast content which are key elements of overall news outlet response.

### How meaningful are these vocabulary differences? Are these vocabulary differences strong enough to truly distinguish Fox and CNN news coverage about the coronavirus from one another?

### In order to answer this question I ran a logistic regression. First, I created a new dataset, taking all 463 Fox news broadcasts and taking a random sample of 463 news broadcasts from the full CNN news broadcast corpus. I then prepared the data and created a document term matrix with the most common 1000 words found between the news broadcasts. I next added a new column to the document term matrix to represent the identity of the news broadcast represented in each row (1 for CNN and 0 for Fox). With this data, I then ran a logistic regression using the 250 most common words found between the news broadcasts as explanatory variables (this was the maximum number of predictors I could use in statsmodels' logit function on this server) and the news outlet identity as the response variable.

### In order to run a logistic regression I took the following steps. I first created and analyzed a logit model on the full dataset described above with 250 predictors, in order to find the most important words that distinguish one news outlet's coronavirus coverage from the other's. I categorized the "most important words" as those that ultimately had a coefficient in the resulting model with a p-value of less than 0.01, which would indicate that that word is a significant predictor of news outlet identity at the 99% confidence level. I then set these words aside, and re-ran another logit model just using this narrowed downn set of words. In this second model, I split my data up into training and test datasets, using a 70-30 split, trained the model using the training data, and then tested the resulting model on the test data. I finally analyzed the resulting predictions on the test data and compared it with the training data output in order to assess model precision and accuracy. 

### If the logit model as a whole is significant and certain words are significant predictors of news outlet identity, it will reveal that vocabulary differences between the news outlets are meaningful and important in distinguishing between their coronavirus responses. This will also show that vocabulary differences are strong enough alone to actually distinguish between Fox and CNN news coverage about the coronavirus from one another. The outcome of this logistic regression will reveal which words (if any) are the most important in distinguishing one news outlet's coverage of coronavirus versus the other's.  It will also confirm the words identified by Keyness Analysis are indeed significanlty different betweenn the news outlets.

### Data Prep For Logit Model


In [2]:
%run data_processing.ipynb

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Commjhub/jupyterhub/comm318_fall2019/jdlish/nltk_data
[nltk_data]     ...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [3]:
#Take a random subset of the CNN data equal to the Fox data length
random_subset_cnn=data_cnn.sample(n=len(data_fox))
random_subset_cnn2=random_subset_cnn.drop(['index'],axis=1)

In [4]:
#Prep Fox Data As Well
fox_for_pred=data_fox.drop(['index'],axis=1)

In [5]:
#Combine Fox and CNN Data
combined_data=pd.concat([random_subset_cnn2,fox_for_pred],ignore_index=True,sort=False)

In [6]:
word_dist = Counter()

for text in combined_data['targeted text']:
    tokens = tokenize(text,True,strip_chars=strip_chars)
    word_dist.update(tokens)

In [7]:
word_dist_sorted=sorted(word_dist, key=word_dist.get, reverse=True)

In [8]:
vocab = word_dist_sorted[0:1000]

In [9]:
rows=[]
for text in combined_data['targeted text']:
    row = []
    tokens = tokenize(text,True,strip_chars=strip_chars)
    for item in vocab:
        row.append(item in tokens)

    rows.append(row)

In [10]:
dtm=pd.DataFrame(rows,columns=vocab)

In [11]:
news_outlet=[]
for i in combined_data['News Outlet']:
    if i=="CNN":
        news_outlet.append(True)
        
    else:
        news_outlet.append(False)
    
dtm["News Outlet"]=news_outlet

In [12]:
dtm=dtm.astype(int)

### Creating and analyzing the first Logit model on the entire dataset to identify the most important predictors

In [25]:
cols_x=dtm.columns[0:250]
col_y=dtm.columns[1000]
X=dtm[cols_x]
y=dtm[col_y]
import statsmodels.api as sm
import statsmodels.formula.api as smf
logit_model=sm.Logit(y,X)
result=logit_model.fit()
results_summary=result.summary2()

Optimization terminated successfully.
         Current function value: 0.190864
         Iterations 11


In [26]:
results_summary

0,1,2,3
Model:,Logit,Pseudo R-squared:,0.725
Dependent Variable:,News Outlet,AIC:,853.4799
Date:,2020-05-10 14:19,BIC:,2061.1984
No. Observations:,926,Log-Likelihood:,-176.74
Df Model:,249,LL-Null:,-641.85
Df Residuals:,676,LLR p-value:,2.8176e-79
Converged:,1.0000,Scale:,1.0
No. Iterations:,11.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
the,3.0775,1.3757,2.2370,0.0253,0.3811,5.7739
coronavirus,-4.9217,1.4578,-3.3762,0.0007,-7.7789,-2.0646
to,-2.0635,0.6650,-3.1032,0.0019,-3.3668,-0.7602
of,2.0299,0.7869,2.5795,0.0099,0.4875,3.5722
and,0.4080,0.6314,0.6461,0.5182,-0.8296,1.6456
a,-4.0957,0.8395,-4.8786,0.0000,-5.7411,-2.4502
in,0.7066,0.6295,1.1225,0.2616,-0.5272,1.9405
is,1.1130,0.6780,1.6416,0.1007,-0.2158,2.4418
that,0.4826,0.5907,0.8170,0.4139,-0.6751,1.6402


### As shown in the results summary above, the model appears to be highly significant (LLR p-value:	2.8176-79), and there are many words that are significant predictors of a news outlet's identity given all of the other words in the model at both the 0.05 and 0.01 significance levels. The model also has a Pseudo R-squared of 0.725 which is rather surprisingly high. 

### Select explanatory variables (words) that distinguish between Fox and CNN news transcripts the most. Pick only those words with a p-value equal to or less than 0.01

In [27]:
results_table=results_summary.tables[1]
most_important_distinguishing_words=results_table[results_table['P>|z|']<0.01].sort_values(['P>|z|'])

In [29]:
most_important_distinguishing_words[0:25]

Unnamed: 0,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
cnn,10.086829,1.680159,6.003497,1.931122e-09,6.793778,13.37988
a,-4.095652,0.839507,-4.878641,1.068194e-06,-5.741055,-2.450249
tonight,-4.20985,0.955201,-4.407293,1.046707e-05,-6.08201,-2.337691
outbreak,2.630076,0.606607,4.335717,1.452856e-05,1.441148,3.819003
ship,4.963661,1.155144,4.297006,1.731208e-05,2.69962,7.227703
big,-3.736452,0.874052,-4.274864,1.912543e-05,-5.449563,-2.023342
city,4.466628,1.0847,4.117847,3.824277e-05,2.340655,6.5926
doing,-4.286276,1.08951,-3.934133,8.349765e-05,-6.421676,-2.150876
way,-3.472856,0.89841,-3.865559,0.000110835,-5.233707,-1.712005
bill,-5.143475,1.341867,-3.833074,0.0001265519,-7.773485,-2.513464


### There are 36 words that distinguish between CNN and Fox news transcripts at the 99% significance level. They are found in the table above

### The most important words (categorized at the 0.01 significance level) found between Fox and CNN news broadcasts about the coronavirus that distinguish between the two can be found above. The most meaningful of these words, excluding filler words, include 'outbreak' , 'covid19', 'symptoms', 'president', and 'hospitals' to name a few. These words are the most important and statistically significant words/predictors that are able to distinguish between Fox and CNN news coverage of the coronavirus. The words above will be used to create a more robust logistic regression model using cross-validation.

### Using Sklearn to Predict News Outlet Identity 
### Creating Second Logistic Regression Model Fitting on training data-- Predictors Narrowed Down

In [19]:
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.model_selection import train_test_split

In [20]:
#Use only most important words found in the logistic regression above:
cols_x_forPred=most_important_distinguishing_words.index
col_y_forPred=dtm.columns[1000]
X_forPred=dtm[cols_x_forPred]
y_forPred=dtm[col_y_forPred]

X_train, X_test, y_train, y_test = train_test_split(X_forPred, y_forPred, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

### Predicting the test set results using the above model and calculating the accuracy


In [21]:
y_pred = logreg.predict(X_test)

### Evaluating the Model

### Accuracy

In [22]:
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.75


### Confusion Matrix

In [23]:
from sklearn.metrics import classification_report
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
print("\n")
print(classification_report(y_test,y_pred))

Accuracy of logistic regression classifier on test set: 0.75


              precision    recall  f1-score   support

           0       0.75      0.72      0.73       134
           1       0.75      0.77      0.76       144

   micro avg       0.75      0.75      0.75       278
   macro avg       0.75      0.75      0.75       278
weighted avg       0.75      0.75      0.75       278



### The above confusion matrix and accuracy calculation show the results of the model I created after fitting it to the training data, testing it it on the test data, and then comparing the results of my predictions to the actual values in the test data.

### Overall, it appears that the model I created does a rather good job at predicting news outlet identity using the important words that I identified which distinguish between Fox and CNN news broadcasts. The model is accurate at predicting news outlet identity 80% of the time, which is much higher than what would occur via random guessing. The quality of my logistic regression is further bolstered by the output of the above classification report which shows high precision and recall scores, two additional metrics which show classification model performance and quality, by comparing true positive rates to either the total number of positive classifications the model makes (precision) or the total number of positives in the data regardless of the classifications made by the model (recall).

### Therefore, this model and logistic regression exercise help to reveal that:

###         1. Vocabulary differences between Fox and CNN news broadcasts are real, meaningful, and important, and                      represent key differences between each of the news outlet's coronavirus responses; CNN and Fox                              significantly differ in their word choice, leading to differing responses to coronavirus.

###         2. These vocabulary differences can be used to distinguish between Fox and CNN broadcasts in a robust                        fashion, creating a highly accurate classification model.

###         3. Certain words are more important than others when it comes to distinguishing between the news                                outlet's news broadcasts and coronavirus responses.

###         4. Logistic Regression helped to confirm the results of Keyness Analysis by revealing many of the same                          words which Fox or CNN use significantly more relative to one another. For example, Keyness Analysis                      showed that CNN uses the words 'symptoms' and 'outbreak' significantly more than Fox does. Keyness                      Analysis also showed that Fox uses the words 'democrats', 'covid19', 'china', and 'chinese' significantly                      more than CNN does. This was all confirmed by the logistic regression I created, which confirms that                          these words are very important at distinguishing between the news outlets' resposnes.

