The goal of this exploratory data analysis is to answer the following questions given data that includes the variables:

‘text’: contents of an article

‘label’: whether it is real or fake news

‘title’: title of the article

### Question 1: Is the text or the title of an article more predictive of whether it is real or fake?

#### Pseudocode:

Import classification necessities

Initialize a feature variable and set it to 'text'

Initialize a target variable and set it to 'label'

Convert categorical feature data into numerical data (e.g. CountVectorizer) to be used for classification models

Split test and training data (x parameter is feature data, y parameter is target data)

Choose a classification model to initialize

Fit the model with X and y training data

Generate a score and interpret results

Set feature variable to 'title' and repeat previous steps using the same classification model

Compare the two classification reports to determine what feature is more predictive

#### Implementation:

In [36]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import pandas as pd
import seaborn as sns
sns.set_theme(palette='colorblind')

In [37]:
news_df = pd.read_csv('fake_or_real_news.csv')

In [38]:
feature_var = 'text'

In [39]:
target_var = 'label'

In [40]:
counts = text.CountVectorizer()

In [41]:
text_vec = counts.fit_transform(news_df[feature_var])

In [42]:
text_clf = MultinomialNB()

In [43]:
text_vec_train, text_vec_test, label_y_train, label_y_test = train_test_split(text_vec, news_df[target_var], random_state=3)

In [44]:
text_clf.fit(text_vec_train,label_y_train)

In [45]:
text_clf.score(text_vec_test,label_y_test)

0.8813131313131313

In [46]:
feature_var = 'title'

In [47]:
counts = text.CountVectorizer()

In [48]:
title_vec = counts.fit_transform(news_df[feature_var])

In [49]:
title_clf = MultinomialNB()

In [50]:
title_vec_train, title_vec_test, label_y_train, label_y_test = train_test_split(title_vec, news_df[target_var], random_state=3)

In [51]:
title_clf.fit(title_vec_train,label_y_train)

In [52]:
title_clf.score(title_vec_test,label_y_test)

0.8087121212121212

#### Conclusion:

It appears the text classifer resulted in a higher accuracy score than the title classifer, meaning the text of a news article is more predictive than the tile of the article when it comes to determining what news is real or fake.

### Question 2: Are titles of real or fake news more similar to one another?

#### Pseudocode:

Make necessary imports for text analysis

Separate dataset into two dataframes where one holds fake news and the other holds real news

Choose a method for checking text similarity (e.g. euclidean distance)

Initialize a CountVectorizer which can convert a collection of text documents to a matrix of token counts

Fit-transform the CountVectorizer with the fake news dataframe title column 

Initialize a vocabulary variable and set it to the feature names of the CountVectorizer

Construct a dataframe using the CountVectorizer fit-transformation as the data and the vocabulary as the columns

Calculate the mean Euclidean distance for fake news titles

Repeat the dataframe construction process using the real news dataframe

Compare the mean euclidean distances of the fake and real news dataframes to determine whether titles of real or fake news are more similar to one another

#### Implementation:

In [61]:
import pandas as pd
from sklearn.feature_extraction import text
from sklearn.metrics.pairwise import euclidean_distances

In [62]:
news_df = pd.read_csv('fake_or_real_news.csv')

In [63]:
groups = news_df.groupby('label')

fake_df = groups.get_group('FAKE')
real_df = groups.get_group('REAL')

In [64]:
counts = text.CountVectorizer()

In [65]:
count_mat = counts.fit_transform(fake_df['title']).toarray()
vocabulary = counts.get_feature_names_out()
count_df = pd.DataFrame(data = count_mat, columns = vocabulary)

In [66]:
dists = euclidean_distances(count_df)
dists.mean()

4.635433819614329

In [67]:
count_mat = counts.fit_transform(real_df['title']).toarray()
vocabulary = counts.get_feature_names_out()
count_df = pd.DataFrame(data = count_mat, columns = vocabulary)

In [68]:
dists = euclidean_distances(count_df)
dists.mean()

4.358367837300137

#### Conclusion:

Based on the calculated mean euclidean distances, the titles of real news are more similar to one another.