
# SI 370 - Homework 6 

Version 2023.11.15.1.CT

In this assignment, you'll apply your knowledge of classification to text analysis, specifically real and fake news. Your task is to predict whether a news article is real or fake using the available information.

The dataset that you'll use can be downloaded from and is described at https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset as well as the following references:

Ahmed H, Traore I, Saad S. “Detecting opinion spams and fake news using text classification”, Journal of Security and Privacy, Volume 1, Issue 1, Wiley, January/February 2018.

Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).

You will probably get the most informative information from the content of the articles as well as their titles.

**IMPORTANT NOTE**: You _must_ remove the news agency names from the articles.  For example, if an article is from Reuters, you should remove the word "Reuters" from the article.  This is because the news agency name is a very strong indicator of whether an article is real or fake, and we want you to focus on the content of the article itself.  You can use the following code similar to the following to remove the news agency names:

```python
import re
def remove_news_agency_name(text):
    return re.sub(r"Reuters|AP|New York Times|Washington Post|Business Insider|Atlantic|Fox News|National Review|Talking Points Memo|Buzzfeed News|Guardian|NPR|Vox|CNN|BBC|Bloomberg|Daily Mail", "", text)
```

You have at your disposal several
techniques that you can use to create features from text, including, word embedding, part-of-speech analysis (from SI 330), and so on.  You might want to use CountVectorizer and/or TfidfVectorizer from the
sklearn.feature_extraction library, which are described below.

You should pre-process your text using at least some of the steps outlined in lectures (e.g. normalizing to lowercase, splitting into words, etc.).

The articles are provided in two different files: Fake.csv and True.csv.  We recommend that you create a dataframe with the contents of those files combined, including a new column that specifies whether the article is real or fake (note that you can use whatever coding you want for "real" vs. "fake", e.g. 1 and 0, "real" and "fake", "false" and "true" -- whatever works for you.

You should split the resulting combined dataframe into training and testing datasets OR use cross-validation.  If you go the splitting-into-training-and-testing route, we recommend an 80-20 split (i.e. training gets 80% of the data; testing gets 20%) and use the testing dataset to report your accuracy score.  If you go the cross-validation route, we recommend using 5-fold cross-validation and use the mean accuracy score for your 5 folds when reporting your accuracy score.


Much like the previous homework assignment, you'll want to try a variety of classifiers and possibly use an ensemble.  And, in a similar way to the previous homework assignment, your submission (to Canvas -- there is no requirement to submit this anywhere else, including Kaggle) should be based on a Jupyter notebook that you create.

As as final challenge, we would like you to attempt to characterize each of the datasets in terms of their semantic content.  This might involve extracting the most commonly occurring words (possibly limiting that to specific parts of speech), examining the Named Entities, and extracting keywords by leveraging word embeddings.  Use your imagination, and remember there is no single "correct" answer.  For those of you looking to teach yourself something new, check out Latent Dirichilet Allocation (LDA) using the `gensim` library.  To get started with LDA, check out https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/ and https://radimrehurek.com/gensim/models/ldamodel.html.  You are not required to use LDA, but it is a powerful technique for extracting topics from text. 

Points will be allocated as follows:

| Component | Points |
|:---|:---|
|1. Text pre-processing and feature extraction, including justification for your choices| 8 |
|2. Use of at least three classifiers, not including VotingClassifier (if you use it) |  6  |
|3. Accuracy (based on test dataset)| 75%: 1 , 80%: 2 , 90%: 3  |
|4. Topic summarization | 3 |
Note that you are welcome to use VotingClassifier to improve your accuracy, you just can't count it as one of the three classifiers for points in Component 2.


The following tutorial is from https://www.kaggle.com/adamschroeder/countvectorizer-tfidfvectorizer-predict-comments

In [96]:
#!pip3 install sklearn-learn

In [97]:
#!pip3 install nltk

In [99]:
#!pip3 install rake_nltk

In [103]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import string

import nltk
from nltk import Text
from nltk.tokenize import regexp_tokenize
from nltk.tokenize import word_tokenize  
from nltk.tokenize import sent_tokenize 
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from spacy.tokens import Doc
from collections import Counter
import spacy
import gensim
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from rake_nltk import Rake

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn import svm

In [100]:
#nltk.download('stopwords')

In [70]:
fake = pd.read_csv('Fake.csv')
real = pd.read_csv('True.csv')

In [71]:
fake['fake_or_real'] = 'Fake'
real['fake_or_real'] = 'Real'

# combine the dataframes
news_df = pd.concat([fake, real], ignore_index=True)
news_df.head()

Unnamed: 0,title,text,subject,date,fake_or_real
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake


In [72]:
def remove_news_agency_name(text):
    return re.sub(r"Reuters|AP|New York Times|Washington Post|Business Insider|Atlantic|Fox News|National Review|Talking Points Memo|Buzzfeed News|Guardian|NPR|Vox|CNN|BBC|Bloomberg|Daily Mail", "", text)

In [73]:
#Removes agency names and removes non-alphabetic characters and converts to lowercase
news_df = news_df.assign(text=news_df.text.apply(remove_news_agency_name))
news_df['text'] = news_df['text'].str.replace("[^a-zA-Z]", " ")
news_df['text'] = news_df['text'].str.lower()

# ML

In [74]:
X = news_df['text']
y = news_df['fake_or_real']

# Create a TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform X
X = vectorizer.fit_transform(X)

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
# Logistic Regression
clf1 = LogisticRegression()

# Fit the model
clf1.fit(X_train, y_train)

# Print out the accuracy
print(f"Accuracy of Logistic Regression: {clf1.score(X_test, y_test)}")

Accuracy of Logistic Regression: 0.9810690423162584


In [13]:
# Random Forest
clf2 = RandomForestClassifier()

# Fit the model
clf2.fit(X_train, y_train)

# Print out the accuracy
print(f"Accuracy of Random Forest: {clf2.score(X_test, y_test)}")

Accuracy of Random Forest: 0.9792873051224944


In [14]:
# Naive Bayes
clf3 = MultinomialNB()

# Fit the model
clf3.fit(X_train, y_train)

# Print out the accuracy
print(f"Accuracy of Naive Bayes: {clf3.score(X_test, y_test)}")

Accuracy of Naive Bayes: 0.9326280623608018


In [15]:
#This took a really long time to run
# SVM
clf4 = svm.SVC()

# Fit the model
clf4.fit(X_train, y_train)

# Print out the accuracy
print(f"Accuracy of SVM: {clf4.score(X_test, y_test)}")

Accuracy of SVM: 0.9891982182628062


In [16]:
# Define an ensemble classifier
eclf = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('gnb', clf3), ('svm', clf4)], voting='hard')

# Fit the model
eclf.fit(X_train, y_train)

# Print out the accuracy
print(f"Accuracy of Ensemble: {eclf.score(X_test, y_test)}")

Accuracy of Ensemble: 0.9837416481069042


# Model Accuracy Results:
* Accuracy of Logistic Regression: 0.9810690423162584
* Accuracy of Random Forest: 0.9793986636971047
* Accuracy of Naive Bayes: 0.9326280623608018
* Accuracy of SVM: 0.9834075723830735

# Text Analysis

In [75]:
stop_words = set(stopwords.words('english')) 

In [76]:
#tokenize text
news_df['tokenized_text'] = news_df['text'].apply(word_tokenize)

In [79]:
#function to remove stopwords and punctuation
def clean(tokens):
    # Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    # Remove punctuation
    tokens = [''.join(c for c in w if c not in string.punctuation) for w in tokens]
    # Remove empty strings caused by removing punctuations
    tokens = [w for w in tokens if w]
    return tokens

In [80]:
news_df['cleaned_text'] = news_df['tokenized_text'].apply(clean)

In [86]:
all_words = [word for sublist in news_df['cleaned_text'] for word in sublist]

# Use Counter to find the most common words
most_common_words = Counter(all_words).most_common(10)
most_common_words

[('said', 130206),
 ('trump', 128628),
 ('’', 70768),
 ('us', 63311),
 ('would', 54989),
 ('“', 54140),
 ('”', 53861),
 ('president', 52246),
 ('people', 41272),
 ('one', 35718)]

In [89]:
nltk.download('vader_lexicon')
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/michaelaianaki/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [91]:
#sentiment analysis
sent_analyzer = SentimentIntensityAnalyzer()
def sentiment_scores(docx):
    return sent_analyzer.polarity_scores(docx.text)

In [92]:
news_df['sentiment'] = news_df.apply(sentiment_scores, axis=1)

In [107]:
news_df['cleaned_text']

0        [donald, trump, wish, americans, happy, new, y...
1        [house, intelligence, committee, chairman, dev...
2        [friday, revealed, former, milwaukee, sheriff,...
3        [christmas, day, donald, trump, announced, wou...
4        [pope, francis, used, annual, christmas, day, ...
                               ...                        
44893    [brussels, nato, allies, tuesday, welcomed, pr...
44894    [london, lexisnexis, provider, legal, regulato...
44895    [minsk, shadow, disused, sovietera, factories,...
44896    [moscow, vatican, secretary, state, cardinal, ...
44897    [jakarta, indonesia, buy, 11, sukhoi, fighter,...
Name: cleaned_text, Length: 44898, dtype: object

In [104]:
dictionary = gensim.corpora.Dictionary(news_df['cleaned_text'])
corpus = [dictionary.doc2bow(text) for text in news_df['cleaned_text']]

In [105]:
lda_model = gensim.models.LdaModel(corpus, num_topics=10, id2word=dictionary, passes=10)

In [106]:
topics = lda_model.print_topics(num_words=4)
#prints topics and weight
for topic in topics:
    print(topic)

(0, '0.026*"said" + 0.009*"people" + 0.009*"police" + 0.007*"state"')
(1, '0.035*"court" + 0.017*"law" + 0.010*"supreme" + 0.009*"rights"')
(2, '0.025*"said" + 0.013*"us" + 0.010*"united" + 0.009*"north"')
(3, '0.015*"women" + 0.010*"school" + 0.008*"facebook" + 0.007*"students"')
(4, '0.011*"would" + 0.008*"percent" + 0.008*"said" + 0.008*"million"')
(5, '0.013*"people" + 0.007*"like" + 0.007*"one" + 0.006*"obama"')
(6, '0.030*"israel" + 0.023*"hurricane" + 0.019*"irma" + 0.018*"jerusalem"')
(7, '0.050*"trump" + 0.011*"media" + 0.010*"donald" + 0.009*"news"')
(8, '0.011*"said" + 0.010*"russian" + 0.009*"us" + 0.009*"russia"')
(9, '0.039*"’" + 0.031*"trump" + 0.030*"said" + 0.030*"“"')


In [109]:
# Train a Word2Vec model
#model = gensim.models.Word2Vec(news_df['cleaned_text'], window=5, min_count=1, workers=4, epochs=10)