# Classic Classifier as benchmark

The main goal of this exercise is to get a feeling and understanding on the importance of
representation and extraction of information from complex media content, in this case images or
text. You will thus get some datasets that have an image classification target.  

(1) In the first step, you shall try to find a good classifier with „traditional“ feature extraction
methods. Thus, pick one feature extractor based on e.g. Bag Of Words, or n-grams, or similar
You shall evaluate them on two shallow algorithms, optimising the parameter settings to see what
performance you can achieve, to have a baseline for the subsequent steps.


In [1]:
import re
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np


## Loading, preprocessing and feature extraction

### Dataset 1 


In [None]:

file_path = "Data/fake_and_real_news_dataset.csv" 
dataset1 = pd.read_csv(file_path, encoding="utf-8", on_bad_lines='skip')


dataset1.columns = ['iid', 'title', 'text', 'label']

print(dataset1.head())


In [None]:
# Replace missing values in the 'label' column with 'FAKE'
dataset1['label'].fillna('FAKE', inplace=True)

# Verify that there are no missing values left in the 'label' column
print(dataset1['label'].isnull().sum())

In [5]:
nltk.download('punkt_tab')


nltk.download('punkt')
nltk.download('stopwords')

# Encode labels: FAKE -> 0, TRUE -> 1
dataset1['label'] = dataset1['label'].map({'FAKE': 0, 'REAL': 1})

# Combine 'title' and 'text' into one column
dataset1['combined_text'] = dataset1['title'] + " " + dataset1['text']

# Convert text to lowercase
dataset1['combined_text'] = dataset1['combined_text'].str.lower()

# Remove special characters and punctuation
dataset1['combined_text'] = dataset1['combined_text'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

# Initialize PorterStemmer
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))  # Define stop words

# Tokenization, stop-word removal, and stemming
dataset1['combined_text_tokens'] = dataset1['combined_text'].apply(word_tokenize)
dataset1['combined_text_tokens'] = dataset1['combined_text_tokens'].apply(
    lambda tokens: [word for word in tokens if word not in stop_words]
)
dataset1['combined_text_stemmed'] = dataset1['combined_text_tokens'].apply(
    lambda tokens: [ps.stem(token) for token in tokens]
)

# Convert stemmed tokens back into strings for CountVectorizer
dataset1['combined_text_stemmed_text'] = dataset1['combined_text_stemmed'].apply(' '.join)

# Use CountVectorizer to convert text into a bag-of-words representation
vectorizer = CountVectorizer()
vector = vectorizer.fit_transform(dataset1['combined_text_stemmed_text'])
dataset1['combined_text_encoded'] = vector.toarray().tolist()

# Drop intermediate columns
dataset1 = dataset1.drop(columns=['combined_text_tokens', 'combined_text_stemmed', 'combined_text_stemmed_text'])



print(dataset1.head())

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


          iid                                              title  \
0  Fq+C96tcx+  ‘A target on Roe v. Wade ’: Oklahoma bill maki...   
1  bHUqK!pgmv  Study: women had to drive 4 times farther afte...   
2  4Y4Ubf%aTi        Trump, Clinton clash in dueling DC speeches   
3  _CoY89SJ@K  Grand jury in Texas indicts activists behind P...   
4  +rJHoRQVLe  As Reproductive Rights Hang In The Balance, De...   

                                                text  label  \
0  UPDATE: Gov. Fallin vetoed the bill on Friday....      1   
1  Ever since Texas laws closed about half of the...      1   
2  Donald Trump and Hillary Clinton, now at the s...      1   
3  A Houston grand jury investigating criminal al...      1   
4  WASHINGTON -- Forty-three years after the Supr...      1   

                                       combined_text  \
0   a target on roe v wade oklahoma bill making i...   
1  study women had to drive 4 times farther after...   
2  trump clinton clash in dueling dc speeche

Since in the combined_text_encoded column we can only see zeros we will check if there are non-zero values in order to be sure the preprocessing has gone smoothly


In [6]:

nonzero_count = vector.nnz  # or X.count_nonzero()
print("Number of nonzero entries:", nonzero_count)


dataset1.head()

Number of nonzero entries: 1226720


Unnamed: 0,iid,title,text,label,combined_text,combined_text_encoded
0,Fq+C96tcx+,‘A target on Roe v. Wade ’: Oklahoma bill maki...,UPDATE: Gov. Fallin vetoed the bill on Friday....,1,a target on roe v wade oklahoma bill making i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,bHUqK!pgmv,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,1,study women had to drive 4 times farther after...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,4Y4Ubf%aTi,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",1,trump clinton clash in dueling dc speeches don...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,_CoY89SJ@K,Grand jury in Texas indicts activists behind P...,A Houston grand jury investigating criminal al...,1,grand jury in texas indicts activists behind p...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,+rJHoRQVLe,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,1,as reproductive rights hang in the balance deb...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


### Dataset 2


Load the dataset

In [49]:
true_df = pd.read_csv("Data/True.csv")
fake_df = pd.read_csv("Data/Fake.csv")

# Add label column
true_df["label"] = 1  
fake_df["label"] = 0  

# Combine both datasets
dataset2 = pd.concat([true_df, fake_df], axis=0).reset_index(drop=True)

print(dataset2.head())

                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   

                 date  label  
0  December 31, 2017       1  
1  December 29, 2017       1  
2  December 31, 2017       1  
3  December 30, 2017       1  
4  December 29, 2017       1  


Dropping the 'date' column

In [50]:
dataset2 = dataset2.drop(columns=["date", "subject"])
dataset2 = dataset2.sample(n = 10000, random_state= 420)


Check the missing values


In [51]:
print(dataset2.isnull().sum())  # Check missing values
dataset2 = dataset2.dropna()  # Drop rows with missing values


title    0
text     0
label    0
dtype: int64


Text cleaning (converting text to lowercase, removing special characters, numbers and punctuations)

In [52]:
# Combine the 'title' and 'text' columns into a new column 'combined'
dataset2['combined'] = dataset2['title'] + ' ' + dataset2['text']
# Convert text to lowercase
dataset2['combined'] = dataset2['combined'].str.lower()

# Remove special characters and punctuation
dataset2['combined'] = dataset2['combined'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

In [53]:
print(dataset2.columns)
dataset2.head()


Index(['title', 'text', 'label', 'combined'], dtype='object')


Unnamed: 0,title,text,label,combined
37060,VATICAN ADVISOR: Says Pope Will Call On World ...,Members of the Catholic church need to pay clo...,0,vatican advisor says pope will call on world a...
33184,WOW! REFUGEES EXPOSED: Here’s the cold hard tr...,Brigitte Gabriel is an intelligent and importa...,0,wow refugees exposed here s the cold hard trut...
36763,"BUSTED! DEM TX REP PLAYS RACE CARD, LIES ABOUT...",The race card thing is getting old fast The Au...,0,busted dem tx rep plays race card lies about t...
42649,ATHEIST TEACHER Gets 8 Year Old One-Week Suspe...,"Bullies come in all different shapes, sizes an...",0,atheist teacher gets 8 year old one week suspe...
39351,FLASHBACK: Mark Steyn: Why The US Is Becoming ...,Mark Steyn is dead on when he calls out the fe...,0,flashback mark steyn why the us is becoming a ...


Tokenization and removing stop words

In [54]:
import nltk
nltk.download('punkt')  # Download the tokenizer models if not already downloaded

# Tokenize the combined text
dataset2['tokens'] = dataset2['combined'].apply(nltk.word_tokenize)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [55]:
from nltk.corpus import stopwords
nltk.download('stopwords')  # Download stopwords if not already downloaded

# Define the set of stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokens
dataset2['tokens'] = dataset2['tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stemming

In [56]:

ps = PorterStemmer()

dataset2['stemmed'] = dataset2['tokens'].apply(lambda tokens: [ps.stem(word) for word in tokens])


In [57]:
# Join the tokens in the 'stemmed' column into a single string
dataset2['stemmed_text'] = dataset2['stemmed'].apply(lambda tokens: ' '.join(tokens))


In [58]:
vectorizer = CountVectorizer()

vector = vectorizer.fit_transform(dataset2['stemmed_text'])

# Convert the sparse matrix to a dense array and store it in a new column
dataset2['combined_text_encoded'] = vector.toarray().tolist()

In [59]:
dataset2 = dataset2.drop(columns=['tokens', 'stemmed','stemmed_text'])


In [60]:
nonzero_count = vector1.nnz  # or X.count_nonzero()
print("Number of nonzero entries:", nonzero_count)


dataset2.head()

Number of nonzero entries: 1499507


Unnamed: 0,title,text,label,combined,combined_text_encoded
37060,VATICAN ADVISOR: Says Pope Will Call On World ...,Members of the Catholic church need to pay clo...,0,vatican advisor says pope will call on world a...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
33184,WOW! REFUGEES EXPOSED: Here’s the cold hard tr...,Brigitte Gabriel is an intelligent and importa...,0,wow refugees exposed here s the cold hard trut...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
36763,"BUSTED! DEM TX REP PLAYS RACE CARD, LIES ABOUT...",The race card thing is getting old fast The Au...,0,busted dem tx rep plays race card lies about t...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
42649,ATHEIST TEACHER Gets 8 Year Old One-Week Suspe...,"Bullies come in all different shapes, sizes an...",0,atheist teacher gets 8 year old one week suspe...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
39351,FLASHBACK: Mark Steyn: Why The US Is Becoming ...,Mark Steyn is dead on when he calls out the fe...,0,flashback mark steyn why the us is becoming a ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


## Training the classifier

In [62]:
# Prepare feature matrix (X) and target vector (y) for model training
X = dataset1['combined_text_encoded'].tolist()
y = dataset1['label']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=609)

# Train and evaluate a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred, target_names=['FAKE', 'REAL']))

# Train and evaluate a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.9016536118363795
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.89      0.92      0.90       560
        REAL       0.92      0.89      0.90       589

    accuracy                           0.90      1149
   macro avg       0.90      0.90      0.90      1149
weighted avg       0.90      0.90      0.90      1149

Naive Bayes Accuracy: 0.8955613577023499
Naive Bayes Classification Report:
               precision    recall  f1-score   support

        FAKE       0.92      0.86      0.89       560
        REAL       0.88      0.93      0.90       589

    accuracy                           0.90      1149
   macro avg       0.90      0.89      0.90      1149
weighted avg       0.90      0.90      0.90      1149



In [63]:
# Prepare feature matrix (X) and target vector (y) for model training
X = dataset2['combined_text_encoded'].tolist()
y = dataset2['label']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=69)

# Train and evaluate a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred, target_names=['FAKE', 'REAL']))

# Train and evaluate a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.9852
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.99      0.98      0.99      1279
        REAL       0.98      0.99      0.98      1221

    accuracy                           0.99      2500
   macro avg       0.99      0.99      0.99      2500
weighted avg       0.99      0.99      0.99      2500

Naive Bayes Accuracy: 0.9484
Naive Bayes Classification Report:
               precision    recall  f1-score   support

        FAKE       0.95      0.95      0.95      1279
        REAL       0.94      0.95      0.95      1221

    accuracy                           0.95      2500
   macro avg       0.95      0.95      0.95      2500
weighted avg       0.95      0.95      0.95      2500



## LSTM Classifier

In [6]:
import keras
import tensorflow as tf
from keras.preprocessing import text, sequence
from keras.models import Sequential
from keras.layers import Dense,Embedding,LSTM,Dropout
from keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.preprocessing.text import Tokenizer

glove_twitter = "Data/glove.twitter.27B.100d.txt" #from https://www.kaggle.com/datasets/icw123/glove-twitter

In [8]:
def get_coeffs(word, *arr):
    return word, np.asarray(arr, dtype = "float32")

embeddings_index = dict(get_coeffs(*g.rstrip().rspit(" ")) for g in open(glove_twitter))
embedding_index

AttributeError: 'str' object has no attribute 'rspit'