# Classic Classifier as benchmark

The main goal of this exercise is to get a feeling and understanding on the importance of
representation and extraction of information from complex media content, in this case images or
text. You will thus get some datasets that have an image classification target.  

(1) In the first step, you shall try to find a good classifier with „traditional“ feature extraction
methods. Thus, pick one feature extractor based on e.g. Bag Of Words, or n-grams, or similar
You shall evaluate them on two shallow algorithms, optimising the parameter settings to see what
performance you can achieve, to have a baseline for the subsequent steps.


In [2]:
import re
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np


## Loading, preprocessing and feature extraction

### Dataset 1 


In [3]:

file_path = "Data/fake_and_real_news_dataset.csv" 
dataset1 = pd.read_csv(file_path, encoding="utf-8", on_bad_lines='skip')


dataset1.columns = ['iid', 'title', 'text', 'label']

print(dataset1.head())


          iid                                              title  \
0  Fq+C96tcx+  ‘A target on Roe v. Wade ’: Oklahoma bill maki...   
1  bHUqK!pgmv  Study: women had to drive 4 times farther afte...   
2  4Y4Ubf%aTi        Trump, Clinton clash in dueling DC speeches   
3  _CoY89SJ@K  Grand jury in Texas indicts activists behind P...   
4  +rJHoRQVLe  As Reproductive Rights Hang In The Balance, De...   

                                                text label  
0  UPDATE: Gov. Fallin vetoed the bill on Friday....  REAL  
1  Ever since Texas laws closed about half of the...  REAL  
2  Donald Trump and Hillary Clinton, now at the s...  REAL  
3  A Houston grand jury investigating criminal al...  REAL  
4  WASHINGTON -- Forty-three years after the Supr...  REAL  


In [4]:
# Replace missing values in the 'label' column with 'FAKE'
dataset1['label'].fillna('FAKE', inplace=True)

# Verify that there are no missing values left in the 'label' column
print(dataset1['label'].isnull().sum())

0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset1['label'].fillna('FAKE', inplace=True)


In [5]:
nltk.download('punkt_tab')


nltk.download('punkt')
nltk.download('stopwords')

# Encode labels: FAKE -> 0, TRUE -> 1
dataset1['label'] = dataset1['label'].map({'FAKE': 0, 'REAL': 1})

# Combine 'title' and 'text' into one column
dataset1['combined_text'] = dataset1['title'] + " " + dataset1['text']

# Convert text to lowercase
dataset1['combined_text'] = dataset1['combined_text'].str.lower()

# Remove special characters and punctuation
dataset1['combined_text'] = dataset1['combined_text'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

# Initialize PorterStemmer
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))  # Define stop words

# Tokenization, stop-word removal, and stemming
dataset1['combined_text_tokens'] = dataset1['combined_text'].apply(word_tokenize)
dataset1['combined_text_tokens'] = dataset1['combined_text_tokens'].apply(
    lambda tokens: [word for word in tokens if word not in stop_words]
)
dataset1['combined_text_stemmed'] = dataset1['combined_text_tokens'].apply(
    lambda tokens: [ps.stem(token) for token in tokens]
)

# Convert stemmed tokens back into strings for CountVectorizer
dataset1['combined_text_stemmed_text'] = dataset1['combined_text_stemmed'].apply(' '.join)

# Use CountVectorizer to convert text into a bag-of-words representation
vectorizer = CountVectorizer()
vector = vectorizer.fit_transform(dataset1['combined_text_stemmed_text'])
dataset1['combined_text_encoded'] = vector.toarray().tolist()

# Drop intermediate columns
dataset1 = dataset1.drop(columns=['combined_text_tokens', 'combined_text_stemmed', 'combined_text_stemmed_text'])



print(dataset1.head())

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


          iid                                              title  \
0  Fq+C96tcx+  ‘A target on Roe v. Wade ’: Oklahoma bill maki...   
1  bHUqK!pgmv  Study: women had to drive 4 times farther afte...   
2  4Y4Ubf%aTi        Trump, Clinton clash in dueling DC speeches   
3  _CoY89SJ@K  Grand jury in Texas indicts activists behind P...   
4  +rJHoRQVLe  As Reproductive Rights Hang In The Balance, De...   

                                                text  label  \
0  UPDATE: Gov. Fallin vetoed the bill on Friday....      1   
1  Ever since Texas laws closed about half of the...      1   
2  Donald Trump and Hillary Clinton, now at the s...      1   
3  A Houston grand jury investigating criminal al...      1   
4  WASHINGTON -- Forty-three years after the Supr...      1   

                                       combined_text  \
0   a target on roe v wade oklahoma bill making i...   
1  study women had to drive 4 times farther after...   
2  trump clinton clash in dueling dc speeche

Since in the combined_text_encoded column we can only see zeros we will check if there are non-zero values in order to be sure the preprocessing has gone smoothly


In [6]:

nonzero_count = vector.nnz  # or X.count_nonzero()
print("Number of nonzero entries:", nonzero_count)


dataset1.head()

Number of nonzero entries: 1226720


Unnamed: 0,iid,title,text,label,combined_text,combined_text_encoded
0,Fq+C96tcx+,‘A target on Roe v. Wade ’: Oklahoma bill maki...,UPDATE: Gov. Fallin vetoed the bill on Friday....,1,a target on roe v wade oklahoma bill making i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,bHUqK!pgmv,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,1,study women had to drive 4 times farther after...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,4Y4Ubf%aTi,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",1,trump clinton clash in dueling dc speeches don...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,_CoY89SJ@K,Grand jury in Texas indicts activists behind P...,A Houston grand jury investigating criminal al...,1,grand jury in texas indicts activists behind p...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,+rJHoRQVLe,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,1,as reproductive rights hang in the balance deb...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


We visualize for this dataset what the most frequent words are.

### Dataset 2


Load the dataset

In [51]:
true_df = pd.read_csv("Data/True.csv")
fake_df = pd.read_csv("Data/Fake.csv")

# Add label column
true_df["label"] = 1  
fake_df["label"] = 0  

# Combine both datasets
dataset2 = pd.concat([true_df, fake_df], axis=0).reset_index(drop=True)

#remove unnecessary columns
dataset2 = dataset2.drop(columns=["date", "subject"])

dataset2 = dataset2.sample(n = 2000, random_state=69)

print(dataset2.head())

                                                   title  \
21347  Downfall of ex-Samsung strategy chief leaves '...   
2583   Attorney General Sessions visits White House, ...   
22746   Trump Has Finally Commented On Portland Train...   
4756   Lawmaker says U.S. foreign surveillance 'unmas...   
35759  SHOCKING POLL RESULTS In Primary Victories Ton...   

                                                    text  label  
21347  SEOUL (Reuters) - Over four decades, Choi Gee-...      1  
2583   ABOARD AIR FORCE ONE (Reuters) - U.S. Attorney...      1  
22746  Donald Trump doesn t particularly have a filte...      0  
4756   WASHINGTON (Reuters) - The Republican chairman...      1  
35759  There are no surprises with the results on the...      0  


Check the missing values


In [52]:
print(dataset2.isnull().sum())  # Check missing values
dataset2 = dataset2.dropna()  # Drop rows with missing values


title    0
text     0
label    0
dtype: int64


Text cleaning (converting text to lowercase, removing special characters, numbers and punctuations)

In [53]:
# Combine the 'title' and 'text' columns into a new column 'combined'
dataset2['combined'] = dataset2['title'] + ' ' + dataset2['text']
# Convert text to lowercase
dataset2['combined'] = dataset2['combined'].str.lower()

# Remove special characters and punctuation
dataset2['combined'] = dataset2['combined'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

In [54]:
dataset2.head()


Unnamed: 0,title,text,label,combined
21347,Downfall of ex-Samsung strategy chief leaves '...,"SEOUL (Reuters) - Over four decades, Choi Gee-...",1,downfall of ex samsung strategy chief leaves s...
2583,"Attorney General Sessions visits White House, ...",ABOARD AIR FORCE ONE (Reuters) - U.S. Attorney...,1,attorney general sessions visits white house n...
22746,Trump Has Finally Commented On Portland Train...,Donald Trump doesn t particularly have a filte...,0,trump has finally commented on portland train...
4756,Lawmaker says U.S. foreign surveillance 'unmas...,WASHINGTON (Reuters) - The Republican chairman...,1,lawmaker says u s foreign surveillance unmaske...
35759,SHOCKING POLL RESULTS In Primary Victories Ton...,There are no surprises with the results on the...,0,shocking poll results in primary victories ton...


Tokenization and removing stop words

In [55]:
import nltk

# Tokenize the combined text
dataset2['tokens'] = dataset2['combined'].apply(nltk.word_tokenize)


In [56]:
from nltk.corpus import stopwords
nltk.download('stopwords')  # Download stopwords if not already downloaded

# Define the set of stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokens
dataset2['tokens'] = dataset2['tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stemming

In [57]:

ps = PorterStemmer()

dataset2['stemmed'] = dataset2['tokens'].apply(lambda tokens: [ps.stem(word) for word in tokens])


In [58]:
# Join the tokens in the 'stemmed' column into a single string
dataset2['stemmed_text'] = dataset2['stemmed'].apply(lambda tokens: ' '.join(tokens))


In [59]:
dataset2 = dataset2.drop(columns=['tokens', 'stemmed'])


## LSTM Classifier

In [38]:
from keras.preprocessing.text import Tokenizer
from keras_preprocessing.sequence import pad_sequences

In [64]:
dataset2_X_train, dataset2_X_test, dataset2_y_train, dataset2_y_test = train_test_split(dataset2['stemmed_text'].tolist(), dataset2['label'], test_size=0.25, random_state=69)

In [65]:
# tokenize text
tokenizer = Tokenizer()
tokenizer.fit_on_texts(dataset2_X_test)
word_index = tokenizer.word_index
vocab_size = len(word_index)

print(vocab_size, word_index)

11532 {'trump': 1, 'said': 2, 'state': 3, 'presid': 4, 'u': 5, 'would': 6, 'peopl': 7, 'one': 8, 'year': 9, 'say': 10, 'republican': 11, 'new': 12, 'like': 13, 'american': 14, 'time': 15, 'hous': 16, 'clinton': 17, 'reuter': 18, 'report': 19, 'also': 20, 'nation': 21, 'elect': 22, 'countri': 23, 'govern': 24, 'democrat': 25, 'first': 26, 'obama': 27, 'could': 28, 'right': 29, 'support': 30, 'unit': 31, 'campaign': 32, 'donald': 33, 'call': 34, 'white': 35, 'vote': 36, 'group': 37, 'two': 38, 'offici': 39, 'use': 40, 'go': 41, 'work': 42, 'told': 43, 'senat': 44, 'make': 45, 'washington': 46, 'parti': 47, 'get': 48, 'offic': 49, 'polit': 50, 'includ': 51, 'north': 52, 'former': 53, 'attack': 54, 'news': 55, 'secur': 56, 'back': 57, 'last': 58, 'want': 59, 'take': 60, 'hillari': 61, 'law': 62, 'mani': 63, 'leader': 64, 'plan': 65, 'even': 66, 'america': 67, 'day': 68, 'media': 69, 'video': 70, 'week': 71, 'polici': 72, 'may': 73, 'need': 74, 'sourc': 75, 'school': 76, 'administr': 77, 't

In [66]:
# padding data
training_sequences = tokenizer.texts_to_sequences(dataset2_X_train) #convert every word in a news to its number in the dictionary
training_padded_seq = pad_sequences(training_sequences, maxlen=500, padding='post', truncating='post') #cut off encoding after 500 words

testing_sequences = tokenizer.texts_to_sequences(dataset2_X_test) #words that are not in the dic. are skipped and dont get a number
testing_padded_seq = pad_sequences(testing_sequences, maxlen=500, padding='post', truncating='post') #cut off encoding after 500 words


In [67]:
training_padded_seq[1] #the first word of the news has index 2961, and so on

array([  418,  1043,  1269,  1096,   439,     1,   686,  1098,    18,
          38,    19,    38,  3641,  1043,    55,  1263,   110,   191,
         632,  1192,   439,  1098,    58,    71,     8,    19,   403,
         277,    11,   106,   589,  1699,    33,     1,    19,   883,
        1764,   432,   663,  1786,   176,     1,    32,   496,  5574,
        7437,    10,  2113,   429,   686,   139,   584,   844,   108,
         196, 10495,  7437,     1,   515,   249,  1138,  2375,  4471,
        1043,  1269,   249,   496, 10932,     1,    32,  1043,   833,
         474,   176,  1138,   455,   186,  1764,  1899,   533,   718,
        7437,  2113,    88,    19,   655,  5079,    55,   632,  1043,
         403,  3641,   493,    20,   127,  3932,   162,   203,   520,
        1764,  1764,  3641,   493,  1431,  1263,   217,   506,   477,
          21,    56,  2272,  1908,  3641,  6092,  1263,   226,    88,
         565,   136,    96,   249,    55,   870,   355,  1450,   429,
           1,    32,

In [68]:
# create embedding index
embedding_index = {}
with open('Data/glove.6B.100d.txt', encoding='utf-8') as f: #https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt/data?select=glove.6B.100d.txt
    
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs


In [69]:
# create embedding matrix (gets rid of vectors of words that are NOT in our vocabulary)
embedding_matrix = np.zeros((vocab_size+1, 100))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_matrix[1]

array([-0.15730999, -0.75502998,  0.36844999, -0.18957999, -0.16896001,
       -0.23157001, -0.22657999, -0.30186   ,  0.24372   ,  0.61896002,
        0.58995003,  0.047638  , -0.055164  , -0.70210999,  0.22084001,
       -0.69231999,  0.49419001,  1.42850006, -0.25362   ,  0.20031001,
       -0.26192001,  0.05315   , -0.048418  , -0.44982001,  0.54644001,
       -0.014645  , -0.015531  , -0.61197001, -0.91964   , -0.75279999,
        0.64842999,  1.0934    ,  0.052682  ,  0.33344999,  0.10532   ,
        0.59517002,  0.023104  , -0.37105   ,  0.29749   , -0.23683   ,
        0.079566  , -0.10326   ,  0.35885   , -0.28935   , -0.19881   ,
        0.22908001, -0.061435  ,  0.56127   , -0.017115  , -0.32868001,
       -0.78416997, -0.49375001,  0.34944001,  0.16278   , -0.061168  ,
       -1.31060004,  0.39151999,  0.124     , -0.20873   , -0.18472999,
       -0.56184   ,  0.55693001,  0.012114  , -0.54544997, -0.31409001,
        0.1       ,  0.31542999,  0.74756998, -0.47734001, -0.18

## Traning the LSTM

In [70]:
from keras.layers import LSTM, Dropout, Dense, Embedding
from keras import Sequential

In [None]:
model = Sequential([
    Embedding(vocab_size+1, 100, weights=[embedding_matrix], trainable=False),
    Dropout(0.2),
    LSTM(128),
    Dropout(0.2),
    Dense(256),
    Dense(1, activation='sigmoid')
])

In [None]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics='accuracy')
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 100)         4027200   
                                                                 
 dropout_2 (Dropout)         (None, None, 100)         0         
                                                                 
 lstm_1 (LSTM)               (None, 128)               117248    
                                                                 
 dropout_3 (Dropout)         (None, 128)               0         
                                                                 
 dense_2 (Dense)             (None, 256)               33024     
                                                                 
 dense_3 (Dense)             (None, 1)                 257       
                                                                 
Total params: 4,177,729
Trainable params: 150,529
Non-

In [None]:
#train the model
history = model.fit(padded_seq, dataset2_y_train, epochs=2, batch_size=256, validation_data=(padded_seq, dataset2_y_train))

Epoch 1/2
Epoch 2/2


## Training the classifier

### Dataset 2

In [76]:
# Train and evaluate a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(training_padded_seq, dataset2_y_train)
y_pred = rf_classifier.predict(testing_padded_seq)
print("Random Forest Accuracy:", accuracy_score(y_pred, dataset2_y_test))
print("Random Forest Classification Report:\n", classification_report(dataset2_y_test, y_pred, target_names=['FAKE', 'REAL']))

# Train and evaluate a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(training_padded_seq, dataset2_y_train)
y_pred_nb = nb_classifier.predict(testing_padded_seq)
print("Random Forest Accuracy:", accuracy_score(y_pred_nb, dataset2_y_test))
print("Random Forest Classification Report:\n", classification_report(dataset2_y_test, y_pred, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.736
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.69      0.83      0.76       247
        REAL       0.80      0.64      0.71       253

    accuracy                           0.74       500
   macro avg       0.75      0.74      0.73       500
weighted avg       0.75      0.74      0.73       500

Random Forest Accuracy: 0.548
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.69      0.83      0.76       247
        REAL       0.80      0.64      0.71       253

    accuracy                           0.74       500
   macro avg       0.75      0.74      0.73       500
weighted avg       0.75      0.74      0.73       500

