# Classic Classifier as benchmark

The main goal of this exercise is to get a feeling and understanding on the importance of
representation and extraction of information from complex media content, in this case images or
text. You will thus get some datasets that have an image classification target.  

(1) In the first step, you shall try to find a good classifier with „traditional“ feature extraction
methods. Thus, pick one feature extractor based on e.g. Bag Of Words, or n-grams, or similar
You shall evaluate them on two shallow algorithms, optimising the parameter settings to see what
performance you can achieve, to have a baseline for the subsequent steps.


In [2]:
import re
import nltk
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report


## Loading, preprocessing and feature extraction

### Dataset 1 


In [3]:

file_path = "Data/fake_and_real_news_dataset.csv" 
dataset1 = pd.read_csv(file_path, encoding="utf-8", on_bad_lines='skip')


dataset1.columns = ['iid', 'title', 'text', 'label']

print(dataset1.head())


          iid                                              title  \
0  Fq+C96tcx+  ‘A target on Roe v. Wade ’: Oklahoma bill maki...   
1  bHUqK!pgmv  Study: women had to drive 4 times farther afte...   
2  4Y4Ubf%aTi        Trump, Clinton clash in dueling DC speeches   
3  _CoY89SJ@K  Grand jury in Texas indicts activists behind P...   
4  +rJHoRQVLe  As Reproductive Rights Hang In The Balance, De...   

                                                text label  
0  UPDATE: Gov. Fallin vetoed the bill on Friday....  REAL  
1  Ever since Texas laws closed about half of the...  REAL  
2  Donald Trump and Hillary Clinton, now at the s...  REAL  
3  A Houston grand jury investigating criminal al...  REAL  
4  WASHINGTON -- Forty-three years after the Supr...  REAL  


In [4]:
dataset1['label'].fillna('FAKE', inplace=True)

print(dataset1['label'].isnull().sum())

0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset1['label'].fillna('FAKE', inplace=True)


### Preprocessing

We encoded the labels, converting 'FAKE' to 0 and 'REAL' to 1. Next, we combined the 'title' and 'text' columns into a single 'combined_text' column. The text was then converted to lowercase to ensure consistency. Special characters and punctuation were removed using a regular expression. We initialized the PorterStemmer to reduce words to their root forms and defined a set of English stopwords. The text was further processed by removing stopwords and applying stemming directly to the words. Finally, we dropped the intermediate 'combined_text' column to clean up the dataset. 

In [5]:


nltk.download('punkt')
nltk.download('stopwords')

dataset1['label'] = dataset1['label'].map({'FAKE': 0, 'REAL': 1})

dataset1['combined_text'] = dataset1['title'] + " " + dataset1['text']

dataset1['combined_text'] = dataset1['combined_text'].str.lower()

dataset1['combined_text'] = dataset1['combined_text'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

ps = PorterStemmer()
stop_words = set(stopwords.words('english'))  

dataset1['combined_text_processed'] = dataset1['combined_text'].apply(
    lambda text: ' '.join([ps.stem(word) for word in text.split() if word not in stop_words])
)

dataset1 = dataset1.drop(columns=['combined_text'])

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxmi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


KeyboardInterrupt: 

Data splitting

In [None]:
dataset1_X_train, dataset1_X_test, dataset1_y_train, dataset1_y_test = train_test_split(
    dataset1['combined_text_processed'].tolist(),  
    dataset1['label'],                             
    test_size=0.25,                                
    random_state=69                                
)

## LTSM classifier

Next we performed text tokenization and sequence padding to prepare text data for machine learning models. The Tokenizer converts words into unique integers, creating a vocabulary from the training data.  The sequences are padded (or truncated) to a fixed length (maxlen) to ensure uniform input size for the model.

In [None]:

text_tokenizer = Tokenizer()

text_tokenizer.fit_on_texts(dataset1_X_train)

train_sequences = text_tokenizer.texts_to_sequences(dataset1_X_train)
test_sequences = text_tokenizer.texts_to_sequences(dataset1_X_test)

max_length = 500  
training_padded_sequences = pad_sequences(train_sequences, maxlen=max_length, padding='post', truncating='post')
testing_padded_sequences = pad_sequences(test_sequences, maxlen=max_length, padding='post', truncating='post')

In [None]:
word_index_dataset1 = text_tokenizer.word_index
vocabulary_size = len(word_index_dataset1)

print("Vocabulary Size:", vocabulary_size)
print("Word Index:", word_index_dataset1)

Vocabulary Size: 34246
Word Index: {'trump': 1, 'clinton': 2, 'said': 3, 'state': 4, 'would': 5, 'campaign': 6, 'one': 7, 'republican': 8, 'peopl': 9, 'presid': 10, 'new': 11, 'say': 12, 'elect': 13, 'like': 14, 'democrat': 15, 'time': 16, 'hillari': 17, 'year': 18, 'parti': 19, 'vote': 20, 'support': 21, 'also': 22, 'polit': 23, 'go': 24, 'obama': 25, 'american': 26, 'candid': 27, 'us': 28, 'make': 29, 'even': 30, 'sander': 31, 'voter': 32, 'could': 33, 'get': 34, 'nation': 35, 'donald': 36, 'first': 37, 'call': 38, 'presidenti': 39, 'report': 40, 'work': 41, 'countri': 42, 'mani': 43, 'day': 44, 'use': 45, 'two': 46, 'cruz': 47, 'want': 48, 'right': 49, 'think': 50, 'take': 51, 'win': 52, 'govern': 53, 'back': 54, 'poll': 55, 'know': 56, 'way': 57, 'come': 58, 'need': 59, 'email': 60, '2016': 61, 'last': 62, 'percent': 63, 'debat': 64, 'show': 65, 'u': 66, 'news': 67, 'point': 68, 'hous': 69, 'world': 70, 'well': 71, 'white': 72, 'may': 73, 'week': 74, 'america': 75, 'war': 76, 'run'

In [None]:
training_padded_sequences[1]


array([ 593,  117,   14,  633,   17,    2,   61,  842,  113,   17,    2,
       2116,   25,  303, 1111, 2438, 1238,  337,   80,   79,  363,  156,
         78,  183,    4,   82,  178,  671,  349,   66, 1551,  307,  407,
        593, 1272, 4724, 3309,  502,  444,  250,    2,  563, 1172,  813,
        689,  117,  593,  341,  628,  386, 1251,  123, 1439,  130, 1146,
        124,  715,    7, 2404,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

Creating Embedding Index

In [None]:
import numpy as np

glove_embeddings = {}
with open('C:/Users/User/Downloads/archive (15)/glove.6B.100d.txt', encoding='utf-8') as file:  # Path to GloVe file
    for line in file:
        values = line.split()
        word = values[0]  
        vector = np.asarray(values[1:], dtype='float32') 
        glove_embeddings[word] = vector

print(f"Loaded {len(glove_embeddings)} word vectors.")

Loaded 400000 word vectors.


Creating Embedding Matrix

In [None]:
embedding_matrix = np.zeros((vocabulary_size + 1, 100))  
for word, index in word_index_dataset1.items():
    embedding_vector = glove_embeddings.get(word) 
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector 

print("Embedding vector for the word at index 1:")
print(embedding_matrix[1])

Embedding vector for the word at index 1:
[-0.15730999 -0.75502998  0.36844999 -0.18957999 -0.16896001 -0.23157001
 -0.22657999 -0.30186     0.24372     0.61896002  0.58995003  0.047638
 -0.055164   -0.70210999  0.22084001 -0.69231999  0.49419001  1.42850006
 -0.25362     0.20031001 -0.26192001  0.05315    -0.048418   -0.44982001
  0.54644001 -0.014645   -0.015531   -0.61197001 -0.91964    -0.75279999
  0.64842999  1.0934      0.052682    0.33344999  0.10532     0.59517002
  0.023104   -0.37105     0.29749    -0.23683     0.079566   -0.10326
  0.35885    -0.28935    -0.19881     0.22908001 -0.061435    0.56127
 -0.017115   -0.32868001 -0.78416997 -0.49375001  0.34944001  0.16278
 -0.061168   -1.31060004  0.39151999  0.124      -0.20873    -0.18472999
 -0.56184     0.55693001  0.012114   -0.54544997 -0.31409001  0.1
  0.31542999  0.74756998 -0.47734001 -0.18332    -0.65622997  0.40768
 -0.30697    -0.47246999 -0.7421     -0.44977999 -0.078122   -0.52673
 -0.70633     1.32710004  0.26298

Training the LSTM classifier

In [None]:
lstm_model1 = Sequential([
    Embedding(input_dim=vocabulary_size + 1, output_dim=100 , input_length=max_length, weights=[embedding_matrix], trainable=False),
    Dropout(rate=0.2),  
    LSTM(units=128),  
    Dropout(rate=0.2),  
    Dense(units=256, activation='relu'),  
    Dense(units=1, activation='sigmoid')  
])

lstm_model1.build(input_shape=(None, max_length))

lstm_model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model1.summary()

lstm_model2 = Sequential([
    Embedding(input_dim=vocabulary_size + 1, output_dim=100 , input_length=max_length, weights=[embedding_matrix], trainable=False),
    Dropout(rate=0.2),  
    LSTM(units=128, return_sequences=True),  
    Dropout(rate=0.2),  
    LSTM(units=128),  
    Dense(units=256, activation='relu'),  
    Dense(units=1, activation='sigmoid')  
])

lstm_model2.build(input_shape=(None, max_length))

lstm_model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model2.summary()

lstm_model3 = Sequential([
    Embedding(input_dim=vocabulary_size + 1, output_dim=100 , input_length=max_length, weights=[embedding_matrix], trainable=False),
    Dropout(rate=0.5),  
    LSTM(units=128)
    Dropout(rate=0.5), 
    Dense(units=256, activation='relu'),  
    Dense(units=1, activation='sigmoid')  
])

lstm_model3.build(input_shape=(None, max_length))

lstm_model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model3.summary()

SyntaxError: invalid syntax (2904284774.py, line 36)

In [None]:
training_history1 = lstm_model1.fit(
    training_padded_sequences, dataset1_y_train,  # Use padded sequences
    validation_data=(testing_padded_sequences, dataset1_y_test),
    epochs=10,
    batch_size=256
)

training_history2 = lstm_model2.fit(
    training_padded_sequences, dataset1_y_train,  # Use padded sequences
    validation_data=(testing_padded_sequences, dataset1_y_test),
    epochs=10,
    batch_size=256
)

training_history3 = lstm_model3.fit(
    training_padded_sequences, dataset1_y_train,  # Use padded sequences
    validation_data=(testing_padded_sequences, dataset1_y_test),
    epochs=10,
    batch_size=256
)

Epoch 1/2
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 1s/step - accuracy: 0.6158 - loss: 0.6717 - val_accuracy: 0.6632 - val_loss: 0.6024
Epoch 2/2
[1m14/14[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 2s/step - accuracy: 0.6811 - loss: 0.6098 - val_accuracy: 0.7224 - val_loss: 0.5718


### Dataset 2


Load the dataset

In [None]:
true_df = pd.read_csv("Data/True.csv")
fake_df = pd.read_csv("Data/Fake.csv")


true_df["label"] = 1  
fake_df["label"] = 0  

dataset2 = pd.concat([true_df, fake_df], axis=0).reset_index(drop=True)

dataset2 = dataset2.drop(columns=["date", "subject"])

dataset2 = dataset2.sample(n = 2000, random_state=69)



Check the missing values


In [None]:
print(dataset2.isnull().sum())
dataset2 = dataset2.dropna()  


title    0
text     0
label    0
dtype: int64


The preprocessing is done in the same way as the dataset1

In [None]:
nltk.download('punkt')
nltk.download('stopwords')


dataset2['combined'] = dataset2['title'] + " " + dataset2['text']

dataset2['combined'] = dataset2['combined'].str.lower()

dataset2['combined'] = dataset2['combined'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

ps = PorterStemmer()
stop_words = set(stopwords.words('english'))  

dataset2['preprocessed_combined'] = dataset2['combined'].apply(
    lambda text: ' '.join([ps.stem(word) for word in text.split() if word not in stop_words])
)

dataset2 = dataset2.drop(columns=['combined'])


[nltk_data] Downloading package punkt to C:\Users\User/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
dataset2.head()

Unnamed: 0,title,text,label,preprocessed_combined
21347,Downfall of ex-Samsung strategy chief leaves '...,"SEOUL (Reuters) - Over four decades, Choi Gee-...",1,downfal ex samsung strategi chief leav salarym...
2583,"Attorney General Sessions visits White House, ...",ABOARD AIR FORCE ONE (Reuters) - U.S. Attorney...,1,attorney gener session visit white hous trump ...
22746,Trump Has Finally Commented On Portland Train...,Donald Trump doesn t particularly have a filte...,0,trump final comment portland train attack sort...
4756,Lawmaker says U.S. foreign surveillance 'unmas...,WASHINGTON (Reuters) - The Republican chairman...,1,lawmak say u foreign surveil unmask trump asso...
35759,SHOCKING POLL RESULTS In Primary Victories Ton...,There are no surprises with the results on the...,0,shock poll result primari victori tonight surp...


## LSTM Classifier

Splitting the data:

In [None]:
dataset2_X_train, dataset2_X_test, dataset2_y_train, dataset2_y_test = train_test_split(dataset2['preprocessed_combined'].tolist(), dataset2['label'], test_size=0.25, random_state=69)

Tokenization and padding

In [None]:

tokenizer = Tokenizer()
tokenizer.fit_on_texts(dataset2_X_test)
word_index = tokenizer.word_index
vocab_size = len(word_index)

print(vocab_size, word_index)

11532 {'trump': 1, 'said': 2, 'state': 3, 'presid': 4, 'u': 5, 'would': 6, 'peopl': 7, 'one': 8, 'year': 9, 'say': 10, 'republican': 11, 'new': 12, 'like': 13, 'american': 14, 'time': 15, 'hous': 16, 'clinton': 17, 'reuter': 18, 'report': 19, 'also': 20, 'nation': 21, 'elect': 22, 'countri': 23, 'govern': 24, 'democrat': 25, 'first': 26, 'obama': 27, 'could': 28, 'right': 29, 'support': 30, 'unit': 31, 'campaign': 32, 'donald': 33, 'call': 34, 'white': 35, 'vote': 36, 'group': 37, 'two': 38, 'offici': 39, 'use': 40, 'go': 41, 'work': 42, 'told': 43, 'senat': 44, 'make': 45, 'washington': 46, 'parti': 47, 'get': 48, 'offic': 49, 'polit': 50, 'includ': 51, 'north': 52, 'former': 53, 'attack': 54, 'news': 55, 'secur': 56, 'back': 57, 'last': 58, 'want': 59, 'take': 60, 'hillari': 61, 'law': 62, 'mani': 63, 'leader': 64, 'plan': 65, 'even': 66, 'america': 67, 'day': 68, 'media': 69, 'video': 70, 'week': 71, 'polici': 72, 'may': 73, 'need': 74, 'sourc': 75, 'school': 76, 'administr': 77, 't

In [None]:
# padding data
training_sequences = tokenizer.texts_to_sequences(dataset2_X_train) #convert every word in a news to its number in the dictionary
training_padded_seq = pad_sequences(training_sequences, maxlen=500, padding='post', truncating='post') #cut off encoding after 500 words

testing_sequences = tokenizer.texts_to_sequences(dataset2_X_test) #words that are not in the dic. are skipped and dont get a number
testing_padded_seq = pad_sequences(testing_sequences, maxlen=500, padding='post', truncating='post') #cut off encoding after 500 words


In [None]:
training_padded_seq[1] 

array([  418,  1044,  1270,  1097,   439,     1,   686,  1099,    18,
          38,    19,    38,  3641,  1044,    55,  1264,   110,   191,
         632,  1193,   439,  1099,    58,    71,     8,    19,   403,
         277,    11,   106,   589,  1700,    33,     1,    19,   884,
        1765,   432,   663,  1787,   176,     1,    32,   496,  5574,
        7437,    10,  2113,   429,   686,   139,   584,   844,   108,
         196, 10495,  7437,     1,   515,   249,  1139,  2375,  4471,
        1044,  1270,   249,   496, 10932,     1,    32,  1044,   833,
         474,   176,  1139,   455,   186,  1765,  1899,   533,   718,
        7437,  2113,    88,    19,   655,  5079,    55,   632,  1044,
         403,  3641,   493,    20,   127,  3932,   162,   203,   520,
        1765,  1765,  3641,   493,  1432,  1264,   217,   506,   477,
          21,    56,  2272,  1908,  3641,  6092,  1264,   226,    88,
         565,   136,    96,   249,    55,   870,   355,  1451,   429,
           1,    32,

Creating Embedding Index

In [None]:
embedding_index = {}
with open('C:/Users/User/Downloads/archive (15)/glove.6B.100d.txt', encoding='utf-8') as f: #https://www.kaggle.com/datasets/danielwillgeorge/glove6b100dtxt/data?select=glove.6B.100d.txt
    
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs


Creating Embedding Matrix

In [None]:
# create embedding matrix (gets rid of vectors of words that are NOT in our vocabulary)
embedding_matrix = np.zeros((vocab_size+1, 100))
for word, i in word_index.items():
    embedding_vector = embedding_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

embedding_matrix[1]

array([-0.15730999, -0.75502998,  0.36844999, -0.18957999, -0.16896001,
       -0.23157001, -0.22657999, -0.30186   ,  0.24372   ,  0.61896002,
        0.58995003,  0.047638  , -0.055164  , -0.70210999,  0.22084001,
       -0.69231999,  0.49419001,  1.42850006, -0.25362   ,  0.20031001,
       -0.26192001,  0.05315   , -0.048418  , -0.44982001,  0.54644001,
       -0.014645  , -0.015531  , -0.61197001, -0.91964   , -0.75279999,
        0.64842999,  1.0934    ,  0.052682  ,  0.33344999,  0.10532   ,
        0.59517002,  0.023104  , -0.37105   ,  0.29749   , -0.23683   ,
        0.079566  , -0.10326   ,  0.35885   , -0.28935   , -0.19881   ,
        0.22908001, -0.061435  ,  0.56127   , -0.017115  , -0.32868001,
       -0.78416997, -0.49375001,  0.34944001,  0.16278   , -0.061168  ,
       -1.31060004,  0.39151999,  0.124     , -0.20873   , -0.18472999,
       -0.56184   ,  0.55693001,  0.012114  , -0.54544997, -0.31409001,
        0.1       ,  0.31542999,  0.74756998, -0.47734001, -0.18

## Traning the LSTM

In [None]:
model = Sequential([
    Embedding(vocab_size+1, 100, weights=[embedding_matrix], trainable=False),
    Dropout(0.2),
    LSTM(128),
    Dropout(0.2),
    Dense(256),
    Dense(1, activation='sigmoid')
])

In [None]:
lstm_model1 = Sequential([
    Embedding(input_dim=vocabulary_size + 1, output_dim=100 , input_length=max_length, weights=[embedding_matrix], trainable=False),
    Dropout(rate=0.2),  
    LSTM(units=128),  
    Dropout(rate=0.2),  
    Dense(units=256, activation='relu'),  
    Dense(units=1, activation='sigmoid')  
])

lstm_model1.build(input_shape=(None, max_length))

lstm_model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model1.summary()

lstm_model2 = Sequential([
    Embedding(input_dim=vocabulary_size + 1, output_dim=100 , input_length=max_length, weights=[embedding_matrix], trainable=False),
    Dropout(rate=0.2),  
    LSTM(units=128, return_sequences=True),  
    Dropout(rate=0.2),  
    LSTM(units=128),  
    Dense(units=256, activation='relu'),  
    Dense(units=1, activation='sigmoid')  
])

lstm_model2.build(input_shape=(None, max_length))

lstm_model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model2.summary()

lstm_model3 = Sequential([
    Embedding(input_dim=vocabulary_size + 1, output_dim=100 , input_length=max_length, weights=[embedding_matrix], trainable=False),
    Dropout(rate=0.5),  
    LSTM(units=128)
    Dropout(rate=0.5), 
    Dense(units=256, activation='relu'),  
    Dense(units=1, activation='sigmoid')  
])

lstm_model3.build(input_shape=(None, max_length))

lstm_model3.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

lstm_model3.summary()

In [None]:
training_history1 = lstm_model1.fit(
    training_padded_sequences, dataset2_y_test,  # Use padded sequences
    validation_data=(testing_padded_sequences, dataset1_y_test),
    epochs=10,
    batch_size=256
)

training_history2 = lstm_model2.fit(
    training_padded_sequences, dataset2_y_test,  # Use padded sequences
    validation_data=(testing_padded_sequences, dataset1_y_test),
    epochs=10,
    batch_size=256
)

training_history3 = lstm_model3.fit(
    training_padded_sequences, dataset2_y_test,  # Use padded sequences
    validation_data=(testing_padded_sequences, dataset1_y_test),
    epochs=10,
    batch_size=256
)

## Training the classifier (Random Forest & Naive Bayes)

### Dataset 1

In [None]:
#Random Forest classifier:
rf_classifier1 = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier1.fit(training_padded_sequences, dataset1_y_train)
y_predicted = rf_classifier1.predict(testing_padded_sequences)
print("Random Forest Accuracy:", accuracy_score(y_predicted, dataset1_y_test))
print("Random Forest Classification Report:\n", classification_report(dataset1_y_test, y_predicted, target_names=['FAKE', 'REAL']))

#Naive Bayes classifier:
nb_classifier = MultinomialNB()
nb_classifier.fit(training_padded_sequences, dataset1_y_train)
y_predicted_nb = nb_classifier.predict(testing_padded_sequences)
print("Random Forest Accuracy:", accuracy_score(y_predicted_nb, dataset1_y_test))
print("Random Forest Classification Report:\n", classification_report(dataset1_y_test, y_predicted, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.7919930374238469
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.79      0.78      0.78       558
        REAL       0.79      0.81      0.80       591

    accuracy                           0.79      1149
   macro avg       0.79      0.79      0.79      1149
weighted avg       0.79      0.79      0.79      1149

Random Forest Accuracy: 0.6718885987815492
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.79      0.78      0.78       558
        REAL       0.79      0.81      0.80       591

    accuracy                           0.79      1149
   macro avg       0.79      0.79      0.79      1149
weighted avg       0.79      0.79      0.79      1149



### Dataset 2

In [None]:
#Random Forest classifier:
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(training_padded_seq, dataset2_y_train)
y_pred = rf_classifier.predict(testing_padded_seq)
print("Random Forest Accuracy:", accuracy_score(y_pred, dataset2_y_test))
print("Random Forest Classification Report:\n", classification_report(dataset2_y_test, y_pred, target_names=['FAKE', 'REAL']))

#Naive Bayes classifier:
nb_classifier = MultinomialNB()
nb_classifier.fit(training_padded_seq, dataset2_y_train)
y_pred_nb = nb_classifier.predict(testing_padded_seq)
print("Random Forest Accuracy:", accuracy_score(y_pred_nb, dataset2_y_test))
print("Random Forest Classification Report:\n", classification_report(dataset2_y_test, y_pred, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.75
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.69      0.89      0.78       247
        REAL       0.85      0.62      0.71       253

    accuracy                           0.75       500
   macro avg       0.77      0.75      0.75       500
weighted avg       0.77      0.75      0.75       500

Random Forest Accuracy: 0.558
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.69      0.89      0.78       247
        REAL       0.85      0.62      0.71       253

    accuracy                           0.75       500
   macro avg       0.77      0.75      0.75       500
weighted avg       0.77      0.75      0.75       500

