# Classic Classifier as benchmark

The main goal of this exercise is to get a feeling and understanding on the importance of
representation and extraction of information from complex media content, in this case images or
text. You will thus get some datasets that have an image classification target.  

(1) In the first step, you shall try to find a good classifier with „traditional“ feature extraction
methods. Thus, pick one feature extractor based on e.g. Bag Of Words, or n-grams, or similar
You shall evaluate them on two shallow algorithms, optimising the parameter settings to see what
performance you can achieve, to have a baseline for the subsequent steps.


In [1]:
import re
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np


Importing the datasets

In [22]:

file_path = "C:/Users/User/Downloads/fake_and_real_news_dataset.csv" 
dataset1 = pd.read_csv(file_path, encoding="utf-8", on_bad_lines='skip')


dataset1.columns = ['iid', 'title', 'text', 'label']

print(dataset1.head())


          iid                                              title  \
0  Fq+C96tcx+  ‘A target on Roe v. Wade ’: Oklahoma bill maki...   
1  bHUqK!pgmv  Study: women had to drive 4 times farther afte...   
2  4Y4Ubf%aTi        Trump, Clinton clash in dueling DC speeches   
3  _CoY89SJ@K  Grand jury in Texas indicts activists behind P...   
4  +rJHoRQVLe  As Reproductive Rights Hang In The Balance, De...   

                                                text label  
0  UPDATE: Gov. Fallin vetoed the bill on Friday....  REAL  
1  Ever since Texas laws closed about half of the...  REAL  
2  Donald Trump and Hillary Clinton, now at the s...  REAL  
3  A Houston grand jury investigating criminal al...  REAL  
4  WASHINGTON -- Forty-three years after the Supr...  REAL  


In [23]:
# Replace missing values in the 'label' column with 'FAKE'
dataset1['label'].fillna('FAKE', inplace=True)

# Verify that there are no missing values left in the 'label' column
print(dataset1['label'].isnull().sum())


0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  dataset1['label'].fillna('FAKE', inplace=True)


In [3]:

true_df = pd.read_csv("C:/Users/User/Downloads/archive (14)/True.csv")
fake_df = pd.read_csv("C:/Users/User/Downloads/archive (14)/Fake.csv")

# Add label column
true_df["label"] = 1  
fake_df["label"] = 0  

# Combine both datasets
dataset2 = pd.concat([true_df, fake_df], axis=0).reset_index(drop=True)

print(dataset2.head())

                                               title  \
0  As U.S. budget fight looms, Republicans flip t...   
1  U.S. military to accept transgender recruits o...   
2  Senior U.S. Republican senator: 'Let Mr. Muell...   
3  FBI Russia probe helped by Australian diplomat...   
4  Trump wants Postal Service to charge 'much mor...   

                                                text       subject  \
0  WASHINGTON (Reuters) - The head of a conservat...  politicsNews   
1  WASHINGTON (Reuters) - Transgender people will...  politicsNews   
2  WASHINGTON (Reuters) - The special counsel inv...  politicsNews   
3  WASHINGTON (Reuters) - Trump campaign adviser ...  politicsNews   
4  SEATTLE/WASHINGTON (Reuters) - President Donal...  politicsNews   

                 date  label  
0  December 31, 2017       1  
1  December 29, 2017       1  
2  December 31, 2017       1  
3  December 30, 2017       1  
4  December 29, 2017       1  


## Preprocessing and feature extraction

Dataset 1 


In [24]:
nltk.download('punkt_tab')


nltk.download('punkt')
nltk.download('stopwords')

# Encode labels: FAKE -> 0, TRUE -> 1
dataset1['label'] = dataset1['label'].map({'FAKE': 0, 'REAL': 1})

# Combine 'title' and 'text' into one column
dataset1['combined_text'] = dataset1['title'] + " " + dataset1['text']

# Convert text to lowercase
dataset1['combined_text'] = dataset1['combined_text'].str.lower()

# Remove special characters and punctuation
dataset1['combined_text'] = dataset1['combined_text'].apply(lambda x: re.sub(r'\W+', ' ', str(x)))

# Initialize PorterStemmer
ps = PorterStemmer()
stop_words = set(stopwords.words('english'))  # Define stop words

# Tokenization, stop-word removal, and stemming
dataset1['combined_text_tokens'] = dataset1['combined_text'].apply(word_tokenize)
dataset1['combined_text_tokens'] = dataset1['combined_text_tokens'].apply(
    lambda tokens: [word for word in tokens if word not in stop_words]
)
dataset1['combined_text_stemmed'] = dataset1['combined_text_tokens'].apply(
    lambda tokens: [ps.stem(token) for token in tokens]
)

# Convert stemmed tokens back into strings for CountVectorizer
dataset1['combined_text_stemmed_text'] = dataset1['combined_text_stemmed'].apply(' '.join)

# Use CountVectorizer to convert text into a bag-of-words representation
vectorizer = CountVectorizer()
vector = vectorizer.fit_transform(dataset1['combined_text_stemmed_text'])
dataset1['combined_text_encoded'] = vector.toarray().tolist()

# Drop intermediate columns
dataset1 = dataset1.drop(columns=['combined_text_tokens', 'combined_text_stemmed', 'combined_text_stemmed_text'])



print(dataset1.head())


[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\User/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to C:\Users\User/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


          iid                                              title  \
0  Fq+C96tcx+  ‘A target on Roe v. Wade ’: Oklahoma bill maki...   
1  bHUqK!pgmv  Study: women had to drive 4 times farther afte...   
2  4Y4Ubf%aTi        Trump, Clinton clash in dueling DC speeches   
3  _CoY89SJ@K  Grand jury in Texas indicts activists behind P...   
4  +rJHoRQVLe  As Reproductive Rights Hang In The Balance, De...   

                                                text  label  \
0  UPDATE: Gov. Fallin vetoed the bill on Friday....      1   
1  Ever since Texas laws closed about half of the...      1   
2  Donald Trump and Hillary Clinton, now at the s...      1   
3  A Houston grand jury investigating criminal al...      1   
4  WASHINGTON -- Forty-three years after the Supr...      1   

                                       combined_text  \
0   a target on roe v wade oklahoma bill making i...   
1  study women had to drive 4 times farther after...   
2  trump clinton clash in dueling dc speeche

Since in the combined_text_encoded column we can only see zeros we will check if there are non-zero values in order to be sure the preprocessing has gone smoothly


In [25]:

nonzero_count = vector.nnz  # or X.count_nonzero()
print("Number of nonzero entries:", nonzero_count)


dataset1.head()

Number of nonzero entries: 1226720


Unnamed: 0,iid,title,text,label,combined_text,combined_text_encoded
0,Fq+C96tcx+,‘A target on Roe v. Wade ’: Oklahoma bill maki...,UPDATE: Gov. Fallin vetoed the bill on Friday....,1,a target on roe v wade oklahoma bill making i...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,bHUqK!pgmv,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,1,study women had to drive 4 times farther after...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,4Y4Ubf%aTi,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",1,trump clinton clash in dueling dc speeches don...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,_CoY89SJ@K,Grand jury in Texas indicts activists behind P...,A Houston grand jury investigating criminal al...,1,grand jury in texas indicts activists behind p...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,+rJHoRQVLe,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,1,as reproductive rights hang in the balance deb...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


In [9]:
dataset1 = pd.DataFrame({
    'iid': ['AA101', 'BB202', 'CC303', 'DD404', 'EE505', 'FF606', 'GG707', 'HH808', 'II909', 'JJ010'],
    'title': [
        'Government announces new economic policies to boost growth',
        'Scientists discover new species in the Amazon rainforest',
        'Major cyberattack disrupts banking systems across Europe',
        'Famous actor donates millions to climate change research',
        'Experts warn about rising sea levels affecting coastal cities',
        'Stock market hits record high amid economic recovery',
        'New study links processed foods to increased health risks',
        'Sports team wins championship in dramatic final match',
        'Breakthrough in renewable energy technology announced',
        'Major corporation accused of environmental violations'
    ],
    'text': [
        'The government has unveiled a series of new economic policies aimed at stimulating growth and increasing employment opportunities. Officials believe these measures will help stabilize the economy.',
        'A group of scientists has identified a previously unknown species of amphibians deep in the Amazon rainforest, shedding light on the region’s incredible biodiversity and ecological significance.',
        'A large-scale cyberattack has disrupted banking operations across multiple European countries, causing financial institutions to implement emergency security measures to protect customer data.',
        'A world-renowned actor has pledged a significant portion of their wealth to support climate change research initiatives, aiming to fund projects that seek solutions to environmental issues.',
        'Climate scientists have issued a warning about the rising sea levels and their impact on major coastal cities, urging governments to take immediate action to prevent catastrophic consequences.',
        'The stock market has reached an all-time high, fueled by strong corporate earnings and renewed investor confidence in the ongoing economic recovery, according to financial analysts.',
        'A newly published study has found a correlation between the consumption of processed foods and increased health risks, leading to calls for better dietary regulations and awareness campaigns.',
        'In an intense and thrilling final match, the underdog sports team secured a stunning victory, claiming the championship title and delighting fans around the world with their performance.',
        'Scientists have announced a major breakthrough in renewable energy technology, which could significantly improve the efficiency of solar panels and make sustainable energy more accessible globally.',
        'A well-known multinational corporation has come under fire after allegations surfaced about environmental violations, prompting an investigation into its practices and potential legal actions.'
    ],
    'label': ['TRUE', 'TRUE', 'FAKE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FAKE'],
})

# Encode labels: FAKE -> 0, TRUE -> 1
dataset1['label'] = dataset1['label'].map({'FAKE': 0, 'TRUE': 1})

# Combine 'title' and 'text' columns to create a single text feature
dataset1['combined_text'] = dataset1['title'] + " " + dataset1['text']

# Initialize PorterStemmer for word stemming
ps = PorterStemmer()

# Tokenize and stem the combined text to reduce words to their root form
dataset1['combined_text_tokens'] = dataset1['combined_text'].apply(word_tokenize)

dataset1['combined_text_stemmed'] = dataset1['combined_text_tokens'].apply(lambda tokens: [ps.stem(token) for token in tokens])

# Convert stemmed tokens back into strings for CountVectorizer
dataset1['combined_text_stemmed_text'] = dataset1['combined_text_stemmed'].apply(' '.join)

# Use CountVectorizer to convert text into a bag-of-words representation
vectorizer = CountVectorizer()
vector = vectorizer.fit_transform(dataset1['combined_text_stemmed_text'])
dataset1['combined_text_encoded'] = vector.toarray().tolist()

# Drop intermediate columns to clean up the DataFrame
dataset1 = dataset1.drop(columns=['combined_text_tokens', 'combined_text_stemmed', 'combined_text_stemmed_text'])

print(dataset1.head())

     iid                                              title  \
0  AA101  Government announces new economic policies to ...   
1  BB202  Scientists discover new species in the Amazon ...   
2  CC303  Major cyberattack disrupts banking systems acr...   
3  DD404  Famous actor donates millions to climate chang...   
4  EE505  Experts warn about rising sea levels affecting...   

                                                text  label  \
0  The government has unveiled a series of new ec...      1   
1  A group of scientists has identified a previou...      1   
2  A large-scale cyberattack has disrupted bankin...      0   
3  A world-renowned actor has pledged a significa...      1   

                                       combined_text  \
0  Government announces new economic policies to ...   
1  Scientists discover new species in the Amazon ...   
2  Major cyberattack disrupts banking systems acr...   
3  Famous actor donates millions to climate chang...   
4  Experts warn about ris

Dataset 2


Dropping the 'date' column

In [6]:
dataset2 = dataset2.drop(columns=["date"])


Check the missing values


In [7]:
print(dataset2.isnull().sum())  # Check missing values
dataset2 = dataset2.dropna()  # Drop rows with missing values


title      0
text       0
subject    0
label      0
dtype: int64


Text cleaning (converting text to lowercase, removing special characters, numbers and punctuations)

In [8]:

def clean_text(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@[A-Za-z0-9]+', '', text)  # Remove @mentions
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove punctuation and numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

dataset2["text"] = dataset2["text"].apply(clean_text)
dataset2["title"] = dataset2["title"].apply(clean_text)


In [9]:
# Combine the 'title' and 'text' columns into a new column 'combined'
dataset2['combined'] = dataset2['title'] + ' ' + dataset2['text']


In [10]:
print(dataset2.columns)
dataset2.head()


Index(['title', 'text', 'subject', 'label', 'combined'], dtype='object')


Unnamed: 0,title,text,subject,label,combined
0,as us budget fight looms republicans flip thei...,washington reuters the head of a conservative ...,politicsNews,1,as us budget fight looms republicans flip thei...
1,us military to accept transgender recruits on ...,washington reuters transgender people will be ...,politicsNews,1,us military to accept transgender recruits on ...
2,senior us republican senator let mr mueller do...,washington reuters the special counsel investi...,politicsNews,1,senior us republican senator let mr mueller do...
3,fbi russia probe helped by australian diplomat...,washington reuters trump campaign adviser geor...,politicsNews,1,fbi russia probe helped by australian diplomat...
4,trump wants postal service to charge much more...,seattlewashington reuters president donald tru...,politicsNews,1,trump wants postal service to charge much more...


Tokenization and removing stop words

In [11]:
import nltk
nltk.download('punkt')  # Download the tokenizer models if not already downloaded

# Tokenize the combined text
dataset2['tokens'] = dataset2['combined'].apply(nltk.word_tokenize)


[nltk_data] Downloading package punkt to C:\Users\User/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [12]:
from nltk.corpus import stopwords
nltk.download('stopwords')  # Download stopwords if not already downloaded

# Define the set of stop words
stop_words = set(stopwords.words('english'))

# Remove stop words from the tokens
dataset2['tokens'] = dataset2['tokens'].apply(lambda tokens: [word for word in tokens if word not in stop_words])


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Stemming

In [None]:

ps = PorterStemmer()

dataset2['stemmed'] = dataset2['tokens'].apply(lambda tokens: [ps.stem(word) for word in tokens])


In [14]:
# Join the tokens in the 'stemmed' column into a single string
dataset2['stemmed_text'] = dataset2['stemmed'].apply(lambda tokens: ' '.join(tokens))


In [None]:
# Limit vocabulary to the top 5000 most frequent words
vectorizer = CountVectorizer(max_features=10000, min_df=5, max_df=0.8)
vector1 = vectorizer.fit_transform(dataset2['stemmed_text'])


# Convert the sparse matrix to a dense array and store it in a new column
dataset2['combined_text_encoded'] = vector1.toarray().tolist()


In [17]:
dataset2 = dataset2.drop(columns=['tokens', 'stemmed','stemmed_text'])


In [19]:

nonzero_count = vector1.nnz  # or X.count_nonzero()
print("Number of nonzero entries:", nonzero_count)


dataset2.head()

Number of nonzero entries: 6401272


Unnamed: 0,title,text,subject,label,combined,combined_text_encoded
0,as us budget fight looms republicans flip thei...,washington reuters the head of a conservative ...,politicsNews,1,as us budget fight looms republicans flip thei...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
1,us military to accept transgender recruits on ...,washington reuters transgender people will be ...,politicsNews,1,us military to accept transgender recruits on ...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,senior us republican senator let mr mueller do...,washington reuters the special counsel investi...,politicsNews,1,senior us republican senator let mr mueller do...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,fbi russia probe helped by australian diplomat...,washington reuters trump campaign adviser geor...,politicsNews,1,fbi russia probe helped by australian diplomat...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, ..."
4,trump wants postal service to charge much more...,seattlewashington reuters president donald tru...,politicsNews,1,trump wants postal service to charge much more...,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


***Luisa & Max please try the code below: 

(This cell to be deleted if it does not work for none of us)

In [None]:
vectorizer = CountVectorizer()

vector = vectorizer.fit_transform(dataset2['stemmed_text'])

# Convert the sparse matrix to a dense array and store it in a new column
dataset2['combined_text_encoded'] = vector.toarray().tolist()


## Training the classifier

In [26]:
# Prepare feature matrix (X) and target vector (y) for model training
X = dataset1['combined_text_encoded'].tolist()
y = dataset1['label']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=69)

# Train and evaluate a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred, target_names=['FAKE', 'REAL']))

# Train and evaluate a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.9164490861618799
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.90      0.93      0.92       558
        REAL       0.93      0.91      0.92       591

    accuracy                           0.92      1149
   macro avg       0.92      0.92      0.92      1149
weighted avg       0.92      0.92      0.92      1149

Naive Bayes Accuracy: 0.8833768494342907
Naive Bayes Classification Report:
               precision    recall  f1-score   support

        FAKE       0.89      0.87      0.88       558
        REAL       0.88      0.89      0.89       591

    accuracy                           0.88      1149
   macro avg       0.88      0.88      0.88      1149
weighted avg       0.88      0.88      0.88      1149



In [27]:
# Prepare feature matrix (X) and target vector (y) for model training
X = dataset2['combined_text_encoded'].tolist()
y = dataset2['label']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=69)

# Train and evaluate a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred, target_names=['FAKE', 'REAL']))

# Train and evaluate a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb, target_names=['FAKE', 'REAL']))

Random Forest Accuracy: 0.9978619153674833
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       1.00      1.00      1.00      5876
        REAL       1.00      1.00      1.00      5349

    accuracy                           1.00     11225
   macro avg       1.00      1.00      1.00     11225
weighted avg       1.00      1.00      1.00     11225

Naive Bayes Accuracy: 0.9455679287305122
Naive Bayes Classification Report:
               precision    recall  f1-score   support

        FAKE       0.95      0.95      0.95      5876
        REAL       0.94      0.94      0.94      5349

    accuracy                           0.95     11225
   macro avg       0.95      0.95      0.95     11225
weighted avg       0.95      0.95      0.95     11225

