# Classic Classifier as benchmark

The main goal of this exercise is to get a feeling and understanding on the importance of
representation and extraction of information from complex media content, in this case images or
text. You will thus get some datasets that have an image classification target.  

(1) In the first step, you shall try to find a good classifier with „traditional“ feature extraction
methods. Thus, pick one feature extractor based on e.g. Bag Of Words, or n-grams, or similar
You shall evaluate them on two shallow algorithms, optimising the parameter settings to see what
performance you can achieve, to have a baseline for the subsequent steps.


In [3]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

## Preprocessing and feature extraction

In [4]:
dataset1 = pd.DataFrame({
    'iid': ['AA101', 'BB202', 'CC303', 'DD404', 'EE505', 'FF606', 'GG707', 'HH808', 'II909', 'JJ010'],
    'title': [
        'Government announces new economic policies to boost growth',
        'Scientists discover new species in the Amazon rainforest',
        'Major cyberattack disrupts banking systems across Europe',
        'Famous actor donates millions to climate change research',
        'Experts warn about rising sea levels affecting coastal cities',
        'Stock market hits record high amid economic recovery',
        'New study links processed foods to increased health risks',
        'Sports team wins championship in dramatic final match',
        'Breakthrough in renewable energy technology announced',
        'Major corporation accused of environmental violations'
    ],
    'text': [
        'The government has unveiled a series of new economic policies aimed at stimulating growth and increasing employment opportunities. Officials believe these measures will help stabilize the economy.',
        'A group of scientists has identified a previously unknown species of amphibians deep in the Amazon rainforest, shedding light on the region’s incredible biodiversity and ecological significance.',
        'A large-scale cyberattack has disrupted banking operations across multiple European countries, causing financial institutions to implement emergency security measures to protect customer data.',
        'A world-renowned actor has pledged a significant portion of their wealth to support climate change research initiatives, aiming to fund projects that seek solutions to environmental issues.',
        'Climate scientists have issued a warning about the rising sea levels and their impact on major coastal cities, urging governments to take immediate action to prevent catastrophic consequences.',
        'The stock market has reached an all-time high, fueled by strong corporate earnings and renewed investor confidence in the ongoing economic recovery, according to financial analysts.',
        'A newly published study has found a correlation between the consumption of processed foods and increased health risks, leading to calls for better dietary regulations and awareness campaigns.',
        'In an intense and thrilling final match, the underdog sports team secured a stunning victory, claiming the championship title and delighting fans around the world with their performance.',
        'Scientists have announced a major breakthrough in renewable energy technology, which could significantly improve the efficiency of solar panels and make sustainable energy more accessible globally.',
        'A well-known multinational corporation has come under fire after allegations surfaced about environmental violations, prompting an investigation into its practices and potential legal actions.'
    ],
    'lable': ['TRUE', 'TRUE', 'FAKE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'TRUE', 'FAKE'],
})

# Encode labels: FAKE -> 0, TRUE -> 1
dataset1['lable'] = dataset1['lable'].map({'FAKE': 0, 'TRUE': 1})

# Combine 'title' and 'text' columns to create a single text feature
dataset1['combined_text'] = dataset1['title'] + " " + dataset1['text']

# Initialize PorterStemmer for word stemming
ps = PorterStemmer()

# Tokenize and stem the combined text to reduce words to their root form
dataset1['combined_text_tokens'] = dataset1['combined_text'].apply(word_tokenize)
dataset1['combined_text_stemmed'] = dataset1['combined_text_tokens'].apply(lambda tokens: [ps.stem(token) for token in tokens])

# Convert stemmed tokens back into strings for CountVectorizer
dataset1['combined_text_stemmed_text'] = dataset1['combined_text_stemmed'].apply(' '.join)

# Use CountVectorizer to convert text into a bag-of-words representation
vectorizer = CountVectorizer()
vector = vectorizer.fit_transform(dataset1['combined_text_stemmed_text'])
dataset1['combined_text_encoded'] = vector.toarray().tolist()

# Drop intermediate columns to clean up the DataFrame
dataset1 = dataset1.drop(columns=['combined_text_tokens', 'combined_text_stemmed', 'combined_text_stemmed_text'])

## Training the classifier

In [5]:
# Prepare feature matrix (X) and target vector (y) for model training
X = dataset1['combined_text_encoded'].tolist()
y = dataset1['lable']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=69)

# Train and evaluate a Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=69)
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, y_pred, target_names=['FAKE', 'TRUE']))

# Train and evaluate a Naive Bayes classifier
nb_classifier = MultinomialNB()
nb_classifier.fit(X_train, y_train)
y_pred_nb = nb_classifier.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print("Naive Bayes Classification Report:\n", classification_report(y_test, y_pred_nb, target_names=['FAKE', 'TRUE']))

Random Forest Accuracy: 0.6666666666666666
Random Forest Classification Report:
               precision    recall  f1-score   support

        FAKE       0.00      0.00      0.00         1
        TRUE       0.67      1.00      0.80         2

    accuracy                           0.67         3
   macro avg       0.33      0.50      0.40         3
weighted avg       0.44      0.67      0.53         3

Naive Bayes Accuracy: 0.3333333333333333
Naive Bayes Classification Report:
               precision    recall  f1-score   support

        FAKE       0.33      1.00      0.50         1
        TRUE       0.00      0.00      0.00         2

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
