# Classic Classifier as benchmark

The main goal of this exercise is to get a feeling and understanding on the importance of
representation and extraction of information from complex media content, in this case images or
text. You will thus get some datasets that have an image classification target.  

(1) In the first step, you shall try to find a good classifier with „traditional“ feature extraction
methods. Thus, pick one feature extractor based on e.g. Bag Of Words, or n-grams, or similar
You shall evaluate them on two shallow algorithms, optimising the parameter settings to see what
performance you can achieve, to have a baseline for the subsequent steps.


In [33]:
import numpy as np
import pandas as pd


In [44]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

# Sample dataset
dataset1 = pd.DataFrame({
    'iid': ['KS823', 'PA432', 'BH235', 'HD732'],
    'title': [
        'In public, they ran a successful public business. In public, public allege they abused hundreds of public',
        'With tariffs signed, Trump warns of pain to come for Americans',
        'Even a potential ceasefire sparks little hope in eastern Ukraine',
        'How the Trump administration is building out its immigration enforcement machine'
    ],
    'lable': ['FAKE', 'FAKE', 'TRUE', 'TRUE'],
})

# Initialize the PorterStemmer
ps = PorterStemmer()

# Tokenize the 'title' column
dataset1['title_tokens'] = dataset1['title'].apply(word_tokenize)

# Define a function to stem a list of tokens
def stem_tokens(tokens):
    return [ps.stem(token) for token in tokens]

# Apply the stem_tokens function to the tokenized column
dataset1['title_stemmed'] = dataset1['title_tokens'].apply(stem_tokens)

# Convert stemmed tokens back into strings for CountVectorizer
dataset1['title_stemmed_text'] = dataset1['title_stemmed'].apply(' '.join)

# Display the updated DataFrame
print("Processed DataFrame:")
print(dataset1[['title', 'title_stemmed_text']])

# Initialize CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform the stemmed text
vectorizer.fit(dataset1['title_stemmed_text'])

# Print the vocabulary
print("\nVocabulary: ", vectorizer.vocabulary_)

# Encode the documents
vector = vectorizer.transform(dataset1['title_stemmed_text'])

# Summarize the encoded texts
print("\nEncoded Document is:")
print(vector.toarray())

Processed DataFrame:
                                               title  \
0  In public, they ran a successful public busine...   
1  With tariffs signed, Trump warns of pain to co...   
2  Even a potential ceasefire sparks little hope ...   
3  How the Trump administration is building out i...   

                                  title_stemmed_text  
0  in public , they ran a success public busi . i...  
1  with tariff sign , trump warn of pain to come ...  
2  even a potenti ceasefir spark littl hope in ea...  
3  how the trump administr is build out it immigr...  

Vocabulary:  {'in': 16, 'public': 25, 'they': 32, 'ran': 26, 'success': 29, 'busi': 5, 'alleg': 2, 'abus': 0, 'hundr': 14, 'of': 21, 'with': 37, 'tariff': 30, 'sign': 27, 'trump': 34, 'warn': 36, 'pain': 23, 'to': 33, 'come': 7, 'for': 11, 'american': 3, 'even': 10, 'potenti': 24, 'ceasefir': 6, 'spark': 28, 'littl': 19, 'hope': 12, 'eastern': 8, 'ukrain': 35, 'how': 13, 'the': 31, 'administr': 1, 'is': 17, 'build': 4,