# Task: News Topic Classification with AG News

## Objective
Classify **news articles** into 4 categories (*World, Sports, Business, Sci/Tech*) using different **text representation methods**.

<small>[AG News Classification Dataset on Kaggle](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)</small>
    
---

## Step 1: Data Preparation
- Load the **AG News dataset** (train.csv & test.csv).  
- Combine the **title + description** into one text field.  
- Apply **basic preprocessing**:
  - Lowercase  
  - Remove symbols/punctuation  
  - Try stopwords removal or stemming → compare results  

---


In [3]:
import pandas as pd
import numpy as np
from sklearn.utils import shuffle

In [4]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [5]:
df_train.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [6]:
df_test.head()

Unnamed: 0,Class Index,Title,Description
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o..."
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...


In [7]:
df_train['Title_Description'] = df_train['Title'] + " " + df_train['Description']
df_test['Title_Description'] = df_test['Title'] + " " + df_test['Description']
df_train.drop(columns=['Title', 'Description'], inplace=True)
df_test.drop(columns=['Title', 'Description'], inplace=True)

In [8]:
df_train.head()

Unnamed: 0,Class Index,Title_Description
0,3,Wall St. Bears Claw Back Into the Black (Reute...
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...
4,3,"Oil prices soar to all-time record, posing new..."


In [9]:
train_texts , train_labels =shuffle(df_train.Title_Description,df_train['Class Index'],random_state=42)
test_texts , test_labels =shuffle(df_test.Title_Description,df_test['Class Index'],random_state=42)

In [42]:
train_texts=train_texts[:7000]
train_labels=train_labels[:7000]
test_texts=test_texts[:2000]
test_labels=test_labels[:2000]


In [11]:
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [12]:
def text_preprocessing_stop_words(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = word_tokenize(text)
    words=[word for word in words if word not in stopwords.words('english')]
    return " ".join(words)

In [13]:
def text_preprocessing_stemmer(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = word_tokenize(text)
    stemmer=PorterStemmer()
    words=[stemmer.stem(word) for word in words ]
    return " ".join(words)


In [14]:
train_texts_stemmed = [text_preprocessing_stemmer(sent) for sent in train_texts]
train_texts_stop_words = [text_preprocessing_stop_words(sent) for sent in train_texts]
test_texts_stemmed = [text_preprocessing_stemmer(sent) for sent in test_texts]
test_texts_stop_words = [text_preprocessing_stop_words(sent) for sent in test_texts]


## Step 2: Representations to Try
You must implement **all 5 methods** below:

1. **Bag of Words (BoW)**  
   - Represent each text as a count of words.  


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from gensim.models import Word2Vec

In [37]:
bow_stemmed = CountVectorizer()
bow_stop_words = CountVectorizer()

X_train_stemmed_bow  = bow_stemmed.fit_transform(train_texts_stemmed)
X_train_stop_words_bow  = bow_stop_words.fit_transform(train_texts_stop_words)
X_test_stemmed_bow  = bow_stemmed.transform(test_texts_stemmed)
X_test_stop_words_bow  = bow_stop_words.transform(test_texts_stop_words)




2. **TF-IDF**  
   - Apply TF-IDF weighting instead of raw counts.  


In [39]:
tf_idf_stemmed = TfidfVectorizer()
tf_idf_stop_words = TfidfVectorizer()

X_train_stemmed_tfidf  = tf_idf_stemmed.fit_transform(train_texts_stemmed)
X_train_stop_words_tfidf  = tf_idf_stop_words.fit_transform(train_texts_stop_words)
X_test_stemmed_tfidf  = tf_idf_stemmed.transform(test_texts_stemmed)
X_test_stop_words_tfidf  = tf_idf_stop_words.transform(test_texts_stop_words)



3. **N-grams (Bi/Tri-grams)**  
   - Use bigrams and trigrams to capture context.   

    


In [40]:
ngram_stemmed = CountVectorizer(ngram_range=(1,2))
ngram_stop_words = CountVectorizer(ngram_range=(1,2))

X_train_stemmed_ngram  = ngram_stemmed.fit_transform(train_texts_stemmed)
X_train_stop_words_ngram  = ngram_stop_words.fit_transform(train_texts_stop_words)
X_test_stemmed_ngram  = ngram_stemmed.transform(test_texts_stemmed)
X_test_stop_words_ngram  = ngram_stop_words.transform(test_texts_stop_words)

4. **Word2Vec (Pretrained)**  
   - Use pretrained embeddings (e.g., GoogleNews vectors).  
   - Convert each document into a vector (average word embeddings).  

  

In [19]:
import gensim.downloader as api
model = api.load("glove-wiki-gigaword-200")
def text_to_vector(text):
    words = word_tokenize(text)
    vectors = []
    for word in words:
        if word in model:
            vectors.append(model[word])
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(300)

In [20]:
X_train_stemmed_word2vec =  np.array([text_to_vector(sent) for sent in train_texts_stemmed])
X_test_stemmed_word2vec =  np.array([text_to_vector(sent) for sent in test_texts_stemmed])
X_train_stop_words_word2vec =  np.array([text_to_vector(sent) for sent in train_texts_stop_words])
X_test_stop_words_word2vec =  np.array([text_to_vector(sent) for sent in test_texts_stop_words])

  
5. **Doc2Vec**  
   - Train your own Doc2Vec model on the dataset.  
   - Represent each document with its vector.  
   
---


In [22]:
from gensim.models import Doc2Vec
from gensim.models.doc2vec import  TaggedDocument
from gensim.utils import simple_preprocess

In [23]:
def prepare_tagged_documents(texts):
    tagged_docs = []
    for i, text in enumerate(texts):
        words = simple_preprocess(text)
        tagged_docs.append(TaggedDocument(words, [i]))
    return tagged_docs
X_train_stemmed_tagged = prepare_tagged_documents(train_texts_stemmed)
X_test_stemmed_tagged = prepare_tagged_documents(test_texts_stemmed)
X_train_stop_words_tagged = prepare_tagged_documents(train_texts_stop_words)
X_test_stop_words_tagged = prepare_tagged_documents(test_texts_stop_words)


In [26]:
model = Doc2Vec(
    vector_size=300,
    window=5,
    min_count=2,
    epochs=20)
model.build_vocab(X_train_stemmed_tagged +   X_train_stop_words_tagged  )
model.train(X_train_stemmed_tagged +  X_train_stop_words_tagged , total_examples=model.corpus_count, epochs=model.epochs)

In [27]:
def get_doc2vec_vector(model, tagged_docs):
    vectors = []
    for doc in tagged_docs:
        vector = model.infer_vector(doc.words)
        vectors.append(vector)
    return np.array(vectors)
X_train_stemmed_doc2vec = get_doc2vec_vector(model, X_train_stemmed_tagged)
X_test_stemmed_doc2vec = get_doc2vec_vector(model, X_test_stemmed_tagged)
X_train_stop_words_doc2vec = get_doc2vec_vector(model, X_train_stop_words_tagged)
X_test_stop_words_doc2vec = get_doc2vec_vector(model, X_test_stop_words_tagged)


## Step 3: Try Two Classifiers
For **each text representation method**, train **two different models** and compare:

- **Logistic Regression**
- **Naive Bayes** (or any other model of your choice, e.g., SVM, Decision Tree)

Hint:  
- Logistic Regression usually performs well on sparse features (BoW, TF-IDF, N-grams).  
- Naive Bayes is very fast and works surprisingly well for text classification.  
- Compare their accuracy for each representation.

---


- **Logistic Regression**


In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
accuracies = {}


In [69]:
# BOW stemmed
clf = LogisticRegression()
clf.fit(X_train_stemmed_bow, train_labels)
preds = clf.predict(X_test_stemmed_bow)
accuracies['BOW Stemmed'] = accuracy_score(test_labels, preds)
# BOW stop words
clf = LogisticRegression()
clf.fit(X_train_stop_words_bow, train_labels)
preds = clf.predict(X_test_stop_words_bow)
accuracies['BOW Stop Words'] = accuracy_score(test_labels, preds)

In [70]:
# tf_idf stemmed
clf = LogisticRegression()
clf.fit(X_train_stemmed_tfidf, train_labels)
preds = clf.predict(X_test_stemmed_tfidf)
accuracies['tf_idf Stemmed'] = accuracy_score(test_labels, preds)
# tf_idf stop words
clf= LogisticRegression()
clf.fit(X_train_stop_words_tfidf, train_labels)
preds = clf.predict(X_test_stop_words_tfidf)
accuracies['tf_idf Stop Words'] = accuracy_score(test_labels, preds)

In [71]:
#   ngram stemmed
clf= LogisticRegression()
clf.fit(X_train_stemmed_ngram, train_labels)
preds = clf.predict(X_test_stemmed_ngram)
accuracies['ngram Stemmed'] = accuracy_score(test_labels, preds)
#   ngram stop words
clf= LogisticRegression()
clf.fit(X_train_stop_words_ngram, train_labels)
preds = clf.predict(X_test_stop_words_ngram)
accuracies['ngram Stop Words'] = accuracy_score(test_labels, preds)

In [72]:
# Word2Vec stemmed
clf.fit(X_train_stemmed_word2vec, train_labels)
preds = clf.predict(X_test_stemmed_word2vec)
accuracies['Word2Vec Stemmed'] = accuracy_score(test_labels, preds)
# Word2Vec stop words
clf.fit(X_train_stop_words_word2vec, train_labels)
preds = clf.predict(X_test_stop_words_word2vec)
accuracies['Word2Vec Stop Words'] = accuracy_score(test_labels, preds)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [73]:
# Doc2Vec stemmed
clf.fit(X_train_stemmed_doc2vec, train_labels)
preds = clf.predict(X_test_stemmed_doc2vec)
accuracies['Doc2Vec Stemmed'] = accuracy_score(test_labels, preds)
# Doc2Vec stop words
clf.fit(X_train_stop_words_doc2vec, train_labels)
preds = clf.predict(X_test_stop_words_doc2vec)
accuracies['Doc2Vec Stop Words'] = accuracy_score(test_labels, preds)

- **Naive Bayes** 

In [74]:
from sklearn.naive_bayes import MultinomialNB,GaussianNB
accuracies2 = {}
# BOW stemmed
clf = MultinomialNB()
clf.fit(X_train_stemmed_bow, train_labels)
preds = clf.predict(X_test_stemmed_bow)
accuracies2['BOW Stemmed NB'] = accuracy_score(test_labels, preds)
# BOW stop words
clf = MultinomialNB()
clf.fit(X_train_stop_words_bow, train_labels)
preds = clf.predict(X_test_stop_words_bow)
accuracies2['BOW Stop Words NB'] = accuracy_score(test_labels, preds)

In [75]:
# tf_idf stemmed
clf = MultinomialNB()
clf.fit(X_train_stemmed_tfidf, train_labels)
preds = clf.predict(X_test_stemmed_tfidf)
accuracies2['tf_idf Stemmed NB'] = accuracy_score(test_labels, preds)
# tf_idf stop words
clf = MultinomialNB()
clf.fit(X_train_stop_words_tfidf, train_labels)
preds = clf.predict(X_test_stop_words_tfidf)
accuracies2['tf_idf Stop Words NB'] = accuracy_score(test_labels, preds)

In [76]:
# ngram stemmed
clf= MultinomialNB()
clf.fit(X_train_stemmed_ngram, train_labels)
preds= clf.predict(X_test_stemmed_ngram)
accuracies2['ngram Stemmed NB'] = accuracy_score(test_labels, preds)
# ngram stop words
clf= MultinomialNB()
clf.fit(X_train_stop_words_ngram, train_labels)
preds= clf.predict(X_test_stop_words_ngram)
accuracies2['ngram Stop Words NB'] = accuracy_score(test_labels, preds)

In [77]:
# Word2Vec stemmed
clf= GaussianNB()
clf.fit(X_train_stemmed_word2vec, train_labels)
preds = clf.predict(X_test_stemmed_word2vec)
accuracies2['Word2Vec Stemmed'] = accuracy_score(test_labels, preds)
# Word2Vec stop words
clf.fit(X_train_stop_words_word2vec, train_labels)
preds = clf.predict(X_test_stop_words_word2vec)
accuracies2['Word2Vec Stop Words'] = accuracy_score(test_labels, preds)


In [78]:
# Doc2Vec stemmed
clf= GaussianNB()
clf.fit(X_train_stemmed_doc2vec, train_labels)
preds = clf.predict(X_test_stemmed_doc2vec)
accuracies2['Doc2Vec Stemmed'] = accuracy_score(test_labels, preds)
# Doc2Vec stop words
clf.fit(X_train_stop_words_doc2vec, train_labels)
preds = clf.predict(X_test_stop_words_doc2vec)
accuracies2['Doc2Vec Stop Words'] = accuracy_score(test_labels, preds)

In [79]:
print("Logistic Regression Accuracies:\n")
for key, value in accuracies.items():
    print(f"{key}: {value:.4f}")
print("-"*30+'\n')

print("Naive Bayes Accuracies:\n")
for key, value in accuracies2.items():
    print(f"{key}: {value:.4f}")


Logistic Regression Accuracies:

BOW Stemmed: 0.8675
BOW Stop Words: 0.8655
tf_idf Stemmed: 0.8805
tf_idf Stop Words: 0.8795
ngram Stemmed: 0.8720
ngram Stop Words: 0.8790
Word2Vec Stemmed: 0.8655
Word2Vec Stop Words: 0.8875
Doc2Vec Stemmed: 0.8005
Doc2Vec Stop Words: 0.7725
------------------------------

Naive Bayes Accuracies:

BOW Stemmed NB: 0.8875
BOW Stop Words NB: 0.8830
tf_idf Stemmed NB: 0.8845
tf_idf Stop Words NB: 0.8805
ngram Stemmed NB: 0.8860
ngram Stop Words NB: 0.8895
Word2Vec Stemmed: 0.8325
Word2Vec Stop Words: 0.8650
Doc2Vec Stemmed: 0.7595
Doc2Vec Stop Words: 0.7410



## Step 4: Results Table

For Stemmed Words dataframe:

| Representation | Logistic Regression Acc | Naive Bayes Acc | Notes |
|----------------|--------------------------|-----------------|-------|
| BoW            |           0.8675         |      0.8875     |       |
| TF-IDF         |           0.8805         |      0.8845     |       |
| N-grams        |           0.8720         |      0.8860     |       |
| Word2Vec       |           0.8325         |      0.8325     |       |
| Doc2Vec        |           0.8005         |      0.7595     |       |
---

For removed Stop words dataframe:

| Representation | Logistic Regression Acc | Naive Bayes Acc | Notes |
|----------------|--------------------------|-----------------|-------|
| BoW            |          0.8655          |     0.8830      |       |
| TF-IDF         |          0.8795          |     0.8805      |       |
| N-grams        |          0.8790          |     0.8895      |       |
| Word2Vec       |          0.8650          |     0.8650      |       |
| Doc2Vec        |          0.7725          |     0.7410      |       |



## Reflection Questions
1. Which method gave the best accuracy? Why?  


    BoW with the Naive Bayes Classifier & Stemmed Words df , Because the dataset is relatively small.

2. Did N-grams improve performance compared to BoW? 

    Yes But just with the Logistic regression clf , the opposite happened with the naive bayes clf.

 
3. How do pretrained embeddings (Word2Vec) compare to TF-IDF?  


    Word2vec is worse compared with TF-IDF

4. Which method is more efficient in terms of speed and memory? 

    Bow 

 
5. If you had to build a **real news classifier**, which method would you choose and why?  


    Word2vec with pretrained model like google news 