# NLP Feature Extraction and Model Comparison Assignment

## Objective
Compare the performance of different feature extraction methods:
1. CountVectorizer
2. TfidfVectorizer
3. Word2Vec averaged embeddings

Use these features to classify news articles (Fake vs Real). Tune models using GridSearchCV and observe effects of n-grams (1 to 3) on feature explosion.

Dataset used: News detection( real or fake) dataset.

Importing Libraries

In [2]:
!pip install gensim


Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m53.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [3]:
# Data manipulation
import pandas as pd
import numpy as np

# Text preprocessing
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec

# Machine learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Utilities
from collections import Counter

# Download NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

Loading Dataset

In [4]:
import pandas as pd
df = pd.read_csv('/content/fake_and_real_news.csv')
df.head()

Unnamed: 0,Text,label
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake
1,U.S. conservative leader optimistic of common ...,Real
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real
3,Court Forces Ohio To Allow Millions Of Illega...,Fake
4,Democrats say Trump agrees to work on immigrat...,Real


Columns

In [5]:
df.columns

Index(['Text', 'label'], dtype='object')

In [6]:
len(df)

9900

Pre-Processing

In [20]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Clean
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.lower().strip()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [w for w in tokens if w not in stop_words]

    # Lemmatize
    #tokens = [lemmatizer.lemmatize(w) for w in tokens]
    #return ' '.join(tokens)  # return as string for CountVectorizer/TF-IDF

    clean_tokens = []
    for w in tokens:
        # Normalize repeated letters (aaaa -> aa)
        w_norm = re.sub(r'(.)\1{2,}', r'\1\1', w)
        # Remove words like 'aaa', 'xxxx', 'kkk', etc.
        if re.fullmatch(r'[a-z]{2,}', w_norm):  # keep words of length >=2 and only letters
            clean_tokens.append(lemmatizer.lemmatize(w_norm))
    return ' '.join(clean_tokens)

df['clean_text'] = df['Text'].apply(preprocess_text)
df.head()


Unnamed: 0,Text,label,clean_text
0,Top Trump Surrogate BRUTALLY Stabs Him In The...,Fake,top trump surrogate brutally stab back he path...
1,U.S. conservative leader optimistic of common ...,Real,u conservative leader optimistic common ground...
2,"Trump proposes U.S. tax overhaul, stirs concer...",Real,trump proposes u tax overhaul stir concern def...
3,Court Forces Ohio To Allow Millions Of Illega...,Fake,court force ohio allow million illegally purge...
4,Democrats say Trump agrees to work on immigrat...,Real,democrat say trump agrees work immigration bil...


Bag of Words using CountVectorizer

In [21]:
clean_text = df['clean_text'].tolist()
cv = CountVectorizer(ngram_range=(1,3), min_df=1)
X_bow = cv.fit_transform(df['clean_text'])
print("BoW feature shape:", X_bow.shape)
print("First 20 BoW features:", cv.get_feature_names_out()[:20])

BoW feature shape: (9900, 3203675)
First 20 BoW features: ['aa' 'aa even' 'aa even nra' 'aa million' 'aa million member' 'aa rental'
 'aa rental application' 'aa yield' 'aa yield scale' 'aaccording'
 'aaccording report' 'aaccording report united' 'aackk' 'aackk dammit'
 'aackk dammit im' 'aaf' 'aaf aircraft' 'aaf aircraft service'
 'aaf conducted' 'aaf conducted aerial']


TF-IDF vectorizer

In [25]:
texts = df['clean_text'].tolist()
tfidf = TfidfVectorizer(ngram_range=(1,3), min_df=2)
X_tfidf = tfidf.fit_transform(texts)
print("TF-IDF feature shape:", X_tfidf.shape)
print("First 20 TF-IDF features:", tfidf.get_feature_names_out()[:20])

TF-IDF feature shape: (9900, 435764)
First 20 TF-IDF features: ['aa' 'aa yield' 'aa yield scale' 'aai' 'aalo' 'aaminus' 'aaminus credit'
 'aaminus credit rating' 'aaplo' 'aaplo ceo' 'aaplo ceo tim'
 'aaplo iphones' 'aaplo major' 'aaplus' 'aaplus rating'
 'aaplus rating removed' 'aaron' 'aaron bernstein' 'aaron bernstein via'
 'aaron bernsteingetty']


Word2Vec embeddings

In [26]:
# Tokenize the cleaned text
tokenized_texts = [text.split() for text in df['clean_text']]

# Train Word2Vec
w2v_model = Word2Vec(
    sentences=tokenized_texts,
    vector_size=100,
    window=5,
    min_count=2,
    workers=4,
    seed=42
)


def sentence_vector(sentence, model):
    vecs = [model.wv[word] for word in sentence.split() if word in model.wv]
    if len(vecs) == 0:
        return np.zeros(model.vector_size)
    return np.mean(vecs, axis=0)


X_w2v = np.vstack([sentence_vector(text, w2v_model) for text in df['clean_text']])
print("Word2Vec sentence embeddings shape:", X_w2v.shape)

Word2Vec sentence embeddings shape: (9900, 100)


 **Training the models and Evaluation**


1. Bag-of-Words

In [33]:
df['label'] = df['label'].map({'Fake':0, 'Real':1})


In [34]:
#traing test split
X_bow = cv.fit_transform(df['clean_text'])
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X_bow, y, test_size=0.2, random_state=42, stratify=y)
print("Train:", X_train.shape, "Test:", X_test.shape)

Train: (7920, 3203675) Test: (1980, 3203675)


In [30]:
from sklearn.metrics import precision_recall_fscore_support,roc_auc_score, confusion_matrix
from sklearn.preprocessing import MinMaxScaler

In [35]:
#evaluation

def evaluate_model(model, X_test, y_test):
    preds = model.predict(X_test)

    # ====== Accuracy, Precision, Recall, F1 ======
    accuracy = accuracy_score(y_test, preds)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test, preds, average='binary', zero_division=0
    )

    # ====== ROC-AUC ======
    try:
        if hasattr(model, "predict_proba"):
            prob = model.predict_proba(X_test)[:, 1]
        elif hasattr(model, "decision_function"):
            scores = model.decision_function(X_test).reshape(-1, 1)
            prob = MinMaxScaler().fit_transform(scores).ravel()
        else:
            prob = None

        roc = roc_auc_score(y_test, prob) if prob is not None else "N/A"
    except:
        roc = "N/A"

    print("======================================")
    print("Accuracy: ", round(accuracy, 4))
    print("Precision:", round(precision, 4))
    print("Recall:   ", round(recall, 4))
    print("F1-score: ", round(f1, 4))
    print("ROC-AUC:  ", roc)
    print("\nConfusion Matrix:\n", confusion_matrix(y_test, preds))
    print("\nClassification Report:\n", classification_report(y_test, preds))
    print("======================================\n")

In [36]:
# Logistic Regression GridSearch for BoW
param_grid_lr = {
    "C": [0.01, 0.1, 1, 10]
}

lr = LogisticRegression(max_iter=2000, solver='liblinear')

grid_lr_bow = GridSearchCV(
    lr,
    param_grid_lr,
    cv=3,
    scoring='f1',
    n_jobs=-1
)

grid_lr_bow.fit(X_train, y_train)

print("Best LR params (BoW):", grid_lr_bow.best_params_)
best_lr_bow = grid_lr_bow.best_estimator_

evaluate_model(best_lr_bow, X_test, y_test)

Best LR params (BoW): {'C': 10}
Accuracy:  0.999
Precision: 0.999
Recall:    0.999
F1-score:  0.999
ROC-AUC:   0.9999979591836735

Confusion Matrix:
 [[999   1]
 [  1 979]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      1000
           1       1.00      1.00      1.00       980

    accuracy                           1.00      1980
   macro avg       1.00      1.00      1.00      1980
weighted avg       1.00      1.00      1.00      1980




In [37]:
# Multinomial Naive Bayes GridSearch for BoW

param_grid_nb = {
    "alpha": [0.1, 0.5, 1, 2]
}

nb = MultinomialNB()

grid_nb_bow = GridSearchCV(
    nb,
    param_grid_nb,
    cv=3,
    scoring='f1',
    n_jobs=-1
)

grid_nb_bow.fit(X_train, y_train)

print("Best NB params (BoW):", grid_nb_bow.best_params_)
best_nb_bow = grid_nb_bow.best_estimator_

evaluate_model(best_nb_bow, X_test, y_test)


Best NB params (BoW): {'alpha': 0.5}
Accuracy:  0.9773
Precision: 0.9661
Recall:    0.9888
F1-score:  0.9773
ROC-AUC:   0.9870785714285715

Confusion Matrix:
 [[966  34]
 [ 11 969]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.97      0.98      1000
           1       0.97      0.99      0.98       980

    accuracy                           0.98      1980
   macro avg       0.98      0.98      0.98      1980
weighted avg       0.98      0.98      0.98      1980




**2. TF-IDF**


In [38]:
#train test split
tfidf = TfidfVectorizer(ngram_range=(1,3), min_df=1)
X_tfidf = tfidf.fit_transform(df['clean_text'])
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

print("Train:", X_train.shape, "Test:", X_test.shape)

Train: (7920, 3203675) Test: (1980, 3203675)


In [39]:
#Logistic Regression + GridSearchCV in TF-IDF
param_grid_lr = {'C': [0.01, 0.1, 1, 10, 100]}
grid_lr_tfidf = GridSearchCV(LogisticRegression(max_iter=1000, solver='liblinear'),
                             param_grid_lr, cv=3, scoring='f1', n_jobs=-1)
grid_lr_tfidf.fit(X_train, y_train)

best_lr_tfidf = grid_lr_tfidf.best_estimator_
print("Best LR params (TF-IDF):", grid_lr_tfidf.best_params_)

#evaluate
evaluate_model(best_lr_tfidf, X_test, y_test)

Best LR params (TF-IDF): {'C': 100}
Accuracy:  0.9919
Precision: 0.9949
Recall:    0.9888
F1-score:  0.9918
ROC-AUC:   0.9997571428571429

Confusion Matrix:
 [[995   5]
 [ 11 969]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1000
           1       0.99      0.99      0.99       980

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980




In [40]:
#Multinomial Naive Bayes + GridSearchCV in TF-IDF
param_grid_nb = {'alpha': [0.1, 0.5, 1.0, 2.0]}
grid_nb_tfidf = GridSearchCV(MultinomialNB(), param_grid_nb, cv=3, scoring='f1', n_jobs=-1)
grid_nb_tfidf.fit(X_train, y_train)

best_nb_tfidf = grid_nb_tfidf.best_estimator_
print("Best NB params (TF-IDF):", grid_nb_tfidf.best_params_)

# Evaluate
evaluate_model(best_nb_tfidf, X_test, y_test)

Best NB params (TF-IDF): {'alpha': 0.1}
Accuracy:  0.9702
Precision: 0.9555
Recall:    0.9857
F1-score:  0.9704
ROC-AUC:   0.9975510204081632

Confusion Matrix:
 [[955  45]
 [ 14 966]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.95      0.97      1000
           1       0.96      0.99      0.97       980

    accuracy                           0.97      1980
   macro avg       0.97      0.97      0.97      1980
weighted avg       0.97      0.97      0.97      1980




3. Word2vec

In [42]:
X_w2v = np.vstack([sentence_vector(text, w2v_model) for text in df['clean_text']])
y = df['label']

#train test split
X_train, X_test, y_train, y_test = train_test_split(
    X_w2v, y, test_size=0.2, random_state=42, stratify=y
)
print("Train shape:", X_train.shape, "Test shape:", X_test.shape)


#GridSearchCV for Logistic Regression for word2vec
param_grid = {'C': [0.01, 0.1, 1, 10, 100]}
lr = LogisticRegression(max_iter=1000, solver='liblinear')

grid_lr_w2v = GridSearchCV(lr, param_grid, cv=3, scoring='f1', n_jobs=-1)
grid_lr_w2v.fit(X_train, y_train)

print("Best LR params (Word2Vec):", grid_lr_w2v.best_params_)

# Evaluate
best_lr_w2v = grid_lr_w2v.best_estimator_
evaluate_model(best_lr_w2v, X_test, y_test)

Train shape: (7920, 100) Test shape: (1980, 100)
Best LR params (Word2Vec): {'C': 100}
Accuracy:  0.9929
Precision: 0.9919
Recall:    0.9939
F1-score:  0.9929
ROC-AUC:   0.9997714285714285

Confusion Matrix:
 [[992   8]
 [  6 974]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      1000
           1       0.99      0.99      0.99       980

    accuracy                           0.99      1980
   macro avg       0.99      0.99      0.99      1980
weighted avg       0.99      0.99      0.99      1980




Multinomial Naive Bayes cannot be used with Word2Vec because its embeddings contain real-valued (including negative) numbers, while MultinomialNB expects non-negative integer counts or frequencies.

# **Summary and comparison**

In [48]:
# Function to get metrics as dictionary for train and test
def get_metrics(model, X_train, y_train, X_test, y_test):
    # Predictions
    train_preds = model.predict(X_train)
    test_preds = model.predict(X_test)

    # Train and Test accuracy
    train_acc = accuracy_score(y_train, train_preds)
    test_acc = accuracy_score(y_test, test_preds)

    # Test metrics
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_test, test_preds, average='binary', zero_division=0
    )

    # ROC-AUC
    try:
        if hasattr(model, "predict_proba"):
            prob = model.predict_proba(X_test)[:, 1]
        elif hasattr(model, "decision_function"):
            from sklearn.preprocessing import MinMaxScaler
            scores = model.decision_function(X_test).reshape(-1, 1)
            prob = MinMaxScaler().fit_transform(scores).ravel()
        else:
            prob = None
        roc = roc_auc_score(y_test, prob) if prob is not None else "N/A"
    except:
        roc = "N/A"

    return {
        "Train Acc": train_acc,
        "Test Acc": test_acc,
        "Precision": precision,
        "Recall": recall,
        "F1": f1,
        "ROC-AUC": roc
    }


comparison_table = pd.DataFrame([
    {"Model": "BoW_LR", **get_metrics(best_lr_bow, X_train_bow, y_train_bow, X_test_bow, y_test_bow)},
    {"Model": "BoW_NB", **get_metrics(best_nb_bow, X_train_bow, y_train_bow, X_test_bow, y_test_bow)},
    {"Model": "TFIDF_LR", **get_metrics(best_lr_tfidf, X_train_tfidf, y_train_tfidf, X_test_tfidf, y_test_tfidf)},
    {"Model": "TFIDF_NB", **get_metrics(best_nb_tfidf, X_train_tfidf, y_train_tfidf, X_test_tfidf, y_test_tfidf)},
    {"Model": "Word2Vec_LR", **get_metrics(best_lr_w2v, X_train_w2v, y_train_w2v, X_test_w2v, y_test_w2v)}
])

print("===== FINAL COMPARISON TABLE WITH TRAIN/TEST ACCURACY =====")
display(comparison_table)


===== FINAL COMPARISON TABLE WITH TRAIN/TEST ACCURACY =====


Unnamed: 0,Model,Train Acc,Test Acc,Precision,Recall,F1,ROC-AUC
0,BoW_LR,1.0,0.99899,0.99898,0.99898,0.99898,0.999998
1,BoW_NB,1.0,0.977273,0.966102,0.988776,0.977307,0.987079
2,TFIDF_LR,1.0,0.991919,0.994867,0.988776,0.991812,0.999757
3,TFIDF_NB,1.0,0.970202,0.95549,0.985714,0.970367,0.997551
4,Word2Vec_LR,0.996843,0.992929,0.991853,0.993878,0.992864,0.999771


## Summary & Observations

### Model Performance
- **BoW with Logistic Regression** achieved the highest test accuracy (≈99.9%), followed closely by **Word2Vec with Logistic Regression** (≈99.3%) and **TF-IDF with Logistic Regression** (≈99.2%).  
- **Multinomial Naive Bayes** models performed slightly worse, with test accuracies around 97–97.7%, but still delivered strong precision and recall scores.

### Overfitting Analysis
- BoW and TF-IDF models have perfect train accuracy (1.0) with slightly lower test accuracy → minor overfitting is present.  
- Word2Vec shows a small gap between train (≈0.997) and test (≈0.993), indicating better generalization.

### Feature Explosion / Efficiency
- BoW creates over 3 million features, causing feature explosion and high memory usage.  
- TF-IDF reduces the number of features (~436k) but still remains high-dimensional.  
- Word2Vec uses only 100 features, providing a compact and efficient representation without sacrificing performance.

### Conclusion
- Logistic Regression consistently outperforms Multinomial Naive Bayes across all representations.  
- BoW achieves very high accuracy but suffers from feature explosion and slight overfitting.  
- Word2Vec provides a highly efficient, low-dimensional representation with strong performance.  
- TF-IDF balances accuracy and feature size, making it a practical alternative.
