### Assignment 2

1. Data Representation & Preprocessing
We start by representing each research description as a clean, token-based document. The text from the Description column is first lowercased then stripped of all punctuation and digits using a simple regular-expression filter. We use ntlk’s word_tokenize to split on whitespace and punctuation, producing a sequence of tokens. This token sequence is our basis for two feature representations:
*Enhanced TF-IDF with Bigrams*: to capture local phrase structures, we extend our vocabulary to include both unigrams and bigrams. We apply document-frequency thresholds to filter out extremely rare or overly comon n‑grams, then weight each feature by TF‑IDF. This downweights generic tokens (e.g., "the", "data") and amplifies more discriminatve terms.
2. Model Implementation and Improvements
*Standard Naive Bayes*:
We implement a multinomial Naive Bayes classifier from scratch:
1. *Prior*: estimate class prior probabilities in log-space.
2. *Likelihood*: for each class, sum term counts across all documents in that class, apply Laplace smoothing, and normalize
3. *Prediction*: for a new document, sum the log-prior with the dot‑product of its feature vector and each class’s log-likelihood vector, choosing the class with the highest total score.\
*Improvements:*
- including bigrams enables the model to recognise phrases that unigrams alone cannot.
- removing tokens that appear in fewer than 5 documents or in more than 80% of all documents, we reduce noise/dimensionality.
- term-frequency scaling coupled with inverse-document-frequency helps emphasise informative tokens and suppress common ones.
These enhancements produce a richer feature set that often yields higher discriminative power while still maintaining computational efficiency.
3. Evaluation Procedure
We split train.csv into an 80% training/20% validation set. For improved model, we report:
- overall fraction of correctly classified descriptions.
Hyperparameters are via 3‑fold cross‑validation on the training split, optimizing validation accuracy.
*Conclusion:* Our accuracy on the data was 98.18%

In [75]:
# Imports for Assignment 2
import pandas as pd
import numpy as np
import re
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [76]:
nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\rony2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\rony2\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [77]:
# Loading Data
train_df = pd.read_csv('train.csv')
test_df  = pd.read_csv('test.csv')

In [78]:
# Data Preprocessing
def clean_text(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    tokens = word_tokenize(text)
    return " ".join(tokens)

In [79]:
train_df['clean'] = train_df['Description'].apply(clean_text)
test_df ['clean'] = test_df ['Description'].apply(clean_text)

In [80]:
# Basic Feature Extraction (Baseline)
vectorizer = CountVectorizer(binary=True)
X_counts = vectorizer.fit_transform(train_df['clean'])
y = train_df['Class']

In [81]:
# Naive Bayes Classifier Implementation
class NaiveBayesClassifier:
    def __init__(self, alpha=1.0):
        self.alpha = alpha
    def fit(self, X, y):
        n_docs, n_feats = X.shape
        self.classes = np.unique(y)
        self.log_prior = {}
        self.log_likelihood = {}
        for c in self.classes:
            idx = np.where(y == c)[0]
            X_c = X[idx]
            self.log_prior[c] = np.log(len(idx) / n_docs)
            counts = np.array(X_c.sum(axis=0)).flatten() + self.alpha
            total = counts.sum()
            self.log_likelihood[c] = np.log(counts / total)
        return self    
    def predict(self, X):
        results = []
        for i in range(X.shape[0]):
            row = X[i].toarray().flatten()
            scores = {c: self.log_prior[c] + (row * self.log_likelihood[c]).sum()
                      for c in self.classes}
            results.append(max(scores, key=scores.get))
        return np.array(results)

In [82]:
X_train, X_val, y_train, y_val = train_test_split(X_counts, y, test_size=0.2, random_state=10000)

In [83]:
# Model with TF-IDF & Bigrams
# CountVectorizer with bigrams + frequency filters
erange = (1,2)
vectorizer2 = CountVectorizer(ngram_range=erange, max_df=0.8, min_df=5)
X_counts2 = vectorizer2.fit_transform(train_df['clean'])

In [84]:
tfidf = TfidfTransformer()
X_tfidf = tfidf.fit_transform(X_counts2)

In [85]:
# Wrap custom NB in sklearn interface
from sklearn.base import BaseEstimator, ClassifierMixin
class SklearnNB(BaseEstimator, ClassifierMixin):
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.model = NaiveBayesClassifier(alpha)
    def fit(self, X, y):
        self.model = self.model.fit(X, y)
        return self
    def predict(self, X):
        return self.model.predict(X)

In [86]:
# Hyperparameter tuning for alpha
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0]}
grid = GridSearchCV(SklearnNB(), param_grid, cv=3, scoring='accuracy')
grid.fit(X_tfidf, y)
best_alpha = grid.best_params_['alpha']
print("Best smoothing alpha:", best_alpha)

Best smoothing alpha: 0.1


In [87]:
# Evaluate improved on validation split
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_tfidf, y, test_size=0.2, random_state=777)
imp_clf = NaiveBayesClassifier(alpha=best_alpha)
imp_clf.fit(X_train2, y_train2)
imp_preds = imp_clf.predict(X_val2)
print("Improved Accuracy:", accuracy_score(y_val2, imp_preds))

Improved Accuracy: 0.9840909090909091


In [88]:
# Final Training
full_counts = vectorizer2.fit_transform(train_df['clean'])
full_tfidf   = tfidf.fit_transform(full_counts)
final_clf = NaiveBayesClassifier(alpha=best_alpha)
final_clf.fit(full_tfidf, y)

<__main__.NaiveBayesClassifier at 0x16fce9ec250>

In [89]:
# Predict on test set
X_test_counts = vectorizer2.transform(test_df['clean'])
X_test_tfidf   = tfidf.transform(X_test_counts)
test_preds = final_clf.predict(X_test_tfidf)