# Assignment

## Instructions

### Text Classification for Spam Detection

In this assignment, you will build a text classification model using Naive Bayes to classify SMS messages as spam or ham (non-spam). You will implement text preprocessing techniques and use the Vector Space Model (TF-IDF) to represent the text data.

#### Dataset

You will be using the SMS Spam Collection dataset, which contains a set of SMS messages that have been labeled as either spam or ham (legitimate). This dataset is available through several Python libraries or can be downloaded directly.

#### Tasks

1. **Text Preprocessing**:

   - Load the dataset
   - Implement tokenization
   - Apply stemming or lemmatization
   - Remove stopwords

2. **Feature Extraction**:

   - Use TF-IDF vectorization to convert the text data into numerical features
   - Explore the most important features for spam and ham categories

3. **Classification**:

   - Split the data into training and testing sets
   - Train a Multinomial Naive Bayes classifier
   - Evaluate the model using appropriate metrics (accuracy, precision, recall, F1-score)
   - Create a confusion matrix to visualize the results

4. **Analysis**:
   - Analyze false positives and false negatives
   - Identify characteristics of messages that are frequently misclassified
   - Suggest improvements to your model

#### Starter Code

In [12]:
# Import necessary libraries
import pandas as pd
import numpy as np
import urllib.request
import spacy
from spacy.tokens import Doc
from spacy.language import Language
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix, classification_report


In [2]:

# Load the SMS Spam Collection dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
urllib.request.urlretrieve(url, "sms.tsv")
sms_data = pd.read_csv('sms.tsv', sep='\t', header=None, names=['label', 'message'])
print(sms_data.head())


  label                                            message
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


In [3]:

# Check data distribution
print(sms_data['label'].value_counts())


label
ham     4825
spam     747
Name: count, dtype: int64


In [4]:
sms_data['target'] = sms_data['label'].map({'ham':0,'spam':1})

In [5]:
sms_data.head()

Unnamed: 0,label,message,target
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [6]:
# TODO: Implement text preprocessing
# - Tokenization
# - Stemming/Lemmatization
# - Stopwords removal

In [7]:
# 1. Define a component that runs after the lemmatizer:
@Language.component("prune_stops_preserve_lemmas")
def prune_stops_preserve_lemmas(doc: Doc) -> Doc:
    # Keep only non-stop, non-punct tokens
    keep = [token for token in doc if not (token.is_stop or token.is_punct)]
    words  = [token.text       for token in keep]
    spaces = [token.whitespace_ for token in keep]
    lemmas = [token.lemma_     for token in keep]
    # Build a fresh Doc with just the words & spaces
    new_doc = Doc(doc.vocab, words=words, spaces=spaces)
    # Copy over each token’s lemma_
    for token, lemma in zip(new_doc, lemmas):
        token.lemma_ = lemma
    return new_doc

# 2. Load spaCy English — keep tagger & lemmatizer, disable heavier components
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

# 3. Insert our prune component *after* the lemmatizer
nlp.add_pipe("prune_stops_preserve_lemmas", after="lemmatizer")

# 4. Example: apply to your sms_data DataFrame
docs = list(nlp.pipe(sms_data["message"], batch_size=50))

# 5. Extract lemmas and cleaned text
sms_data["lemmas"] = [
    [token.lemma_ for token in doc]
    for doc in docs
]
sms_data["cleaned_text"] = [
    " ".join(token.lemma_ for token in doc)
    for doc in docs
]


In [8]:
sms_data

Unnamed: 0,label,message,target,lemmas,cleaned_text
0,ham,"Go until jurong point, crazy.. Available only ...",0,"[jurong, point, crazy, available, bugis, n, gr...",jurong point crazy available bugis n great wor...
1,ham,Ok lar... Joking wif u oni...,0,"[ok, lar, joke, wif, u, oni]",ok lar joke wif u oni
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,"[free, entry, 2, wkly, comp, win, FA, Cup, fin...",free entry 2 wkly comp win FA Cup final tkts 2...
3,ham,U dun say so early hor... U c already then say...,0,"[u, dun, early, hor, u, c]",u dun early hor u c
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,"[nah, think, go, usf, live]",nah think go usf live
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,1,"[2nd, time, try, 2, contact, u., U, win, £, 75...",2nd time try 2 contact u. U win £ 750 Pound pr...
5568,ham,Will ü b going to esplanade fr home?,0,"[ü, b, go, esplanade, fr, home]",ü b go esplanade fr home
5569,ham,"Pity, * was in mood for that. So...any other s...",0,"[pity, mood, suggestion]",pity mood suggestion
5570,ham,The guy did some bitching but I acted like i'd...,0,"[guy, bitching, act, like, interested, buy, we...",guy bitching act like interested buy week give...


In [9]:
# TODO: Apply TF-IDF vectorization

In [None]:
# TODO: Split data into training and testing sets


In [None]:

# TODO: Train a Multinomial Naive Bayes classifier

# TODO: Evaluate the model

# TODO: Analyze misclassifications

In [13]:
# 1. Prepare your data
#    sms_data must already be a pandas DataFrame with:
#      - sms_data["cleaned_text"]  : your lemmatized, stop-word-free text
#      - sms_data["label"]         : the target column (e.g. 'spam' vs 'ham')
X = sms_data["cleaned_text"]
y = sms_data["target"]

# 2. Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3. Vectorize with TF–IDF
vectorizer = TfidfVectorizer(max_df=0.9, min_df=5, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf  = vectorizer.transform(X_test)

# 4. Train Multinomial Naive Bayes
clf = MultinomialNB()
clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)

# 5. Compute and display the confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
cm_df = pd.DataFrame(cm, index=clf.classes_, columns=clf.classes_)
print("Confusion Matrix:")
print(cm_df, "\n")

# 6. Compute and display the classification report
print("Classification Report:")
print(classification_report(y_test, y_pred))


Confusion Matrix:
     0    1
0  964    2
1   17  132 

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       966
           1       0.99      0.89      0.93       149

    accuracy                           0.98      1115
   macro avg       0.98      0.94      0.96      1115
weighted avg       0.98      0.98      0.98      1115

