### Text Classification for Spam Detection

In this assignment, you will build a text classification model using Naive Bayes to classify SMS messages as spam or ham (non-spam). You will implement text preprocessing techniques and use the Vector Space Model (TF-IDF) to represent the text data.

#### Dataset

You will be using the SMS Spam Collection dataset, which contains a set of SMS messages that have been labeled as either spam or ham (legitimate). This dataset is available through several Python libraries or can be downloaded directly.

#### Tasks

1. **Text Preprocessing**:

   - Load the dataset
   - Implement tokenization
   - Apply stemming or lemmatization
   - Remove stopwords

2. **Feature Extraction**:

   - Use TF-IDF vectorization to convert the text data into numerical features
   - Explore the most important features for spam and ham categories

3. **Classification**:

   - Split the data into training and testing sets
   - Train a Multinomial Naive Bayes classifier
   - Evaluate the model using appropriate metrics (accuracy, precision, recall, F1-score)
   - Create a confusion matrix to visualize the results

4. **Analysis**:
   - Analyze false positives and false negatives
   - Identify characteristics of messages that are frequently misclassified
   - Suggest improvements to your model





In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import urllib.request


#import necessary libraries for text processing and model building
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer


# Load the SMS Spam Collection dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms.tsv"
urllib.request.urlretrieve(url, "sms.tsv")
sms_data = pd.read_csv('sms.tsv', sep='\t')
sms_data.columns = ["label", "message"]
print(sms_data.head())


# Check data distribution
print(sms_data['label'].value_counts())







  label                                            message
0   ham                      Ok lar... Joking wif u oni...
1  spam  Free entry in 2 a wkly comp to win FA Cup fina...
2   ham  U dun say so early hor... U c already then say...
3   ham  Nah I don't think he goes to usf, he lives aro...
4  spam  FreeMsg Hey there darling it's been 3 week's n...
label
ham     4824
spam     747
Name: count, dtype: int64


In [2]:
# TODO: Implement text preprocessing
nltk.download('punkt')
nltk.download('stopwords')

#Lemmatization (optional alternative to Stemming)
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/smapgal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/smapgal/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/smapgal/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
# Initialize tools
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

#Lemmatizer (optional alternative to Stemming)
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove numbers and punctuation
    text = re.sub(r'[^a-z\s]', '', text)
    
    # 3. Tokenization
    tokens = word_tokenize(text)
    
    # 4. Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # 5. Stemming
    tokens = [stemmer.stem(word) for word in tokens]

    # 6. Lemmatization 
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # 7. Join back to string
    return " ".join(tokens)


In [20]:

#Apply Preprocessing to Dataset
sms_data['clean_message'] = sms_data['message'].apply(preprocess_text)

print(sms_data[['message', 'clean_message']].head(20))


                                              message  \
0                       Ok lar... Joking wif u oni...   
1   Free entry in 2 a wkly comp to win FA Cup fina...   
2   U dun say so early hor... U c already then say...   
3   Nah I don't think he goes to usf, he lives aro...   
4   FreeMsg Hey there darling it's been 3 week's n...   
5   Even my brother is not like to speak with me. ...   
6   As per your request 'Melle Melle (Oru Minnamin...   
7   WINNER!! As a valued network customer you have...   
8   Had your mobile 11 months or more? U R entitle...   
9   I'm gonna be home soon and i don't want to tal...   
10  SIX chances to win CASH! From 100 to 20,000 po...   
11  URGENT! You have won a 1 week FREE membership ...   
12  I've been searching for the right words to tha...   
13                I HAVE A DATE ON SUNDAY WITH WILL!!   
14  XXXMobileMovieClub: To use your credit, click ...   
15                         Oh k...i'm watching here:)   
16  Eh u remember how 2 spell h

In [5]:
# TODO: Apply TF-IDF vectorization
from sklearn.feature_extraction.text import TfidfVectorizer



In [13]:
#create TF-IDF vectorizer
tfidf = TfidfVectorizer(
    max_features=5000,     # limit vocabulary size
    ngram_range=(1,2),     # use unigrams + bigrams
    min_df=2               # ignore rare words
)
X = tfidf.fit_transform(sms_data['clean_message'])
y = sms_data['label'].str.strip

print(X.shape)

(5571, 5000)


In [None]:
# TODO: Split data into training and testing sets

#Convert the labels to binary values
y = sms_data['label'].map({'ham': 0, 'spam': 1})

# Train-test split
from sklearn.model_selection import train_test_split

#X_train, X_test, y_train, y_test = train_test_split(
#    X, y, test_size=0.2, random_state=42
#)

# To redo the train-test split with indices
X_train, X_test, y_train, y_test, idx_train, idx_test = train_test_split(
    X, y, sms_data.index, test_size=0.2, random_state=42)


['ham' 'spam']


In [16]:
# TODO: Train a Multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)


In [17]:
# TODO: Evaluate the model

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_pred = model.predict(X_test)

print("Test Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))




Test Accuracy: 0.9775784753363229

Confusion Matrix:
 [[955   0]
 [ 25 135]]

Classification Report:
               precision    recall  f1-score   support

           0       0.97      1.00      0.99       955
           1       1.00      0.84      0.92       160

    accuracy                           0.98      1115
   macro avg       0.99      0.92      0.95      1115
weighted avg       0.98      0.98      0.98      1115



In [18]:
# TODO: Analyze misclassifications

# Identify misclassified samples
# Create a DataFrame for comparison

results = pd.DataFrame({
    'message': sms_data.loc[idx_test, 'message'].values,
    'actual': y_test.values,
    'predicted': y_pred
})



In [19]:
#Extract misclassified samples
misclassified = results[results['actual'] != results['predicted']]

print("Number of misclassified messages:", len(misclassified))
misclassified.head()


Number of misclassified messages: 25


Unnamed: 0,message,actual,predicted
41,ringtoneking 84484,1,0
52,Can U get 2 phone NOW? I wanna chat 2 set up m...,1,0
59,Free-message: Jamster!Get the crazy frog sound...,1,0
148,Ever thought about living a good life with a p...,1,0
245,Ur balance is now £600. Next question: Complet...,1,0


In [21]:
# Separate false positives 

false_positive = results[(results['actual'] == 0) & (results['predicted'] == 1)]
print("False Positives:", len(false_positive))
false_positive.head()


False Positives: 0


Unnamed: 0,message,actual,predicted


In [22]:
#false negatives
false_negative = results[(results['actual'] == 1) & (results['predicted'] == 0)]
print("False Negatives:", len(false_negative))
false_negative.head()


False Negatives: 25


Unnamed: 0,message,actual,predicted
41,ringtoneking 84484,1,0
52,Can U get 2 phone NOW? I wanna chat 2 set up m...,1,0
59,Free-message: Jamster!Get the crazy frog sound...,1,0
148,Ever thought about living a good life with a p...,1,0
245,Ur balance is now £600. Next question: Complet...,1,0


# Conclusion
The model produced zero false positives, meaning no legitimate messages were incorrectly flagged as spam. However, there were 25 false negatives, indicating that some spam messages were misclassified as ham. This suggests the model is conservative and prioritizes precision over recall. While this reduces the risk of blocking valid messages, it allows some spam messages to pass through.

# To improve model

1) Tune TD-IDF features by increasing from 5000 to 7000
2) Tune Naive Bayes Smoothing (alpha) and Smaller alpha → model is more sensitive to rare words.
3) Adjust Prediction Threshold - Lower Threshold can catch more spam however maybe increase false positive
4) Increase features - Increase message length and special characters