# Name : Nagham Tharwat Ramadan
# ID : 21510350
# TM340 TMA 2025
# Classify SMS messages as either spam or ham (non-spam).

## 1. Environment Setup and Data Import

### Import Required Libraries:

In [5]:
# Import Python libraries
import pandas as pd
import numpy as np
import re
import string
# Import NLP libraries 
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# Import machine learning tools from scikit-learn 
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [6]:
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nagha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nagha\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nagha\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Dataset Loading:

In [8]:
spam_or_ham_data = pd.read_csv("SMSSpamCollection.CSV", encoding='latin-1')
spam_or_ham_data

Unnamed: 0,Column1,Column2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


## 2. Data Preprocessing and Exploration 

### Initial Exploration:

In [11]:
# Display the first few rows of the dataset to understand its structure.
spam_or_ham_data.head(10)

Unnamed: 0,Column1,Column2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
5,spam,FreeMsg Hey there darling it's been 3 week's n...
6,ham,Even my brother is not like to speak with me. ...
7,ham,As per your request 'Melle Melle (Oru Minnamin...
8,spam,WINNER!! As a valued network customer you have...
9,spam,Had your mobile 11 months or more? U R entitle...


In [12]:
spam_or_ham_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Column1  5574 non-null   object
 1   Column2  5574 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [13]:
spam_or_ham_data.isnull().sum()

Column1    0
Column2    0
dtype: int64

### Column Handling:

In [15]:
print(spam_or_ham_data.columns)

Index(['Column1', 'Column2'], dtype='object')


In [16]:
spam_or_ham_data = spam_or_ham_data.rename(columns={"Column1": "label", "Column2": "message"})
spam_or_ham_data

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


### Feature Engineering:

In [18]:
# New feature, length, to compute the number of characters
spam_or_ham_data['char_length'] = spam_or_ham_data['message'].str.len()
# Append this feature as a new column in your dataset.
spam_or_ham_data[['message', 'char_length']].head(10)

Unnamed: 0,message,char_length
0,"Go until jurong point, crazy.. Available only ...",111
1,Ok lar... Joking wif u oni...,29
2,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,U dun say so early hor... U c already then say...,49
4,"Nah I don't think he goes to usf, he lives aro...",61
5,FreeMsg Hey there darling it's been 3 week's n...,147
6,Even my brother is not like to speak with me. ...,77
7,As per your request 'Melle Melle (Oru Minnamin...,160
8,WINNER!! As a valued network customer you have...,157
9,Had your mobile 11 months or more? U R entitle...,154


### Text Cleaning:

In [20]:
# Define the clean_text function
def clean_text(message):
    # Convert to lowercase
    message = message.lower()

    # Remove punctuation
    message = re.sub(r'[^\w\s]', '', message)

    # Tokenize
    words = message.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Optional: Stemming 
    stemmer = PorterStemmer()
    words = [stemmer.stem(word) for word in words]

    # Join back to string
    return ' '.join(words)

In [21]:
# Apply the clean_text function
spam_or_ham_data['cleaned_text'] = spam_or_ham_data['message'].apply(clean_text)

In [22]:
# Print original and cleaned version of the first message
print("Original Message:")
print(spam_or_ham_data['message'][0])
print("\nCleaned Text:")
print(spam_or_ham_data['cleaned_text'][0])

Original Message:
Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

Cleaned Text:
go jurong point crazi avail bugi n great world la e buffet cine got amor wat


## 3. Data Preparation for Model Training

### Text Vectorization:

In [25]:
# Initialize CountVectorizer
vectorizer = CountVectorizer()
# X -> feature matrix (from CountVectorizer)
# y -> target vector (from LabelEncoder)
# Transform the "message" column into feature matrix
X = vectorizer.fit_transform(spam_or_ham_data['cleaned_text'])

# Encode the labels ("label" column) into binary target values
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(spam_or_ham_data['label'])

In [26]:
# Display shape of X and first few entries of y
print("Shape of X (feature matrix):", X.shape)
print("First 5 target labels:", y[:5])

Shape of X (feature matrix): (5574, 8086)
First 5 target labels: [0 0 1 0 0]


### Data Splitting:

In [28]:
# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y  # stratify helps maintain label proportions
)
# Display shapes of the splits
print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)

Training set shape: (4459, 8086)
Testing set shape: (1115, 8086)


## 4. Model Training: Multinomial Naive Bayes

### Training the Classifier:

In [31]:
# Instantiate the classifier
nb_classifier = MultinomialNB()

# Train the classifier on the training data
nb_classifier.fit(X_train, y_train)

print("Model training complete.")

Model training complete.


### Optional Enhancements:

In [33]:
# It's the Laplace smoothing parameter.
# Default is alpha=1.0. Lowering it (e.g., 0.1, 0.01) can sometimes improve accuracy.
# Larger values make the model less sensitive to word frequency differences.
# So,Train and Compare Models with Different alpha Values
alpha_values = [0.01, 0.1, 0.5, 1.0, 2.0]
for alpha in alpha_values:
    model = MultinomialNB(alpha=alpha)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    print(f"Alpha = {alpha}: Accuracy = {acc:.4f}")

Alpha = 0.01: Accuracy = 0.9767
Alpha = 0.1: Accuracy = 0.9749
Alpha = 0.5: Accuracy = 0.9776
Alpha = 1.0: Accuracy = 0.9767
Alpha = 2.0: Accuracy = 0.9767


## 5. Model Evaluation and Analysis

### Making Predictions:

In [36]:
# Predict labels for the test set
y_pred = nb_classifier.predict(X_test)
print("First 10 Predictions:", y_pred[:10])

First 10 Predictions: [0 0 0 1 0 0 1 0 0 0]


### Performance Metrics: calculate and display.

In [38]:
# 1. Accuracy Score
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy Score: {accuracy:.4f}")

Accuracy Score: 0.9767


In [39]:
# 2. Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)


Confusion Matrix:
[[951  15]
 [ 11 138]]


In [40]:
# 3. Classification Report
class_report = classification_report(y_test, y_pred, target_names=label_encoder.classes_)
print("\nClassification Report:")
print(class_report)


Classification Report:
              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       966
        spam       0.90      0.93      0.91       149

    accuracy                           0.98      1115
   macro avg       0.95      0.96      0.95      1115
weighted avg       0.98      0.98      0.98      1115



# Result Analysis:

### My observation
The Multinomial Naive Bayes classifier performed well on the SMS spam classification task, achieving a high accuracy score. Based on the classification report and confusion matrix:
- Ham (non-spam) messages were classified with high precision and recall.
- Spam messages were also identified accurately, though there may be a few false positives or false negatives depending on the dataset's balance and quality.

### Preprocessing Challenges
1. Data Loading & Format Issues: contain inconsistent delimiters (tabs), requiring manual inspection or cleaning before parsing.
2. Text Cleaning: in lowercase conversion, punctuation removal, and stopword removal.
3. Model Smoothing Parameter: The alpha parameter (Laplace smoothing) in MultinomialNB required careful tuning.

### Potential Improvements
- Explore more advanced models like logistic regression, random forest, or SVM for better generalization.