# 📌 Spam Detection Using NLP and Machine Learning

## 📝 Project Description
Spam detection is a **text classification task** where we classify SMS messages as **spam (1)** (unwanted messages) or **ham (0)** (legitimate messages). In this project, I applied **Natural Language Processing (NLP) techniques** and **Machine Learning (ML) models** to detect spam messages.

---

## 📊 Dataset
I used a publicly available **Spam/Ham SMS dataset** in CSV format. The dataset contains:  
- **Class** → (Spam or Ham)  
- **Message** → (Text of the SMS)  

---

## 🔄 Project Workflow
1️⃣ **Data Collection** → Loaded dataset from CSV file.  
2️⃣ **Data Preprocessing** → Applied text cleaning, tokenization, stopword removal, and lemmatization.  
3️⃣ **Word Representation** → Used **TF-IDF Vectorization** to convert text into numerical form.  
4️⃣ **Model Building** → used **Multinomial Naïve Bayes, Logistic Regression, and SVM** models.  
5️⃣ **Model Evaluation** → Compute accuracy, confusion matrix, and precision score.  

--- 

🔹 **Best Precision Score Achieved: 1.0** (Perfect precision in spam detection 🚀)


In [246]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re, string
import nltk

In [247]:
df = pd.read_csv(r"C:\Users\jinil\Desktop\SMS Spam classifier\Spam_SMS.csv")
df.head()

Unnamed: 0,Class,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [248]:
df.shape

(5574, 2)

In [250]:
df.isnull().sum()

Class      0
Message    0
dtype: int64

In [251]:
df.duplicated().sum()

415

In [252]:
df = df.drop_duplicates(keep="first")

In [253]:
df.shape

(5159, 2)

In [254]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

df["Class"] = encoder.fit_transform(df["Class"])
df.head()


Unnamed: 0,Class,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [255]:
df["Class"].value_counts()

Class
0    4518
1     641
Name: count, dtype: int64

**0 -> Ham and 1 -> Spam**

## <font size="10">**Text Preprocessing**</font>

In [256]:
def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra spaces
    return text

df["Cleaned_text"] = df["Message"].apply(clean_text)
df.head()


Unnamed: 0,Class,Message,Cleaned_text
0,0,"Go until jurong point, crazy.. Available only ...",go until jurong point crazy available only in ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry in a wkly comp to win fa cup final ...
3,0,U dun say so early hor... U c already then say...,u dun say so early hor u c already then say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah i dont think he goes to usf he lives aroun...


## <font size="7">Tokenization</font>

In [257]:
from nltk.tokenize import word_tokenize 
df["Cleaned_text"] = df["Cleaned_text"].apply(word_tokenize)
df["Cleaned_text"].head()

0    [go, until, jurong, point, crazy, available, o...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, in, a, wkly, comp, to, win, fa, ...
3    [u, dun, say, so, early, hor, u, c, already, t...
4    [nah, i, dont, think, he, goes, to, usf, he, l...
Name: Cleaned_text, dtype: object

## <font size="7">Remove StopWords</font>

In [258]:
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jinil/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [259]:
stopwords = set(stopwords.words('english'))

In [260]:
df["Cleaned_text"] = df["Cleaned_text"].apply(lambda x: [word for word in x if word not in stopwords])
df["Cleaned_text"].head()

0    [go, jurong, point, crazy, available, bugis, n...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, wkly, comp, win, fa, cup, final,...
3        [u, dun, say, early, hor, u, c, already, say]
4    [nah, dont, think, goes, usf, lives, around, t...
Name: Cleaned_text, dtype: object

## <font size="7">Lemmatization</font>

In [261]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [262]:
df["Cleaned_text"] = df["Cleaned_text"].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
df["Cleaned_text"].head()

0    [go, jurong, point, crazy, available, bugis, n...
1                       [ok, lar, joking, wif, u, oni]
2    [free, entry, wkly, comp, win, fa, cup, final,...
3        [u, dun, say, early, hor, u, c, already, say]
4    [nah, dont, think, go, usf, life, around, though]
Name: Cleaned_text, dtype: object

In [263]:
# Convert tokenized words back to a sentence
df["Cleaned_text"] = df["Cleaned_text"].apply(lambda x: " ".join(x))
df.head()

Unnamed: 0,Class,Message,Cleaned_text
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts st ...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah dont think go usf life around though


## <font size="6">Word Representation (TF-IDF)</font>

In [264]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df["Cleaned_text"])
y = df["Class"]

## <font size="10">**Model Building**</font>

In [265]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

## <font size="6">Multinomial Naive Bayes</font>

In [266]:
from sklearn.naive_bayes import MultinomialNB
mlm = MultinomialNB()
mlm.fit(X_train, y_train)


In [267]:
mlm_pred = mlm.predict(X_test)

In [268]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score

In [269]:
print("\n🔹 Naïve Bayes Model")
print("Accuracy:", accuracy_score(y_test, mlm_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, mlm_pred))
print("Precision Score:", precision_score(y_test, mlm_pred, pos_label= 1))


🔹 Naïve Bayes Model
Accuracy: 0.9660852713178295
Confusion Matrix:
 [[911   0]
 [ 35  86]]
Precision Score: 1.0


## <font size="6">Logistic Regression</font>

In [271]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [272]:
lr.fit(X_train, y_train)

In [273]:
lr_pred = lr.predict(X_test)

In [274]:
print("\n🔹 Naïve Bayes Model")
print("Accuracy:", accuracy_score(y_test, lr_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, lr_pred))
print("Precision Score:", precision_score(y_test, lr_pred, pos_label= 1))


🔹 Naïve Bayes Model
Accuracy: 0.9622093023255814
Confusion Matrix:
 [[911   0]
 [ 39  82]]
Precision Score: 1.0


## <font size="6">Support Vector Machine</font>

In [275]:
from sklearn.svm import SVC

In [276]:
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)

In [277]:
svm_preds = svm_model.predict(X_test)

In [278]:
print("\n🔹 Naïve Bayes Model")
print("Accuracy:", accuracy_score(y_test, lr_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, lr_pred))
print("Precision Score:", precision_score(y_test, lr_pred, pos_label= 1))


🔹 Naïve Bayes Model
Accuracy: 0.9622093023255814
Confusion Matrix:
 [[911   0]
 [ 39  82]]
Precision Score: 1.0
