<a href="https://colab.research.google.com/github/pavannayak9398/Natural-Language-Processing-Projects/blob/main/SMS_Spam_Detection_using_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Spam Detection using NLP**

**Objective:**

This project aims to classify messages as Spam or Not Spam using Natural Language Processing (NLP) techniques. By analyzing message content, the model identifies unwanted or fraudulent messages effectively.

**Techniques Used:**

1. Dataset: SMS Spam Collection - a widely used dataset for spam detection.
2. Feature Extraction: TF-IDF Vectorizer - converts text messages into numerical representations by measuring word importance.
3. Model: Naïve Bayes - a probabilistic classifier known for its efficiency in text classification tasks, especially spam filtering.

By leveraging TF-IDF for feature representation and Naïve Bayes for classification, this project provides an effective approach for detecting spam messages, which can be used in email filtering, SMS blocking, and other security applications.

**Import Libraries**

In [None]:
import numpy as np
import pandas as pd
import nltk
import re
import seaborn as sns
import matplotlib.pyplot as plt

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

**Download required NLTK**

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Load Dataset & Analyzing**

In [None]:
df=pd.read_csv('/content/drive/MyDrive/FSDS @Kodi Senapati/Colab files/NLP/Datasets/spam.csv', encoding='latin-1')

df.columns


Index(['v1', 'v2', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], dtype='object')

In [None]:
df.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


In [None]:
# from above we can understand that we need only columns v1 & v2


df=df[['v1','v2']]
df.head()

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Name the columns

df.columns=['Label', 'Message']
df.columns

Index(['Label', 'Message'], dtype='object')

In [None]:
# Convert Labels to numeric

df['Label']=df['Label'].map({'ham':0, 'spam':1})
df.head()

Unnamed: 0,Label,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


**Text Preprocessing**

In [None]:
stop_words=set(stopwords.words('english'))
lemmatizer=WordNetLemmatizer()

def preprocess(text):
  tokens= word_tokenize(text.lower())
  tokens= [word for word in tokens if word.isalpha()]
  tokens= [word for word in tokens if word not in stop_words]
  tokens= [lemmatizer.lemmatize(word) for word in tokens]
  return ' '.join(tokens)

df['Clean_Message']=df['Message'].apply(preprocess)

df.head()

Unnamed: 0,Label,Message,Clean_Message
0,0,"Go until jurong point, crazy.. Available only ...",go jurong point crazy available bugis n great ...
1,0,Ok lar... Joking wif u oni...,ok lar joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win fa cup final tkts may...
3,0,U dun say so early hor... U c already then say...,u dun say early hor u c already say
4,0,"Nah I don't think he goes to usf, he lives aro...",nah think go usf life around though


**Check data imbalance**

In [None]:
from collections import Counter

print('Class Distribution:', Counter(df['Label']))

Class Distribution: Counter({0: 4825, 1: 747})


**# Apply SMOTE to balance the proportions**

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.feature_extraction.text import TfidfVectorizer


X=df['Clean_Message']
y=df['Label']

# Before applying SMOTE, must first convert the
# text data into a numerical format using TF-IDF Vectorization

vectorizer=TfidfVectorizer()
X_vec=vectorizer.fit_transform(X)


# Apply SMOTE to the data
smote=SMOTE(random_state=42)
X_re, y_re=smote.fit_resample(X_vec, y)

print('Resampled Class Distribution:', Counter(y_re))

Resampled Class Distribution: Counter({0: 4825, 1: 4825})


**Model Training**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report


X_train, X_test, y_train, y_test = train_test_split(X_re, y_re, test_size=0.2, random_state=42)

model= MultinomialNB()
model.fit(X_train, y_train)

y_pred=model.predict(X_test)


In [None]:
result=pd.DataFrame({'Actual': y_test, 'Predicted':y_pred})
print(result)

      Actual  Predicted
3773       0          0
4053       0          0
4165       0          0
3424       0          0
5391       0          0
...      ...        ...
5688       1          1
9098       1          1
8191       1          1
5630       1          1
1615       0          0

[1930 rows x 2 columns]


**Model Evaluation**

In [None]:
print('Accuracy:', (accuracy_score(y_test, y_pred))*100)
print("\n Classification Report: \n", classification_report(y_test, y_pred))

Accuracy: 97.35751295336787

 Classification Report: 
               precision    recall  f1-score   support

           0       0.99      0.95      0.97       985
           1       0.95      0.99      0.97       945

    accuracy                           0.97      1930
   macro avg       0.97      0.97      0.97      1930
weighted avg       0.97      0.97      0.97      1930



### **Predict for future SMS**

In [None]:
# Function to predict if a message is spam or not
def predict_sms(model, vectorizer, message):
    clean_message=preprocess(message)
    message_vec = vectorizer.transform([clean_message])  # Convert text to numerical
    prediction = model.predict(message_vec)  # Predict using trained model
    return "Spam" if prediction[0] == 1 else "Not Spam"

In [None]:
# SMS 1

message=input('Enter the SMS')
result=predict_sms(model, vectorizer, message)
print(f"Prediction: {result}")

Enter the SMSCongratulations! You've won a free iPhone. Click here to claim now.
Prediction: Spam


In [None]:
# SMS 2

message=input('Enter the SMS')
result=predict_sms(model, vectorizer, message)
print(f"Prediction: {result}")

Enter the SMSHey, are we still meeting for coffee at 5 PM?
Prediction: Not Spam


In [None]:
msgs=[
    "Congratulations! You have won a free iPhone. Click here to claim your prize now!",  # SPAM
    "Hey, are we still meeting for lunch today?",  # NOT SPAM
    "URGENT! Your bank account has been compromised. Call this number immediately to secure your funds.",  # SPAM
    "Can you please send me the project files by tonight?",  # NOT SPAM
    "You have been selected for a $500 Amazon gift card! Claim now before it expires.",  # SPAM
    "Reminder: Your electricity bill is due tomorrow. Please pay to avoid disconnection.",  # NOT SPAM
    "Win a brand new car! Just answer a simple question and claim your reward now!",  # SPAM
    "Don't forget to submit your assignment before the deadline!",  # NOT SPAM
    "Limited-time offer! Buy 1 get 1 free on all pizzas. Order now!",  # SPAM
    "Mom is asking if you will be home for dinner tonight.",  # NOT SPAM
]


for i in msgs:
  result=predict_sms(model, vectorizer, i)
  print(f"{i}: {result}")


Congratulations! You have won a free iPhone. Click here to claim your prize now!: Spam
Hey, are we still meeting for lunch today?: Not Spam
URGENT! Your bank account has been compromised. Call this number immediately to secure your funds.: Spam
Can you please send me the project files by tonight?: Not Spam
You have been selected for a $500 Amazon gift card! Claim now before it expires.: Spam
Reminder: Your electricity bill is due tomorrow. Please pay to avoid disconnection.: Not Spam
Win a brand new car! Just answer a simple question and claim your reward now!: Spam
Don't forget to submit your assignment before the deadline!: Not Spam
Limited-time offer! Buy 1 get 1 free on all pizzas. Order now!: Spam
Mom is asking if you will be home for dinner tonight.: Not Spam
