<h1>SPAM vs HAM Message Classification using NLP</h1>

<img src="https://media-exp1.licdn.com/dms/image/C4D12AQGcO7K2z4FRAQ/article-cover_image-shrink_600_2000/0/1626909750548?e=2147483647&v=beta&t=8ItTeGUTBm1054CLh17YnZ2fWM0WWkp7a_962aaV00Y" style="width:100%;margin:auto;">

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from textblob import TextBlob
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

<h2>Data Loading</h2>

In [None]:
data = pd.read_csv("../input/sms-spam-collection-dataset/spam.csv",encoding="ISO-8859-1")
data

<h2>Data Engineering</h2>

In [None]:
data.isna().sum()

<b>Drop unnecessary columns</b>

In [None]:
data.drop(["Unnamed: 2","Unnamed: 3","Unnamed: 4"],axis=1,inplace=True)

<b>Rename columns</b>

In [None]:
data.rename(columns={'v1' : 'Class','v2' : 'Text'},inplace=True)

<b>Sentiment Identification</b>

In [None]:
sentiment = []
for text in data.Text.values:
    res = TextBlob(text).sentiment.polarity
    if res < 0:
        sentiment.append("Negative")
    elif res == 0:
        sentiment.append("Neutral")
    else:
        sentiment.append("Positive")
data["Sentiment"] = sentiment

<h2>Data Exploration</h2>

<h3>Class Balancing</h3>

In [None]:
un, count = np.unique(data.Class, return_counts=True)
plt.bar(un, count)
plt.xlabel("Class")
plt.ylabel("Count")
plt.title("Class Balancing")
plt.show()

<h3>Sentiment Distribution</h3>

In [None]:
un, count = np.unique(data.Sentiment, return_counts=True)
plt.bar(un, count)
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.title("Sentiment Distribution")
plt.show()

<h3>Data Modelling</h3>

<b>Vectorizing</b>

In [None]:
vec = TfidfVectorizer()
enc = LabelEncoder()
data.Class = enc.fit_transform(data.Class.values)
X_train,X_test,Y_train,Y_test = train_test_split(vec.fit_transform(data['Text']).toarray(),
                                                 data['Class'].values,
                                                 test_size = 0.2,
                                                 random_state=42)

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train,Y_train)

<b>Metrics</b>

In [None]:
from sklearn.metrics import accuracy_score, precision_score,recall_score
print("Train Accuracy     : {:.2f} %".format(accuracy_score(model.predict(X_train),Y_train)))
print("Test Accuracy      : {:.2f} %".format(accuracy_score(model.predict(X_test),Y_test)))
print("Precision Accuracy : {:.2f} %".format(precision_score(model.predict(X_test),Y_test)))
print("Recall Accuracy    : {:.2f} %".format(recall_score(model.predict(X_test),Y_test)))

<b>Confusion Matrix</b>

In [None]:
cf = confusion_matrix(model.predict(X_test),Y_test, labels=[0,1])
disp = ConfusionMatrixDisplay(confusion_matrix = cf, display_labels = ["ham","spam"])
disp.plot()
plt.title("Confusion Matrix")
plt.show()

<b>Inference : </b>Due to the large imbalance of the classes, there is a large variation in precision and recall. Future works can focus on improving precision/recall and on better processing of data.

<h1 style="margin:auto;text-align:center;background-color:rgb(232, 230, 223);border-radius : 5px;padding-top : 25px;padding-bottom : 25px; width : 80%;font-size : 25px;">Thank you for reading! Upvote and share my notebook if you liked it</h1>