# Spam Detection in Messages
#### This dataset is designed for building and evaluating machine learning models for spam detection in text messages. It contains text-based SMS messages labeled as either spam or ham (not spam). The goal is to classify incoming messages as spam or not, based on their content.

### Features:

#### 1) Category (Target Variable):

Indicates the label assigned to the message.
Values:
"spam": The message is considered unwanted or promotional.
"ham": The message is legitimate and not spam.

#### 2) Message (Input Feature):

The raw content of the SMS text message.

This is the main feature used for training the model using techniques such as Bag of Words, TF-IDF, or embeddings.

Example:

"Congratulations! You've won a $1000 Walmart gift card. Go to http://bit.ly/123456 to claim now."

"Are we still meeting at 6 today?"


In [1]:
# importing basic packages
import pandas as pd 
import numpy as np
import pickle

In [2]:
# loading data 
data=pd.read_csv("spam.csv")

In [3]:
# first five rows of data 
data.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [4]:
# total records and features in dataset
print("Total records in dataset:",data.shape[0])
print("Total features in dataset:",data.shape[1])

Total records in dataset: 5572
Total features in dataset: 2


In [5]:
# dataset information
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Category  5572 non-null   object
 1   Message   5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [6]:
# checking for null values
data.isnull().sum()

Category    0
Message     0
dtype: int64

In [7]:
# checking duplicated data
data.duplicated().sum()

415

In [8]:
# dropping duplicated data
data.drop_duplicates(inplace=True)

In [9]:
# checking target variable frequency
data["Category"].value_counts()

Category
ham     4516
spam     641
Name: count, dtype: int64

In [10]:
# importing nlp packages for preprocess text data
import nltk
import re
nltk.download("stopwords")
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to C:\Users\Mayur
[nltk_data]     kadam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
le=WordNetLemmatizer()

In [12]:
# create a function for data preprocessing
def preprocessing(data): 
    clean_data=re.sub("[^a-zA-Z]"," ",data) 
    lower_data=clean_data.lower()
    tokenize=lower_data.split()
    stemming=[le.lemmatize(word) for word in tokenize if not word in stopwords.words("english")]
    processed_data=" ".join(stemming)
    return processed_data

In [13]:
# text data before preprocessing
data["Message"][1]

'Ok lar... Joking wif u oni...'

In [14]:
# apply data preprocessing on "Message"
data["Message"]=data["Message"].apply(preprocessing)

In [15]:
# text data after preprocessing
data["Message"][1]

'ok lar joking wif u oni'

In [16]:
# encoding target variable
from sklearn.preprocessing import LabelEncoder 
encoder=LabelEncoder()
data["Category"]=encoder.fit_transform(data['Category'])
data.head()

Unnamed: 0,Category,Message
0,0,go jurong point crazy available bugis n great ...
1,0,ok lar joking wif u oni
2,1,free entry wkly comp win fa cup final tkts st ...
3,0,u dun say early hor u c already say
4,0,nah think go usf life around though


In [17]:
# divide data into input and target feature
x=data.iloc[:,1:2]
y=data.iloc[:,:1]

In [18]:
# splitting dataset for train and test
from sklearn.model_selection import train_test_split 
train_x,test_x,train_y,test_y=train_test_split(x,y,test_size=0.2)

In [19]:
# convert text data into vector by using bag of words
from sklearn.feature_extraction.text import CountVectorizer 
cv=CountVectorizer(max_features=1000,ngram_range=(1,2))

In [20]:
bow_train_x=cv.fit_transform(train_x["Message"]).toarray()
bow_test_x=cv.transform(test_x["Message"]).toarray()

In [21]:
# input data array
bow_train_x

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [22]:
# dataset is highly imbalanced so,applying oversampling
from imblearn.over_sampling import SMOTE 
sampling=SMOTE()
train_x_resample,train_y_resample=sampling.fit_resample(bow_train_x,train_y)

In [23]:
# train model by using MultinomialNB
from sklearn.naive_bayes import MultinomialNB 
nb=MultinomialNB()
nb.fit(train_x_resample,train_y_resample)

  y = column_or_1d(y, warn=True)


In [24]:
# importing mertics to evaluate model
from sklearn.metrics import accuracy_score,confusion_matrix,precision_score,recall_score,f1_score

In [25]:
# creating a function of metrics
def get_metrics(actual,predicted): 
    print("accuracy score:",accuracy_score(actual,predicted))
    print("confusion matrix: \n",confusion_matrix(actual,predicted))
    print("precision score:",precision_score(actual,predicted))
    print("recall score:",recall_score(actual,predicted))
    print("f1 score:",f1_score(actual,predicted))

In [26]:
# prediction on train and test
train_pred=nb.predict(bow_train_x)
test_pred=nb.predict(bow_test_x)

In [27]:
# data evaluation
print("training data:")
get_metrics(train_y,train_pred)
print("--------------------------------------------------")
print("testing data:")
get_metrics(test_y,test_pred)

training data:
accuracy score: 0.9660606060606061
confusion matrix: 
 [[3529   87]
 [  53  456]]
precision score: 0.8397790055248618
recall score: 0.8958742632612967
f1 score: 0.8669201520912547
--------------------------------------------------
testing data:
accuracy score: 0.9709302325581395
confusion matrix: 
 [[884  16]
 [ 14 118]]
precision score: 0.8805970149253731
recall score: 0.8939393939393939
f1 score: 0.8872180451127819


In [28]:
# model saving
with open("final_model.pkl","wb")as f: 
    pickle.dump(nb,f)

with open("bag_of_word.pkl","wb")as f: 
    pickle.dump(cv,f)
    
with open("data_processing.pkl","wb")as f: 
    pickle.dump(preprocessing,f)