# SPAM SMS DETECTION

### Problem Statement:
Build an AI model that can classify SMS messages as spam or legitimate. Use techniques like TF-IDF or word embeddings with
classifiers like Naive Bayes, Logistic Regression, or Support Vector Machines to identify spam messages

### Domain Analysis:
Spam SMS detection involves using algorithms and techniques to identify and filter out unwanted or unsolicited text messages. This process typically includes analyzing message content, sender information, and behavioral patterns to distinguish between legitimate and spam messages. Common methods include keyword analysis, machine learning models, and heuristics to automatically flag or block spam SMS, providing users with a cleaner and more secure messaging experience.

In [1]:
# Import the required libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import string
import matplotlib.pyplot as plt

### Data processing 

In [3]:
import chardet

with open('C:/Users/pares/OneDrive/Documents/codsoft_internship/spam_detection/spam.csv', 'rb') as f:
    result = chardet.detect(f.read())

df = pd.read_csv('C:/Users/pares/OneDrive/Documents/codsoft_internship/spam_detection/spam.csv', encoding=result['encoding'])

In [4]:
df

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,
...,...,...,...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...,,,
5568,ham,Will Ì_ b going to esplanade fr home?,,,
5569,ham,"Pity, * was in mood for that. So...any other s...",,,
5570,ham,The guy did some bitching but I acted like i'd...,,,


In [5]:
# to view the first record
data=df[['v1','v2']]

In [6]:
data

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [7]:
# Summary of the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   v1      5572 non-null   object
 1   v2      5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


In [8]:
# create a column to keep the count of the characters present in each record
data['Length'] = data['v2'].apply(len)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Length'] = data['v2'].apply(len)


In [9]:
data['Length']

0       111
1        29
2       155
3        49
4        61
       ... 
5567    161
5568     37
5569     57
5570    125
5571     26
Name: Length, Length: 5572, dtype: int64

In [10]:
# view the dataset with the column 'Length' which contains the number of characters present in each mail
data.head(10)

Unnamed: 0,v1,v2,Length
0,ham,"Go until jurong point, crazy.. Available only ...",111
1,ham,Ok lar... Joking wif u oni...,29
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,ham,U dun say so early hor... U c already then say...,49
4,ham,"Nah I don't think he goes to usf, he lives aro...",61
5,spam,FreeMsg Hey there darling it's been 3 week's n...,148
6,ham,Even my brother is not like to speak with me. ...,77
7,ham,As per your request 'Melle Melle (Oru Minnamin...,160
8,spam,WINNER!! As a valued network customer you have...,158
9,spam,Had your mobile 11 months or more? U R entitle...,154


In [11]:
## The mails are categorised into 2 classes ie., spam and ham. 
# Let's see the count of each class
data.groupby('v1').count()

Unnamed: 0_level_0,v2,Length
v1,Unnamed: 1_level_1,Unnamed: 2_level_1
ham,4825,4825
spam,747,747


### Data Visualization

In [12]:
data['Length'].describe() # to find the max length of the message. 

count    5572.000000
mean       80.118808
std        59.690841
min         2.000000
25%        36.000000
50%        61.000000
75%       121.000000
max       910.000000
Name: Length, dtype: float64

In [13]:
data['Length']==910

0       False
1       False
2       False
3       False
4       False
        ...  
5567    False
5568    False
5569    False
5570    False
5571    False
Name: Length, Length: 5572, dtype: bool

In [14]:
# the message that has the max characters
data[data['Length']==910]['v2']

1084    For me the love should start with attraction.i...
Name: v2, dtype: object

In [15]:
# view the message that has 910 characters in it
data[data['Length']==910]['v2'].iloc[0]

"For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."

In [16]:
# View the message that has min characters
data[data['Length']==2]['v2'].iloc[0]

'Ok'

### Text Pre-Processing

In [17]:
# creating an object for the target values
dObject = data['v1'].values
dObject

array(['ham', 'ham', 'spam', ..., 'ham', 'ham', 'ham'], dtype=object)

In [18]:
# Lets assign ham as 1
data.loc[data['v1']=="ham","v1"] = 1

In [19]:
# Lets assign spam as 0
data.loc[data['v1']=="spam","v1"] = 0

In [20]:
dObject2=data['v1'].values
dObject2

array([1, 1, 0, ..., 1, 1, 1], dtype=object)

In [21]:
data.head(8)

Unnamed: 0,v1,v2,Length
0,1,"Go until jurong point, crazy.. Available only ...",111
1,1,Ok lar... Joking wif u oni...,29
2,0,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,1,U dun say so early hor... U c already then say...,49
4,1,"Nah I don't think he goes to usf, he lives aro...",61
5,0,FreeMsg Hey there darling it's been 3 week's n...,148
6,1,Even my brother is not like to speak with me. ...,77
7,1,As per your request 'Melle Melle (Oru Minnamin...,160


#### first we remove the punctuations in the message

In [22]:
# the default list of punctuations
import string

string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [23]:
# Let's remove the punctuation

def remove_punct(text):
    text = "".join([char for char in text if char not in string.punctuation])
    return text

data['text_clean'] = data['v2'].apply(lambda x: remove_punct(x))

data.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['text_clean'] = data['v2'].apply(lambda x: remove_punct(x))


Unnamed: 0,v1,v2,Length,text_clean
0,1,"Go until jurong point, crazy.. Available only ...",111,Go until jurong point crazy Available only in ...
1,1,Ok lar... Joking wif u oni...,29,Ok lar Joking wif u oni
2,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...,49,U dun say so early hor U c already then say
4,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...


In [24]:
# original text and cleaned text
data.head(8)

Unnamed: 0,v1,v2,Length,text_clean
0,1,"Go until jurong point, crazy.. Available only ...",111,Go until jurong point crazy Available only in ...
1,1,Ok lar... Joking wif u oni...,29,Ok lar Joking wif u oni
2,0,Free entry in 2 a wkly comp to win FA Cup fina...,155,Free entry in 2 a wkly comp to win FA Cup fina...
3,1,U dun say so early hor... U c already then say...,49,U dun say so early hor U c already then say
4,1,"Nah I don't think he goes to usf, he lives aro...",61,Nah I dont think he goes to usf he lives aroun...
5,0,FreeMsg Hey there darling it's been 3 week's n...,148,FreeMsg Hey there darling its been 3 weeks now...
6,1,Even my brother is not like to speak with me. ...,77,Even my brother is not like to speak with me T...
7,1,As per your request 'Melle Melle (Oru Minnamin...,160,As per your request Melle Melle Oru Minnaminun...


## TF-IDF

In [25]:
# Splitting x and y

X = data['text_clean'].values
y = data['v1'].values
y

array([1, 1, 0, ..., 1, 1, 1], dtype=object)

In [26]:
# Datatype for y is object. lets convert it into int
y = y.astype('int')
y

array([1, 1, 0, ..., 1, 1, 1])

In [27]:
type(X)

numpy.ndarray

In [28]:
## text preprocessing and feature vectorizer
# To extract features from a document of words, we import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


tf=TfidfVectorizer() ## object creation
X=tf.fit_transform(X) ## fitting and transforming the data into vectors


In [29]:
X.shape

(5572, 9489)

In [30]:
## print feature names selected from the raw documents
tf.get_feature_names_out()

array(['008704050406', '0089my', '0121', ..., 'ûïharry', 'ûò', 'ûówell'],
      dtype=object)

In [31]:
## number of features created
len(tf.get_feature_names_out())

9489

In [32]:
X

<5572x9489 sparse matrix of type '<class 'numpy.float64'>'
	with 72459 stored elements in Compressed Sparse Row format>

In [33]:
## getting the feature vectors
X=X.toarray()

In [34]:
## Creating training and testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=6)

### Creating various models....

### Naive Bayes

In [35]:
## Model creation
from sklearn.naive_bayes import BernoulliNB

## model object creation
nb=BernoulliNB(alpha=0.01) 

## fitting the model
nb.fit(X_train,y_train)

## getting the prediction
y_hat=nb.predict(X_test) 

In [36]:
y_hat

array([1, 1, 1, ..., 1, 1, 1])

#### Evaluating the model

In [37]:

from sklearn.metrics import classification_report,confusion_matrix

In [38]:
print(classification_report(y_test,y_hat))

              precision    recall  f1-score   support

           0       0.98      0.93      0.96       201
           1       0.99      1.00      0.99      1192

    accuracy                           0.99      1393
   macro avg       0.99      0.96      0.97      1393
weighted avg       0.99      0.99      0.99      1393



In [39]:
## confusion matrix
pd.crosstab(y_test,y_hat)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,187,14
1,3,1189


### logistic regression

In [40]:
#Model creation
from sklearn.linear_model import LogisticRegression
clf=LogisticRegression()
clf.fit(X_train,y_train)  ## training

In [41]:
#Prediction
y_pred=clf.predict(X_test)

In [42]:
y_pred

array([1, 1, 1, ..., 1, 1, 1])

In [43]:
y_pred_prob=clf.predict_proba(X_test)

In [44]:
y_pred_prob

array([[0.1008965 , 0.8991035 ],
       [0.05973893, 0.94026107],
       [0.04643641, 0.95356359],
       ...,
       [0.27610511, 0.72389489],
       [0.067371  , 0.932629  ],
       [0.03038311, 0.96961689]])

#### Evaluating the model

In [45]:
cr=classification_report(y_test,y_pred)
print(cr)

              precision    recall  f1-score   support

           0       0.99      0.66      0.79       201
           1       0.95      1.00      0.97      1192

    accuracy                           0.95      1393
   macro avg       0.97      0.83      0.88      1393
weighted avg       0.95      0.95      0.95      1393



### Support Vector Classifier Model

In [46]:
from sklearn.svm import SVC
svclassifier = SVC() ## base model with default parameters
svclassifier.fit(X_train, y_train)

In [47]:
# Predict output for X_test

y_hat=svclassifier.predict(X_test)

#### Evaluating the model

In [48]:
print(classification_report(y_test,y_hat))

              precision    recall  f1-score   support

           0       0.99      0.79      0.88       201
           1       0.97      1.00      0.98      1192

    accuracy                           0.97      1393
   macro avg       0.98      0.89      0.93      1393
weighted avg       0.97      0.97      0.97      1393



## Model Comparison report:

     model                        accuracy
    1)Naive Bayes               -    99%
    
    2)logistic regression       -    95%
    
    3)Support Vector classifier -    97%
   
   Among all the models applied Navie Bayes has performed far better than the others.

## challenges faced:
The major challenge faced was the inability to read the csv file data due to utf-8 unicode error that was fixed and the data was used for the project.