# SMS spam detection

#### Steps to be followed:
1. Data Acquisition(from Kaggle)
2. Data Preprocessing
3. Text conversion
4. Model Training
5. Testing

My plan in acquiring data is to acquire data through kaggle api key

Steps followed:
1. Importing json file (consists api token)
2. Extracting req dataset using API command
3. Unzip the files imported
4. Create a dataframe with the data acquired


In [None]:
from google.colab import files
files.upload()
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

!chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [None]:
!kaggle datasets download -d uciml/sms-spam-collection-dataset

Dataset URL: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
License(s): unknown
Downloading sms-spam-collection-dataset.zip to /content
  0% 0.00/211k [00:00<?, ?B/s]
100% 211k/211k [00:00<00:00, 74.5MB/s]


In [None]:
!unzip /content/sms-spam-collection-dataset.zip

Archive:  /content/sms-spam-collection-dataset.zip
  inflating: spam.csv                


In [None]:
import pandas as pd
data_train=pd.read_csv("/content/spam.csv",encoding='latin-1')

In [None]:
data_train.head()

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
0,ham,"Go until jurong point, crazy.. Available only ...",,,
1,ham,Ok lar... Joking wif u oni...,,,
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,,,
3,ham,U dun say so early hor... U c already then say...,,,
4,ham,"Nah I don't think he goes to usf, he lives aro...",,,


#### Data Preprocessing

Steps followed:
1. Checking for null values & remove them
2. Clear understanding of features and labels
3. Check whether the data is balanced or not
4. Apply cleaning of text to extract features
5. Apply some NLP techniques like removing stopwords, lemmitization, and stemming
6. Encode label data with ordinal values

In [None]:
data=data_train[['v2','v1']]

In [None]:
data=data.rename(columns={'v2':'Feature','v1':'label'})

In [None]:
data

Unnamed: 0,Feature,label
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
...,...,...
5567,This is the 2nd time we have tried 2 contact u...,spam
5568,Will Ì_ b going to esplanade fr home?,ham
5569,"Pity, * was in mood for that. So...any other s...",ham
5570,The guy did some bitching but I acted like i'd...,ham


In [None]:
data.shape

(5572, 2)

In [None]:
data['label'].unique()

array(['ham', 'spam'], dtype=object)

In [None]:
data['label'].value_counts()

label
ham     4825
spam     747
Name: count, dtype: int64

In [None]:
import re
def cleantext(sentence):
  sentence=re.sub('\s+',' ',sentence)#remove extra spaces
  sentence=re.sub("[^a-zA-Z0-9\s]","",sentence)
  return sentence

data['Feature']=data['Feature'].apply(cleantext)


In [None]:
data.head()

Unnamed: 0,Feature,label
0,Go until jurong point crazy Available only in ...,ham
1,Ok lar Joking wif u oni,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor U c already then say,ham
4,Nah I dont think he goes to usf he lives aroun...,ham


In [None]:
# I need to remove stop words and apply lemmitization and stemming too.

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
import spacy
import pandas as pd

# Download NLTK data files (run only once)
# nltk.download('stopwords')
# nltk.download('wordnet')

# Initialize SpaCy
nlp = spacy.load('en_core_web_sm')

class TextPreprocessing:
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))
        self.lemmatizer = WordNetLemmatizer()
        self.stemmer = PorterStemmer()

    def remove_stopwords(self, text):
        words = text.split()
        filtered_words = [word for word in words if word.lower() not in self.stop_words or len(word) == 1]
        return ' '.join(filtered_words)

    def lemmatize(self, text):
        words = text.split()
        lemmatized_words = [self.lemmatizer.lemmatize(word) for word in words]
        return ' '.join(lemmatized_words)

    def stem(self, text):
        words = text.split()
        stemmed_words = [self.stemmer.stem(word) for word in words]
        return ' '.join(stemmed_words)

    def preprocess(self, text):
        text = self.remove_stopwords(text)
        text = self.lemmatize(text)
        text = self.stem(text)
        return text

preprocessor = TextPreprocessing()

# Apply preprocessing to the 'text' column
data['Feature'] = data['Feature'].apply(preprocessor.preprocess)



In [None]:
data

Unnamed: 0,Feature,label
0,go jurong point crazi avail bugi n great world...,ham
1,ok lar joke wif u oni,ham
2,free entri 2 wkli comp win fa cup final tkt 21...,spam
3,u dun say earli hor u c alreadi say,ham
4,nah dont think go usf life around though,ham
...,...,...
5567,2nd time tri 2 contact u u 750 pound prize 2 c...,spam
5568,b go esplanad fr home,ham
5569,piti mood soani suggest,ham
5570,guy bitch act like id interest buy someth el n...,ham


In [None]:
# Need to know how many unique words in entire text column

feature_data=[]
for sentence in data['Feature']:
  for word in sentence.split():
    if len(word) >1:
      feature_data.append(word)
print("Number of unique words ",len(set(feature_data)))
set(feature_data)

Number of unique words  7909


{'cup',
 'youwhen',
 '1843',
 'teethi',
 'prap',
 'wwwtextpodnet',
 'btwn',
 'dearli',
 'mel',
 'nan',
 'dt',
 'count',
 'doll',
 'hoodi',
 'bowl',
 'owe',
 '125gift',
 'land',
 'spile',
 '1pm',
 'justbeen',
 'godyou',
 '09066361921',
 'pest',
 'ra',
 '30th',
 'high',
 '29m',
 'newest',
 'lionm',
 'box334',
 'filthi',
 '84128',
 'road',
 'darker',
 'marriageprogram',
 'magic',
 'dang',
 'mi',
 'minapn',
 'divorc',
 'alian',
 'dificult',
 'hwd',
 'held',
 '526',
 'camp',
 'text82228',
 'walk',
 'lara',
 'wwwrtfsphostingcom',
 'grumbl',
 'readi',
 '6pm',
 'sopha',
 'lack',
 'maneg',
 'crack',
 'novemb',
 'outfit',
 'kisi',
 'nacho',
 'nickey',
 'hunt',
 'didt',
 'sez',
 'onlydon',
 'bye',
 'amor',
 'urn',
 'bray',
 'cri',
 'cin',
 'incorrect',
 'tamilnaduthen',
 'batt',
 '09065171142stopsms08',
 'mmmmmm',
 'ugo',
 'tell',
 'implic',
 'whore',
 'goodnit',
 'kintu',
 'yep',
 '1013',
 'giv',
 '020903',
 'boutxx',
 'ava',
 'parad',
 'broken',
 'chikkub',
 '2hr',
 'yesim',
 'youclean',
 'waht

In [None]:
# total we have 7000 unique words
# encode label with 0 or 1
# 0 for ham
# 1 for spam

def label_encoding(word):
  if word == 'ham':
    return 0
  else:
    return 1

data['label']=data['label'].apply(label_encoding)
data['label'].value_counts()

label
0    4825
1     747
Name: count, dtype: int64

In [None]:
data

Unnamed: 0,Feature,label
0,go jurong point crazi avail bugi n great world...,0
1,ok lar joke wif u oni,0
2,free entri 2 wkli comp win fa cup final tkt 21...,1
3,u dun say earli hor u c alreadi say,0
4,nah dont think go usf life around though,0
...,...,...
5567,2nd time tri 2 contact u u 750 pound prize 2 c...,1
5568,b go esplanad fr home,0
5569,piti mood soani suggest,0
5570,guy bitch act like id interest buy someth el n...,0


#### Text Conversion
Aim: Text -> d dim vector (my preferred dim is 5000) <br>
Available Techniques:
1. Bag of words (CountVectorizer()) (chosen)
2. Tfidf vector (TfidfVectorizer())
3. Word to vector

I used Bag of words technique to extract features in it with max number of 5000 features.

#### Model Training

As I have Binary classified data I preferred Logistic regression which gives best hyperplane to separate two distinct classes. With this model I got 97.84 accuracy.



In [None]:
# applying count vectorizer with max features 5000

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Initialize CountVectorizer with max_features=5000
vectorizer = CountVectorizer(max_features=5000)

# Transform the documents into feature vectors
X = vectorizer.fit_transform(data['Feature'])
y=data['label']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


Accuracy: 0.97847533632287


In [None]:
from sklearn.metrics import confusion_matrix
print("Confusion matrix ",confusion_matrix(y_test,y_pred))

Confusion matrix  [[965   0]
 [ 24 126]]


In [None]:
tp=0
tn=0
fp=0
fn=0
for i in range(y_test.shape[0]):
  if(y_test.iloc[i]==y_pred[i]):
    if(y_test.iloc[i]==0):
      tn+=1
    else:
      tp+=1
  else:
    if(y_test.iloc[i]==0):
      fp+=1
    else:
      fn+=1
print("True negative ",tn)
print("False Positive ",fp)
print("False Negative ",fn)
print("True Positive ",tp)

True negative  965
False Positive  0
False Negative  24
True Positive  126


In [None]:
# see we can observe that some of the samples are predicted as not spam but they are spam
# We need to minimize the error
# Another best model which can be used for text classification is Naive Bayes

# applying Naive Bayes Model

from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')


Accuracy: 0.97847533632287


As we know for text classification Naive bayes algorithm gives best predictions rather than other models. So with that curiocity I tried Naive bayes model where I got 97.84 accuracy.