## Natural Language Processing - SMS Spam Detection

**Let's train a model to predict spam messages!**

Description of the data:
The SMS Spam Collection is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.


Dataset used: SMS Spam Collection Dataset, UCI

### Importing Initial Libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
sns.set_style('darkgrid')

## Data Cleaning and Preparation

In [None]:
#reading the data
df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv',encoding='latin1')

#printing data information
df.info()

**The 2nd, 3rd, and 4th columns are having too many null values.**

**Let's explore these columns in detail.**


In [None]:
#Exploring some parts of the data where messages are found in Unnamed:3 and Unnamed:4 columns.

df[df.columns.drop(['v1','v2'])].dropna(how='all').head(10)

**The columns other than v1 and v2 also have messages (shown above).**

In [None]:
df.loc[df['v1'] == 'spam',df.columns.drop(['v1','v2'])].dropna()

**None of the columns Unnamed: 2, Unnamed: 3 and Unnamed: 4 have any *Spam* messages.**

**Hence we are labelling the messages in these columns as *ham***

**Let's try to rearrage these columns to a two column dataframe, where one is for labels and the other is for  messages.**

In [None]:
#Assigning label ham to messages in Unnamed: 2, Unnamed: 3, Unnamed: 4 and concatenating them with v2
temp1 = df[['v2','v1']]

temp2 = df['Unnamed: 2'].to_frame()
temp2.loc[:,'v1'] = 'ham'
temp2 = temp2.rename(columns={'Unnamed: 2':'v2'})

temp3 = df['Unnamed: 3'].to_frame()
temp3.loc[:,'v1'] = 'ham'
temp3 = temp3.rename(columns={'Unnamed: 3':'v2'})

temp4 = df['Unnamed: 4'].to_frame()
temp4.loc[:,'v1'] = 'ham'
temp4 = temp4.rename(columns={'Unnamed: 4':'v2'})

df = pd.concat([temp1,temp2,temp3,temp4],axis=0).dropna().reindex(columns=['v1','v2'])

In [None]:
#Let's print the last 10 messages in our dataframe
df.tail(10)

**Let's change the column labels of the dataframe for readability.**

In [None]:
#rename the columns for readability
df.columns = ['label','message']

Let's have a look at the messages labelled as **spam**.

In [None]:
df[df['label'] == 'spam'].head(10)

**Spam** messages seem to be well structured and lengthy.

Let's have a look at the countplot of labels.

In [None]:
#plotting frequency vs label name
_=sns.countplot(data=df,x='label',palette='viridis')
print("percentage of labels:",f"{100*df.groupby('label').count()/len(df)}")

**ham** messages are high in number compared to **spam** messages. The dataset is slightly imbalanced or skewed. We have to take this imbalance into account while training the model.

Let's have a look at distribution of message sizes.

In [None]:
#prepare a length column
df['len'] = df['message'].apply(len)

Usually spam messages are larger than personal messages. Let's comapare the distribution of the lengths of **spam** and **ham** messages.

In [None]:
fig,ax=plt.subplots(figsize=(12,8))
check = 'ham'
sns.histplot(df[df['label']==check],x='len',color='green',ax=ax,alpha=0.7)
ymin, ymax = plt.gca().get_ylim()
plt.vlines(x=df.loc[df['label']==check,'len'].median(),ymin=ymin,ymax=ymax,color='green',alpha=0.5,linestyles='dashed')
check = 'spam'
sns.histplot(df[df['label']==check],x='len',color='orange',ax=ax,alpha=0.9)
plt.vlines(x=df.loc[df['label']==check,'len'].median(),ymin=ymin,ymax=ymax,color='orange',linestyles='dashed')
plt.legend(['median ham lengths','median spam lengths'])
_=plt.title('Distribution of message sizes')
_=plt.ylabel('Count or Frequency')
_=plt.xlabel('Message Length  (in characters)')


**The median of both the frequency distributions indicate that personal messages are mostly short and spam messages are lengthy! Hence we can use the length of messages as a feature while training the model.**

## Feature Engineering

In [None]:
#importing functions that helps in feature engineering
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import string

In [None]:
#Collecting words and punctuations to be removed in a list
remove_list = set(list(string.punctuation) + stopwords.words('english'))
#defining a function that can tokenize the messages, remove unwanted words and punctuations
porterstemmer = PorterStemmer()
def message_cleaner(message):
    cleaned_message = []
    for word in word_tokenize(message.lower()):
        word = re.sub('[^a-zA-Z]','',word)
        if(word == ''):
            continue
        if(word not in remove_list):
            cleaned_message.append(porterstemmer.stem(word))
    return cleaned_message

Let's test the message cleaner function using a dummy messsage.

In [None]:
message_cleaner('Hey, Are you attending that guitar competition?')

Cool. The function seems to do it's job. Now let's calculate tf-idf features from the messages.

**The TfidfVectorizer function will run the analyzer (message_cleaner) through each of the messages and create a sparse matrix of words and it's frequency. It then calculates the tf-idf features from this matrix. The length for each message is also calculated and appended to the matrix as a feature.**

**We must split the dataset into train and test data for further processing.**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

**Let's map labels *spam* to 1 and *ham* to 0**

In [None]:
df['label'] = df['label'].map(lambda x: 0 if(x=='ham') else 1)

In [None]:
#splitting the dataset
X_train,X_test,y_train,y_test = train_test_split(df['message'],df['label'],test_size=0.25)

In [None]:
from sklearn.pipeline import FeatureUnion 
#FeatureUnion is used to concat feaatures obtained using different transform functions
from sklearn.preprocessing import FunctionTransformer
#FunctionTransformer can create a transform function from any arbitrary or user defined function
from scipy.sparse import csr_matrix

**Writing a function that can calculate and return the length of messages in sparse matrix format**

In [None]:
def get_length(df):
    df = df.apply(len)
    return csr_matrix(df.to_numpy().reshape(-1,1))

**Let's unite the TfidfVectorizer and get_length features to a single matrix using a FeatureUnion pipeline. The features matrix can then be created using the fit and transform method on this FeatureUnion object.**

In [None]:
feature_pipe = FeatureUnion([
    ('tfidf',TfidfVectorizer(analyzer=message_cleaner)),
    ('length',FunctionTransformer(get_length))])
tfidf_mat = feature_pipe.fit_transform(X_train)

**Let's print the shape of the feature matrix**

In [None]:
print('Shape of the matrix:', tfidf_mat.shape) # This is a very sparse matrix

**Let's check out the tf-idf features of words in the first message (message_num=0)**

In [None]:
message_num=0 #give the index value of message in DataFrame df

#get the indices of the token words in any message and its length
ind = tfidf_mat[message_num].nonzero()[1]
#print the corresponding word and its tf-idf weights, and the length of the whole message
for index in ind:
    if(index == (tfidf_mat.shape[1]-1)):
        print('\nmessage length:',tfidf_mat[message_num,index])
    else:
        print('Word:',feature_pipe.transformer_list[0][1].get_feature_names()[index],'  tf-idf:',tfidf_mat[message_num,index])

## Training and tuning the model

Let's use the Multinomial naive bayes classifier to train the model and GridSearchCV to tune it

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

**Since the labels are imbalanced, we choose recall as the scoring method. Tuning the model using recall score will improve predictability of the spam messages.**

In [None]:
#define grid search
params =  {'alpha':[0.001,0.01,0.1,0.25,0.3,0.35,0.5,0.6,0.7,0.8,0.9,1,2,5]}
grid = GridSearchCV(MultinomialNB(),param_grid=params,cv=6,scoring='recall')
grid.fit(tfidf_mat,y_train)

In [None]:
#print accuracy score
print(f'Accuracy score is: {grid.score(feature_pipe.transform(X_test),y_test):.2f}')
#Print the best parameter
print('Best parameter:',grid.best_params_)

**Let's print and plot the result scores**

In [None]:
from sklearn.metrics import classification_report,confusion_matrix

In [None]:
#transform test data before predicting
transformed_X_test = feature_pipe.transform(X_test)

#classify test messages
predictions = grid.predict(transformed_X_test)
#print classification report
print(classification_report(y_test,predictions))

**We have a good recall and f1-score for both labels!**

In [None]:
#create a confusion matrix
sns.heatmap(confusion_matrix(y_test,predictions),annot=True,fmt='d')
plt.xlabel(['ham predict','spam predict'])
_=plt.ylabel(['spam actual','ham actual'])

## Thank You

In [None]:
nan