# Introduction


<B><I>Goal: To classify whether the SMS is spam or ham</I></B>

In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (the coefficients in the linear combination). Formally, in binary logistic regression there is a single binary dependent variable, coded by an indicator variable, where the two values are labeled "0" and "1", while the independent variables can each be a binary variable (two classes, coded by an indicator variable) or a continuous variable (any real value). The corresponding probability of the value labeled "1" can vary between 0 (certainly the value "0") and 1 (certainly the value "1"), hence the labeling; the function that converts log-odds to probability is the logistic function, hence the name. - Wikipedia

The classification in Logistic Regression is based on the relationship between the categorical dependent variable and one or more independent by estimating probabilities using a logistic function.

In this notebook, we will follow below steps to come up with predictor function for spam or ham based on provided data set.

Ideally text preprocessing should be done before applying the logistic regression. Please refer: https://github.com/indianspidy/notebooks/blob/main/TextPreProcessing.ipynb


<I>References:
- wikipedia (https://www.wikipedia.org/)
- Analytics Vidhya (https://www.analyticsvidhya.com/)
- Python Programming Language: https://pythonprogramminglanguage.com/
</I>


In [66]:
# Imporing required libraries

import pandas as pd # for data frames
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer # for term frequency-inverse document frequency
from sklearn.linear_model import LogisticRegression # for model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # for the error metrics
from sklearn.metrics import confusion_matrix # for creating confusion matrix
from sklearn.model_selection import train_test_split # to do test & train data split

In [67]:
# Loading the dataset. Dataset is text file separated by tab, so we use seperator='\t'
# Also header is added with names. Class label is for ham or spam & message is the SMS message
df = pd.read_csv('C:/Users/rm634391/Desktop/Kaggle/SpamClassification/SMSSpamCollection', sep='\t', names=['label', 'message'])

In [68]:
# Printing the number of rows & number of columns in data frame
df.shape

(5572, 2)

In [69]:
# printing data types of each column
datatypes = df.dtypes
datatypes

label      object
message    object
dtype: object

In [70]:
# printing the header
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### The labels need to be converted into binary variables (for binary classification): 0 for ham; 1 for spam

In [71]:
# ham is labelled with 0 & spam is with 1 
#df['label'] = df.label.map({'ham':"0", 'spam':"1"})
df['label'] = df['label'].map({'ham':'0', 'spam':'1'})

In [72]:
# printing the header again
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [73]:
# before splitting we will be building two arrays x & y

x = df['message']
y = df['label']

In [74]:
# Now let us split the datasets x & y into train & test respectively. 
# 0.2 represents 20% of data will be leveraged for training & 80% of data will be used for test

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)

In [75]:
# print the shape of training set

x_train.shape[0]

4457

In [76]:
# print the shape of test set

x_test.shape[0]

1115

### TF-IDF 

TF-IDF (term frequency-inverse document frequency) is an information retrieval technique that helps find the most relevant documents corresponding to a given query.

TF is a measure of how often a phrase appears in a document, and IDF is about how important that phrase is. The multiplication of these two scores makes up a TF-IDF score.

Please read: https://www.onely.com/blog/what-is-tf-idf/ to understand more about TF-IDF

In [77]:
# creating a new TF-IDF vectorizer
vectorizer = TfidfVectorizer()

### fit_transform
fit() method calculates the parameters mean and standard deviation and saves them as internal objects.
transform() method uses the initial calculated values and return modified training data as output.
fit_transform() combines both the above (primarily to scale the training data)

In [78]:
# fit & transform
x_transform_train = vectorizer.fit_transform(x_train)
x_transform_test = vectorizer.transform(x_test)
y_transform_test = vectorizer.transform(y_test)

In [79]:
# create a new classifier
classifier = LogisticRegression()

In [80]:
# fit the model
response = classifier.fit(x_transform_train, y_train)

In [81]:
# Applying the classifier on test data
classified_text = response.predict(x_transform_test)

## Confusion Matrix

Confusion Matrix (also known as error matrix) helps to calculate the performance of fitted model.
To learn more, please read https://en.wikipedia.org/wiki/Confusion_matrix


In [82]:
# calculating the accuracy score
accuracy_score(y_test, classified_text)


0.9721973094170404

In [83]:
# Predicting a new message

test_message = vectorizer.transform( ["Urgent, You've won a prize", "Hey, do we have college today?", "The IRS is trying to contact you"] )
predictions = classifier.predict(test_message)
print(predictions)

['1' '0' '0']
