<a href="https://colab.research.google.com/github/padmapriyajain/MyPythonworld/blob/master/Copy_of_ML_INTERMEDIATE_SMS_SPAM_DETECTION.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# INSAID ML INTERMEDIATE PROJECT - SMS SPAM DETECTION

<img src =  'https://github.com/padmapriyajain/MyPythonworld/blob/master/IMG1.jpg?raw=true' width="5000" height="400">



<a id='section1'></a>
# 1. INTRODUCTION:

I am M Padmapriya, student of Jan 2020 cohort. I have takenup the SMS SPAM DETECTION datset for my ML INTERDIATE project.

<a id=section101></a> 
## 1.1 PROBLEM STATEMENT

This dataset is a set of SMS tagged messages that have been collected for SMS Spam research. It contains one set of SMS messages in English of 5,574 messages, tagged acording being ham (legitimate) or spam.

The files contain one message per line. Each line is composed by two columns: v1 contains the label (ham or spam) and v2 contains the raw text.

Can you use this dataset to build a prediction model that will accurately classify which texts are spam??

<a id=section102></a> 
## 1.2 IMPORT LIBRARIES AND DATA

In [None]:
import sys                                                                      # Import packages
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import pie, axis, show
%matplotlib inline                   
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity= 'all' 

import warnings 
# Ignore warning related to pandas_profiling                                                                
warnings.filterwarnings('ignore') 

# Display all dataframe columns in outputs (it has 27 columns, which is wider than the notebook)
# This sets it up to dispaly with a horizontal scroll instead of hiding the middle columns
pd.set_option('display.max_columns', 100)                                       

# Load the dataset to df_sms dataframe
df_sms = pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-3/master/Projects/spam.csv", encoding = "latin-1")     

# 2. DATA DESCRIPTION

In [None]:
# Display sample of data:
df_sms.sample(5)

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
426,ham,aathi..where are you dear..,,,
3544,ham,Thank You meet you monday,,,
59,ham,Yes..gauti and sehwag out of odi series.,,,
2239,ham,Every day i use to sleep after &lt;#&gt; so ...,,,
3127,ham,would u fuckin believe it they didnt know i ha...,,,


In [None]:
# Lets perform .info on the data:
df_sms.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   v1          5572 non-null   object
 1   v2          5572 non-null   object
 2   Unnamed: 2  50 non-null     object
 3   Unnamed: 3  12 non-null     object
 4   Unnamed: 4  6 non-null      object
dtypes: object(5)
memory usage: 217.8+ KB


Observations: SMS SPAM DETECTION dataset has - 
  * 5572 observations/messages
  * 5 columns
  * Columns ( Unamed: 2, Unamed: 3 and Unamed: 4) has a lot of missing values


In [None]:
# Lets perform .describe on the data:
df_sms.describe(include='all')

Unnamed: 0,v1,v2,Unnamed: 2,Unnamed: 3,Unnamed: 4
count,5572,5572,50,12,6
unique,2,5169,43,10,5
top,ham,"Sorry, I'll call later","bt not his girlfrnd... G o o d n i g h t . . .@""","MK17 92H. 450Ppw 16""","GNT:-)"""
freq,4825,30,3,2,2


Observations:
  * v1 and v2 has no missing values unlike the other 3 columns.
  * v1 column is the "target" which has 2 unique values and "ham" seems to be the top value with a frequency of 4825.
  * v2 column has 5169 unique values and "Sorry, I'll call later bt not his girlfrnd" has been repeated for 30 times in the dataset.
  * Unamed: 2, Unamed: 3 and Unamed: 4 has very less values out of which most of them are unique values respectively.



Decisions:
  * Unamed: 2, Unamed: 3 and Unamed: 4 has a lot of null values hence it can be dropped.
  * v1 and v2 names can be replaced with 'spam' and 'messages'
  * v1 has 2 values "ham" and "spam", which can be replaced with 0 and 1
  * v2 has text values which needs to be processed and converted to a numerical notation before modelling the dataset.

## 3. DATA PREPROCESSING

In [None]:
# Drop unwanted columns:
df_sms.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis=1, inplace=True)
df_sms.head(5)

Unnamed: 0,v1,v2
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Change the column names:
df_sms.rename(columns = {'v1':'spam','v2':'messages'},inplace=True)
df_sms.head(5)

Unnamed: 0,spam,messages
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# values in v1
spam_count = df_sms.spam.value_counts()
spam_count
print("\nThis dataset has",round(((spam_count[1]/float(spam_count[0]+spam_count[1]))*100), 2), "% of spam messages.")

ham     4825
spam     747
Name: spam, dtype: int64


This dataset has 13.41 % of spam messages.


In [None]:
# Change the "ham" as 0 and "spam" as 1
df_sms['spam'] = df_sms.spam.map({'ham':0, 'spam':1})
df_sms.head(5)

Unnamed: 0,spam,messages
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
# Lets inroduce a new column which gives the length of the messages.
df_sms["msg_length"] = df_sms["messages"].str.len()
df_sms.head(5)

Unnamed: 0,spam,messages,msg_length
0,0,"Go until jurong point, crazy.. Available only ...",111
1,0,Ok lar... Joking wif u oni...,29
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,155
3,0,U dun say so early hor... U c already then say...,49
4,0,"Nah I don't think he goes to usf, he lives aro...",61


In [None]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The values in the messages columns seems to have:
1. Words with a-z and A-Z (both lower and upper case)
2. Use of punctuations (!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~)
3. Use of numbers (0-9)and float values.
4. Use of whitespaces before and after a word
5. Use of stopwords (A group of words which are highly frequently used without any additional information, such as articles, determiners and prepositions are called stop-words)

Decision:

We have to cleanup all the above observations from the messages so that we can convert the words to different features.

In [None]:
# Replace email addresses with 'email'
df_sms['processed_msg'] = df_sms.messages.str.replace(r'^.+@[^\.].*\.[a-z]{2,}$', 'email')

In [None]:
# Replace web addresses with 'link'
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'^http\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(/\S*)?$', 'link')
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'www\.\S+\.com', 'link')
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'www\.\S+\.net', 'link')
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'www\.\S+\.org', 'link')

In [None]:
# Remove punctuations:
df_sms['processed_msg'] = df_sms.processed_msg.str.replace('[{}]'.format(string.punctuation), '')

In [None]:
# Replace all numbers with 'number'
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'\d+(\.\d+)?', 'number')


In [None]:
# Replace whitespace between terms with a single space
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'\s+', ' ')

In [None]:
# Remove leading and trailing whitespace
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'^\s+|\s+?$', '')

In [None]:
#Replace money symbols with 'currency'
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'£|\$', 'currency')

In [None]:
#Remove non-ascii characters with 'nonascii'
df_sms['processed_msg'] = df_sms.processed_msg.str.replace(r'[^\x00-\x7f]','')

In [None]:
# Converting the text to lower case:
df_sms['processed_msg'] = df_sms.processed_msg.str.lower()

In [None]:
# Remove the stop words:
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df_sms['processed_msg'] = df_sms['processed_msg'].apply(lambda x: ' '.join(term for term in x.split() if term not in stop_words))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# Stemming of the words: This reduces the word to their word stem like run, running to run
from nltk.stem import PorterStemmer
ps = PorterStemmer()
df_sms['processed_msg'] = df_sms['processed_msg'].apply(lambda x: ' '.join(ps.stem(term) for term in x.split()))

In [None]:
df_sms_processed = df_sms.copy()
df_sms_processed.drop(['messages', 'msg_length'], axis = 1, inplace = True)
df_sms_processed.head(5)
df_sms_processed.info()

Unnamed: 0,spam,processed_msg
0,0,go jurong point crazi avail bugi n great world...
1,0,ok lar joke wif u oni
2,1,free entri number wkli comp win fa cup final t...
3,0,u dun say earli hor u c alreadi say
4,0,nah dont think goe usf live around though


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5572 entries, 0 to 5571
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   spam           5572 non-null   int64 
 1   processed_msg  5572 non-null   object
dtypes: int64(1), object(1)
memory usage: 87.2+ KB


# 4 EDA - Spam detection

# 5 Generating Features:
We can use Bag of words approach to extract features from the test data

In [None]:
processed_msg = df_sms_processed.processed_msg
y = df_sms_processed.spam
processed_msg.shape

(5572,)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1500) # Extracting top 1500 text only as features 
X = pd.DataFrame(cv.fit_transform(processed_msg).toarray())
X.shape
X.head(5)

(5572, 1500)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,...,1450,1451,1452,1453,1454,1455,1456,1457,1458,1459,1460,1461,1462,1463,1464,1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1482,1483,1484,1485,1486,1487,1488,1489,1490,1491,1492,1493,1494,1495,1496,1497,1498,1499
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
