# **Objective**

The objective of this project is to predict if a given text(sms) message is a spam  based on the analysis of a data set of spam and ham(legitimate) text messages.

# Problem
In recent times, telemarketers and hackers have been using text messages to steal personal information or market events, products and websites. These spam texts are often intrusive and costly at the same time. They are also a major threat to the phone user as they may lead to installation of malware on the phone,data theft and affect phone performance. 

# Data set information

The data set(SMSSpamcollection.csv) used for data analysis in this project has been collected from the reference given below. The data set is a set of SMS tagged messages that have been collected for SMS Spam research. The set contains a total of 5,081 messages in English, tagged as being ham (legitimate) or spam. The data set has a total of 4,392 sms legitimate messages and a total of 689 spam messages. The files contain one message per line. Each line is composed of two columns: one with label (ham or spam) and other with the raw text. 
   
   Reference: https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [83]:
#Import libraries

import pandas as pd
import numpy as np
import re
import statsmodels.api as sm
from sklearn import linear_model

In [84]:
#import data from dataset

SMS_dataset = pd.DataFrame.from_csv('C:\Users\shikha\Desktop\SMSv5\SMSSpamCollection.csv', index_col=None)
print SMS_dataset.head(10)

   Type                                            Message
0   ham  @@.comGo until jurong point  crazy.. Available...
1   ham  Ok lar... Joking wif u oni...\t\t\t\t\t\t\t\t\...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf  he lives aro...
5  spam  FreeMsg Hey there darling it's been 3 week's n...
6   ham  Even my brother is not like to speak with me. ...
7   ham  As per your request 'Melle Melle (Oru Minnamin...
8  spam  WINNER!! As a valued network customer you have...
9  spam  Had your mobile 11 months or more? U R entitle...


In [85]:
#Dataset specifications

#Total records
len_total = len(SMS_dataset)
print len_total

#Total ham messages
tot_ham = len(SMS_dataset[SMS_dataset.Type == 'ham'])
print tot_ham

#Total spam messages
tot_spam = len(SMS_dataset[SMS_dataset.Type == 'spam'])
print tot_spam

5081
4392
689


# Operations on Data set:

* Removed extra tabs(\t) present in the text messages.

* Created a data set with columns:
 * Type : Type of message. value is '0' in case ham message, '1' for spam message.
 2. Message : Text message
 3. Length : Length of message
 4. Exclaim : Number of exclaimation marks in a message, column value will be updated to '0' in case of zero exclaimation     marks.
 5. Link : Value will be updated to '1' in case a website URL is present in a message else '0'  
 6. Has_large_number : Value will be updated to 1 if there is any number present in the message with more than 4 digits else value will be '0'
 7. Uppercase_letters : Number of uppercase letters in the message.


In [86]:
#Removing extra tabs from dataset

remove_tab = lambda x:x.strip('\t')
df = pd.DataFrame(columns=['Type','Message','Length','Exclaim','Link','Has_large_number','Uppercase_letters'])
df['Message'] = SMS_dataset['Message'].apply(remove_tab)
for i in range(0,len(df)):
    t = SMS_dataset.loc[i,'Type']
    if t == 'ham':
        df.loc[i,'Type'] = '0'
    else:
        df.loc[i,'Type'] = '1'

#Counting exclaimation marks in each message

exclaim = lambda x:x.count('!')
df['Exclaim'] = df['Message'].apply(exclaim)
print df

#Checking for website URL in each message

t = df['Message']
t1 = np.asarray(t)
for i in range(0,len(df)):
    myString = t1[i]
    if re.search("(?P<url>https?://[^\s]+)", myString) is not None:
       
       
        df.loc[i,'Link'] = '1'
    else:
        df.loc[i,'Link'] = '0'
print df

     Type                                            Message Length  Exclaim  \
0       0  @@.comGo until jurong point  crazy.. Available...    NaN        0   
1       0                      Ok lar... Joking wif u oni...    NaN        0   
2       1  Free entry in 2 a wkly comp to win FA Cup fina...    NaN        0   
3       0  U dun say so early hor... U c already then say...    NaN        0   
4       0  Nah I don't think he goes to usf  he lives aro...    NaN        0   
5       1  FreeMsg Hey there darling it's been 3 week's n...    NaN        2   
6       0  Even my brother is not like to speak with me. ...    NaN        0   
7       0  As per your request 'Melle Melle (Oru Minnamin...    NaN        0   
8       1  WINNER!! As a valued network customer you have...    NaN        3   
9       1  Had your mobile 11 months or more? U R entitle...    NaN        1   
10      0  I'm gonna be home soon and i don't want to tal...    NaN        0   
11      1  SIX chances to win CASH! From

In [98]:
#Find numbers greater than 4 digits in each message
for i in range(0,len(df)):
    myString = df.loc[i,'Message']
    j =  re.findall("[-+]?\d+[\.]?\d*[eE]?[-+]?\d*", myString)
    if len(j) > 0:
        for k in range(0,len(j)):
            j[k] = j[k].strip('.+-')
            if len(j[k]) > 4:
                df.loc[i,'Has_large_number'] = '1'
                break   
            else:
                df.loc[i,'Has_large_number'] = '0'
    else:
        df.loc[i,'Has_large_number'] = '0'
    

#Find count of uppercase letters in every message

    j =  re.findall("[A-Z]", myString)
    if j is not None:
        df.loc[i,'Uppercase_letters'] = len(j)
    else:
        df.loc[i,'Uppercase_letters'] = '0'
        
#Find length of messages
    df.loc[i,'Length'] = len(myString)
print df.head(10)

#Storing dataset in csv file

df.to_csv('SMS_Final_dataset.csv', sep=',', encoding='utf-8')

  Type                                            Message Length  Exclaim  \
0    0  @@.comGo until jurong point  crazy.. Available...    117        0   
1    0                      Ok lar... Joking wif u oni...     29        0   
2    1  Free entry in 2 a wkly comp to win FA Cup fina...    155        0   
3    0  U dun say so early hor... U c already then say...     49        0   
4    0  Nah I don't think he goes to usf  he lives aro...     61        0   
5    1  FreeMsg Hey there darling it's been 3 week's n...    148        2   
6    0  Even my brother is not like to speak with me. ...     77        0   
7    0  As per your request 'Melle Melle (Oru Minnamin...    160        0   
8    1  WINNER!! As a valued network customer you have...    158        3   
9    1  Had your mobile 11 months or more? U R entitle...    154        1   

  Link Has_large_number Uppercase_letters  
0    0                0                 3  
1    0                0                 2  
2    0              