## Goal:  Train a Naive Bayes model to classify future SMS messages as either spam or ham.

Steps:

1.  Convert the words ham and spam to a binary indicator variable(0/1)

2.  Convert the txt to a sparse matrix of TFIDF vectors

3.  Fit a Naive Bayes Classifier

4.  Measure your success using roc_auc_score



In [89]:
import pandas as pd
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn import naive_bayes
from sklearn.metrics import roc_auc_score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ohm\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [90]:
df= pd.read_csv("C:/Users/ohm/Downloads/Machine Learning Exercises/Naive Bayes/sms_spam.csv")

In [91]:
df.head()

Unnamed: 0,type,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [92]:
df.shape

(5574, 2)

#### Train the classifier if it is spam or ham based on the text

In [93]:
#TFIDF Vectorizer
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)

In [94]:
print(vectorizer)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'hadn', 'there', 'on', 'below', 'isn', 'whom', 'up', "shan't", "you'll", "don't", 'm', 'out', 'aren', "it's", 'weren', 'so', 'did', 'll', 'ma', 'down', 'have', 'own', 'your', 'that', 'again', "mustn't", 'both', 'yours', "doesn't", 'while', 'shouldn', 'off', 'nor', 'as', 'through', 'd', "...", "hadn't", 'himself', 'now', 'him', 'these', 'few', 'at', 'their', 'more', 'wouldn', 'and', 'can'},
        strip_accents='ascii', sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)


In [95]:
vectorizer.fit(df)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words={'hadn', 'there', 'on', 'below', 'isn', 'whom', 'up', "shan't", "you'll", "don't", 'm', 'out', 'aren', "it's", 'weren', 'so', 'did', 'll', 'ma', 'down', 'have', 'own', 'your', 'that', 'again', "mustn't", 'both', 'yours', "doesn't", 'while', 'shouldn', 'off', 'nor', 'as', 'through', 'd', "...", "hadn't", 'himself', 'now', 'him', 'these', 'few', 'at', 'their', 'more', 'wouldn', 'and', 'can'},
        strip_accents='ascii', sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)

#### Convert the spam and ham to 1 and 0 values respectively for probability testing

In [96]:
df.type.replace('spam', 1, inplace=True)

In [97]:
df.type.replace('ham', 0, inplace=True)

In [98]:
df.head()

Unnamed: 0,type,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [99]:
df.shape

(5574, 2)

In [100]:
##Our dependent variable will be 'spam' or 'ham' 
y = df.type

In [101]:
#Convert df.txt from text to features
X = vectorizer.fit_transform(df.text)

In [102]:
X.shape

(5574, 8586)

In [103]:
X

<5574x8586 sparse matrix of type '<class 'numpy.float64'>'
	with 47400 stored elements in Compressed Sparse Row format>

### TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document)

### IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

## tf-idf score=TF(t)*IDF(t)

In [104]:
## Spliting the SMS to separate the text into individual words
splt_txt1=df.text[0].split()
print(splt_txt1)

['Go', 'until', 'jurong', 'point,', 'crazy..', 'Available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet...', 'Cine', 'there', 'got', 'amore', 'wat...']


In [105]:
## Finding the most frequent word appearing in the SMS
max(splt_txt1)

'world'

In [106]:
## Count the number of words in the first SMS
len(splt_txt1)

20

### It means in the first SMS there are 20(len(splt_txt1)) words & out of which only 14 elements have been taken, that;s why we'll get only 14 tf-idf values for the first the SMS.Likewise elements or words of all other SMSes are taken into consideration

In [107]:
X[0]

<1x8586 sparse matrix of type '<class 'numpy.float64'>'
	with 14 stored elements in Compressed Sparse Row format>

## 0 is the first SMS,3536,4316 etc are the positions of the elements or the words & 0.15,0.34,0.27 are the tf_idf value of the words . Like wise we can find the next SMSes & the tf-idf value of the words of the SMSes

In [108]:
print(X)

  (0, 3536)	0.1570070817542793
  (0, 4316)	0.3466185073652293
  (0, 5877)	0.2711124074492608
  (0, 2316)	0.26843531434169243
  (0, 1301)	0.25926284833436075
  (0, 1746)	0.2928268764441005
  (0, 3620)	0.19147848622350877
  (0, 8428)	0.23446497404204308
  (0, 4442)	0.2928268764441005
  (0, 1744)	0.3308854638944828
  (0, 2038)	0.2928268764441005
  (0, 3580)	0.1625034702178997
  (0, 1074)	0.3466185073652293
  (0, 8218)	0.19367543856970723
  (1, 5466)	0.27190435673704183
  (1, 4478)	0.4083285209202484
  (1, 4284)	0.5236769406481622
  (1, 8333)	0.4316309977097208
  (1, 5493)	0.5466195966483365
  (2, 3340)	0.11532016948053561
  (2, 2931)	0.3598966605883333
  (2, 8387)	0.19049443007546943
  (2, 2155)	0.19443486429295845
  (2, 8345)	0.14768604533962174
  (2, 3068)	0.46962403601340863
  :	:
  (5569, 165)	0.3330442123216397
  (5569, 5384)	0.3330442123216397
  (5570, 3876)	0.3652144637345925
  (5570, 3549)	0.3642455181785356
  (5570, 3327)	0.5597074067013798
  (5570, 2963)	0.6485917181474956
  (55

In [109]:
vectorizer.get_feature_names()[4316]## 4316 is the position of the word jurong

'jurong'

## Second SMS

In [110]:
## Spliting the SMS to separate the text into individual words
splt_txt2=df.text[1].split()
print(splt_txt2)

['Ok', 'lar...', 'Joking', 'wif', 'u', 'oni...']


In [111]:
len(splt_txt2)

6

In [112]:
X[1]## Second SMS

<1x8586 sparse matrix of type '<class 'numpy.float64'>'
	with 5 stored elements in Compressed Sparse Row format>

In [113]:
## Finding the most frequent word appearing in the second SMS
max(splt_txt2)

'wif'

### From the above in the 2nd SMS there are 6 words  & out of which only 5 elements have been taken, that's why
### we'll get only 5 tf-idf values for the 2nd the SMS.Likewise elements or words of all other SMSes are taken into consideration

In [114]:
## The most freaquent word across all the SMSes
max(vectorizer.get_feature_names())

'zyada'

In [115]:
print (y.shape)
print (X.shape)

(5574,)
(5574, 8586)


In [116]:
##Split the test and train
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size = 0.30, random_state=42)

In [117]:
##Train Naive Bayes Classifier
## Fast (One pass)
## Not affected by sparse data, so most of the 8605 words dont occur in a single observation
clf = naive_bayes.MultinomialNB()
model=clf.fit(X_train, y_train)

In [118]:
predicted_class=model.predict(X_test)
print(predicted_class)

[0 0 0 ... 0 0 0]


### First 3 SMSes are correctly assigned to Ham(0) based on the tf-idf scores of the words given in the SMSes

In [127]:
print(y_test.head(10))

3690    0
3527    0
724     0
3370    0
468     0
5412    0
4362    0
4241    0
5442    0
5309    0
Name: type, dtype: int64


In [120]:
df.loc[[19]]

Unnamed: 0,type,text
19,1,England v Macedonia - dont miss the goals/team...


In [121]:
predicted_class[19]## This SMS(SMS no. 19) has been classified as Ham but Actually it's SPAM

0

#### Check for null values in spam

In [122]:
df[df.type.isnull()]

Unnamed: 0,type,text


#### There are no null values

### Find the probability of assigning a SMS to a specific class

In [123]:
prd=model.predict_proba(X_test)

In [124]:
prd

array([[9.97117603e-01, 2.88239699e-03],
       [9.84580163e-01, 1.54198369e-02],
       [9.32967559e-01, 6.70324405e-02],
       ...,
       [9.99557895e-01, 4.42105374e-04],
       [9.91588725e-01, 8.41127472e-03],
       [9.85902925e-01, 1.40970754e-02]])

In [125]:
clf.predict_proba(X_test)[:,1]

array([0.0028824 , 0.01541984, 0.06703244, ..., 0.00044211, 0.00841127,
       0.01409708])

In [126]:
##Check model's accuracy
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

0.9879151861342662

### With the model, the success rate is ~98.8%