## Importing necessary packages at first

In [213]:
import pandas as pd

## Training text with 3 records 

In [214]:
train_text = ['Where are you','Love you so much','Call me when you are free','I am good human']

 **Count vectoriser** is the package used to **convert text into sparse matrix**, thereby treating each **unique words** in the record as **features**

In [215]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

### For training number, we go for fit and predict but for text, we use fit and transform method

In [216]:
vect.fit(train_text)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In fit, we learn the **vocabulary of sentences** in train_text by **removing recurring words, removing special characters** etc

In [211]:
train_dtm = vect.transform(train_text) 
train_dtm

<4x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

In transform, we use the **fitted vocabulary** to build a **document term matrix** with **number of messages** as rows and **number of unique words** in these sentences as columns. If **4x13** is shape of matrix, it means your train set has **4 messages** and total of **13 unique words** 

### Getting features names ie unique words in the train record using the below function

In [217]:
vect.get_feature_names() 

['am',
 'are',
 'call',
 'free',
 'good',
 'human',
 'love',
 'me',
 'much',
 'so',
 'when',
 'where',
 'you']

### Converting matrix elements into array using to_array() function and make it as a dataframe for proper readability

In [218]:
pd.DataFrame(train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,am,are,call,free,good,human,love,me,much,so,when,where,you
0,0,0,1,0,1,0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0,0,0,1,1,0,1
2,1,0,0,0,1,1,0,0,1,0,0,1,1
3,0,0,0,1,0,0,1,1,0,0,0,0,0


## Now applying that concept in real dataset

In [9]:
path = 'C:/Users/HP/Downloads/Selection_sklearn/Machine_Learning_with_Text/pycon-2016-tutorial-master/data/sms.tsv'
sms = pd.read_table(path, header=None, names=['label', 'message'])

In [10]:
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [11]:
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

Changing **text labels into numeric** by mapping **ham as 0** and **spam as 1**

In [221]:
sms['label_num']=sms.label.map({'ham':0, 'spam':1})

In [222]:
sms.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


### Making message as only one input feature and label_num as target

In [14]:
x = sms.message
y = sms.label_num
print(x.shape)
print(y.shape)

(5572,)
(5572,)


### Now train/test split the original data to evaluate the performance of the algorithm

In [15]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=1) #default: 75% training and 25% testing
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)




In [16]:
vect1 = CountVectorizer()

In [17]:
X_train_dtm = vect1.fit_transform(X_train)

### For testing the data, we just need to transform because the CountVectoriser is already trained .i.e. fitted 

In [18]:
X_test_dtm = vect1.transform(X_test)

**Another sense of fact** is that we have to **compare the test sample with vocabulary of trained sample**. If the word in **test sample** is not part of **vocabulary of train samples**, it **does not** give results or in other words, **will not process that word. 

In [19]:
X_test_dtm.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Now for classification process , we use Multinomial Naive Bayes. Here as usual, import, instantiate, fit and predict 

In [20]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

For **CountVectoriser()**, we have to **pass the text** as the **parameter** of fit. But for **algorithms** which are for **classifying the text nature**, we have to pass **transformed count vectoriser output** as the **parameter** for fit

In [21]:
%time nb.fit(X_train_dtm, y_train)

Wall time: 4 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [22]:
y_pred_class = nb.predict(X_test_dtm)

### Checking efficiency of the algorithm using accuracy score function from metrics package

In [23]:
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.98851399856424982

### Confusion matrix is another function from metrics which checks the classification accuracy, not by score but by stats

In [24]:
metrics.confusion_matrix(y_test, y_pred_class)
# out of 1393 test samples, the algorithm classified into 1208 ham and 185 spam messages
# out of 1208 ham messages, 1203 are originally ham and 5 are spam, which is wrongly predicted as ham by the algorithm
# out of 185 spam messages, 174 are originally spam and 11 are ham, which is wrongly predicted as spam by the algorithm

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [241]:
X_test[y_pred_class > y_test] # Printing out messages which are ham but wrongly predicted as spam

574               Waiting for your call.
3375             Also andros ice etc etc
45      No calls..messages..missed calls
3415             No pic. Please re-send.
1988    No calls..messages..missed calls
Name: message, dtype: object

In [242]:
X_test[y_pred_class < y_test] # Printing out messages which are spam but wrongly predicted as ham

3132    LookAtMe!: Thanks for your purchase of a video...
5       FreeMsg Hey there darling it's been 3 week's n...
3530    Xmas & New Years Eve tickets are now on sale f...
684     Hi I'm sue. I am 20 years old and work as a la...
1875    Would you like to see my XXX pics they are so ...
1893    CALL 09090900040 & LISTEN TO EXTREME DIRTY LIV...
4298    thesmszone.com lets you send free anonymous an...
4949    Hi this is Amy, we will be sending you a free ...
2821    INTERFLORA - It's not too late to order Inter...
2247    Hi ya babe x u 4goten bout me?' scammers getti...
4514    Money i have won wining number 946 wot do i do...
Name: message, dtype: object

In [232]:
y_pred_prob = nb.predict_proba(X_test_dtm)[:,1] # Here taking the probability of message to be spam to be tested for accuracy score
y_pred_prob

array([  2.87744864e-03,   1.83488846e-05,   2.07301295e-03, ...,
         1.09026171e-06,   1.00000000e+00,   3.98279868e-09])

**predict_proba** is the function which displays the **probability of text messages to be ham or spam messages.**

**For example:** in the **first row**, **0.99** is the **probability** that the message is **ham** and **0.002** is the **probability** that the message is **spam**

In [233]:
y_test.value_counts()

0    1208
1     185
Name: label_num, dtype: int64

In [234]:
len(y_pred_prob)

1393

In [239]:
metrics.roc_auc_score(y_test, y_pred_prob) 

0.98664310005369615

**roc_auc_score** is the **accuracy score** checking function for **binary classification or multi label classification** task where you can send **continuous** predicted values to be compared with actual value which is **discrete**. **Predicted values** can be **probability results of a particular class** 

In [29]:
y_test.value_counts()

0    1208
1     185
Name: label_num, dtype: int64

In [31]:
len(vect1.get_feature_names())

7456

In [32]:
len(nb.feature_count_)

2

## Finding the spaminess of each token for some interesting insights

In [119]:
X_train_tokens = vect1.get_feature_names()
len(X_train_tokens)

7456

In [120]:
print(X_train_tokens[0:50]) # printing first 50 features in default smaller to larger sort

['00', '000', '008704050406', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '0430', '05', '050703', '0578', '06', '07', '07008009200', '07090201529', '07090298926', '07123456789', '07732584351', '07734396839', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '078', '07801543489', '07808', '07808247860', '07808726822', '07815296484', '07821230901', '07880867867', '0789xxxxxxx', '07946746291', '0796xxxxxx', '07973788240', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705']


### Now finding number of times the feature appeared in ham and spam using featurecount

In [244]:
nb.feature_count_.shape
#First array contains count of features i.e.tokens in ham messages
#Second array contains count of features in spam messages

(2, 7456)

**For example**, the feature **'00'** appeared **0 times** in **ham** messages and **5 times** in **spam** messages

In [158]:
ham_token_count = nb.feature_count_[0, :] # taking the feature counts in ham messages alone
ham_token_count= ham_token_count.astype(int)

In [159]:
spam_token_count = nb.feature_count_[1, :] #taking the feature counts in spam messages alone
spam_token_count= spam_token_count.astype(int)

In [160]:
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,5
0,0,23
8704050406,0,2
121,0,1
1223585236,0,1


In [161]:
tokens.sample(10, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,64,2
nasty,1,1
villa,0,1
beloved,1,0
textoperator,0,2
arng,2,0
1013,0,1
scores,1,1
nahi,2,0
long,35,0


In [162]:
nb.class_count_

array([ 3617.,   562.])

In [163]:
nb.class_count_

array([ 3617.,   562.])

### Adding 1 to count to avoid 'Divide by zero' error

In [164]:
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,65,3
nasty,2,2
villa,1,2
beloved,2,1
textoperator,1,3


Normalising the ham and spam counts by **dividing** the counts by total number of ham and spam messages 

In [165]:
tokens['ham'] = tokens['ham'] / nb.class_count_[0]
tokens['spam'] = tokens['spam'] / nb.class_count_[1]
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
very,0.017971,0.005338
nasty,0.000553,0.003559
villa,0.000276,0.003559
beloved,0.000553,0.001779
textoperator,0.000276,0.005338


In [166]:
tokens['spam']

token
00              0.010676
000             0.042705
008704050406    0.005338
0121            0.003559
01223585236     0.003559
01223585334     0.005338
0125698789      0.001779
02              0.008897
0207            0.007117
02072069400     0.003559
02073162414     0.005338
02085076972     0.003559
021             0.005338
03              0.012456
04              0.017794
0430            0.003559
05              0.003559
050703          0.003559
0578            0.003559
06              0.007117
07              0.005338
07008009200     0.003559
07090201529     0.003559
07090298926     0.003559
07123456789     0.003559
07732584351     0.003559
07734396839     0.005338
07742676969     0.005338
0776xxxxxxx     0.005338
07781482378     0.003559
                  ...   
yourjob         0.001779
yours           0.023132
yourself        0.003559
youwanna        0.001779
yowifes         0.001779
yoyyooo         0.001779
yr              0.017794
yrs             0.007117
ything          0.0

### Finding the spam and ham ratio for all tokens - High the ratio, High is its nature

In [167]:
tokens['spam_ratio'] = tokens.spam / tokens.ham # That adding 1 helps here i.e. if any one of the token count in ham is 0, then it will lead to Inf 
tokens['ham_ratio'] = tokens.ham / tokens.spam
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,spam_ratio,ham_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
very,0.017971,0.005338,0.297044,3.36651
nasty,0.000553,0.003559,6.435943,0.155377
villa,0.000276,0.003559,12.871886,0.077689
beloved,0.000553,0.001779,3.217972,0.310755
textoperator,0.000276,0.005338,19.307829,0.051792


In [168]:
tokens.sort_values('spam_ratio',ascending=False).head()

Unnamed: 0_level_0,ham,spam,spam_ratio,ham_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
claim,0.000276,0.158363,572.798932,0.001746
prize,0.000276,0.135231,489.131673,0.002044
150p,0.000276,0.087189,315.36121,0.003171
tone,0.000276,0.085409,308.925267,0.003237
guaranteed,0.000276,0.076512,276.745552,0.003613


In [173]:
tokens.sort_values('ham_ratio',ascending=False).head()

Unnamed: 0_level_0,ham,spam,spam_ratio,ham_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
gt,0.064971,0.001779,0.027387,36.513685
lt,0.064142,0.001779,0.027741,36.047553
he,0.047,0.001779,0.037858,26.414155
she,0.035665,0.001779,0.049891,20.043683
lor,0.0329,0.001779,0.054084,18.489909


In [182]:
tokens.loc['award']

ham            0.000553
spam           0.035587
spam_ratio    64.359431
ham_ratio      0.015538
Name: award, dtype: float64