### Text Classification - Predict text message as Ham or Spam
 - Convert text message to vector, vocabulary and document-term matrix.
 - Apply Naive bayes model.
 - Predict message as ham/spam
 - Calculate accuracy_score, confusion_matrix, roc_auc_score
 - Also apply Logistic Regression model
 - Examine the predicted values

In [1]:
import pandas as pd

In [2]:
data = pd.read_table('./data/sms.tsv.txt', header=None, names=['label', 'message'])

In [3]:
data.shape

(5572, 2)

In [4]:
# Class distribution
data.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [5]:
# Convert label to a numeric value
data['label_num'] = data.label.map({'ham': 0, 'spam': 1})

In [6]:
data.head()

Unnamed: 0,label,message,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0
1,ham,Ok lar... Joking wif u oni...,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,ham,U dun say so early hor... U c already then say...,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0


In [9]:
# Split data into Train and Test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data.message, data.label_num, test_size=0.2)
print(X_train.shape)
print(X_test.shape)

(4457,)
(1115,)


In [10]:
# Convert text data into numerical
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# Create a vocabulary from the training data
vect.fit(X_train)

# Using the vocabulary create document-term matrix
X_train_dtm = vect.transform(X_train)
X_train_dtm.shape

(4457, 7659)

In [11]:
# Transform test data
X_test_dtm = vect.transform(X_test)

In [12]:
X_test_dtm.shape

(1115, 7659)

In [13]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(X_train_dtm, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [15]:
# Here NB has predicted 3555 texts as Ham and 602 text as Spam
nb.class_count_

array([3848.,  609.])

In [16]:
test_pred = nb.predict(X_test_dtm)

In [45]:
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score
accuracy_score(y_test, test_pred)

0.9820627802690582

In [33]:
# For better understanding of confusion matrix and TP, FP, TN, FN
# https://www.quora.com/What-is-the-best-way-to-understand-the-terms-precision-and-recall
tab = confusion_matrix(y_test, test_pred)
tab

array([[975,   2],
       [ 13, 125]], dtype=int64)

In [34]:
# True Positives
print('True Positives  (TP):', tab[0][0])
print('False Positives (FP):', tab[1][0])
print('True Negatives  (TN):', tab[1][1])
print('False Negatives (FN):', tab[0][1])

True Positives  (TP): 975
False Positives (FP): 13
True Negatives  (TN): 125
False Negatives (FN): 2


In [35]:
X_test[y_test < test_pred]

991                               26th OF JULY
4382    Mathews or tait or edwards or anderson
Name: message, dtype: object

In [26]:
# Since the text contains keyword 'draw',
# model is predicting it as SPAM
X_test[991]

'26th OF JULY'

In [27]:
X_test_dtm

<1115x7659 sparse matrix of type '<class 'numpy.int64'>'
	with 13624 stored elements in Compressed Sparse Row format>

In [31]:
# Predicted probability
test_pred_prob = nb.predict_proba(X_test_dtm)
test_pred_prob[:, 1]

array([3.42217463e-09, 1.48671422e-07, 1.00000000e+00, ...,
       1.37542092e-10, 1.00000000e+00, 1.46575640e-01])

### Implement  Logistic Regression

In [36]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [37]:
% time logreg.fit(X_train_dtm, y_train)

Wall time: 47.5 ms


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [39]:
# Make prediction for test data set
test_pred = logreg.predict(X_test_dtm)
test_pred

array([0, 0, 1, ..., 0, 1, 0], dtype=int64)

In [42]:
# Calculate predicted probabilities for test data set
test_pred_proba = logreg.predict_proba(X_test_dtm)
test_pred_proba[:, 1]

array([1.58562770e-03, 2.19070824e-02, 9.97015075e-01, ...,
       9.75566405e-04, 9.97724884e-01, 1.93751405e-02])

In [44]:
# Calculate accuracy
accuracy_score(y_test, test_pred)

0.9820627802690582

In [46]:
# Calculate AUC
roc_auc_score(y_test, test_pred)

0.9337590672422232

### Examine the NB model

In [47]:
# Vocabulary created by vector for X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

7659

In [50]:
# First 50 tokens
print(X_train_tokens[0:50])

['00', '000', '008704050406', '0089', '0121', '01223585236', '01223585334', '0125698789', '02', '0207', '02072069400', '02073162414', '02085076972', '021', '03', '04', '05', '0578', '06', '07', '07046744435', '07090201529', '07099833605', '07123456789', '0721072', '07732584351', '07742676969', '0776xxxxxxx', '07781482378', '07786200117', '077xxx', '078', '07801543489', '07808247860', '07808726822', '07815296484', '07821230901', '078498', '07880867867', '0789xxxxxxx', '07946746291', '07xxxxxxxxx', '08', '0800', '08000407165', '08000776320', '08000839402', '08000930705', '08000938767', '08001950382']


In [51]:
# Last 50 tokens
print(X_train_tokens[-50:])

['yesterday', 'yet', 'yetunde', 'yi', 'yifeng', 'yijue', 'ym', 'ymca', 'yo', 'yoga', 'yogasana', 'yor', 'yorge', 'you', 'youdoing', 'youi', 'young', 'younger', 'youphone', 'your', 'youre', 'yourinclusive', 'yourjob', 'yours', 'yourself', 'youuuuu', 'youwanna', 'yowifes', 'yoyyooo', 'yr', 'yrs', 'yummmm', 'yummy', 'yun', 'yuo', 'yuou', 'yup', 'zac', 'zaher', 'zealand', 'zebra', 'zed', 'zhong', 'zoe', 'zogtorius', 'zoom', 'zouk', 'èn', 'ú1', '〨ud']


In [52]:
# Number of times each token appears in each class
nb.feature_count_

array([[ 0.,  0.,  0., ...,  1.,  0.,  1.],
       [ 9., 25.,  1., ...,  0.,  1.,  0.]])

In [53]:
# Rows are classes and columns are tokens
nb.feature_count_.shape

(2, 7659)

In [54]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count

array([0., 0., 0., ..., 1., 0., 1.])

In [55]:
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count

array([ 9., 25.,  1., ...,  0.,  1.,  0.])

In [58]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({
    'token': X_train_tokens,
    'ham': ham_token_count,
    'spam': spam_token_count
}).set_index('token')
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.0,9.0
0,0.0,25.0
8704050406,0.0,1.0
89,0.0,1.0
121,0.0,1.0


In [60]:
# examine 5 random DataFrame rows
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
abuse,0.0,1.0
wap,0.0,11.0
tome,1.0,0.0
mother,7.0,0.0
iron,1.0,0.0


In [61]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

array([3848.,  609.])

In [62]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
abuse,1.0,2.0
wap,1.0,12.0
tome,2.0,1.0
mother,8.0,1.0
iron,2.0,1.0


In [63]:
# calculate ham and spam count in terms of frequency
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(6, random_state=6)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
abuse,0.00026,0.003284
wap,0.00026,0.019704
tome,0.00052,0.001642
mother,0.002079,0.001642
iron,0.00052,0.001642
specialise,0.00052,0.001642


In [65]:
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5, random_state=6)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
abuse,0.00026,0.003284,12.63711
wap,0.00026,0.019704,75.82266
tome,0.00052,0.001642,3.159278
mother,0.002079,0.001642,0.789819
iron,0.00052,0.001642,3.159278


In [67]:
tokens.sort_values('spam_ratio', ascending=False)[:10]

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
claim,0.00026,0.157635,606.581281
prize,0.00026,0.137931,530.758621
150p,0.00026,0.09688,372.794745
tone,0.00026,0.078818,303.29064
guaranteed,0.00026,0.073892,284.334975
18,0.00026,0.07225,278.01642
www,0.00052,0.137931,265.37931
cs,0.00026,0.065681,252.7422
500,0.00026,0.059113,227.46798
awarded,0.00026,0.055829,214.83087


In [68]:
# Look up the spam_ratio for a given token
tokens.loc['dating', 'spam_ratio']

94.77832512315271

### Tuning the vectorizer

In [70]:
# Default values for CountVectorizer
vect

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [93]:
# Remove stop words, include 1-grams and 2-grams
# ignore the terms that appear in more than 50% of the documents
# only keep terms that appear in at least 2 documents
vect = CountVectorizer(ngram_range=(1,2),
                       max_df=0.5,
                       min_df=2
                      )

In [94]:
vect.fit(X_train, y_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.5, max_features=None, min_df=2,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [95]:
vocabulary = vect.get_feature_names()
len(vocabulary)

12147

### Conclusion

Removing features that appear in more than 50% of documents and 
those appear in only one document makes big difference in feature counts.

Features using ** n_gram_range ** only: ** 43874 **

Features using ** max_df ** and ** min_df** : ** 12147 **