### Analyse text data and create models to predict if a message is spam or not. 

Read csv file and analyse the text data to classify as spam or  not

In [1]:
import pandas as pd
import numpy as np

input_data = pd.read_csv('spam.csv')

input_data['target'] = np.where(input_data['target']=='spam', 1, 0)
input_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [2]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(input_data['text'], 
                                                    input_data['target'], 
                                                    random_state=0)
# X_train of type Series
X_train.shape[0], X_test.shape[0]

(4179, 1393)

Percentage of documents that are spam

In [3]:
# Gets entire row based on column 'target' == 1
spam_df = input_data[input_data['target'] == 1]
print(type(spam_df), spam_df.shape)
spam_text = input_data[input_data['target'] == 1]['text']
print(type(spam_text), spam_text.shape)

spam_cnt = spam_df.shape[0]
total_cnt = input_data.shape[0]

(spam_cnt*100)/total_cnt

<class 'pandas.core.frame.DataFrame'> (747, 2)
<class 'pandas.core.series.Series'> (747,)


13.406317300789663

#### Build the model

Fit and transform the training data `X_train` using a Count Vectorizer with default parameters.<br>
Next, fit a Multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. This classifier is suitable to work with discrete features (e.g., word counts for text classification). <br>
Find the area under the curve (AUC) score using the transformed test data. <br>

#### Best is MultinomialNB with CountVectorizer. Next is LogisticRegression.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

cvec = CountVectorizer().fit(X_train)
    
nb_clf = MultinomialNB(alpha=0.1)
X_train_vec = cvec.transform(X_train)
print(X_train_vec.shape)
nb_clf.fit(X_train_vec, y_train)
    
y_pred = nb_clf.predict(cvec.transform(X_test))
roc_auc_score(y_test, y_pred)

(4179, 7354)


0.9720812182741116

In [5]:
tfvec = TfidfVectorizer().fit(X_train)

nb_clf2 = MultinomialNB(alpha=0.1)
nb_clf2.fit(tfvec.transform(X_train), y_train)
print(tfvec.transform(X_train).shape)
    
y_pred = nb_clf2.predict(tfvec.transform(X_test))
roc_auc_score(y_test, y_pred)

(4179, 7354)


0.9492385786802031

In [6]:
svm_clf = SVC(C=10000).fit(X_train_vec, y_train)
y_pred = svm_clf.predict(cvec.transform(X_test))
roc_auc_score(y_test, y_pred)

0.934010152284264

In [7]:
log_clf = LogisticRegression(C=100).fit(X_train_vec, y_train)
y_pred = log_clf.predict(cvec.transform(X_test))
roc_auc_score(y_test, y_pred)

0.9437443763475543

From the analysis below it is found that document length, number of digits, number of special characters are much higher for spam documents.
#### Average length of documents (number of characters) for non-spam and spam documents

In [8]:
# Series of the 'text' column for rows of non-spam
nonspam_text = input_data[input_data['target'] == 0]['text']
non_spam_len = nonspam_text.str.len()
non_spam_mean = non_spam_len.mean()

spam_text = input_data[input_data['target'] == 1]['text']
spam_len = spam_text.str.len()
spam_mean = spam_len.mean()

(non_spam_mean, spam_mean)

(71.02362694300518, 138.8661311914324)

#### Average number of digits for not spam and spam documents

In [9]:
# Series of digit count of each row of text
nonspam_digitCnt = nonspam_text.str.count('\d')

spam_digitCnt = spam_text.str.count('\d')

(nonspam_digitCnt.mean(), spam_digitCnt.mean())

(0.2992746113989637, 15.759036144578314)

#### Average number of non-word characters (anything other than a letter, digit or underscore) per document for not spam and spam documents.

In [10]:
# Series of non-word character count of each row of text
nonspam_spCharCnt = nonspam_text.str.count('\W')

spam_spCharCnt = spam_text.str.count('\W')

(nonspam_spCharCnt.mean(), spam_spCharCnt.mean())

(17.29181347150259, 29.041499330655956)

#### Function to combine new features into the training data
Returns sparse feature matrix with added feature.

In [11]:
from scipy.sparse import csr_matrix, hstack

# feature_to_add can also be a list of features
# return type scipy.sparse.csr.csr_matrix

def add_feature(X, feature_to_add):
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')

Fit and transform the training data X_train using a CountVectorizer ignoring terms that have a document frequency lower than **5**.<br>
Using this matrix and two additional features:
* the length of document (number of characters)
* number of digits per document

fit a MultinomialNB.

#### Accuracy is 97.09%
For default CountVectorizer, number of features are much more and accuracy is 97.2%. Here, with lesser features, but 2 additional meaningful features, accuracy is 97.09%.

In [12]:
cvec = CountVectorizer(min_df=5).fit(X_train)
X_train_vec = cvec.transform(X_train)     
   
# shape inc from 4179*1468 -> 4179*1470
# parameters of add_feature are Series or list of Series, return type scipy.sparse.csr.csr_matrix
new_train_features = [X_train.str.len(), X_train.str.count('\d')]
X_train_vec_new = add_feature(X_train_vec, new_train_features)
print(X_train_vec_new.shape)
nb_clf = MultinomialNB(alpha=0.1).fit(X_train_vec_new, y_train)

new_test_features = [X_test.str.len(), X_test.str.count('\d')]
X_test_vec_new = add_feature(cvec.transform(X_test), new_test_features)
y_pred = nb_clf.predict(X_test_vec_new)
roc_auc_score(y_test, y_pred)

(4179, 1470)


0.9708567475340815

Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.** Character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this matrix and three additional features:
* the length of document (number of characters)
* number of digits per document
* number of non-word characters

#### Accuracy is 98.18%

Without document freq threshold, features 7354. With min_df=5, it reduced to 1468. 
With word ngram(1,3), it increased to 3383. With char ngram(2,5), it increased drastically to 16314.
And accuracy increased by more than 1%.

In [13]:
cvec = CountVectorizer(min_df=5, ngram_range=(2,5), analyzer='char_wb').fit(X_train)
X_train_vec = cvec.transform(X_train)     
   
# shape inc from 4179*1468 -> 4179*16317
new_train_features = [X_train.str.len(), X_train.str.count('\d'), X_train.str.count('\W')]
X_train_vec_new = add_feature(X_train_vec, new_train_features)
print(X_train_vec_new.shape)
nb_clf = MultinomialNB(alpha=0.1).fit(X_train_vec_new, y_train)

new_test_features = [X_test.str.len(), X_test.str.count('\d'), X_test.str.count('\W')]
X_test_vec_new = add_feature(cvec.transform(X_test), new_test_features)
y_pred = nb_clf.predict(X_test_vec_new)
roc_auc_score(y_test, y_pred)

(4179, 16317)


0.9818451521993787

Find the 10 smallest and 10 largest coefficients from the model.
The list of 10 smallest coefficients should be sorted smallest first, the list of 10 largest coefficients should be sorted largest first.

In [14]:
# Model coef(weights) are the same for each data point, got from training.
# Features are what defines the data points
feature_list = cvec.get_feature_names()

# Feature list is increased to 16317 to match the matrix
feature_list.extend(['length_of_doc', 'digit_count', 'non_word_char_count'])
# List converted to array so that indexing can be done with another array
feature_array = np.array(feature_list)

# Printing unsorted feature names 
print("First 10 feature names", feature_array[:10])
print("Last 10 feature names", feature_array[:-11:-1])

sorted_coef_index = nb_clf.coef_[0].argsort()
#print("Indices of the largest coef ", sorted_coef_index[:-11:-1])

small_features = feature_array[sorted_coef_index[:10]]
large_features = feature_array[sorted_coef_index[:-11:-1]]

print("\nSorted features :")
small_features.tolist(), large_features.tolist()

First 10 feature names [' !' ' ! ' ' !!' ' !! ' " !!'" " !!''" ' #' ' $' ' $ ' ' &']
Last 10 feature names ['non_word_char_count' 'digit_count' 'length_of_doc' 'û÷t ' 'û÷t' 'û÷m '
 'û÷m' 'û÷ll ' 'û÷ll' 'û÷l']

Sorted features :


(['mount',
  'aptop',
  'apto',
  'apt',
  'april',
  'apri',
  'apr',
  'appy.',
  'appy ',
  'appy'],
 ['length_of_doc',
  'non_word_char_count',
  'digit_count',
  'e ',
  ' t',
  ' c',
  't ',
  's ',
  'r ',
  'to'])