# Spam Classifiers

Explore text message data and create models to predict if a message is spam or not. 

In [21]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

spam_data = pd.read_csv('spam.csv')
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",ham
1,Ok lar... Joking wif u oni...,ham
2,Free entry in 2 a wkly comp to win FA Cup fina...,spam
3,U dun say so early hor... U c already then say...,ham
4,"Nah I don't think he goes to usf, he lives aro...",ham
5,FreeMsg Hey there darling it's been 3 week's n...,spam
6,Even my brother is not like to speak with me. ...,ham
7,As per your request 'Melle Melle (Oru Minnamin...,ham
8,WINNER!! As a valued network customer you have...,spam
9,Had your mobile 11 months or more? U R entitle...,spam


In [22]:
spam_data['target'] = np.where(spam_data['target']=='spam',1,0)
spam_data.head(10)

Unnamed: 0,text,target
0,"Go until jurong point, crazy.. Available only ...",0
1,Ok lar... Joking wif u oni...,0
2,Free entry in 2 a wkly comp to win FA Cup fina...,1
3,U dun say so early hor... U c already then say...,0
4,"Nah I don't think he goes to usf, he lives aro...",0
5,FreeMsg Hey there darling it's been 3 week's n...,1
6,Even my brother is not like to speak with me. ...,0
7,As per your request 'Melle Melle (Oru Minnamin...,0
8,WINNER!! As a valued network customer you have...,1
9,Had your mobile 11 months or more? U R entitle...,1


In [23]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], 
                                                    spam_data['target'], 
                                                    random_state=0)

#### Lets check what percentage of the documents in `spam_data` are spam?

In [24]:
np.sum(sum(spam_data['target'])/len(spam_data['target']))*100

13.406317300789663

###### Fit the training data `X_train` using a Count Vectorizer with default parameters.

In [25]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer().fit(X_train)

###### What is the longest token in the vocabulary?

In [26]:
sorted(vect.get_feature_names(),key = len)[-1]    

'com1win150ppmx3age16subscription'


###### Next, fit a fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1`. Find the area under the curve (AUC) score using the transformed test data.

In [27]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
vectorizer = CountVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)
X_test_transformed = vectorizer.transform(X_test)

clf = MultinomialNB(alpha=0.1)
clf.fit(X_train_transformed, y_train)

y_predicted = clf.predict(X_test_transformed)

print(roc_auc_score(y_test, y_predicted))

0.9716631580734427


### Fit and transform the training data `X_train` using a Tfidf Vectorizer with default parameters.

###### What 20 features have the smallest tf-idf and what 20 have the largest tf-idf?

In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer        

In [29]:
import operator

vectorizer = TfidfVectorizer()
X_train_transformed = vectorizer.fit_transform(X_train)

feature_names = vectorizer.get_feature_names()
idfs = vectorizer.idf_
names_idfs = list(zip(feature_names, idfs))

smallest = sorted(names_idfs, key=operator.itemgetter(1))[:20]
smallest = pd.Series([features[1] for features in smallest], index=[features[0] for features in smallest])

largest = sorted(names_idfs, key=operator.itemgetter(1), reverse=True)[:20]
largest = sorted(largest, key=operator.itemgetter(0))
largest = pd.Series([features[1] for features in largest], index=[features[0] for features in largest])
    
print(smallest, largest)

to      2.198406
you     2.266493
the     2.707383
in      2.890761
and     2.976764
is      3.003012
me      3.111530
for     3.206840
it      3.224384
my      3.231044
call    3.297812
your    3.300196
of      3.319473
have    3.354130
that    3.413811
on      3.463136
now     3.465949
can     3.545053
are     3.560414
so      3.566625
dtype: float64 000pes         8.644919
0089           8.644919
0121           8.644919
01223585236    8.644919
0125698789     8.644919
02072069400    8.644919
02073162414    8.644919
02085076972    8.644919
021            8.644919
0430           8.644919
07008009200    8.644919
07099833605    8.644919
07123456789    8.644919
0721072        8.644919
07753741225    8.644919
077xxx         8.644919
078            8.644919
07808247860    8.644919
07808726822    8.644919
078498         8.644919
dtype: float64


##### Now lets play with document frequency

Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **3**.

Then fit a multinomial Naive Bayes classifier model with smoothing `alpha=0.1` and compute the area under the curve (AUC) score using the transformed test data.

In [30]:
vect = TfidfVectorizer(min_df = 3).fit(X_train)
#feature_names = vect.get_feature_names()
X_train_vect = vect.transform(X_train)
model  = MultinomialNB(alpha=.1).fit(X_train_vect,y_train)
y_pred = model.predict(vect.transform(X_test))
roc_auc_score(y_test,y_pred)    

0.9416243654822335

## Count vectorizer still has better results

###### Let's see what is the average length of documents (number of characters) for not spam and spam documents?

In [31]:
temp = spam_data.copy()
temp['length'] = temp['text'].str.len()
average_length = temp.groupby('target')['length'].agg('mean').values
print(average_length[0], average_length[1])

71.13284974093264 139.75903614457832


<br>
<br>
The following function has been provided to combine new features into the training data:

In [32]:
def add_feature(X, feature_to_add):
    """
    Returns sparse feature matrix with added feature.
    feature_to_add can also be a list of features.
    """
    from scipy.sparse import csr_matrix, hstack
    return hstack([X, csr_matrix(feature_to_add).T], 'csr')


##### Fit and transform the training data X_train using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5**.

Using this document-term matrix and an additional feature, **the length of document (number of characters)**, fit a Support Vector Classification model with regularization `C=10000`. Then compute the area under the curve (AUC) score using the transformed test data.


In [33]:
from sklearn.svm import SVC

vectorizer = TfidfVectorizer(min_df=5)

X_train_transformed = vectorizer.fit_transform(X_train)
X_train_transformed_with_length = add_feature(X_train_transformed, X_train.str.len())

X_test_transformed = vectorizer.transform(X_test)
X_test_transformed_with_length = add_feature(X_test_transformed, X_test.str.len())

clf = SVC(C=10000)

clf.fit(X_train_transformed_with_length, y_train)

y_predicted = clf.predict(X_test_transformed_with_length)

print(roc_auc_score(y_test, y_predicted))

0.9661689557407943


######  Adding average number of digits per document for not spam and spam documents as a feature

In [34]:
import re
temp = spam_data.copy()
temp['length'] = temp['text'].apply(lambda row:len(re.findall(r'\d{1}',row)))
average_length = temp.groupby('target')['length'].agg('mean').values
print(average_length[0], average_length[1])

0.2992746113989637 15.759036144578314


### Using Logistic regression as it works good for sparse matrix

Fit and transform the training data `X_train` using a Tfidf Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **word n-grams from n=1 to n=3** (unigrams, bigrams, and trigrams).

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* **number of digits per document**

fit a Logistic Regression model with regularization `C=100`. Then compute the area under the curve (AUC) score using the transformed test data.

In [35]:
from sklearn.linear_model import LogisticRegression

temp = spam_data.copy()
temp['length_of_doc'] = temp['text'].str.len()
temp['digits_count'] = temp['text'].apply(lambda row: len(re.findall(r'(\d)', row)))
X_train, X_test, y_train, y_test = train_test_split(temp.drop('target', axis=1), temp['target'], random_state=0)

vect = TfidfVectorizer(min_df=5, ngram_range=(1, 3)).fit(X_train['text'])
X_train_vectorized = vect.transform(X_train['text'])
X_test_vectorized = vect.transform(X_test['text'])
X_train_vectorized = add_feature(X_train_vectorized, X_train['length_of_doc'])
X_train_vectorized = add_feature(X_train_vectorized, X_train['digits_count'])
X_test_vectorized = add_feature(X_test_vectorized, X_test['length_of_doc'])
X_test_vectorized = add_feature(X_test_vectorized, X_test['digits_count'])

clf = LogisticRegression(C=100).fit(X_train_vectorized, y_train)
y_score = clf.predict(X_test_vectorized)
score = roc_auc_score(y_test, y_score)
print(score)

0.9674528462047772


#### Adding word/ non word charater as a feature

In [36]:

temp = spam_data.copy()
temp['length'] = temp['text'].apply(lambda row:len(re.findall(r'\W{1}',row)))
average_length = temp.groupby('target')['length'].agg('mean').values

print(average_length[0], average_length[1])

17.339274611398963 29.49263721552878



Fit and transform the training data X_train using a Count Vectorizer ignoring terms that have a document frequency strictly lower than **5** and using **character n-grams from n=2 to n=5.**

To tell Count Vectorizer to use character n-grams pass in `analyzer='char_wb'` which creates character n-grams only from text inside word boundaries. This should make the model more robust to spelling mistakes.

Using this document-term matrix and the following additional features:
* the length of document (number of characters)
* number of digits per document
* **number of non-word characters (anything other than a letter, digit or underscore.)**

fit a Logistic Regression model with regularization C=100. Then compute the area under the curve (AUC) score using the transformed test data.

Also lets **find the 10 smallest and 10 largest coefficients from the model** and return them along with the AUC score in a tuple.


In [37]:
temp = spam_data.copy()
temp['length_of_doc'] = temp['text'].str.len()
temp['digit_count'] = spam_data['text'].apply(lambda row: len(re.findall(r'\d', row)))
temp['non_word_char_count'] = temp['text'].apply(lambda row: len(re.findall(r'\W', row)))
X_train, X_test, y_train, y_test = train_test_split(temp.drop('target', axis=1), temp['target'], random_state=0)

vect = CountVectorizer(min_df=5, ngram_range=(2, 5), analyzer='char_wb').fit(X_train['text'])
X_train_vectorized = vect.transform(X_train['text'])
X_test_vectorized = vect.transform(X_test['text'])
X_train_vectorized = add_feature(X_train_vectorized, X_train['length_of_doc'])
X_train_vectorized = add_feature(X_train_vectorized, X_train['digit_count'])
X_train_vectorized = add_feature(X_train_vectorized, X_train['non_word_char_count'])
X_test_vectorized = add_feature(X_test_vectorized, X_test['length_of_doc'])
X_test_vectorized = add_feature(X_test_vectorized, X_test['digit_count'])
X_test_vectorized = add_feature(X_test_vectorized, X_test['non_word_char_count'])
clf = LogisticRegression(C=100).fit(X_train_vectorized, y_train)
y_score = clf.predict(X_test_vectorized)
score = roc_auc_score(y_test, y_score)

feature_names = np.append(np.array(vect.get_feature_names()), ['length_of_doc', 'digit_count', 'non_word_char_count'])
sorted_coef_index = clf.coef_[0].argsort()
largest_coefs = feature_names[sorted_coef_index[:-11:-1]]
smallest_coefs = feature_names[sorted_coef_index[:10]]
    
print(score, list(smallest_coefs), list(largest_coefs))

0.9813973821367333 ['..', '. ', ' i', ' go', ' y', '? ', 'pe', 'go', ' h', 'ca'] ['digit_count', 'ne', 'co', 'ia', 'ar', 'ww', ' r', ' ch', ' x', 'xt']


# We achieved an ROC score of >98% 

Using 
- Count Vectorizer 
- document frequency lower than **5**
- n-grams from n=2 to n=5
- analyzer='char_wb'
- Logistic Regression with regularization C=100