# Text Classification -- Spam Filtering

## Multinomial Naive Bayes Spam Classifier
This is from the DOST AI Summer School Materials

In [1]:
%pylab inline
import pandas as pd

Populating the interactive namespace from numpy and matplotlib


#### Steps in building the classifier
- Representing text as numerical or count data
- Reading a text corpus into a pandas DataFrame
- Vectorizing the dataset with CountVectorizer
- Building and evaluating a Spam Classifier
- Examining a model for further insight
- Tuning the vectorizer (challenge)
- Tuning the Laplacian Correction factor (challenge)

## Dataset: Representing text as numerical data

In [2]:
# example text for model training
simple_train = ['hello how are you', 'Hello are you there', 'why hello there']

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

We will use the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) to "convert a text corpus into a sparse matrix of word or token counts":

In [3]:
# import and instantiate CountVectorizer (with the default parameters)
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

In [4]:
# learn the 'vocabulary' of the training data (occurs in-place)
vect.fit(simple_train)

CountVectorizer()

In [5]:
# examine the fitted vocabulary
vect.get_feature_names()



['are', 'hello', 'how', 'there', 'why', 'you']

In [6]:
# transform training data into a 'document-term matrix'
simple_train_dtm = vect.transform(simple_train)
simple_train_dtm

<3x6 sparse matrix of type '<class 'numpy.int64'>'
	with 11 stored elements in Compressed Sparse Row format>

In [7]:
# convert sparse matrix to a dense matrix
simple_train_dtm.toarray()

array([[1, 1, 1, 0, 0, 1],
       [1, 1, 0, 1, 0, 1],
       [0, 1, 0, 1, 1, 0]])

In [8]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_train_dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,are,hello,how,there,why,you
0,1,1,1,0,0,1
1,1,1,0,1,0,1
2,0,1,0,1,1,0


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [9]:
# check the type of the document-term matrix
type(simple_train_dtm)

scipy.sparse.csr.csr_matrix

In [10]:
# examine the sparse matrix contents
print(simple_train_dtm)

  (0, 0)	1
  (0, 1)	1
  (0, 2)	1
  (0, 5)	1
  (1, 0)	1
  (1, 1)	1
  (1, 3)	1
  (1, 5)	1
  (2, 1)	1
  (2, 3)	1
  (2, 4)	1


From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [11]:
# example text for model testing
simple_test = ["hello world"]

In [12]:
# transform testing data into a document-term matrix (using existing vocabulary)
simple_test_dtm = vect.transform(simple_test)
simple_test_dtm.toarray()

array([[0, 1, 0, 0, 0, 0]])

In [13]:
# examine the vocabulary and document-term matrix together
pd.DataFrame(simple_test_dtm.toarray(), columns=vect.get_feature_names())



Unnamed: 0,are,hello,how,there,why,you
0,0,1,0,0,0,0


### Load Dataset: Reading a text-based dataset into pandas

In [14]:
# read file into a pandas DataFrame
path = './spam_ham.csv'
spam_ham = pd.read_csv(path, header=0, names=['label', 'location','message'])
spam_ham.drop('location', axis=1, inplace=True)
spam_ham.dropna(inplace=True)

In [22]:
## What if you have separate files for each of the emails?
## How will you load the files and create a single table of them?

In [23]:
# examine the shape
spam_ham.shape

(30974, 3)

In [24]:
# examine the first 10 rows
spam_ham.head(10)

Unnamed: 0,label,message,label_num
0,spam,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...,1
1,spam,Academic Qualifications available from prestig...,1
2,ham,Greetings all. This is to verify your subscrip...,0
3,spam,try chauncey may conferred the luscious not co...,1
4,ham,"It's quiet. Too quiet. Well, how about a straw...",0
5,ham,It's working here. I have departed almost tota...,0
6,spam,The OIL sector is going crazy. This is our wee...,1
7,spam,Little magic. Perfect weekends.http://othxu.rz...,1
8,ham,Greetings all. This is a mass acknowledgement ...,0
9,spam,"Hi, L C P A X V V e I r m a A I v A o b n L A ...",1


In [25]:
# examine the class distribution
spam_ham.label.value_counts()

spam    19280
ham     11694
Name: label, dtype: int64

In [26]:
# convert label to a numerical variable
spam_ham['label_num'] = spam_ham.label.map({'ham':0, 'spam':1})
spam_ham.loc[:, 'label_num'] = pd.Series(spam_ham.label.map({'ham':0, 'spam':1}))

In [27]:
# check that the conversion worked
spam_ham.head(10)

Unnamed: 0,label,message,label_num
0,spam,LUXURY WATCHES - BUY YOUR OWN ROLEX FOR ONLY $...,1
1,spam,Academic Qualifications available from prestig...,1
2,ham,Greetings all. This is to verify your subscrip...,0
3,spam,try chauncey may conferred the luscious not co...,1
4,ham,"It's quiet. Too quiet. Well, how about a straw...",0
5,ham,It's working here. I have departed almost tota...,0
6,spam,The OIL sector is going crazy. This is our wee...,1
7,spam,Little magic. Perfect weekends.http://othxu.rz...,1
8,ham,Greetings all. This is a mass acknowledgement ...,0
9,spam,"Hi, L C P A X V V e I r m a A I v A o b n L A ...",1


In [28]:
# This is to define the features and labels for the CountVectorizer
X = spam_ham.message
y = spam_ham.label_num
print(X.shape)
print(y.shape)

(30974,)
(30974,)


In [29]:
# split X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(23230,)
(7744,)
(23230,)
(7744,)


## Data Processing: Vectorizing the dataset

In [30]:
# instantiate the vectorizer
vect = CountVectorizer()

In [31]:
# learn training data vocabulary, then use it to create a document-term matrix
vect.fit(X_train)

CountVectorizer()

In [32]:
# examine the fitted vocabulary
vect.get_feature_names()



['00',
 '000',
 '0000',
 '000000',
 '0000000',
 '00000000',
 '000000000',
 '00000000000000',
 '000000000000received',
 '000000000001received',
 '0000000000status',
 '0000000016',
 '00000000message',
 '00000000x',
 '00000001',
 '00000001content',
 '00000004',
 '00000010',
 '00000011',
 '0000001196',
 '00000049',
 '0000005',
 '0000006hz',
 '000000eb',
 '000001',
 '00000111',
 '000001bdaaa0',
 '000001bdb744',
 '000001bdc5a5',
 '000001bdd411',
 '000001bdd98c',
 '000001bdda70',
 '000001bde0a0',
 '000001bed6b7',
 '000001c20f35',
 '000001c642d0',
 '000001c64562',
 '000001c64585',
 '000001c64641',
 '000001c6465f',
 '000001c6468e',
 '000001c676b8',
 '0000020',
 '0000040',
 '0000040b',
 '0000040c',
 '0000060',
 '00000dd0',
 '00001',
 '000010',
 '0000100',
 '00001000',
 '00001004',
 '00001008',
 '0000100c',
 '00001010',
 '00001014',
 '00001018',
 '0000101c',
 '00001023',
 '00001026',
 '00001028',
 '0000102c',
 '00001030',
 '00001034',
 '00001038',
 '0000103e',
 '00001040',
 '00001044',
 '00001048

In [33]:
# transform training data into a 'document-term matrix with a single step
X_train_dtm = vect.fit_transform(X_train)

In [34]:
# examine the document-term matrix
X_train_dtm

<23230x165852 sparse matrix of type '<class 'numpy.int64'>'
	with 2317896 stored elements in Compressed Sparse Row format>

In [35]:
# transform testing data into a document-term matrix
# using the transform() method
X_test_dtm = vect.transform(X_test)
X_test_dtm

<7744x165852 sparse matrix of type '<class 'numpy.int64'>'
	with 736524 stored elements in Compressed Sparse Row format>

## Building and evaluating a Spam Classifier

We will use [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [36]:
# import and instantiate a Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB()

In [37]:
# train the model using X_train_dtm
nb.fit(X_train_dtm, y_train)

MultinomialNB()

In [38]:
# make predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

In [39]:
# calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

0.9878615702479339

In [40]:
# print the confusion matrix
metrics.confusion_matrix(y_test, y_pred_class)

array([[2923,   11],
       [  83, 4727]])

In [41]:
# Print the classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_class, digits=4))

              precision    recall  f1-score   support

           0     0.9724    0.9963    0.9842      2934
           1     0.9977    0.9827    0.9902      4810

    accuracy                         0.9879      7744
   macro avg     0.9850    0.9895    0.9872      7744
weighted avg     0.9881    0.9879    0.9879      7744



In [42]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

array([1.00000000e+000, 7.48963820e-179, 1.00000000e+000, ...,
       1.00000000e+000, 1.60433156e-080, 1.00000000e+000])

In [43]:
# calculate AUC
metrics.roc_auc_score(y_test, y_pred_prob)

0.9974493960690279

## Analysis: Examining the vectorized dataset and spam classifier

We will examine the **count vectorizer** and **trained spam classifier** to calculate and approximate **spam ratio of each token**.

In [44]:
# store the vocabulary of X_train with get_feature_names of the vect() object
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)



165852

In [45]:
# examine the first 20 tokens
print(X_train_tokens[0:20])

['00', '000', '0000', '000000', '0000000', '00000000', '000000000', '00000000000000', '000000000000received', '000000000001received', '0000000000status', '0000000016', '00000000message', '00000000x', '00000001', '00000001content', '00000004', '00000010', '00000011', '0000001196']


In [46]:
# examine the last 20 tokens
print(X_train_tokens[-20:])

['ｋ村', 'ｍ子様セーリングクルーザーをお持ちで', 'ｍ字開脚オナニーを机の下から盗撮', 'ｍａｉｌでのサポートは２４時間対応です', 'ｎ藤', 'ｏｌ', 'ｐｃ', 'ｐｃから簡単プロフィール作成', 'ｓクラス専門店', 'ｓ子様秘密が条件で', 'ｓｅｘを求めている', 'ｓｅｘを求めているのです', 'ｓｍ', 'ｔ165', 'ｔバックは', 'ｔバックはいていたらおならが左右に分散するのでなんか変な感じですけどね', 'ｔ島', 'ｔ谷', 'ｗ６２', 'ｙ里様お互いがくつろげるような']


In [47]:
# Naive Bayes counts the number of times each token appears in each class
nb.feature_count_

array([[2.154e+03, 4.340e+02, 3.720e+02, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [3.460e+03, 5.208e+03, 2.000e+00, ..., 1.000e+00, 4.000e+00,
        1.000e+00]])

In [48]:
# rows represent classes, columns represent tokens
nb.feature_count_.shape

(2, 165852)

In [49]:
# number of times each token appears across all HAM messages
ham_token_count = nb.feature_count_[0, :]
ham_token_count

array([2154.,  434.,  372., ...,    0.,    0.,    0.])

In [50]:
# number of times each token appears across all SPAM messages
spam_token_count = nb.feature_count_[1, :]
spam_token_count

array([3.460e+03, 5.208e+03, 2.000e+00, ..., 1.000e+00, 4.000e+00,
       1.000e+00])

In [51]:
# create a DataFrame of tokens with their separate ham and spam counts
tokens = pd.DataFrame({'token':X_train_tokens, 'ham':ham_token_count, 'spam':spam_token_count}).set_index('token')
tokens.head()

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2154.0,3460.0
0,434.0,5208.0
0,372.0,2.0
0,51.0,44.0
0,1.0,0.0


In [52]:
# examine 5 random DataFrame rows
tokens.sample(5)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
thed,0.0,37.0
peac,1.0,0.0
asciimay,1.0,0.0
oforganizational,1.0,0.0
phillipines,1.0,0.0


In [53]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

array([ 8760., 14470.])

Before we can calculate the "spamminess" of each token, we need to avoid **dividing by zero** and account for the **class imbalance**.

In [54]:
# add 1 to ham and spam counts to avoid dividing by 0
tokens['ham'] = tokens.ham + 1
tokens['spam'] = tokens.spam + 1
tokens.sample(5)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
len,6.0,1.0
roadparamus,3.0,1.0
1000pf,3.0,1.0
3z4jiqbvzu,4.0,1.0
antibiotics,3.0,5.0


In [55]:
# convert the ham and spam counts into frequencies
tokens['ham'] = tokens.ham / nb.class_count_[0]
tokens['spam'] = tokens.spam / nb.class_count_[1]
tokens.sample(5)

Unnamed: 0_level_0,ham,spam
token,Unnamed: 1_level_1,Unnamed: 2_level_1
3billion,0.000342,6.9e-05
wheeling,0.000342,0.000346
xoring,0.000342,6.9e-05
captures,0.000799,6.9e-05
thinkthis,0.000228,6.9e-05


In [56]:
# calculate the ratio of spam-to-ham for each token
tokens['spam_ratio'] = tokens.spam / tokens.ham
tokens.sample(5)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
onepossible,0.000342,6.9e-05,0.201797
qedt,0.000228,6.9e-05,0.302695
aqaaap1uaacttaaclvqaaagaaaaaaaaaaaaaa56z,0.000228,6.9e-05,0.302695
offical,0.000228,6.9e-05,0.302695
madison1500,0.000342,6.9e-05,0.201797


In [57]:
# examine the DataFrame sorted by spam_ratio
tokens.sort_values('spam_ratio', ascending=False)

Unnamed: 0_level_0,ham,spam,spam_ratio
token,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
product_table,0.000114,0.586386,5136.738079
proms,0.000114,0.146856,1286.454734
professionaladobe,0.000114,0.146856,1286.454734
hereopt,0.000114,0.126469,1107.864547
bz,0.000114,0.124672,1092.124395
...,...,...,...
nodes,0.165068,0.000069,0.000419
handy,0.165868,0.000069,0.000417
node,0.167694,0.000069,0.000412
hb,0.185731,0.000069,0.000372


In [58]:
# look up the spam_ratio for a given token
# Note that the specified token, adobe, can change due to the nature of randomness
tokens.loc['adobe', 'spam_ratio']

30.18771362931695

# Additional Exercises: Tuning the vectorizer (Challenge)

Currently, we've been using the default parameters of [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html):

In [59]:
# show default parameters for CountVectorizer
vect

CountVectorizer()

Some parameters that we can tune in the CountVectorizer:

- **stop_words:** string {'english'}, list, or None (default)
    - If 'english', a built-in stop word list for English is used.
    - If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens.
    - If None, no stop words will be used.

- **ngram_range:** tuple (min_n, max_n), default=(1, 1)
    - The lower and upper boundary of the range of n-values for different n-grams to be extracted.
    - All values of n such that min_n <= n <= max_n will be used.

- **max_df:** float in range [0.0, 1.0] or int, default=1.0
    - When building the vocabulary, ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words).
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

- **min_df:** float in range [0.0, 1.0] or int, default=1
    - When building the vocabulary, ignore terms that have a document frequency strictly lower than the given threshold. (This value is also called "cut-off" in the literature.)
    - If float, the parameter represents a proportion of documents.
    - If integer, the parameter represents an absolute count.

**Guidelines for tuning the CountVectorizer:**

Tasks:
1. From the spam ratios that you've obtained from before, **experiment** by adding more stop words!
2. Play with the df and n-gram parameters.
    * Try using GridSearch on the CountVectorizer!
3. Try to reduce or increase the features and get a better score on the previous model. 
    * Score above a 99.5%? Tell us! :)

In [0]:
vect = CountVectorizer(stop_words='english', ngram_range=(1, 7), max_df=0.70)
X_trimmed = vect.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_trimmed, y)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

In [0]:
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

In [0]:
nb = MultinomialNB()
nb.fit(X_train, y_train)
print(classification_report(y_test, nb.predict(X_test), digits=4))

# Additional Exercises: Tuning the Laplacian Correction Factor (Challenge)

From the [scikit-learn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

> Parameters:	
alpha : float, optional (default=1.0)
Additive (Laplace/Lidstone) smoothing parameter (0 for no smoothing).

One of the parameters that we can tune in training a Multinomial Naive Bayes Classifier is the Laplacian Correction Factor.

Tasks:
1. Tweak the correction factor from 0-3 in increments of 0.1, 5, and 10, thus training multiple classifiers.
2. Plot the precision-recall curves for these classifiers to compare and contrast.

In [0]:
classifiers = [MultinomialNB(alpha=i) for i in np.concatenate((np.arange(0, 3.1, 0.1), [5, 10]))]

In [0]:
for i in classifiers:
    i.fit(X_train, y_train)

In [0]:
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
for i in classifiers:
    precision, recall, _ = precision_recall_curve(y_test.ravel(),
        i.predict(X_test).ravel())
    average_precision = average_precision_score(y_test, i.predict(X_test).ravel(),
                                                         average="micro")
    plot(recall, precision,
             label='micro-average Precision-recall curve (area = {0:0.2f})'
                   ''.format(average_precision))

In [0]:
for i in classifiers:
    print(i.get_params()['alpha'])
    print(classification_report(y_test, i.predict(X_test),digits=4))

## References
This practicals notebook is largely based from the Sci-kit Learn Documentation and PyCon 2016.

1. http://scikit-learn.org/stable/modules/feature_extraction.html
2. http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
3. https://www.youtube.com/watch?v=WHocRqT-KkU