# Naive Bayes NLP Lab

### Introduction

In this lesson, we'll work with data from the [Spooky Kaggle competition](https://www.kaggle.com/c/spooky-author-identification).  In this challenge, were given excerpts of horror stories from Edgar Allan Poe, Mary Shelley, and HP Lovecraft, and our task is to predict the author that wrote the text.

In this lesson, we'll use the naive bayes classifier along with our knowledge of NLP to classify text.

Let's get started.

### Loading the Data

We can begin by loading our csv file, and then taking a look at the data.

In [1]:
import pandas as pd

df_train = pd.read_csv('./train.zip', index_col = 0)

Let's look at the first couple rows of our dataset.

In [2]:
df_train[:2]

Unnamed: 0_level_0,text,author
id,Unnamed: 1_level_1,Unnamed: 2_level_1
id26305,"This process, however, afforded me no means of...",EAP
id17569,It never once occurred to me that the fumbling...,HPL


Next let's assign the text to the variable `documents`.

In [3]:
documents = df_train.text

To translate the authors into numbers, we can use the LabelEncoder, calling `fit_transform` on all of the author data.

> Turn the returned data into a pandas series.

In [4]:
from sklearn.preprocessing import LabelEncoder

In [5]:
lbl_enc = LabelEncoder()
y = lbl_enc.fit_transform(df_train.author.values)

In [6]:
import pandas as pd

y = pd.Series(y)

In [7]:
y[:5]

0    0
1    1
2    0
3    2
4    1
dtype: int64

In [8]:
y.value_counts()

0    7900
2    6044
1    5635
dtype: int64

Now lots split the data into training and validation data, setting the test size as .1.  Make sure to do `stratify` the data.

In [10]:
from sklearn.model_selection import train_test_split
X_train_documents, X_documents_validation, y_train, y_validation = train_test_split(df_train.text.values, y, 
                                                  stratify=y, 
                                                  random_state=42, 
                                                  test_size=0.1, shuffle=True)

### Split Dataset

Now let's employ the $TF-IDF$ vectorizer.

Now, we'll move towards parsing our dataset.  Let's first use spacy to parse and lemmatize our text.

> Begin by initializing spacy and assigning the library to the variable nlp. 

In [11]:
import spacy

nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(document):
    return [token.lemma_ for token in nlp(document)]

Now let's define a `tokenizer` function that uses spacy to return a list of the lemmas for each word in a document.

In [12]:
def spacy_tokenizer(document):
    return [token.lemma_ for token in nlp(document)]

In [13]:
first_doc = documents.iloc[0]

spacy_tokenizer(first_doc)[:10]

# ['this',  'process', ',',  'however', ',', 'afford', '-PRON-', 'no', 'mean', 'of']

['this',
 'process',
 ',',
 'however',
 ',',
 'afford',
 '-PRON-',
 'no',
 'mean',
 'of']

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfv = TfidfVectorizer(min_df=3, # ignore terms that appear in fewer than 3 documents
                      tokenizer=spacy_tokenizer, 
            strip_accents='unicode', # remove accents for character normalization
                      analyzer='word', # Specifies to use word ngrams, as opposed to character ngrams
            ngram_range=(1, 2), # specify the ngram range
                      use_idf=1, # use inverse document frequency
                      smooth_idf=1, # add one to each document so that 
                      sublinear_tf=1)



Now let's use the vectorizer to fit to both the training and validation data. 

In [18]:
X_train_and_validate = list(X_train_documents) + list(X_documents_validation)
len(X_train_and_validate)

19579

In [125]:
tfv.fit(list(X_train) + list(X_validation))

TfidfVectorizer(min_df=3, ngram_range=(1, 2), smooth_idf=1,
                strip_accents='unicode', sublinear_tf=1,
                tokenizer=<function spacy_tokenizer at 0x11f5eb950>, use_idf=1)

Then we'll separately transform the training data and validation data.  Assign the training data to `X_train_tfv` and the validation data to `X_valid_tfv`.

In [None]:
X_train_tfv =  tfv.transform(X_train)
X_valid_tfv = tfv.transform(X_validation)

# Train model

Now that we performed text representation on our document, we can move towards training a logistic regression model.

> Train a logistic regression model, setting the number of iterations to 2000.

In [128]:
from sklearn.linear_model import LogisticRegression
logistic_clf = LogisticRegression(max_iter = 2000)
logistic_clf.fit(X_train_tfv, y_train)

LogisticRegression(max_iter=2000)

> To evaluate the model, we'll use the multiclass logloss function that is defined in the kaggle competition.  We need to pass through the actual target values and the probabilities.

In [48]:
import numpy as np
def multiclass_logloss(actual, predicted, eps=1e-15):
    """Multi class version of Logarithmic Loss metric.
    :param actual: Array containing the actual target classes
    :param predicted: Matrix with class predictions, one probability per class
    """
    # Convert 'actual' to a binary array if it's not already:
    if len(actual.shape) == 1:
        actual2 = np.zeros((actual.shape[0], predicted.shape[1]))
        for i, val in enumerate(actual):
            actual2[i, val] = 1
        actual = actual2

    clip = np.clip(predicted, eps, 1 - eps)
    rows = actual.shape[0]
    vsota = np.sum(actual * np.log(clip))
    return -1.0 / rows * vsota

Now that this function is defined, we can score our model on the validation set.  To do so, we first need to find the predicted probabilities for our validation data.

In [130]:
predictions_tfv = logistic_clf.predict_proba(X_valid_tfv)

In [131]:
predictions_tfv[:5]

array([[0.6942848 , 0.07171852, 0.23399668],
       [0.79806293, 0.08128602, 0.12065105],
       [0.61021556, 0.16348575, 0.22629869],
       [0.72833801, 0.1429832 , 0.12867879],
       [0.68032276, 0.11208712, 0.20759012]])

> We can see the predicted probabilities for each instance.

Then let's view the log loss of the predictions.

In [135]:
multiclass_logloss(y_validation, predictions).round(4)

# 0.4947

0.4947

### Try bag of words

Now that we tried the TF-IDF vectorizer, let's also try the CountVectorizer for the bag of words technique. 

> Use the spacy tokenizer, and an ngram_range of (1, 2).

In [181]:
from sklearn.feature_extraction.text import CountVectorizer

bow = CountVectorizer(min_df=3, tokenizer=spacy_tokenizer,
                      strip_accents='unicode', analyzer='word', ngram_range=(1, 2))

In [182]:
bow.fit(list(X_train) + list(X_validation))



CountVectorizer(min_df=3, ngram_range=(1, 3), strip_accents='unicode',
                tokenizer=<function spacy_tokenizer at 0x11f5eb950>)

Now let's transform both the training and validation data separately.

In [183]:
X_train_bow =  bow.transform(X_train) 

In [184]:
X_valid_bow = bow.transform(X_validation)

Now that the data is trained, let's try the logistic regression model again.  Set the number of iterations to $2000$ again, and fit on the training bag of words training data.  

In [185]:
from sklearn.linear_model import LogisticRegression
log_bow_clf = LogisticRegression(max_iter = 2000)
log_bow_clf.fit(X_train_bow, y_train)

LogisticRegression(max_iter=2000)

And predict the probabilities on the bag of words.

In [186]:
predictions_bow = log_bow_clf.predict_proba(X_valid_bow)
predictions_bow[:3]

array([[0.15346822, 0.0011679 , 0.84536387],
       [0.86860729, 0.06581638, 0.06557634],
       [0.85971122, 0.06715917, 0.07312961]])

Then let's take a look at the multiclass log loss score now that we are using the bag of words.

In [187]:
multiclass_logloss(y_validation, predictions_bow)
# 0.4037840408645343

0.40548335671806734

This is a significant drop in the score by using bag of words.

### Try Naive Bayes

Now it's time to move onto the naive bayes model to see how this performs.  Use the Multinomial naive bayes model, with the bag of words data.

In [151]:
from sklearn.naive_bayes import MultinomialNB

nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)

MultinomialNB()

Now, let's make predictions using our multinomial naive bayes classifier.  Assign the results to the bag of words.

In [152]:
nb_preds = nb_bow.predict_proba(X_valid_bow)

In [154]:
nb_preds[:3]

array([[9.99977748e-01, 5.50505568e-15, 2.22522977e-05],
       [7.42444654e-01, 6.68697822e-03, 2.50868367e-01],
       [9.99998881e-01, 3.87083864e-10, 1.11832902e-06]])

Finally, let's score the naive bayes model to see how it performs.

In [153]:
multiclass_logloss(y_validation, nb_preds)

0.6881403031068911

Here, we can see that our naive bayes model does not perform quite as well as our logistic regression model.

### Summary

In this lesson, we practiced using the both a logistic regression and naive bayes models to predict the author of different passages.  We wrote a `tokenizer` using spacy to perform lemmatization, and also provided some parameters in our text representation.  Take a look at some of the resources below to see how some other competitors approached the problem.

### Resources

[Feature Engineering for Spooky](https://www.kaggle.com/sudalairajkumar/simple-feature-engg-notebook-spooky-author)

[NLP Kaggle](https://www.kaggle.com/abhishek/approaching-almost-any-nlp-problem-on-kaggle)