<h2>Background</h2>

 Using Naive Bayes and Logistic Regression to perform authorship identification. 

The dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley. The data was prepared by chunking larger texts into sentences using CoreNLP's MaxEnt sentence tokenizer (an automatic tokenizer similar to the NLTK one we used in the lab), so you may notice the odd non-sentence here and there. Our objective is to accurately identify the author of the sentences in the test set.



<h2>Data Description</h2>

There are only three columns in the CSV file: 
- "**id**": The id of the sample. You should not use this column.
- "**text**": The sentence that you will need to build your model with.
- "**author**": The target you need to predict.
  1. _EAP_: Edgar Allan Poe
  2. _HPL_: HP Lovecraft
  3. _MWS_: Mary Shelley

# $\Omega$ 1: Explore the Training Data

## Step 1

Load the training data `train.csv` into a Dataframe named `data`.

Import the necessary packages if needed.

In [1]:
# Step 1
# TODO:
import pandas as pd

df = pd.read_csv('train.csv')

df.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


## Step 2

Separate the dataframe into the data (the "text" column) and the targets (the "author" column).

In [2]:
X = df['text']
y= df['author']

## Step 3
  
Explore the training data 

For example, check the distribution of different authors in the dataset.

In [3]:
df.groupby('author').count()

Unnamed: 0_level_0,id,text
author,Unnamed: 1_level_1,Unnamed: 2_level_1
EAP,7900,7900
HPL,5635,5635
MWS,6044,6044


# $\Omega$ 2: Split data into training and testing

splitting the data into 80% for training and 20% for testing.

Call the resulting training and test data as `X_train` and `X_test`; the resulting training and test targets as `y_train` and `y_test`.

In [4]:
# TODO: do the 80-20 train/test splitting
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

# $\Omega$ 3: Preprocess text data

## Step 1

Write a function to tokenize the text into words. We assume each text is one single sentence, so there is no need to do sentence splitting here.

> Hint: use NLTK's `word_tokenize` function.

In [5]:
# Step 1
import nltk

def tokenize_words(text):
    # TODO tokenize the text into words
    return nltk.word_tokenize(text)

## Step 2

Remove stopwords from the text.

> Hint: use stopwords from NLTK to do the filtering.

In [6]:
# Step 2
from nltk.corpus import stopwords

en_stopwords = set(stopwords.words('english'))

def remove_stopwords(words):
    # TODO: return a new list of words with any stopwords removed
     return [word.lower() for word in words if word.lower() not in en_stopwords]   
    

## Step 3

Combine the first two steps to perform end-to-end preprocessing for a given text. 

In [7]:
# Step 3

def preprocess(text):
    words = tokenize_words(text) # tokenize
    words = remove_stopwords(words) # remove stopwords
    
    # combine the words together into one single string, 
    # such that the down-stream vectorizer can work with
    return ' '.join(words)

## Step 4

Test `preprocess` function in the previous step by running through some text and check the results.

In [8]:
sentence= 'The dataset contains text from works of fiction written by spooky authors of the public domain: Edgar Allan Poe, HP Lovecraft and Mary Shelley.'
preprocess(sentence)


'dataset contains text works fiction written spooky authors public domain : edgar allan poe , hp lovecraft mary shelley .'

## Step 5

Use `preprocess()` function defined in Step 3 to process each text in `X_train`. Do the same for `X_test`.

In [9]:
# Step 5
X_train = [preprocess(x) for x in X_train]

# TODO: preprocess X_test in a similar way
X_test = [preprocess(x) for x in X_test]


# $\Omega$ 3: Vectorize the text data

After the preprocessing, we need to vectorize, i.e., transform the words into actual numeric numbers that the model can digest. 

For text data, [Tf-idf vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) is a good choice, as it is able to represent each word using its TF-IDF weights.

> `TF-IDF` weights , which is an empirically solid measurement of word importance in text.

> For more info, look at http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Step 1

Initialize an Tf-Idf vectorizer.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=2) 
# by setting `min_df=2`, we will ignore any words with less than 2 occurrences in the data

## Step 2

Fit the Tf-IDF vectorizer on our **training data**.

> Hint: use the `fit_transform` function.

In [11]:
# Step 2
# TODO: use the vectorizer defined above to fit and transform the training data `X_train`

X_train = vectorizer.fit_transform(X_train)

## Step 3

Transform the **test data** using the vectorizer fitted in the previous step.

In [12]:
# Step 2
# TODO: use the vectorizer fitted in the previous step to transform the test data `X_test`
X_test = vectorizer.transform(X_test)


# $\Omega$ 4: Train a Naive Bayes classifier

Here, we will use Sklearn's [MultinomialNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) to train our authorship classifier.

> For more info: see http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

## Step 1

Initialize a MultinomialNB classifier object and call it `nb`. 

Read its documentation (http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) to figure out if you want to use any non-default values when initializing the classifier.

In [13]:
# Step 1
# TODO: initialize a MultinomialNB
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()

## Step 2

Fit the classifier `nb` using the training data.

In [14]:
# Step 2
# TODO: fit the NB classifier on training data

nb.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Step 3

Use the trained `nb` classifier to classify the texts in the test data.

In [15]:
# Step 3
# TODO: predict test data

y_preds = nb.predict(X_test)

## Step 4

Write a function to evaluate your predictions against the truth. This is a classification problem, so we need to use classification metrics here.

In [16]:
# Step 4
from sklearn.metrics.classification import classification_report, accuracy_score

def report_performance(y_preds, y_test):
    # TODO: implement this reporting function
    acc = accuracy_score(y_preds, y_test)
    
    print(f'accuracy: {acc}')    

## Step 5

Call `report_performance` function to report the performance of the `nb` classifier.

In [17]:
# Step 5
# TODO: report the performance of your `nb` classifier
report_performance(y_preds, y_test)


accuracy: 0.81511746680286


# $\Omega$ 5: Train a Logistic Regression classifier

In addition to the Naive Bayes classifier we just experimented with, we can also try the Logistic Regression model 

Again, we can use Sklearn's [Logistic Regression model](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

## Step 1

Initialize a LogisticRegression object and call it `lr`. Train `lr` with the training data.

> Hint: think about whether you need to set the `class_weight` argument when initializing the `LogisticRegression` object, base on your understanding of the data.

In [18]:
# Step 1
# TODO: Initialize a LogisticRegression object and call it `lr`. Train `lr` with the training data.
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(class_weight='balanced')

lr.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

## Step 2

Use the trained `lr` classifier to classify the texts in the test data. Report the performance of this classifier.

In [19]:
# Step 1
# TODO: Use the trained lr classifier to classify the texts in the test data. 
# Report the performance of this classifier.
y_preds = lr.predict(X_test)

report_performance(y_preds, y_test)


accuracy: 0.804902962206333
