# Capstone Project
## Machine Learning Engineer Nanodegree

Bharat Ramanathan
September 20, 2016

## Definition

### Project Overview

Today, computers perform a myriad of tasks and yet we are unable to communicate with them naturally. We humans use words, sound and gestures to communicate our ideas, thoughts and feelings. Yet computers only understand explicit instructions, that require complex artificial programming languages. Natural Language Processing (NLP) is a field of Machine Learning and AI that is concerned with human-computer interaction using Natural Language.  NLP aims to break the language barrier and allows us to interact with computers in a natural manner. Advances in NLP research is one of the primary reasons for the development of personal assistants, voice commands, search etc.

In this project we are concerned with the ability of systems to process and understand written text. Natural Language processing is often difficult due to the lack of a strict structure in human languages and a variety of anomalies present in them. All this messiness results in challenges to process text in a meaningful way. A rule based system is not viable since the rules of language are plenty, often unrestricted and change over time.

With text documents it often becomes difficult to find and discover what we are looking for. Traditionally this is done using search keywords and links. Imagine  instead the ability to search and explore documents based on the themes that run through them. We might first find the theme that we are interested in, and then examine the documents related to that theme. This project is about discovering thematic similarities in a collection of documents using Machine Learning and Natural Language Processing.

### Problem Statement

Searching fan-fiction documents is an onerous task. This is further emphasized when search tags are quite ineffective in finding relevant documents and seldom consider context and themes while retrieving documents. While document search is augmented by SEO (Search Engine Optimization) keywords they can be misleading often on purpose to improve page visits and user clicks. Other tags include genre tags which although quite accurate prove to be too general for the task of search. Users often struggle with finding the necessary documents using only genres.

We seek to address the problem of document search by forming meaningful thematic categories that can be used to find relevant documents. Furthermore, these thematically similar documents can be utilized to build recommender systems. We intend to explore *probabilistic topic models* that represent documents as being generated from a group of topics. One way to think about the process of topic modelling is to assume that each document in a collection contains a mixture of topics. A topic can be thought of as a distribution of words that appear in similar syntactic and semantic context. Of course, all we have to begin with are the documents but the model specifies a simple probabilistic procedure by which documents can be generated from a distribution of topics. Standard statistical techniques are then used to invert this process, inferring the set of topics that were responsible for generating a collection of documents. We will further discuss topic models and how they work in the Algorithms and Techniques section.

In [None]:
# Project Imports
%matplotlib inline
import os
from collections import defaultdict, OrderedDict
from operator import itemgetter
from itertools import islice

import numpy as np
import pandas as pd

from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import f1_score, classification_report

from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer

import matplotlib.pyplot as plt
import seaborn as sns
import pyLDAvis
from IPython.core.display import display, HTML

from settings import project_root


### Metrics<a id="metrics"></a>

The topic model's performance is evaluated by measuring its ability to improve the performance of a simple [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn-linear-model-sgdclassifier) to classify the documents into genres. We calculate the weighted `F1-Score` between predicted genre and the actual genre of documents on a test set.

The `F1-Score` is combination of the `Precision` and `Recall`. Mathematically this is defined as follows.'

\begin{align}
F_1 = 2 \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{\mathrm{precision} + \mathrm{recall}}
\end{align}

We shall use the [**F1_score**](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html) function in sklearn library to calculate the `F1-Score`.

It's important to note that the F1-score on the classification task is a way to explicitly measure the performance of the topic models. We use only an out-of-the-box classifier which if further tuned through GridSearch or parameter search should increase performance. However, we wish to use the F1-Score as a way to evaluate the quality of the topic models. We believe that even if we are able to achieve the same performance as using the traditional Bag-of-Words or TFIDF features using the topic models we have a good topic model since we would have indeed reduced the dimensions of the data without affecting the performance.

The code for the evaluation metrics is implemented below. We report the `Confusion matrix`, the `Classification report` and the `F1-Score`. However, we only use the weighted `F1-Score` to improve the performance of the Topic Model.

In [None]:
def plot_confusion_matrix(cm, title='Confusion matrix'):
    # Plot the confusion matrix
    sns.set_context('talk')
    sns.heatmap(cm, xticklabels=genres, yticklabels=genres, 
                square=True, annot=True, fmt='.2f')
    plt.title('Confusion Matrix')
    plt.ylabel('True labels')
    plt.xlabel('Predicted Labels')

def evaluate_prediction(predictions, target, title="Confusion matrix"):
    # F1_score - weighted by class.
    f_score = f1_score(target, predictions, labels=genres, 
                       pos_label=None, average='weighted')
    print('f1_Score : {}'.format(f_score))
    
    # Confusion matrix of precision and recall
    cm = confusion_matrix(target, predictions)
    print('confusion matrix\n {}'.format(cm))
    print('(row=expected, col=predicted)')
    
    # Classification report precision, recall, f_score and support.
    print('classification report')
    print(classification_report(predictions, target, target_names=genres))
    
    # Normalizing the confusion matrix because proportions are better than numbers. 
    cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    plot_confusion_matrix(cm_normalized, title + ' Normalized')

## Analysis

### Data Exploration

The raw dataset comprises around 1.5 Million documents scraped from [**fanfiction.net**](https://www.fanfiction.net/) spread across 14 genres. The average length of documents is about 10,000 characters. An initial analysis of a sample of documents revealed that there were languages other than English in the dataset. These were filtered out using a naive stopwords-based language filter and reduced to about 1.38 Million documents primarily in English. Please refer to `lib/Wordvectors/language_filter.py` for how this is accomplished.

Within the scope of this project we intend to demonstrate the use of topic models in categorizing a randomly chosen sample of 100,000 documents spread across 5 genres viz. `Humor`, `Sci-fi`, `Family`, `Romance` and `Supernatural`. This decision was made due to scalability constraints. We believe however, that a similar model can be scaled using distributed computing for the original dataset. Kindly, refer to `var/sample_genres.py` and `var/create_csv.py`

The dataset is also not free from errors and noise in the form of *spelling errors* and *grammatical errors* since most documents are not proofread and the authors tend to use custom tokens as fillers.

In text related NLP the input text is represented as a set of features comprised of `tokens` i.e. a tangible unique set of character(s). [**Tokenization**](#Tokenization) is the process of chopping up a piece of text into pieces, called `tokens` , perhaps at the same time throwing away certain characters, such as punctuation. These are further discussed in the [**Data Pre-processing**](#preprocessing) step along with further feature extraction and selection methods.

The final dataset is present in `data_file.csv`. The text has been tokenized before creating the `csv` and has been stripped off the punctuation. It was further noticed that the file contained text that were less than 2000 Characters in length. These were generally *`Author Notes`* and *`Disclaimers`* and added no relevance to the `genre`. They were hence omitted from the dataset. We finally arrived at around 89,000 texts spread across 5 classes in the following ratios.

### Exploratory Visualization
With text data any visualization is bound to be not informative unless we perform some sort of preprocessing. This is because, we will be unable to visualize the text in a meaningful manner unless we `tokenize`, `remove stop words` and normalize the text. Hence, we [preprocess](#preprocessing) the text before visualization.

In [None]:
# Stopword list from Stone, Denis, Kwantes (2010)
STOPWORDS = """
a about above across after afterwards again against all almost alone along already also although always am among amongst amoungst amount an and another any anyhow anyone anything anyway anywhere are around as at back be
became because become becomes becoming been before beforehand behind being below beside besides between beyond bill both bottom but by call can
cannot cant co computer con could couldnt cry de describe
detail did didn do does doesn doing don done down due during
each eg eight either eleven else elsewhere empty enough etc even ever every everyone everything everywhere except few fifteen
fify fill find fire first five for former formerly forty found four from front full further get give go
had has hasnt have he hence her here hereafter hereby herein hereupon hers herself him himself his how however hundred i ie
if in inc indeed interest into is it its itself keep last latter latterly least less ltd
just
kg km
made make many may me meanwhile might mill mine more moreover most mostly move much must my myself name namely
neither never nevertheless next nine no nobody none noone nor not nothing now nowhere of off
often on once one only onto or other others otherwise our ours ourselves out over own part per
perhaps please put rather requite rather really regarding
same say see seem seemed seeming seems serious several she should show side since sincere six sixty so some somehow someone something sometime sometimes somewhere still such system take ten
than that the their them themselves then thence there thereafter thereby therefore therein thereupon these they thick thin third this those though three through throughout thru thus to together too top toward towards twelve twenty two un under
until up unless upon us used using
various very very via
was we well were what whatever when whence whenever where whereafter whereas whereby wherein whereupon wherever whether which while whither who whoever whole whom whose why will with within without would yet you
your yours yourself yourselves
"""
STOPWORDS = frozenset(word for word in STOPWORDS.split())

In [None]:
# Stemming 
stemming = PorterStemmer()
def stemmer(story):
    # step1. split the document into tokens using spaces,
    # step2. if token length is less than 3 ignore the token,
    # step3. if token in stop word list ignore the token,
    # step4. normalize the case to the token to lower case,
    # step5. stem the token using porter stemmer,
    # step6. Join the tokens back to a string.
    return ' '.join([stemming.stem(word.lower()) for word in story.split() 
                     if len(word)>3 and word not in STOPWORDS])

In [None]:
def load_data(f_name):
    if os.path.isfile(f_name+'.pickle'):
        # if the pickle file exists load it instead.
        df = pd.read_pickle(f_name+'.pickle')
    else:
        df = pd.DataFrame([line.split(',') for line in open(f_name+'.csv')], columns=['genre', 'story'])
        
        # filter out texts that are less than 2000 chars in length.
        df['charlen'] = [len(story) for story in df['story']]
        df = df[df['charlen']>2000]
        
        # stem the tokens in the text
        df['story'] = df['story'].map(stemmer)
        
        # store the pre-processed texts for persistence.
        final_data_pickle = os.path.join(settings.project_root, 'tmp', 'data_file.pickle')
        df.to_pickle(final_data_pickle)
    return df

In [None]:
data_loc = os.path.join(project_root, 'tmp', 'data_file')
data_set = load_data(data_loc)
genres = data_set['genre'].unique()

In [None]:
data_set['genre'].value_counts()

In [None]:
def get_counts(series):
    # tokenize each document in an input series and 
    # update a dictionary of token frequencies.(keys=tokens, values=frequency)
    # return a sorted dictionary based on frequencies.
    freqs = defaultdict(int)
    for story in series:
        for token in story.split():
            freqs[token] +=1
    return OrderedDict(sorted(freqs.items(), key=itemgetter(1), reverse=True))

In [None]:
def plot_data(freqs, title, n=25, m=50):
    # plot the frequency of n to m tokens from an ordered frequency dictionary. 
    freqs = OrderedDict(islice(freqs.items(),n,m))
    X = np.arange(len(freqs))
    sns.set_context("talk")
    plt.figure(figsize=(15,5))
    sns.barplot(X, freqs.values())
    plt.xticks(X, freqs.keys())
    locs, labels = plt.xticks()
    plt.setp(labels, rotation=90)
    ymax = max(freqs.values()) + 0.1
    plt.ylim(0, ymax)
    plt.title(title)
    plt.show()

In [None]:
# Plot the n-m most frequent tokens in the dataset.
all_freqs = get_counts(data_set['story'])
plot_data(all_freqs, title='{}-{} Most frequent words in the vocabulary'.format(0,15))

In [None]:
def genre_counts(df, n=25,m=50):
    # Group by the genres for each genre
    # calculate the frequency of the tokens
    # plot the frequencies for n to m tokens.
    genre_groups = df.groupby('genre')
    for genre in df['genre'].unique():
        group_freqs = get_counts(genre_groups['story'].get_group(genre))
        plot_data(group_freqs, title='{}-{} Most frequent words in {}'.format(0,15,genre),n=25,m=50)
# Plot the n to m most frequent tokens in each genre.
genre_counts(df)

We observe that while many words such as `['wash', 'took', 'voice', 'long']` repeat across multiple genres certain words appear more frequently in a few genres `{'Family': ['father', 'mother'], 'Romance': ['felt', 'smile']}` and seem to be representative of the genre. We believe the topic models must be able to make use of these features to arrive at some coherent topics.

### Algorithms and Techniques<a id="Algorithms"></a>

Topic modeling is a form of text mining, a way of identifying patterns in a corpus. We take the corpus and run it through a model which groups words across the corpus into `topics`. More formally a topic model assumes that each document is contains a mixture of `topics`. The `topics` are themselves a mixture of `tokens`. These `topics` are what are referred to as Latent features. In this project we use *Latent Dirichlet Allocation* (LDA).

#### Latent Dirichlet Allocation<a id="LDA"></a>

The basic idea behind [LDA](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) is that documents exhibit multiple topics and are created from a mixture of topics. Formally, a topic is defined as a distribution over a fixed vocabulary. LDA assumes that documents are generated using a distribution of these topics. Operating on this assumption for each document in the collection, it generates the words in a two-stage process.
1. Randomly choose a distribution over topics. For instance, a topic is sampled from a [Dirichlet Distribution](https://en.wikipedia.org/wiki/Dirichlet_distribution) over a fixed set of *`K`* topics.
2. For each word in the document:<br>
    a. Randomly choose a topic from the distribution over topics in step 1.<br>
    b. Randomly choose a word from the corresponding distribution over the vocabulary.

Assuming the above generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection. One way to do this is using [Gibbs Sampling](http://www.mit.edu/~ilkery/papers/GibbsSampling.pdf) algorithm.
The algorithm is used as follows:
1. for each document `d` in the collection
2. go through each word `w` in `d`
3. for each topic `t`, compute two things:

\begin{align}
p(topic_{t} \mid document_{d})
\end{align}

*The proportion of words in document `D` that are currently assigned to topic `t`.*

\begin{align}
p(word_{w} \mid topic_{t})
\end{align}

*The proportion of assignments to topic `t` over all documents that come from this word `w`.*

Reassign *`w`* a new topic, where it chooses the topic *`t`* using *`Bayesian Inference`*:


\begin{align}
p(topic_{t} \mid document_{d}) \times p(word_{w} \mid topic_{t}) 
\end{align}

According to the generative model, this is essentially the probability that topic *`t`* generated word *`w`*, hence the model resamples the current word’s topic with this probability.Given enough number of iterations using the Gibbs Sampling algorithm this model forms cohesive topics of the given text. A more detailed explaination and introduction to LDA is presented in this [paper](https://www.cs.princeton.edu/~blei/papers/Blei2011.pdf) by David M. Blei, one of the authors of the original papers.

There are two parameters of the `dirichlet distribution` that play an important role in **LDA** namely:
1. $\alpha$ (`doc_topic_prior`) controls the per-document topic distribution.
   A high $\alpha$ value means that every documents is likely to contain a mixture of most of the topics and not a specific topic, while a low alpha implies that a document is more likely to be represented by a few topics.

2. $\beta$ (`topic_word_prior`) controls the per-topic word distribution.<br>
   A high $\beta$ value means that every topic is likely to contain a mixture of most of the words, while a low  $\beta$ value means that a topic may contain a mixture of just a few of the words.

Intuitively, high $\alpha$ makes documents appear more similar to each other, High $\beta$ makes topics appear more similar to each other. This [video](https://www.youtube.com/watch?v=3mHy4OSyRf0) was a reference to understand the importance of the hyperparameters.

We implement the algorithm using the [LatentDirchletAllocation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html#sklearn-decomposition-latentdirichletallocation)implementation in sklearn and intialize with the following parameters:
```
LatentDirichletAllocation(n_topics=5, doc_topic_prior=None, topic_word_prior=None, learning_method='online', learning_decay=0.7, max_iter=250, batch_size=1000, n_jobs=-1, random_state=42)
```

#### Classifier
For the classification task use a Stochastic Gradient Descent Classifier(`SGDClassifier`). We use the classifier as a way to explicitly measure the performance of the topic models generated by `LDA`. `SGDClassifier` is a linear model and is a widely used algorithm in text classification. They operate by separating the samples into classes in a hyperplane. They are computationally efficient and scale to a large number of samples.

We use the [SGDClassfier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn-linear-model-sgdclassifier) algorithm for the classification tasks with the following initial parameters:
```
sklearn.linear_model.SGDClassifier('penalty'='elasticnet', 'loss'='log', 'n_iter'= 25, 'shuffle'= True', 'class_weight'='balanced', 'n_jobs': -1, 'random_state'=42)
```

### Benchmark

We benchmark the classification task using a classifier that performs random guesses. This is generally used as a baseline only if all the classes are uniformly distributed. Since, this is true for this dataset ie. we have 5 uniformly distributed classes, we proceed with a `dummy_classifier` that performs random guesses. We believe that the introduction of features using Feature extraction methods such as `Bag-Of-Words` and `TF-IDF` will increases the performance of the classifier. With the introduction of the `topic model` the classifier must be able to do better than a random guess and at least as good as the `Bag-of-words` and `TF-IDF` if not better for us to be able to claim that the topic model is useful in reducing the dimensions of the dataset whilst forming `topics`.

In [None]:
def dummy_predictor(samples):
    # for each sample in the input predict a random genre.
    return np.random.choice(genres, len(samples))

In [None]:
baseline_predictions = dummy_predictor(test_data['story'])
evaluate_prediction(baseline_predictions, test_data['genre'])

## Methodology

### Data Preprocessing<a name="preprocessing"></a>

As discussed in the data exploration section earlier, in NLP tasks [pre-processing](#preprocessing) of the text is quite an important task. A machine learning algorithm cannot work with text as input. We must therefore transform the input collection of documents to a meaningful mathematical form representative of the text. Before we begin discussing the transformations it is important to understand that these steps are similar to feature engineering in traditional machine learning and is often an iterative process.

The pre-processing module in this project applies a series of transformations on the text to arrive at a mathematical representation of the text. These are discussed below.

#### Tokenization<a name="Tokenization"></a>

This is a process of representing a document as a list of `tokens`. The token can be based on pre-defined characteristics such as words(unigrams), two-word pairs(bigram), three-word-sets(trigram) and so on. `Tokens` may also defined to be a group of n-characters (ex: character trigram). In this project we choose to use unigrams since it is one of the most commonly used representations. Thus, a document can be represented as a list of unigram tokens. For instance, we can tokenize the following sentence `"Machine Learning is fun."` as a list of unigram tokens to `['Machine', 'Learning', 'is', 'fun', '.']`. While this is the general idea, tokenization of real text is often complex due to irregularities in the text.

We use the NLTK package’s regular expression tokenizer [**`nltk.tokenize.RegexpTokenizer`**](http://www.nltk.org/_modules/nltk/tokenize/regexp.html) to tokenize words. We further normalize the case of the text and turn all text to lowercase.

In [None]:
# Example tokenization on a sample document.
sample_texts = ["He has a very big dog.","He owns a bike.",
                "The dog knows how to ride a bike", "He wanted to walk his dog.",
                "His dog was at home.", "He went home on his bike.",
                "The dog barked as he reached.", "His cycled with his dog."]

tokenizer = RegexpTokenizer(r'\w+')
tokened_texts = [tokenizer.tokenize(sample.lower()) 
                 for sample in sample_texts]
print(tokened_texts)

#### Stopword Removal<a name="Stopwords"></a>

In NLP text processing **stopwords** refer to words that are removed from a corpus of documents when an index is created from the text data. Generally, these are the most commonly used functional words such as `the`, `is`, `a` etc. Some NLP tools exclusively avoid removal of stopwords for reasons such as preserving context. We will be removing them since they generally affect the performance of topic models such as [`Latent Dirichlet Allocation`](#LDA). 

Besides these funtional stop words the implementation in this project uses the parameters `min_df`, `max_df` and `max_features` in the [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) class to remove additional stop words that appear less than 10 times and in more than 75% of the documents.

In [None]:
filtered_tokens = [[word for word in tokened_text if not word in STOPWORDS]
                   for tokened_text in tokened_texts]
print(filtered_tokens)

#### Stemming<a name="stem"></a>

Stemming is a process of reducing a word to a common root, often ignoring the syntactic and semantic context of the token. For example, the words *`being`*, *`am`*, *`are`*, *`is`*  are reduced to *`be`*. Text documents often for grammatical reasons, use different forms of the same word in different contexts. While this might be helpful to derive grammatical meaning, since these words do not differ in their semantic usage we could stem the token and thus benefit from the reduced size of the vocabulary. We perform stemming using the [`PorterStemmer`](http://www.nltk.org/api/nltk.stem.html#nltk.stem.porter.PorterStemmer) algorithm in the [nltk](http://www.nltk.org/#natural-language-toolkit) library. It's a very popular stemmer in natural language processing and is used in a variety of classification tasks as a pre-processor.

In [None]:
# Eample of Stemming the text with the PorterStemmer.
demo_stemmer = PorterStemmer()
token_stems = [[demo_stemmer.stem(token) for token in filtered_token] 
                for filtered_token in filtered_tokens] 
print(token_stems)

#### Bag-Of-Words<a name="BoW"></a>

A `bag-of-words(BoW)` is a matrix representation of the text corpus. The terms/tokens in the vocabulary form the rows while the documents in the corpus form the columns of the matrix. Since, not all tokens in the vocabulary are present in all documents the `BoW` model is often a Sparse Matrix. The values in the matrix is the frequency of occurrence of the token in the documents. This is constructed using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) class in sklearn.

In [None]:
# Example BoW model matrix for a series of documents.

pre_processed_text = [' '.join(text) for text in token_stems]
demo_bow_vectorizer = CountVectorizer(analyzer='word',
                                      tokenizer=word_tokenize)

demo_bow_model = demo_bow_vectorizer.fit_transform(pre_processed_text)
col_names = ['text'+str(i+1) for i in range(len(pre_processed_text))]
demo_df = pd.DataFrame(demo_bow_model.todense().T,
                  index=demo_bow_vectorizer.get_feature_names(),
                  columns=col_names)
display(demo_df)

#### TFIDF Matrix<a name="TF-IDF"></a>

The `Term frequency-Inverse document frequency`(TFIDF) matrix is a representation of the [BoW](#BoW) model.It augments the representaiton by scoring the importance of `terms`(words) in a document based on how frequently they appear in multiple documents. Intuitively, if a word appears multiple times in a document, it gets a huge score. However, if it also appears in multiple documents it's not a unique word hence, it is given a low score. We use [`Tfidftransformer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn-feature-extraction-text-tfidftransformer) to transform the BoW model to a TF-IDF matrix.

In [None]:
# Example of Tfidf Transformation of a BoW matrix.
demo_tfidf_vectorizer = TfidfTransformer()
demo_tfidf_model = demo_tfidf_vectorizer.fit_transform(demo_bow_model)

col_names = ['text'+str(i+1) for i in range(len(pre_processed_text))]
demo_df = pd.DataFrame(demo_tfidf_model.todense().T, 
                       index=demo_bow_vectorizer.get_feature_names(),
                       columns=col_names)
display(demo_df)

### Implementation

#### Read and preprocess

We begin with importing the pre-processed CSV file into a pandas Dataframe. We first process the text, removing `stop-words` and `stemming` the text while also case normalizing the text and writing back the normalized text to the dataframe. Further, we shuffle the data in this step. We also pickle the data for future use.

In [None]:
data_set = load_data(data_loc)
data_set = data_set.sample(frac=1).reset_index(drop=True)
genres = data_set['genre'].unique()

#### Train-test split.

Next we split the Dataframe into a train and test sets using the sklearn [train-test split](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.train_test_split.html#sklearn-cross-validation-train-test-split) function. The test set is generally kept hidden from both the feature transformers and the classifier. This is to ensure that the no data leaks from the test to the train set during the feature transformation and extraction stages. We only use the test set to evaluate the final performance in each case.

In [None]:
# at 89,000 instance even 8900 instace forms a good test set.
train_data, test_data = train_test_split(data_set, test_size=0.1, random_state=42)

#### Bag-of-Words Features Model Performance

Initially, we implement the [Bag-Of-Words](#BoW) model using the `CountVectorizer` class as demonstrated earlier. We fit the vectorizer on the training set and transform the training and the test sets. Note that we do not train on the test set and do not include it in the vocabulary. To get an idea of the quality of features we measure the performance of the `SGDClassifier` using the [evaluation metrics](#metrics) defined earlier.

In [None]:
# initialize the countvectorizer with the intial parameters.
bow_params ={'analyser': 'word',
             'tokenizer': word_tokenize,
             'max_df': 0.5,
             'max_features': 75000,
             'min_df': 10
             } 
bow_vectorizer = CountVectorizer(**bow_params)

# train on and transform the train-set
bow_train_features = bow_vectorizer.fit_transform(train_data['story'])

# transform the test-set
bow_test_features = bow_vectorizer.transform(test_data['story'])

# Initialize the classifier with intial parameters.
clf_params = {'class_weight': 'balanced', 
              'loss': 'log', 
              'n_iter': 25, 
              'penalty': 'elasticnet', 
              'random_state': 42, 
              'shuffle': True,
              'n_jobs': -1,
              'warm_start': False
              }
classifier = SGDClassifier(**clf_params)

# train the classifer using the BOW features
classifier.fit(bow_train_features, train_data['genre'])

# predict the classes on the test set.
bow_predictions = classifier.predict(bow_test_features)

# evaluate performance on the transformed test set.
evaluate_prediction(bow_predictions, test_data['genre'])

#### TF-IDF Features Model Performance

We transform the [`Bag-of-Words`](#BoW)(BoW) matrix using a [TFIDFTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn-feature-extraction-text-tfidftransformer) to a [`TF-IDF matrix`](#TF-IDF). Here, we make use of the sklearn [**Pipeline**](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn-pipeline-pipeline) class to first transform the features from the BoW matrix to a TF-IDF-Matrix and then train the `LinearSVC`. The advantage of using such a pipeline is that we can perform the same chained transformations on the test set during prediction. We also measure the performance of the classifier using the TF-IDF features on the test set and report the results.

In [None]:
# initialize pipeline with tf-idf transformation.
tfidf_params = {'tfidf__norm': 'l1',
                'tfidf__use_idf': True
                }
tfidf_pipeline = Pipeline([('tfidf', TfidfTransformer()),
                           ('clf', classifier)
                           ])
tfidf_pipeline.set_params(**tfidf_params)

# train the pipeline using th Bag-of-Words features.
tfidf_pipeline.fit(bow_train_features, train_data['genre'])

# predict on the test Bag-of-words features.
tfidf_predictions = tfidf_pipeline.predict(bow_test_features, test_data['genre'])

# evaluate performance on transformed test set.
evaluate_prediction(tfidf_predictions, test_data['genre'])

#### LDA Features Model Performance

The LDA algorithm is implemented on both the BoW and the TF-IDF independently. The performance is reported individually on both feature models. Although the performance has not improved, there are hyperparameters that need to be tuned to improve the performance of the LDA model. These include the `n_topics`, `doc_topic_prior` and `topic_word_prior`. The tuning of these hyperparameters is further discussed in the [Refinement](#refine) section.

In [None]:
# initialize pipeline with lda transformation.

bow_lda_pipeline = Pipeline([('lda', LatentDirichletAllocation()),
                             ('clf', classifier)
                             ])

# train the pipeline using the Bag-of-Words features
bow_lda_pipeline.fit(bow_train_features, train_data['genre'])

# predict on the test Bag-of-words features.
bow_lda_predictions = bow_lda_pipeline.predict(bow_test_features)

# evaluate performance on transformed test set.
evaluate_prediction(bow_lda_predictions, test_data['genre'])

In [None]:
# initialize pipeline with tf-idf and lda transformations.
tfidf_lda_params = tfidf_param.copy()
tfidf_lda_params.update(lda_params)

tfidf_lda_pipeline = Pipeline([('tfidf': TfidfTransformer()),
                               ('lda', LatentDirichletAllocation()),
                               ('clf', classifier)
                              ])
tfidf_lda_pipeline.set_params(**tfidf_lda_params)

# train the pipeline using th Bag-of-Words features
tfidf_lda_pipeline.fit(bow_train_features, train_data['genre'])

# predict on the test Bag-of-words features.
tfidf_lda_predictions = tfidf_lda_pipeline.predict(bow_test_features)

# evaluate performance on transformed test set.
evaluate_prediction(tfidf_lda_predictions, test_data['genre'])

### Refinement

Here we discuss the refinement process we follow to improve our model. We use [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html#sklearn-grid-search-randomizedsearchcv) for finding the hyperparameters of the pipelines described above.

Specifically, we first construct a pipeline for the `bow_lda model` and the `tfidf_lda_model`, run `RandomizedSearchCV` and report the `grid_scores` and `best_params`. Next, we predict using the new set of parameters on the test set and report the results.

In [None]:
# initialize pipelines
ref_bow_lda_pipeline = Pipeline([('lda', LatentDirichletAllocation()),
                                 ('clf', classifier)
                                 ])

ref_tfidf_lda_pipeline = Pipeline([('lda', LatentDirichletAllocation()),
                                   ('clf', classifier)
                                  ])

# initalize parameters for RandomizedSearchCV

lda_params = {'lda__n_topics': [10, 30, 150, 250],
              'lda__max_iter': [30, 100, 250]
              'lda__doc_topic_prior': [None, 0.3, 0.9],
              'lda__topic_word_prior': [None, 0.3, 0.9],
              'lda__learning_decay': [0.6, 0.9]
             }

In [None]:
# initialize and search 
bow_lda_search = RandomizedSearchCV(estimator=ref_bow_lda_pipeline,
                                    param_distributions=lda_params,
                                    n_iter=15,
                                    scoring='f1_weighted',
                                    random_state=42,
                                    n_jobs=-1)
bow_lda_search.fit(bow_train_features, train_data['genre'])

In [None]:
print('grid scores for the Randomized search bow-lda')
print(bow_lda_search.grid_scores)
print('best parameters: {}'.format(bow_lda_search.best_params))

In [None]:
# predict using best parameters.
bow_lda_seach_predictions = bow_lda_search.predict(bow_test_features)
evaluate_prediction(bow_lda_search_predictions, test_data['genre'])

In [None]:
# we initialize Tfidftransformer separately.
# this reduces the transformation overhead while GridSearch 
tfidf_vectorizer = TfidfTransformer(**tfidf_params)

# train the transformer using the Bag-of-Words train_features
tfidf_train_features = tfidf_vectorizer.fit_transform(bow_train_features)

# transform the test-set
tfidf_test_features = tfidf_vectorizer.transform(bow_test_features)

In [None]:
# initialize
tfidf_lda_search = RandomizedSearchCV(estimator=ref_tfidf_lda_pipeline,
                                      param_distributions=lda_params,
                                      n_iter=15,
                                      scoring='f1_weighted',
                                      random_state=42,
                                      n_jobs=-1)
# fit and search the hyperparameter space
tfidf_lda_search.fit(tfidf_train_features, train_data['genre'])

In [None]:
print('grid scores for the Randomized search of Tfidf-lda')
print(tfidf_lda_search.grid_scores)
print('best parameters: {}'.format(tfidf_lda_search.best_params))

In [None]:
# predict using best parameters.
tfidf_lda_seach_predictions = tfidf_lda_search.predict(tfidf_test_features)
evaluate_prediction(tfidf_lda_search_predictions, test_data['genre'])

## Results

### Model Evaluation, Validation and Visualization

Topic models are often evaluated by human experts by verifying the coherence of topics. This is often measured by how similar words/tokens categorized under a topic are. Although we explicitly measured the performance of the topics using the `f1-weighted` score, we shall further investigate the resulting topics using visualizations. We employ the [pyLDAVis](https://pyldavis.readthedocs.io/en/latest/readme.html#pyldavis) library to perform the topic visualizations below. In this visualization we inspect the best topic model alone on the test set.

In [None]:
pyLDAvis.enable_notebook()
pyLDAvis.sklearn.prepare(lda_topic_model, vectorized_test_set, vectorizer)

looking at the above topics it's quite clear that topic 1 is about ...  topic 2 is about ...  topic 3 is about ...

### Justification

## Conclusion

### Reflections and Discussion

Natural Language processing is quite a challenging endeavour and true to its nature creating good models for text data is a daunting task. This arises from the various characteristics of text data such as the number of features the presence of outliers, etc. We began with the objective of creating a coherent topic model of the fan-fiction dataset using `probabilistic topic modelling`. Concretely we had planned to use LDA to find coherent topics that can then be used to describe the collection of documents as a group of topics. To this end we pre-processed the documents, extracted features in the form of document-term matrices and applied LDA on the features. We evaluated the performance of the LDA model using a classifier. The results..

### Improvement

Our task at hand was to improve the 