# ANLP 2015 Text Classification Assignment

## Write Up

### Introduction

This notebook contains the code and documentation that we&mdash;Emily Scharff and Juan Shishido&mdash;used to obtain our score of "XX.XX" on the public leaderboard for the ANLP 2015 Classification Assignment. We describe our text processing, feature engineering, and model selection approaches.

### Text Processing

The data were loaded into pandas DataFrames. We began plotting the frequency of each category in the training set and noticed that the distribution was not uniform. Category 1, for example, was the most well represented with 769 questions. Category 6, on the other hand, had the least amount of questions&mdash;232. This would prove to be a good insight and we'll describe how we used this to our advantage.

In terms of processing the data, our approach was *not* to modify the original text. Rather, we created a new column, `text_clean`, that reflected our changes.

While examining the plain-text training data, we noticed sequences of HTML escaped characters, such as `&#xd;&lt;br&gt;`, which we removed with a regular expression. We also remove non-alphanumeric characters and replace whitespace with single spaces.

### Features and Models

In terms of features, we started simple, using a term-document matrix that only included word frequencies. We also decided to get familiar with a handful of algorithms. We used our word features to train logistic regression and multinomial naive Bayes models. Using Scikit-Learn's `cross_validation` function, we were surprised to find initial scores of around 50% accuracy.

From here, we deviated somewhat and tried document similarity. Using the training data, we combined questions, by category. Our thought was to create seven "documents," one for each category, that represented the words used for the corresponding questions. This resulted in a $7 x w$ matrix, where $w$ represents the number of unique words *across* documents. This was created using Scikit-Learn's `TfidfVectorizer`. For the test data, the matrix was of dimension $w x q$, where $q$ represents the number of questions. Note that $w$ is the same in each of our matrices. This is so that it's possible to perform matrix multiplication. Of course, the `cosine_similarity` function, the metric we decided to use, takes care of some of the implementation details. Our first submission was based on this approach. We then stemmed the words in our corpus, using the Porter Stemmer, and that increased our score slightly.

Before proceeding, we decided to use Scikit-Learn's `train_test_split` function to create a development set&mdash;about 20% of the training data&mdash;on which to test our models. To fit our models, we used the remaining 80% of the original training data.

In our next iteration, we went back to experimenting with logistic regression and naive Bayes, but also added a linear support vector classifier. Here, we also started to add features. Because we were fitting a model, we did *not* combine questions by category. Rather, our tfidf feature matrix had a row for each question.

We tried many features. We ended up with the following list:

* number of question marks
* number of periods
* number of apostrophes
* number of "the"s
* number of words
* number of stop words
* number of first person words
* number of second person words
* number of third person words
* indicators for whether the *first* word was in ['what', 'how', 'why', 'is']
* **OTHER FEATURES**

We also stemmed the words prior to passing them through the `TfidfVectorizer`.

When we noticed some misspelled words, we tried using Peter Norvig's `correct` function, but it did not improve our accuracy scores.

One thing that was helpful was the plots we created when assessing the various models. We plotted the predicted labels against the ground truth. (An example of this in included below.) This helped us see, right away, that the linear SVC was performing best across all the permutations of features we tried. This is how we eventually decided to stick with that algorithm.

During one of the iterations, we noticed that the naive Bayes model was incorrectly predicting category 1 for a majority of the data. We remembered the distribution of categories mentioned earlier and decided to sample the *other* categories at higher frequencies. We took the original training data, and then drew a random sample of questions from categories 2 through 7. After some experimentation, we decided to sample an extra 1,200 observations. This strategy helped improve our score.

We also spend time examining and analyzing the confidence scores using the `decision_function()` method. The idea here was to see if we could identify patterns in *how* the classifier was incorrectly labeling the development set. Unfortunately, we were not able to use this information to improve our scores.

Finally, because of all the testing we had done, we had several results files, which included results we did not submit. With this data, we used a bagging approach&mdash;majority vote&mdash;to get a "final" classification on the 1,874 test examples.

## Code