In [1]:
%matplotlib inline

In [13]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F

import matplotlib.pyplot as plt

# 0. Getting Started

Welcome to the Practical NLP with PyTorch reading group! The main goal of this reading group is to become comfortable using PyTorch to implement deep learning models for NLP. Each session will focus on a few different "atoms" -- feedforward networks, LSTMs, attention, etc. -- which are common patterns across papers.

## 0.0 Readings

The main readings necessary for the first session are:
- the [PyTorch 60 Minute Blitz](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html). We'll be going over much of this in the first meeting.
- for understanding **word embeddings** and **continuous bag of words (CBoW) training**, I recommend Jurafsky's [Speech and Language Processing (SLP3) Ch. 16](https://web.stanford.edu/~jurafsky/slp3/16.pdf).

This notebook assumes you're already familiar with such concepts as feedforward neural networks, stochastic gradient descent. If you're unfamiliar with (or need a refresher on) any of those topics, I recommend the following resources:
- firstly, if you are unfamiliar with **feedforward neural networks** or **backpropagation**, I recommend no resource more than Michael Neilsen's [Neural Networks and Deep Learning Ch. 1 & 2](http://neuralnetworksanddeeplearning.com/chap1.html). It's a long read (they're 2 chapters of a book), but it is highly approachable, and very thorough. If you only need a refresher, or would just appreciate something shorter, you can also use [...].
- if you are familiar with **logistic regression**, but unfamiliar with how a logistic regression classifier is trained using gradient descent, I highly recommend the [UFLDL Tutorial on Logistic Regression (and Softmax Regression)](http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). It goes into a lot of depth about both *what* logistic regression is, as well as how to derive the gradient updates. I strongly recommend going through the derivations for the gradient updates, even though we'll be using PyTorch's autograd.

Once you have completed the readings you feel are necessary, come back and give this notebook a try!

## 0.1 Feature Extraction

The first step will be preprocessing the text. To make processing the data easier, we'll use scikit-learn's `CountVectorizer`, which can automatically create the bag-of-words representation from text.

In [7]:
from utils import read_rt_data
from sklearn.feature_extraction.text import CountVectorizer

X, y = read_rt_data()

# TODO: train/test/validation split

vectorizer = CountVectorizer(lowercase=False, analyzer=lambda x: x)
X_features =  vectorizer.fit_transform(X)

### Extra Credit: Stopword Removal

In order to make stopword removal easier, you can import a list of common English stopwords from the utils file. The goal here will be to create a new `X_features` vector that has all the stopwords removed. 

In order to accomplish this quickly, I recommend looking into the arguments of `CountVectorizer()`. Scikit-learn is amazingly comprehensive, and it's always better to not have to reinvent the wheel! (Especially when scikit-learn's wheel is a lot faster than ours would be.)

In [8]:
from utils import stop_words

# Your code goes here.

# 1. Logistic Regression

For a more detailed explanation of the below math, refer to the [linked UFLDL tutorial](http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/). I'm not aiming to provide a full logistic regression tutorial, but rather just enough math for you to be able to impelement it yourself.

In order to get our feet wet with building models, we'll first attempt to classify the text with a simple linear classifier - Logistic Regression. It is possible to think of a logistic regression classifier as the simplest possible neural network: one with no hidden layers. It simply transforms the data according to the following equation:

$$\hat{y} = \frac{1}{1 + exp(-w^T \cdot X)}$$

In this case, $\hat{y}$ is the prediction, $X$ is the **design matrix**, and $w$ is the **weight vector**, which we'll want to train. This equation is known as the **logistic function**, and returns a normalized $\hat{y}$ on the range [0, 1]. Just that equation should give us enough to implement our classifier!

Remember, when implementing a model in PyTorch, you subclass from `nn.Module`. When doing so, you must implement an `__init__` method, which sets up all the weights and functions, and a `forward` method, which takes in a piece of data and returns a prediction.

In [None]:
class LogisticRegression(nn.Module):
    """ Your code goes here. """
    pass

There's only one thing missing to train our classifier: a loss function. The loss function gives a penalty for each prediction based on how wrong it is -- a perfect prediction gives us no penalty, whereas a bad prediction can yield a very high penalty. The loss function we'll be using is called **cross-entropy**, which is what is typically used for classification problems:

$$L(y, \hat{y}) = \sum_{i=1}^{N}{y_i log(\hat{y_i})}$$

If you want more information about what makes cross-entropy better than other loss functions for classification (such as mean squared error, aka MSE), I recommend taking a look at [this section of chapter 3 of the neural networks book](http://neuralnetworksanddeeplearning.com/chap3.html#introducing_the_cross-entropy_cost_function).

In [None]:
model = LogisticRegression(n_features=10)

...

### Extra Credit: Baselines and Bigrams

Too often, not enough thought is given to our baselines before we jump right into complex deep learning models. Wang and Manning attempt to in their 2012 paper, [Baselines and Bigrams](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). It provides a **really** tough-to-beat baseline for sentiment classification.

Using the paper and what you know, you should be able to implement this model in PyTorch already! You can simply use unigram features for now, though you are also welcome to recompute the feature vectors with higher-order ngrams. Thankfully, that can be done easily with just another argument to the `CountVectorizer`.

https://www.kaggle.com/jhoward/nb-svm-strong-linear-baseline-eda-0-052-lb/

In [9]:
class NBSVM(nn.Module):
    """ Your code goes here. """
    pass

# 2. Off-the-Shelf Embeddings

Next, we will attempt the same problem, but using off-the-shelf word embeddings, which we'll combine into document embeddings with simple averaging:
- add up all the word vectors in a document
- divide by the number of vectors we found

We'll then train our logistic regression classifier using this reduced feature set.

In [None]:
X_features_emb = X_features.dot(embeddings)

In [None]:
model = LogisticRegression(n_features=embeddings.size)

# 3. CBoW Training

Finally, we'll learn how to train our own embeddings. Due to computational constraints, we'll just be using the IMDB dataset to train the embeddings. They won't perform nearly as well as the SpaCy embeddings (which won't perform as well as the baselines and bigrams paper), but that's okay!

It's not typical to train your own embeddings from scratch, but our goal for this session is learning rather than accuracy.

In [None]:
class CBoW(nn.Module):
    """Your code goes here."""
    pass