In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.autograd as autograd
import torch.nn.functional as F

# 0. Getting Started

Welcome to the Practical NLP with PyTorch reading group! The main goal of this reading group is to become comfortable using PyTorch to implement deep learning models for NLP. Each session will focus on a few different "atoms" -- feedforward networks, LSTMs, attention, etc. -- which are common patterns across papers.

## 0.0 Readings for Next Week

For next week, there will be one reading, plus this assignment:
- for understanding **word embeddings** and **continuous bag of words (CBoW) training**, I recommend Jurafsky's [Speech and Language Processing (SLP3) Ch. 16](https://web.stanford.edu/~jurafsky/slp3/16.pdf).

If there are parts of this notebook you're unfamiliar with, I'll do my best to help on Slack! In addition, I recommend the following resources:
- the [PyTorch 60 Minute Blitz](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html). We'll be going over much of this in the first meeting.
- firstly, if you are unfamiliar with **feedforward neural networks** or **backpropagation**, I recommend no resource more than Michael Neilsen's [Neural Networks and Deep Learning Ch. 1 & 2](http://neuralnetworksanddeeplearning.com/chap1.html). It's a long read (they're 2 chapters of a book), but it is quite approachable, and very thorough.

## 0.1 Feature Extraction

The first step will be preprocessing the text. To make processing the data easier, we'll use scikit-learn's `CountVectorizer`, which can automatically create the bag-of-words representation from text.

Below, you'll be responsible for splitting the training set into a training and validation set, as well as transforming the test and validation sets.

In [None]:
from utils import read_imdb_data, data_batcher
from sklearn.feature_extraction.text import CountVectorizer

X_raw_train, y_train = read_imdb_data('../data/aclImdb/train')
X_raw_test, y_test = read_imdb_data('../data/aclImdb/test')

# TODO: Your code goes here to create an X_raw_val set.

vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(X_raw_train)

You'll also need to transform the `X_test` and `X_val` data.

In [None]:
# TODO: Your code goes here.

# 1. Softmax Regression

In [None]:
class SoftmaxRegression(nn.Module):
    """ TODO: Your code goes here. """
    pass

    # def __init__(self, n_features, n_classes):

Once you have the classifier built, follow the instructions to train it. However, note that you'll probably want to add an inner loop

In [None]:
# TODO: Fill in the appropriate shape for the data. The number of 
# features comes from the shape of X_train.
model = SoftmaxRegression()

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

num_epochs = 5

for epoch in range(num_epochs):
    for X_batch, y_batch in data_(X_train, y_train):
        # TODO: your training code goes here
        pass

Finally, you'll need to test the classifier to see how well it performed. Use the validation set to optimize the learning rate, momentum, and number of epochs.

In [None]:
# TODO: Test on the validation set.

## 1.1 Stopword Removal

In order to make stopword removal easier, you can import a list of common English stopwords from the utils file. The goal here will be to create a new `X_features` vector that has all the stopwords removed. 

In order to accomplish this quickly, I recommend looking into the arguments of `CountVectorizer()`. Scikit-learn is amazingly comprehensive, and it's always better to not have to reinvent the wheel! (Especially when scikit-learn's wheel is a lot faster than ours would be.)

In [None]:
from utils import stop_words

# Your code goes here.

You'll then want to test if you got an improvement with stopword removal. You'll need to recreate the code above for featurizing and training the bag of words classifier, but with the correct number of features.

### Extra Credit: Baselines and Bigrams

Too often, not enough thought is given to our baselines before we jump right into complex deep learning models. Wang and Manning attempt to in their 2012 paper, [Baselines and Bigrams](https://nlp.stanford.edu/pubs/sidaw12_simple_sentiment.pdf). It provides a **really** tough-to-beat baseline for sentiment classification.

Using the paper and what you know, you should be able to implement this model in PyTorch already! You can simply use unigram features for now, though you are also welcome to recompute the feature vectors with higher-order ngrams. Thankfully, that can be done easily with just another argument to the `CountVectorizer`.

In [None]:
class NBSVM(nn.Module):
    """ Your code goes here. """
    pass