<a href="https://colab.research.google.com/github/nlp-course/materials/blob/tmp_psets/distrib/project1/project1_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1: Text classification

In this homework you will build several varieties of text classifiers using PyTorch.

1. A majority baseline.
2. A naive Bayes classifer.
3. A logistic regression classifier.
4. A multilayer perceptron classifier.

## Setup

In [None]:
!pip install -qU torchtext

In [None]:
import os
import re
import copy
from collections import Counter

from tqdm import tqdm

import torch
import torch.nn as nn
import torchtext as tt

# Set random seeds
seed = 1234
torch.manual_seed(seed)

# GPU check, make sure to set runtime type to "GPU" where available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (device)

### Load Data

For this and future project segments, you will be working with the [ATIS (Airline Travel Information System) dataset](https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk). This dataset is composed of queries about flights – their dates, times, locations, airlines, and the like. Over the years, the dataset has been annotated in all kinds of ways, with parts of speech, informational chunks, parse trees, corresponding SQL queries. You'll use various of these annotations in future assignments. For this project segment, however, you'll pursue an easier classification task: **given a query, predict the answer type**.

Below is an example taken from this dataset:

_Query:_

```
show me the afternoon flights from washington to boston
```

_SQL:_

```
SELECT DISTINCT flight_1.flight_id FROM flight flight_1 , airport_service airport_service_1 , city city_1 , airport_service airport_service_2 , city city_2 
   WHERE flight_1.departure_time BETWEEN 1200 AND 1800 
     AND ( flight_1.from_airport = airport_service_1.airport_code 
           AND airport_service_1.city_code = city_1.city_code 
           AND city_1.city_name = 'WASHINGTON' 
           AND flight_1.to_airport = airport_service_2.airport_code 
           AND airport_service_2.city_code = city_2.city_code 
           AND city_2.city_name = 'BOSTON' )
```

In this problem set, we will consider the answer type for a query to be the target field of the corresponding SQL query. For the above example, the answer type would be *flight_id*.

First, let's download the dataset.

In [None]:
!wget -nv -N -P data https://raw.githubusercontent.com/nlp-course/data/master/ATIS/train.nl
!wget -nv -N -P data https://raw.githubusercontent.com/nlp-course/data/master/ATIS/train.sql
!wget -nv -N -P data https://raw.githubusercontent.com/nlp-course/data/master/ATIS/dev.nl
!wget -nv -N -P data https://raw.githubusercontent.com/nlp-course/data/master/ATIS/dev.sql
!wget -nv -N -P data https://raw.githubusercontent.com/nlp-course/data/master/ATIS/test.nl
!wget -nv -N -P data https://raw.githubusercontent.com/nlp-course/data/master/ATIS/test.sql

### Process the data

We use `torchtext` to process the data. More information on `torchtext` can be found at https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/.

To begin, `torchtext` requires that we define a mapping from the raw text data to featurized indices, called a [`Field`](https://torchtext.readthedocs.io/en/latest/data.html#fields). We need one field for processing the question (`TEXT`), and another for processing the label (`LABEL`). These fields make it easy to map back and forth between readable data and lower-level representations like numbers.

In [None]:
TEXT = tt.data.Field(lower=True, # lowercased
                     sequential=True, # sequential data
                     include_lengths=False, # do not include lengths
                     batch_first=True, # batches will be batch_size X max_len
                     tokenize=tt.data.get_tokenizer("basic_english")) 
LABEL = tt.data.Field(batch_first=True, sequential=False, unk_token=None)

We provide an interface for loading ATIS data built on top of `torchtext.data.Dataset`. 

In [None]:
class ATIS(tt.data.Dataset):
  @staticmethod
  def sort_key(ex):
    return len(ex.text)

  def __init__(self, path, text_field, label_field, **kwargs):
    """Creates an ATIS dataset instance given a path and fields.
    Arguments:
        path: Path to the data file
        text_field: The field that will be used for text data.
        label_field: The field that will be used for label data.
        Remaining keyword arguments: Passed to the constructor of
            tt.data.Dataset.
    """
    fields = [('text', text_field), ('label', label_field)]
    
    examples = []
    # Get text
    with open(path+'.nl', 'r') as f:
        for line in f:
            ex = tt.data.Example()
            ex.text = text_field.preprocess(line.strip()) 
            examples.append(ex)
    
    # Get labels
    with open(path+'.sql', 'r') as f:
        for i, line in enumerate(f):
            label = self._get_label_from_query(line.strip())
            examples[i].label = label
            
    super(ATIS, self).__init__(examples, fields, **kwargs)
  
  def _get_label_from_query(self, query):
    """Returns the answer type from `query` by dead reckoning.
    It's basically the second or third token in the SQL query.
    """    
    match = re.match(r'\s*SELECT\s+(DISTINCT\s*)?(\w+\.)?(?P<label>\w+)', query)
    if match:
        label = match.group('label')
    else:
        raise RuntimeError(f'no label in query {query}')
    return label

  @classmethod
  def splits(cls, text_field, label_field, path='./',
              train='train', validation='dev', test='test',
              **kwargs):
    """Create dataset objects for splits of the ATIS dataset.
    Arguments:
        text_field: The field that will be used for the sentence.
        label_field: The field that will be used for label data.
        root: The root directory that the dataset's zip archive will be
            expanded into; therefore the directory in whose trees
            subdirectory the data files will be stored.
        train: The filename of the train data. Default: 'train.txt'.
        validation: The filename of the validation data, or None to not
            load the validation set. Default: 'dev.txt'.
        test: The filename of the test data, or None to not load the test
            set. Default: 'test.txt'.
        Remaining keyword arguments: Passed to the splits method of
            Dataset.
    """

    train_data = None if train is None else cls(
        os.path.join(path, train), text_field, label_field, **kwargs)
    val_data = None if validation is None else cls(
        os.path.join(path, validation), text_field, label_field, **kwargs)
    test_data = None if test is None else cls(
        os.path.join(path, test), text_field, label_field, **kwargs)
    return tuple(d for d in (train_data, val_data, test_data)
                  if d is not None)


We load the data splits, and build the vocabularies from the training data.

In [None]:
# Make splits for data
train_data, val_data, test_data = ATIS.splits(TEXT, LABEL, path='./data/')

# Build vocabulary for data fields
MIN_FREQ = 3 # words appearing less than 3 times are treated as 'unknown'
TEXT.build_vocab(train_data, min_freq=MIN_FREQ)
LABEL.build_vocab(train_data)

# Compute size of vocabulary
vocab_size = len(TEXT.vocab.itos)
num_labels = len(LABEL.vocab.itos)
print(f"Size of vocab: {vocab_size}")
print(f"Number of labels: {num_labels}")

Note that we mapped words appearing less than 3 times to a special _unknown_ token `<unk>` for two reasons: first, due to the scarcity of such rare words in training data, we might not be able to learn generalizable conclusions about them; second, introducing an unknown token allows us to deal with out-of-vocabulary words in the test data as well: we just map those words to `<unk>`.

In [None]:
unk_token = TEXT.unk_token
print (f"Unknown token: {unk_token}")
unk_index = TEXT.vocab.stoi[unk_token]
print (f"Unknown word id: {unk_index}")

# UNK example
example_unk_token = 'IAmAnUnknownWordForSure'
print (f"An unknown token: {example_unk_token}")
print (f"Mapped back to word id: {TEXT.vocab.stoi[example_unk_token]}")
print (f"Mapped to <unk>: {TEXT.vocab.stoi[example_unk_token] == unk_index}")

To load data in batches, we use `data.BucketIterator`. This enables us to iterate over the dataset under a given `BATCH_SIZE` which specifies how many instances we want at a time.

In [None]:
BATCH_SIZE = 32
train_iter = tt.data.BucketIterator(train_data, batch_size=BATCH_SIZE, device=device)
val_iter = tt.data.BucketIterator(val_data, batch_size=BATCH_SIZE, device=device)
test_iter = tt.data.Iterator(test_data, batch_size=BATCH_SIZE, sort=False, device=device)

Let's look at a single batch from one of these iterators.

In [None]:
batch = next(iter(train_iter))
text = batch.text
print (f"Size of text batch: {text.size()}")
print (f"Third sentence in batch: {text[2]}")
print (f"Converted back to string: {' '.join([TEXT.vocab.itos[i] for i in text[2]])}")

label = batch.label
print (f"Size of label batch: {label.size()}")
print (f"Third label in batch: {label[2]}")
print (f"Converted back to string: {LABEL.vocab.itos[label[2].item()]}")

You might notice some padding tokens `<pad>` when we convert word ids back to string, or equivalently, padding ids `1` in the corresponding tensor. The reason why we need such padding is because the sentences in a batch might be of different lengths, and to save them in a 2D tensor for parallel processing, sentences that are shorter than the longest sentence need to be padded with some placeholder values. Note that during training we need to make sure that the paddings do not affect the final results.

In [None]:
padding_token = TEXT.pad_token
print (f"Padding token: {padding_token}")

padding_id = TEXT.vocab.stoi[padding_token]
print (f"Padding word id: {padding_id}")

Alternatively, we can also directly iterate over the individual examples in `train_data`, `val_data` and `test_data`. Here the returned values are the raw sentences and labels instead of word ids, and you might need to explicitly deal with the unknown words, unlike using bucket iterators which automatically map unknown words to an unknown word id.

In [None]:
for example in train_iter.dataset: # train_iter.dataset is just train_data
  print ("Sentence:", example.text)
  print ("label:", example.label)
  break

## Part 1: Establish a majority baseline

A simple baseline for classification tasks is to always predict the most common class. 

**Implement the majority baseline and compute test accuracy using the starter code below.** Note that for this baseline, and the naive Bayes classifier later, we don't need to use the validation set since we don't tune any hyper-parameters.

In [None]:
#TODO
def majority_baseline_accuracy(train_iter, test_iter):
  '''Returns the most common label in the training set, and the accuracy of
  the majority baseline on the test set.
  '''
  "your code here"
  return most_common_label, test_accuracy

In [None]:
# Call the method to establish a baseline
most_common_label, test_accuracy = majority_baseline_accuracy(train_iter, test_iter)

print(f'Most common label: {most_common_label}\n'
      f'Test accuracy:     {test_accuracy:.3f}')

## Part 2: Implement a Naive Bayes classifier

### Review of Naive Bayes

$$
   \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
   \newcommand{\Prob}{{\Pr}}
   \newcommand{\given}{\,|\,}
   \newcommand{\vect}[1]{\mathbf{#1}}
   \newcommand{\cnt}[1]{\sharp(#1)}
$$
Recall from lab 3 that the Naive Bayes classification method classifies a text $\mathbf{x} = \langle x_1, x_2, \ldots, x_m \rangle$ as  the class $c_i$ given by the following maximization:
$$
\argmax{i} \Prob(c_i \given \vect{x}) \approx \argmax{i} \Prob(c_i) \cdot \prod_{j=1}^m \Prob(x_j \given c_i)
$$
or equivalently (since taking the log is monotonic)
$$ \begin{align*}
\argmax{i} \Prob(c_i \given \vect{x}) &= \argmax{i} \log\Prob(c_i \given \vect{x}) \\
&\approx \argmax{i} \left(\log\Prob(c_i) + \sum_{j=1}^m \log\Prob(x_j \given c_i)\right)
\end{align*}$$

All we need, then, to apply the Naive Bayes classification method is values for the various log probabilities $\log\Prob(c_i)$ and $\log\Prob(x_j \given c_i)$ for each feature $x_j$ and each class $c_i$.

We can estimate the prior probabilities $\Prob(c_i)$ by examining the empirical probability in the training set. That is, we estimate 

$$ \Prob(c_i) \approx \frac{\cnt{c_i}}{N} $$

We can estimate the likelihood probabilities $\Prob(x_j \given c_i)$ similarly by examining the empirical probability in the training set. That is, we estimate 

$$ \Prob(x_j \given c_i) \approx \frac{\cnt{x_j, c_i}}{\sum_{j'} \cnt{x_{j'}, c_i}} $$

To handle cases in which the count $\cnt{x_j, c_i}$ is zero, we can adjust this estimate using add-$\delta$ smoothing:

$$ \Prob(x_j \given c_i) \approx \frac{\cnt{x_j, c_i} + \delta}{\sum_{j'} \cnt{x_{j'}, c_i} + \delta \cdot V} $$

where $V$ is the total vocabulary size.

$$ \newcommand{\Prob}{{\Pr}}
   \newcommand{\given}{\,|\,}
$$
### Implementation
 
For the implementation, we ask you to implement a Python class `NaiveBayes` that will have (at least) the following three methods:

1. `__init__`: An initializer that takes two `torchtext` fields providing descriptions of the text and label aspects of examples.

2. `train`: A method that takes a training data iterator and estimates all of the log probabilities $\log\Prob(c_i)$ and $\log\Prob(x_j \given c_i)$ as described above. Perform add-$\delta$ smoothing with $\delta=1$. These probabilities will be used by the `evaluate` method to evaluate a test dataset for accuracy, so you'll want to store these probabilities in some data structures in objects of the class.

3. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

You should expect to achieve about an **86% accuracy** on the ATIS task.

In [None]:
#TODO
class NaiveBayes():
  def __init__ (self, text, label):
    self.text = text
    self.label = label
    self.padding_id = text.vocab.stoi[text.pad_token]
    self.V = len(text.vocab.itos) # vocabulary size
    self.N = len(label.vocab.itos) # the number of classes
    #TODO: Implement this method.
    "your code here"
    
  def train(self, iterator):
    """Populates tables of log probabilities for training dataset `iterator`."""
    #TODO: Implement this method.
    "your code here"
    
  def evaluate(self, iterator):
    """Returns the model's performance on a given dataset `iterator`."""
    #TODO: Implement this method.
    "your code here"
    return accuracy

In [None]:
# Instantiate and train classifier
nb_classifier = NaiveBayes(TEXT, LABEL)
nb_classifier.train(train_iter)

# Evaluate model performance
print(f'Training accuracy: {nb_classifier.evaluate(train_iter):.3f}\n'
      f'Test accuracy:     {nb_classifier.evaluate(test_iter):.3f}')

## Part 3: Logistic regression

In this part, you'll complete a PyTorch implementation of a logistic regression (equivalently, a single layer perceptron) classifier.

### Review of logistic regression

In logistic regression, we assign a weight $w^c_x$ to each word type $x\in\mathcal{V}$ and each label $c$. Then for a text $\mathbf{x} = \langle x_1, x_2, \ldots, x_m \rangle$ we can model $\Prob (c \given \mathbf{x})$ as

$$ \Prob(c \given \mathbf{x}) = \sigma\left(\sum_{i=1}^m w^c_{x_i}\right), $$

where $\sigma$ is the softmax function, a generalization of the sigmoid function from lab 4:

$$ \Prob(c \given \mathbf{x}) = \frac{\exp\left(\sum_{i=1}^m w^c_{x_i}\right)}{\sum_{c'} \exp\left(\sum_{i=1}^m w^{c'}_{x_i}\right)}
$$

Here, we're treating the types of the individual words in the text to be the features, and "looking up" the weights. Since each word is treated separately, the order of words doesn't matter. Consequently, we can use the bag of words representation introduced in lab 1. Recall that the bag-of-words representation of a text is just the frequency distribution over the vocabulary, which we can notate $bow(\mathbf{x})$. Given a vocabulary of word types $\mathbf{v} = \langle v_1, v_2, \ldots, v_V \rangle$, the representation of a sentence $\mathbf{x} = \langle x_1, x_2, \ldots, x_m \rangle$ is a vector of size $V$, where $$bow(\mathbf{x})_j = \sum_{i=1}^m 1[x_i = v_j]$$

(We write $1[x_i = v_j]$ to indicate 1 if $x_i = v_j$ and 0 otherwise.)

Then, we can rewrite logistic regression as: 

$$ \begin{align*}
\Prob(c \given \mathbf{x})
    &= \sigma\left(\sum_{i=1}^m w^c_{x_i}\right) \\
    &= \sigma(w^c \cdot bow(\mathbf{x})) 
\end{align*}$$

Why do we mention this? Because in your implementation, you can mkae use of either of these two approaches. You can convert the text into a sequence of word type indices or into a bag-of-words representation.

The calculation of $\Prob(c \given \mathbf{x})$ for each text $\mathbf{x}$ is referred to as the _forward_ computation. In summary, the forward computation for logistic regression involves a linear calculation ($\sum_{i=1}^m w^c_{x_i}$ or $w^c \cdot bow(\mathbf{x})$) followed by a nonlinear calculation ($\sigma$). We think of the perceptron (and more generally many of these neural network models) as transforming from one representation to another. A perceptron performs a linear transformation from the index or bag-of-words representation of the text to a representation as a vector followed by a nonlinear transformation, a sigmoid, giving a representation as a probability distribution over the class labels. This single-layer perceptron thus involves two _sublayers_. (In the next part of the problem set, you'll experiment with a multilayer perceptron, with two perceptron layers, and hence four sublayers.)

The loss function you'll use is the negative log probability $-\log \Prob (c \given \mathbf{x})$. The negative is used, since it is convention to minimize loss, whereas we want to maximize log likelihood. 

The forward and loss computations are illustrated in the figure below. In practice, for numerical stability reasons, PyTorch absorbs the softmax operation into the loss function `nn.CrossEntropyLoss`. That is, the input to the `nn.CrossEntropyLoss` function is the vector of sums $\sum_{i=1}^m w^c_{x_i}$ (labeled as "your output" in the figure) rather than the vector of probabilities $\Prob(c \given \mathbf{x})$. That makes things easier for you (!), since you're responsible only for the first sublayer.

<img src="https://raw.githubusercontent.com/nlp-course/data/master/img/logistic_regression.png" alt="LR_illustration" width="400"/>

Given a forward computation, the weights can then be adjusted by taking a step opposite to the gradient of the loss function. Adjusting the weights in this way is referred to as the _backward_ computation. Fortunately, `torch` takes care of the backward computation for you.

The optimization process of performing the forward computation, calculating the loss, and performing the backward computation to improve the weights is done repeatedly until the process converges on a (hopefully) good set of weights. You'll find this optimization process in the `train_all` method that we've provided. The trained weights can then be used to perform classification on a test set. See the `evaluate` method.

You'll be responsible for implementing the forward computation as a method `forward`. We have provided code for performing the optimization and evaluation (though you should feel free to change them). 

### Implement a logistic regression classifier

For the implementation, we ask you to implement a logistic regression classifier as a subclass of the [`torch.nn` module](https://pytorch.org/docs/stable/nn.html). You will be adding the following two methods:

1. `__init__`: an initializer that takes two `torchtext` fields providing descriptions of the text and label aspects of examples.

    During initialization, you'll want to define a [tensor](https://pytorch.org/docs/stable/tensors.html#torch-tensor) of weights, [initialized randomly](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_). This tensor is a [parameter](https://pytorch.org/docs/master/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter) of the `torch.nn` instance in the following special technical sense: It is the parameters of the module whose gradients will be calculated and whose values will be updated. Alternatively, you might find it easier to use the [`nn.Embedding` module](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) which is a wrapper to the weight tensor with a lookup implementation.

2. `forward`: given a text batch of size `batch_size X max_length`, return a tensor of logits of size `batch_size X num_labels`. That is, for each text $\mathbf{x}$ in the batch and each label $c$, you'll be calculating $\sum_{i=1}^m w_{x_i}^c$ as shown in the illustration above, returning a tensor of these values. Note that the softmax operation is absorbed into [`nn.CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) so you won't need that.

Some things to consider:

1. The parameters of the model, the weights need to be initialized properly. We suggest initializing them to some small random values. See [`torch.uniform_`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_).

2. You'll want to make sure that padding tokens are handled properly. What should the weight be for the padding token?

3. In extracting the proper weights to sum up, based on the word types in a sentence, we are essentially doing a lookup operation. You might find [`nn.Embedding`](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) or [`torch.gather`](https://pytorch.org/docs/stable/generated/torch.gather.html#torch-gather) useful.

You should expect to achieve about **90%** accuracy on the ATIS classificiation task. 

In [None]:
#TODO
class LogisticRegression(nn.Module):
  def __init__ (self, text, label):
    super().__init__()
    self.text = text
    self.label = label
    self.padding_id = text.vocab.stoi[text.pad_token]
    # Keep the vocabulary sizes available
    self.N = len(label.vocab.itos) # num_classes
    self.V = len(text.vocab.itos)  # vocab_size
    # Specify cross-entropy loss for optimization
    self.criterion = nn.CrossEntropyLoss()
    # TODO: create and initialize a tensor for the weights,
    #       or create an nn.Embedding module and initialize
    raise NotImplementedError

  def forward(self, text_batch):
    # TODO: calculate the logits for the batch, 
    #       returning a tensor of size batch_size x num_labels
    raise NotImplementedError    

  def train_all(self, train_iter, val_iter, epochs=8, learning_rate=3e-3):
    # Switch the module to training mode
    self.train()
    # Use Adam to optimize the parameters
    optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
    best_validation_accuracy = -float('inf')
    best_model = None
    # Run the optimization for multiple epochs
    for epoch in range(epochs):
      c_num = 0
      total = 0
      running_loss = 0.0
      for batch in tqdm(train_iter):
        # Zero the parameter gradients
        optim.zero_grad()

        # Input and target
        text = batch.text           # a tensor of shape (bsz, max_len)
        logits = self.forward(text) # perform the forward computation
        target = batch.label.long() # bsz
        batch_size = len(target)

        # Compute the loss
        loss = self.criterion(logits, target)

        # Perform backpropagation
        loss.backward()
        optim.step()

        # Prepare to compute the accuracy
        predictions = torch.argmax(logits, dim=1)
        total += batch_size
        c_num += (predictions == target).float().sum().item()        
        running_loss += loss.item() * batch_size

      # Evaluate and track improvements on the validation dataset
      validation_accuracy = self.evaluate(val_iter)
      if validation_accuracy > best_validation_accuracy:
        best_validation_accuracy = validation_accuracy
        self.best_model = copy.deepcopy(self.state_dict())
      epoch_loss = running_loss / total
      epoch_acc = c_num / total
      print (f'Epoch: {epoch} Loss: {epoch_loss:.4f} '
             f'Training accuracy: {epoch_acc:.4f} '
             f'Validation accuracy: {validation_accuracy:.4f}')

  def evaluate(self, iterator):
    self.eval()   # switch the module to evaluation mode
    total = 0     # running total of example
    c_num = 0     # running total of correctly classified examples
    for batch in tqdm(iterator):
      text = batch.text
      logits = self.forward(text)                 # calculate forward probabilities
      target = batch.label.long()                 # extract gold labels
      predictions = torch.argmax(logits, dim=-1)  # calculate predicted labels
      total += len(target)
      c_num += (predictions == target).float().sum().item()
    return c_num / total

In [None]:
# Instantiate classifier and run it
model = LogisticRegression(TEXT, LABEL).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)
test_accuracy = model.evaluate(test_iter)
print (f'Test accuracy: {test_accuracy:.4f}')

## Part 3: Multilayer perceptron

### Review of multilayer perceptrons

In the last part, you implemented a perceptron, a model that involved a linear calculation (the sum of weights) followed by a nonlinear calculation (the softmax, which converts the summed weight values to probabilities). In a multi-layer perceptron, we take the output of the first perceptron to be the input of a second perceptron (and of course, we could continue on with a third or even more).

In this part, you'll implement the forward calculation of a two-layer perceptron, again letting PyTorch handle the backward calculation as well as the optimization of parameters. The first layer will involve a linear summation as before and a sigmoid as the nonlinear function. The second will involve a linear summation and a softmax (the latter absorbed, as before, into the loss function). Thus, the difference from the logistic regression implementation is simply the adding of the sigmoid and second linear calculations. See the figure for the structure of the computation. 

<img src="https://raw.githubusercontent.com/nlp-course/data/master/img/MLP.png" alt="MLP_illustration" width="400"/>



### Implement a multilayer perceptron classifier

For the implementation, we ask you to implement a two layer perceptron classifier, again as a subclass of the [`torch.nn` module](https://pytorch.org/docs/stable/nn.html). You might reuse quite a lot of the code from logistic regression. As before, you will be adding the following two methods:

1. `__init__`: An initializer that takes two `torchtext` fields providing descriptions of the text and label aspects of examples, and `hidden_size` specifying the size of the hidden layer (e.g., in the above illustration, `hidden_size` is 4).

    During initialization, you'll want to define two tensors of weights, which serve as the parameters of this model, one for each layer. You'll want to [initialize them randomly](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_). 
    
    The weights in the first layer are a kind of lookup (as in the previous part), mapping words to a vector of size `hidden_size`. The [`nn.Embedding` module](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) is a good way to set up and make use of this weight tensor.
    
    The weights in the second layer define a linear mapping from vectors of size `hidden_size` to vectors of size `num_labels`. The [`nn.Linear` module](https://pytorch.org/docs/master/generated/torch.nn.Linear.html) or [`torch.mm`](https://pytorch.org/docs/master/generated/torch.mm.html) for matrix multiplication may be helpful here.

2. `forward`: Given a text batch of size `batch_size X max_length`, the `forward` function returns a tensor of logits of size `batch_size X num_labels`. 

    That is, for each text $\mathbf{x}$ in the batch and each label $c$, you'll be calculating $MLP(bow(\mathbf{x}))$ as shown in the illustration above, returning a tensor of these values. Note that the softmax operation is absorbed into [`nn.CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) so you don't need to worry about that.
    
    For the sigmoid sublayer, you might find [`nn.Sigmoid`](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html) useful.

You should expect to achieve at least **90%** accuracy on the ATIS classificiation task. 

In [None]:
#TODO
class MultiLayerPerceptron(nn.Module):
  def __init__ (self, label, text, hidden_size=128):
    super().__init__ ()
    self.text = text
    self.label = label
    self.padding_id = text.vocab.stoi[text.pad_token]
    self.hidden_size = hidden_size
    # Keep the vocabulary sizes available
    self.N = len(label.vocab.itos) # num_classes
    self.V = len(text.vocab.itos)  # vocab_size
    # Specify cross-entropy loss for optimization
    self.criterion = nn.CrossEntropyLoss()
    # TODO: implement here
    raise NotImplementedError

  def forward(self, text_batch):
    # TODO: implement here
    raise NotImplementedError

  def train_all(self, train_iter, val_iter, epochs=8, learning_rate=3e-3):
    # Switch the module to training mode
    self.train()
    # Use Adam to optimize the parameters
    optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
    best_validation_accuracy = -float('inf')
    best_model = None
    # Run the optimization for multiple epochs
    for epoch in range(epochs):
      c_num = 0
      total = 0
      running_loss = 0.0
      for batch in tqdm(train_iter):
        # Zero the parameter gradients
        optim.zero_grad()

        # Input and target
        text = batch.text           # a tensor of shape (bsz, max_len)
        logits = self.forward(text) # perform the forward computation
        target = batch.label.long() # bsz
        batch_size = len(target)

        # Compute the loss
        loss = self.criterion(logits, target)

        # Perform backpropagation
        loss.backward()
        optim.step()

        # Prepare to compute the accuracy
        predictions = torch.argmax(logits, dim=1)
        total += batch_size
        c_num += (predictions == target).float().sum().item()        
        running_loss += loss.item() * batch_size

      # Evaluate and track improvements on the validation dataset
      validation_accuracy = self.evaluate(val_iter)
      if validation_accuracy > best_validation_accuracy:
        best_validation_accuracy = validation_accuracy
        self.best_model = copy.deepcopy(self.state_dict())
      epoch_loss = running_loss / total
      epoch_acc = c_num / total
      print (f'Epoch: {epoch} Loss: {epoch_loss:.4f} '
             f'Training accuracy: {epoch_acc:.4f} '
             f'Validation accuracy: {validation_accuracy:.4f}')

  def evaluate(self, iterator):
    self.eval()   # switch the module to evaluation mode
    total = 0     # running total of example
    c_num = 0     # running total of correctly classified examples
    for batch in tqdm(iterator):
      text = batch.text
      logits = self.forward(text)                 # calculate forward probabilities
      target = batch.label.long()                 # extract gold labels
      predictions = torch.argmax(logits, dim=-1)  # calculate predicted labels
      total += len(target)
      c_num += (predictions == target).float().sum().item()
    return c_num / total

In [None]:
# Instantiate classifier and run it
model = MultiLayerPerceptron(TEXT, LABEL).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)
test_accuracy = model.evaluate(test_iter)
print (f'Test accuracy: {test_accuracy:.4f}')