# Setup

*   GPU: Go to Edit -> Notebook settings. Choose GPU as Hardware accelerator.
*   Packages: Import some useful packages and set everything up.

In [None]:
%matplotlib inline
import copy
import math
import numpy as np
import os
import random
import torch
import torch.nn as nn
import unittest

from collections import Counter
from datetime import datetime
from torch.utils.data import Dataset, DataLoader

def set_seed(seed):  # For reproducibility, fix random seeds.
  random.seed(seed)
  np.random.seed(seed)
  torch.manual_seed(seed)

set_seed(42)

# Data

Download [SST-2](https://dl.fbaipublicfiles.com/glue/data/SST-2.zip), which is a version of [Stanford Sentiment Treebank](https://nlp.stanford.edu/sentiment/index.html) prepared by the [GLUE Benchmark](https://gluebenchmark.com/) for sentence-level binary sentiment classification of movie reviews (either positive or negative). We will assume that we have the directory `data/SST-2/` in our Google Drive account. Let's load the data and stare at it.  

In [None]:
# Load the Drive helper and mount. You will have to authorize this operation.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
label2str = {0: 'NEGATIVE', 1: 'POSITIVE'}  # Integer-string mapping for labels

def load_data(filename):
  with open(os.path.join('/content/drive/My Drive/data/SST-2/', filename)) as f:
    f.readline()
    data = [line.split('\t') for line in f]
  data = [(x.split(), int(y)) for (x, y) in data]  # Text already tokenized (woohoo!), just use whitespace
  balance = len([_ for x, y in data if y == 1]) / len(data) * 100.  # What percentage is positive?
  return data, balance, [len(x) for x, _ in data]

train_data, balance_train, lengths_train = load_data('train.tsv')
val_data, balance_val, lengths_val = load_data('dev.tsv')

print('{} train examples ({:.1f}% positive)'.format(len(train_data), balance_train))
print('{} val examples ({:.1f}% positive)'.format(len(val_data), balance_val))
print('No test labels released\n')

print('Some cherry-picked examples (labels {}):'.format(str(label2str)))
for i in [13, 901, 903, 1001, 61]:
  print('{}\t"{}"'.format(train_data[i][1], ' '.join(train_data[i][0])))

print('\nSentence lengths')
print('  Train: average {:5.1f}, max {}, min {}'.format(sum(lengths_train) / len(lengths_train), max(lengths_train), min(lengths_train)))
print('  Val:   average {:5.1f}, max {}, min {}'.format(sum(lengths_val) / len(lengths_val), max(lengths_val), min(lengths_val)))

67349 train examples (55.8% positive)
872 val examples (50.9% positive)
No test labels released

Some cherry-picked examples (labels {0: 'NEGATIVE', 1: 'POSITIVE'}):
0	"saw how bad this movie was"
1	"nicely done"
0	"the satire is just too easy to be genuinely satisfying ."
1	"a gritty police thriller with all the dysfunctional family dynamics one could wish for"
0	"no apparent joy"

Sentence lengths
  Train: average   9.4, max 52, min 1
  Val:   average  19.5, max 47, min 2


Feel free to look at other examples yourself. Some observations about the data:

- Many sentences can be classified correctly by keywords like "bad" for negative and "nicely" for positive, so we expect the bag-of-words representation to be performant. But there are harder examples, like "too easy to be genuinely satisfying" and "no apparent joy" where the positive sentiment is flipped and "dysfunctional" which is a negative word but actually used for praising the movie.

- Sentences are really short. In the training set, the maximum length is merely 52, and on average a sentence has less than 10 tokens. This will be convenient for training.

As a first step, let's construct a vocabulary and convert everything to integers. We will introduce an "unk" type to represent any unknown word type at test time. We will also introduce a special padding token "pad", and ensure that it gets index 0 which is convenient for later use.

In [None]:
PAD = '<pad>'
UNK = '<unk>'
vocab = Counter([tok for toks, _ in train_data for tok in toks])
assert not PAD in vocab
assert not UNK in vocab
vocab[PAD] = 9999999  # PAD will get index 0
vocab[UNK] = 9999998  # UNK will get index 1
vocab_size = 10000
vocab = [word for word, _ in vocab.most_common(vocab_size)]
assert vocab[0] == PAD
print('Vocab size: {} (with PAD and UNK added)'.format(len(vocab)))
print('vocab[0]:', vocab[0])
print('vocab[1]:', vocab[1])
w2i = {}
for i, word in enumerate(vocab):
  w2i[word] = i

# Note that we're preserving word ordering.
train_sents = [[w2i[tok] if tok in w2i else w2i[UNK] for tok in x] for x, _ in train_data]
val_sents = [[w2i[tok] if tok in w2i else w2i[UNK] for tok in x] for x, _ in val_data]

Vocab size: 10000 (with PAD and UNK added)
vocab[0]: <pad>
vocab[1]: <unk>


Now we package data for the [PyTorch](https://pytorch.org/) library.

Read [torch.utils.data](https://pytorch.org/docs/stable/data.html) for details of how PyTorch expects data. The first thing we need is a `Dataset`. This simply wraps data and returns specified items at given indices. To make items easily batchable, we will do naive padding and make every sentence have the same length. This naive padding won't work in general when sentences can be long; in such a case we need to define a custom batching operation (see below).

In [None]:
class SSTDataset(Dataset):  # A child class of torch.utils.data.Dataset

  def __init__(self, sents, labels, max_length):
    self.sents = sents
    self.labels = labels
    self.max_length = max_length

  def __len__(self):  # This defines the "size" of the dataset.
    return len(self.sents)

  def __getitem__(self, index):  # This returns a single indexed example.
    sent = torch.tensor(self.sents[index])
    sent_padded = torch.cat([sent, torch.zeros(self.max_length - len(sent))]).long()  # Avoid for loop by using torch.zeros.
    label = torch.tensor(self.labels[index])
    return sent_padded, label, len(sent)  # Since we've padded, we need to inform the original length.

dataset_train = SSTDataset(train_sents, [y for _, y in train_data], max(lengths_train))
dataset_val = SSTDataset(val_sents, [y for _, y in val_data], max(lengths_val))

x1, y1, length1 = dataset_train[0]
print(x1, x1.size(), y1, length1)

tensor([4457,   94,    1,   37,    2, 7261, 9001,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0]) torch.Size([52]) tensor(0) 7


Each element of a `Dataset` is a list of PyTorch tensor types, and can be fed into a `DataLoader` which does batching for us. **Important**: No matter what dataset you work on in the future, you will always follow this general syntax:
```python
loader = DataLoader(dataset,
                    batch_size=batch_size,
                    shuffle=shuffle,  # True if this is a loader for training
                    num_workers=num_workers,  # Number of threads to parallelize data processing work
                    collate_fn=collate_fn,  # Custom batching operation (e.g., needed if sequences have different lengths)
                    batch_sampler=batch_sampler)  # Custom batch sampling
```
In our case, every padded sentence has the same length, so default batching kicks in and we don't have to provide a custom batching operation (`collate_fn`). A `DataLoader` is a [generator](https://wiki.python.org/moin/Generators) that allows us to iterate over data. .

In [None]:
# We will define a train loader later since we'll be changing training batch sizes.
dataloader_val = DataLoader(dataset_val, batch_size=128, shuffle=False)

for batch_num, batch in enumerate(dataloader_val):
  sents, labels, lengths = batch
  if batch_num == 0:
    print(sents, sents.size())  # (batch_size, max_length)
    print(labels, labels.size())  # batch_size
    print(lengths, lengths.size())  # batch_size
  # Don't break here, so the generator is all spent and reset.

tensor([[  13,    9,    4,  ...,    0,    0,    0],
        [   1, 3043,    5,  ...,    0,    0,    0],
        [ 813,   93,    8,  ...,    0,    0,    0],
        ...,
        [   4,   20,   11,  ...,    0,    0,    0],
        [   1,    2, 7356,  ...,    0,    0,    0],
        [  12,    2,  192,  ...,    0,    0,    0]]) torch.Size([128, 47])
tensor([1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,
        1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1,
        1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1,
        1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0,
        0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0,
        1, 0, 1, 1, 1, 1, 0, 0]) torch.Size([128])
tensor([ 9,  4, 20, 20,  9, 21,  5, 10, 20, 33, 16, 30, 19, 23, 25, 20, 18, 32,
        10, 19, 32, 13,  6, 16, 16, 27, 28, 23, 10, 11, 28, 25,  8, 33, 20, 11,
        26, 27, 23, 16, 19, 21, 21, 11, 1

# Model

Let's write code to train a binary classifier. We will again train it by the cross-entropy loss. As an exercise, we will implement the loss module by hand, which combines sigmoid with log for a more numerically stable gradient form. See the [page about extending torch.autograd](https://pytorch.org/docs/master/notes/extending.html#extending-torch-autograd) to understand the syntax here.

In [None]:

class BinaryCrossEntropyLossFunction(torch.autograd.Function):

  @staticmethod
  def forward(ctx, logits, labels):
    probs = 1. / (1 + (-logits).exp())
    ctx.save_for_backward(probs, labels)  # Just need probabilities for backward
    losses = - (labels * torch.log(probs) + (1 - labels) * torch.log(1 - probs))  # By the definition of binary cross entropy loss
    return losses.sum()

  @staticmethod
  def backward(ctx, grad_output):
    probs, labels = ctx.saved_tensors
    jacobian = probs - labels  # Jacobian is partial derivative of Loss function wrt logits. By chain rule we get the Jacobian as p - y i.e. probs - labels
    grad_logits = grad_output * jacobian
    return grad_logits, None  # No need to calculate gradient wrt labels


class BinaryCrossEntropyLoss(nn.Module):

  def forward(self, logits, labels):
    return BinaryCrossEntropyLossFunction.apply(logits, labels)

Now our custom `BinaryCrossEntropyLoss` module can be plugged into a computation graph as a node. When the scalar output node of a computation graph calls `backward()`, the graph will automatically run the forward and backward pass.

For instance, in the unit test and also training below, we will call backward on `loss / N`. Behind the hood, PyTorch creates another node with the scalar division that has its own forward and backward functions when we write `loss / N` which is syntactic sugar.

In [None]:
class TestBinaryCrossEntropyLoss(unittest.TestCase):

  def setUp(self):
    self.batch_size = 42
    self.places = 6
    self.logits = np.random.randn(self.batch_size)
    self.labels = np.random.randint(2, size=self.batch_size)
    self.mine = BinaryCrossEntropyLoss()
    self.gold = torch.nn.BCEWithLogitsLoss(reduction='sum')

  def test_forward_backward(self):
    def run_loss_layer(layer):
      variables = torch.tensor(self.logits, requires_grad=True)

      # The final node is actually a scalar division node.
      loss_node = layer(variables, torch.tensor(self.labels).float()) / self.batch_size

      # This runs forward backward. Our layer will propagate gradient from the final node (scalar division).
      loss_node.backward()

      grad = copy.deepcopy(variables.grad)
      return loss_node.item(), grad.tolist()

    loss, grad = run_loss_layer(self.mine)
    loss_gold, grad_gold = run_loss_layer(self.gold)
    self.assertAlmostEqual(loss, loss_gold, places=self.places)
    for i in range(len(grad)):
        self.assertAlmostEqual(grad[i], grad_gold[i], places=self.places)

unittest.main(TestBinaryCrossEntropyLoss(), argv=[''], verbosity=2, exit=False)

test_forward_backward (__main__.TestBinaryCrossEntropyLoss) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.118s

OK


<unittest.main.TestProgram at 0x7ce3fc589c60>

A general binary classifier module is implemented below. It assumes an **encoder** module that encodes sentences into same-dimensional embeddings. At that point, all that's left is to apply a linear classifier to get logits and use them to classify / compute loss.

In [None]:
class BinaryClassifier(nn.Module):

  def __init__(self, encoder):
    super().__init__()
    self.encoder = encoder
    self.score = nn.Linear(encoder.dim, 1)  # In PyTorch, Linear has both weight matrix and bias (W, b) by default.
    self.loss = BinaryCrossEntropyLoss()

  def forward(self, sents, lengths, labels=None):
    embs = self.encoder(sents, lengths)  # (batch_size, dim)
    logits = self.score(embs)  # (batch_size, 1)
    logits = logits.squeeze(1)  # batch_size
    loss_total = None if labels is None else self.loss(logits, labels)
    return logits, loss_total

## Continuous bag-of-words (CBOW) encoder

We will consider a simple encoder (we will call this "continuous bag-of-words" or CBOW encoder) that averages word embeddings, and optionally applies a nonlinear feedforward layer with a residual connection.

In [None]:
class CBOWEncoder(nn.Module):

  def __init__(self, vocab_size, dim, ff=False, activation='relu', drop=0.):
    super().__init__()
    self.dim = dim
    self.ff = ff
    self.wemb = nn.Embedding(vocab_size, dim, padding_idx=0)  # Assumes padding token has index 0
    if ff:
      # Chain a linear layer with an activation layer into one layer.
      self.ff_layer = nn.Sequential(nn.Linear(dim, dim),
                                    {'relu': nn.ReLU(), 'tanh': nn.Tanh()}[activation])
    self.drop = nn.Dropout(drop)

  def forward(self, sents, lengths):
    wembs = self.wemb(sents)  # (batch_size, max_length, dim)
    embs = wembs.sum(dim=1) / lengths[:, None]  # (batch_size, dim)
    if self.ff:
      embs = self.ff_layer(embs) + embs  # Residual connection, probably harmless
    embs = self.drop(embs)
    return embs

We introduced the notion of **word embeddings** (more generally, feature embeddings). This is simply a $V \times d$ matrix where we have a $d$-dimensional *parameter* for each of $V$ word types. We learn these embeddings (along with other weights) to minimize the loss.

Because we use padding, we will indicate that 0 is a pad token. Then PyTorch Embedding will make this a zero vector, so that it does not affect the loss value or gradients.

All parameter values are initialized randomly according to some default initialization scheme. For instance, the [Embedding](https://github.com/pytorch/pytorch/blob/8a9090219ef3a18d41f5b95ab545915f707847c2/torch/nn/modules/sparse.py#L136) layer uses normal and the [Linear](https://github.com/pytorch/pytorch/blob/8a9090219ef3a18d41f5b95ab545915f707847c2/torch/nn/modules/linear.py#L86) layer uses Kaiming uniform.

In [None]:
A = nn.Embedding(5, 2, padding_idx=0)
print(A.weight)  # Each row corresponds to a 2-dimensional embedding of the corresponding feature type.

Parameter containing:
tensor([[ 0.0000,  0.0000],
        [ 1.1561,  0.3965],
        [-2.4661,  0.3623],
        [ 0.3765, -0.1808],
        [ 0.3930,  0.4327]], requires_grad=True)


We also use **dropout**. The dropout layer `Dropout(p)` is an effective tool for regularization. During training, it "drops" each element of the input array to 0 with probability $p$, and for those elements not dropped scale them as $x \mapsto x/(1-p)$ to roughly preserve the size of the input. During evaluation, PyTorch turns off dropout.

In [None]:
dropper = nn.Dropout(0.75)  # Train mode by default.
x = torch.randn(10)
print('x:          ({})'.format(' '.join(['{:.2f}'.format(val) for val in x])))
print('dropper(x): ({})'.format(' '.join(['{:.2f}'.format(val) for val in dropper(x)])))  # Survivors multiplied by 4

dropper.eval()
print('At eval:    ({})'.format(' '.join(['{:.2f}'.format(val) for val in dropper(x)])))

x:          (-1.36 1.36 0.67 -0.71 -0.33 -0.28 -0.42 -1.33 -0.36 0.15)
dropper(x): (-0.00 0.00 0.00 -2.83 -0.00 -0.00 -0.00 -0.00 -1.46 0.00)
At eval:    (-1.36 1.36 0.67 -0.71 -0.33 -0.28 -0.42 -1.33 -0.36 0.15)


This encoder module is fed into the classifier module above. In PyTorch this makes the parameters of the encoder module as part of the parameters of the classifier module. You can print the model and see all parameters and their information like dimensions.

In [None]:
def count_params(model):
  return sum(p.numel() for p in model.parameters())

model = BinaryClassifier(CBOWEncoder(vocab_size, 128, ff=True, activation='tanh', drop=0.1))
print('Model has {} parameters\n'.format(count_params(model)))
print(model)
print()

print('First few values of the score layer\'s weight vector')
print(model.score.weight.data[0][:10])

Model has 1296641 parameters

BinaryClassifier(
  (encoder): CBOWEncoder(
    (wemb): Embedding(10000, 128, padding_idx=0)
    (ff_layer): Sequential(
      (0): Linear(in_features=128, out_features=128, bias=True)
      (1): Tanh()
    )
    (drop): Dropout(p=0.1, inplace=False)
  )
  (score): Linear(in_features=128, out_features=1, bias=True)
  (loss): BinaryCrossEntropyLoss()
)

First few values of the score layer's weight vector
tensor([-0.0363, -0.0349, -0.0250,  0.0802,  0.0356, -0.0730,  0.0248,  0.0112,
        -0.0367, -0.0807])


Note that the number of parameters is quite large, mostly because of the word embedding matrix which alone has $Vd$ parameters. Let's write a helper function to evaluate a model on validation data.

In [None]:
def get_acc_val(model):
  num_correct_val = 0
  model.eval()  # This turns off the training mode.
  with torch.no_grad():  # This deactivates autodiff for improved efficiency.
    for batch in dataloader_val:
      sents, labels, lengths = batch
      sents = sents.to(model.score.weight.device)  # Send data to same device that model is on.
      labels = labels.to(model.score.weight.device)
      lengths = lengths.to(model.score.weight.device)
      logits, _ = model(sents, lengths, labels)
      preds = torch.where(logits > 0., 1, 0)  # 1 if p(1|x) > 0.5, 0 else
      num_correct_val += (preds == labels).sum()
  acc_val = num_correct_val / len(dataloader_val.dataset) * 100.
  return acc_val

### Training

Let's again define our custom SGD optimizer to demystify the training process. It simply iterates over all parameters and takes gradient steps, assuming that the gradient of the parameter is stored in `grad` field.

In [None]:
class SGDOptimizer:

  def __init__(self, parameters, learning_rate):
    self.params = list(parameters)
    self.lr = learning_rate

  def step(self):
    with torch.no_grad():
      for p in self.params:
        p -= self.lr * p.grad  # In PyTorch, every parameter object has a grad field that stores an accumulated gradient.

  def zero_grad(self):
    for p in self.params:
      p.grad.data.zero_()

Let's write code for training a PyTorch model. This is a pretty general training routine, independent of specific models or tasks. Spend some time to make yourself familiar with it.

In [None]:
def train(model, dataloader_train, optimizer, num_epochs=10, clip=0., verbose=True, device='cpu', select_model=False):
  model = model.to(device)  # Move the model to device.

  loss_avg = float('inf')
  acc_train = 0.
  best_acc_val = 0.
  start_time = datetime.now()  # Keep track of training time.
  best_state_dict = None
  num_continuous_fails = 0
  tolerance = 6
  for epoch in range(num_epochs):
    model.train()  # This turns on the training mode (e.g., enable dropout).
    loss_total = 0.
    num_correct_train = 0
    for batch_ind, batch in enumerate(dataloader_train):
      sents, labels, lengths = batch
      sents = sents.to(device)  # Move data to device.
      labels = labels.to(device)
      lengths = lengths.to(device)
      logits, loss_batch_total = model(sents, lengths, labels)
      preds = torch.where(logits > 0., 1, 0)  # 1 if p(1|x) > 0.5, 0 else
      num_correct_train += (preds == labels).sum()
      loss_total += loss_batch_total.item()

      if math.isnan(loss_total):  # Let's not waste time if we get NaN.
        break

      loss_batch_avg = loss_batch_total / sents.size(0)  # Final node of the computation graph of this batch.
      loss_batch_avg.backward()  # This calls forward and backward.

      if clip > 0.:  # Optional gradient clipping
        nn.utils.clip_grad_norm_(model.parameters(), clip)

      optimizer.step()  # optimizer updates model weights based on stored gradients
      optimizer.zero_grad()  # Reset gradient slots to zero

    if math.isnan(loss_total):
      print('Stopping training because loss is NaN')
      break

    # Useful training information, kept track of efficiently.
    loss_avg = loss_total / len(dataloader_train.dataset)
    acc_train = num_correct_train / len(dataloader_train.dataset) * 100.

    # Check validation performance at the end of every epoch.
    acc_val = get_acc_val(model)

    if verbose:
      print('Epoch {:3d} | avg loss {:8.4f} | train acc {:2.2f} | val acc {:2.2f}'.format(epoch + 1, loss_avg, acc_train, acc_val))

    if acc_val > best_acc_val:
      num_continuous_fails = 0
      best_acc_val = acc_val
      if select_model:
        best_state_dict = copy.deepcopy(model.state_dict())
    else:
      num_continuous_fails += 1
      if num_continuous_fails > tolerance:
        print('Early stopping')
        break

  train_time = datetime.now() - start_time
  if verbose:
    print('Final avg loss {:8.4f} | final train acc {:2.2f} | best val acc {:2.2f} | train time {:d} secs'.format(loss_avg, acc_train, best_acc_val, train_time.seconds))

  if select_model and best_state_dict is not None:
    model.load_state_dict(best_state_dict)
  model.eval()
  return loss_avg, acc_train, best_acc_val, train_time

Let's try training a CBOW sentiment classifier. Since we have no idea what hyperparameter configuration works, we will do a random search, treating all of the following as hyperparameters:
- Random seed
- Learning rate
- Batch size
- Dropout probability
- Whether to use a nonlinear feedforward layer or not
- Activation function
- Dimension of embeddings

In [None]:
grid = {'seed': list(range(100000)),
        'lr': [1, 0.1, 0.01, 0.001, 0.0001],
        'batch_size': [16, 32, 64],
        'drop': [0., 0.1, 0.3, 0.5],
        'ff': [False, True],
        'activation': ['relu', 'tanh'],
        'dim': [64, 128, 256]}

def print_row(hparams, train_output):
  loss_avg, acc_train, acc_val, train_time = train_output
  print('seed {:6d} | lr {:1.5f} | batch {:d} | drop {:.1f} | ff {:d} | {} | dim {:3d} | loss {:1.4f} | train acc {:2.2f} | val acc {:2.2f} | {:d} secs'.format(
      hparams['seed'], hparams['lr'], hparams['batch_size'], hparams['drop'], hparams['ff'], hparams['activation'], hparams['dim'], loss_avg, acc_train, acc_val, train_time.seconds))

num_runs = 20

if False:  # Set True if you want to run a random search.
  best_acc_val = 0.
  for run_num in range(1, num_runs + 1):
    hparams = {}
    for hparam, values in grid.items():
      hparams[hparam] = random.choice(values)
    set_seed(hparams['seed'])
    encoder = CBOWEncoder(vocab_size, hparams['dim'], ff=hparams['ff'], activation=hparams['activation'], drop=hparams['drop'])
    model = BinaryClassifier(encoder)
    dataloader_train = DataLoader(dataset_train, batch_size=hparams['batch_size'], shuffle=True)
    optimizer = SGDOptimizer(model.parameters(), hparams['lr'])
    train_output = train(model, dataloader_train, optimizer, clip=1., num_epochs=30, verbose=False, device='cuda')
    print_row(hparams, train_output)
    acc_val = train_output[2]
    if acc_val > best_acc_val:
      best_acc_val = acc_val
  print('Best acc {:2.2f}'.format(best_acc_val))

You're not required to run the above search yourself (it'll take $>30$ minutes). In practice, you'd want to parallelize this so that several runs are executed at the same time (assuming you have multiple CPUs/GPUs). But you'll see something like this:

```
Stopping training because loss is NaN
seed  83810 | lr 1.00000 | batch 16 | drop 0.3 | ff 0 | relu | dim  64 | loss 0.6967 | train acc 63.64 | val acc 68.35 | 27 secs
Stopping training because loss is NaN
seed  84440 | lr 1.00000 | batch 64 | drop 0.3 | ff 0 | relu | dim 256 | loss 0.5923 | train acc 72.33 | val acc 75.23 | 43 secs
Early stopping
seed  44131 | lr 0.00010 | batch 32 | drop 0.5 | ff 0 | tanh | dim 256 | loss 0.6894 | train acc 54.89 | val acc 55.50 | 63 secs
seed  61748 | lr 0.10000 | batch 64 | drop 0.1 | ff 1 | relu | dim  64 | loss 0.5525 | train acc 71.66 | val acc 70.18 | 155 secs
Early stopping
seed  55325 | lr 0.00100 | batch 32 | drop 0.3 | ff 1 | tanh | dim 256 | loss 0.6546 | train acc 61.30 | val acc 63.53 | 130 secs
Early stopping
seed  38842 | lr 0.01000 | batch 32 | drop 0.5 | ff 0 | tanh | dim  64 | loss 0.6771 | train acc 56.96 | val acc 55.39 | 94 secs
Stopping training because loss is NaN
seed  31506 | lr 1.00000 | batch 32 | drop 0.5 | ff 1 | relu | dim 256 | loss inf | train acc 0.00 | val acc 0.00 | 2 secs
Early stopping
seed  38316 | lr 0.00100 | batch 64 | drop 0.0 | ff 0 | tanh | dim  64 | loss 0.6819 | train acc 55.78 | val acc 53.33 | 47 secs
Stopping training because loss is NaN
seed  53981 | lr 1.00000 | batch 32 | drop 0.5 | ff 1 | tanh | dim  64 | loss 0.2527 | train acc 90.07 | val acc 81.08 | 124 secs
Stopping training because loss is NaN
seed  76179 | lr 0.01000 | batch 16 | drop 0.0 | ff 1 | relu | dim 256 | loss 0.4118 | train acc 83.50 | val acc 71.90 | 208 secs
Early stopping
seed  95698 | lr 0.00100 | batch 32 | drop 0.5 | ff 0 | tanh | dim  64 | loss 0.6816 | train acc 55.82 | val acc 53.10 | 56 secs
Stopping training because loss is NaN
seed  36782 | lr 1.00000 | batch 32 | drop 0.0 | ff 1 | tanh | dim 128 | loss 0.3386 | train acc 85.51 | val acc 75.57 | 57 secs
Stopping training because loss is NaN
seed  74559 | lr 1.00000 | batch 16 | drop 0.0 | ff 0 | tanh | dim 256 | loss inf | train acc 0.00 | val acc 0.00 | 2 secs
Early stopping
seed  88667 | lr 0.00100 | batch 16 | drop 0.5 | ff 0 | tanh | dim 256 | loss 0.6634 | train acc 60.00 | val acc 60.78 | 283 secs
Stopping training because loss is NaN
seed  28082 | lr 0.10000 | batch 16 | drop 0.0 | ff 1 | tanh | dim 128 | loss 0.3180 | train acc 86.37 | val acc 73.97 | 266 secs
Stopping training because loss is NaN
seed  51128 | lr 1.00000 | batch 16 | drop 0.3 | ff 0 | tanh | dim 256 | loss inf | train acc 0.00 | val acc 0.00 | 4 secs
seed  73733 | lr 0.00100 | batch 64 | drop 0.1 | ff 1 | relu | dim 128 | loss 0.6679 | train acc 59.05 | val acc 57.45 | 144 secs
Early stopping
seed  38556 | lr 0.00100 | batch 32 | drop 0.3 | ff 1 | relu | dim  64 | loss 0.6774 | train acc 57.30 | val acc 54.36 | 134 secs
seed  81377 | lr 0.00100 | batch 16 | drop 0.3 | ff 1 | relu | dim 128 | loss 0.6661 | train acc 59.61 | val acc 59.52 | 384 secs
seed  81160 | lr 0.00100 | batch 16 | drop 0.0 | ff 0 | tanh | dim 128 | loss 0.6613 | train acc 60.71 | val acc 61.58 | 329 secs
Best acc 81.08
```

It can be overwhelming with so many moving parts. Nonetheless, we can make some observations:

1. **The performance varies wildly.** Don't be discouraged (yet)! It is expected that many hyperparameter values are completely off, and only a very small subset of them work at all. If we find *some* configuration that seems to work generally okay (e.g., learning rate 1 and batch size 32), it doesn't matter how many of these random runs are failures.

2. **Numerical stability is an issue.** In many cases we get NaN loss because the learning rate is too high. But then the performance is often bad because the learning rate is too low. Optimization issues can be mitigated to some degree by using a more robust optimizer like [Adam](https://pytorch.org/docs/stable/optim.html#torch.optim.Adam), but are in general unavoidable.

3. **It's pretty difficult to tell which hparam is having what impact.** Because the hparams interact, it's often hard to tell what's actually making a difference. For instance, in the above run we got lucky with val acc 81.08. Is it because of the learning rate 1, or the batch size 32, or the use of feedforward layer with tanh, or the dimension 64, or the dropout prob 0.5, or even the random seed 53981? There's no magical solution, but we can still try to get a big picture, for instance (1) basic SGD doesn't seem to work well with learning rates that are too small (e.g., $0.001$), (2) additional nonlinaer feedforward doesn't seem that beneficial (we can get 75.23 acc without it). To tell the impact of a specific hparam, we must do a controlled study (e.g., does dropout help?).

Some advice:

- If well-working hyperparameter values are known (e.g., you're replicating some paper), **start from the reported hyperparameter values** instead of searching from scratch like this. It'll save you a huge amount of time and effort.

- To avoid exaggerating lucky performance, you may want to compute standard deviation across ~10 seed values, or even do a [statistical significance test](http://karlstratos.com/notes/statsig.pdf). Unless your new model works *substantially* better than baselines, it's important to be wary of randomness in performance.

We will just use the best configuration based on the search above and see what we get.

In [None]:
if True: # Set True to run.
  set_seed(53981)
  model = BinaryClassifier(CBOWEncoder(vocab_size, 64, ff=True, activation='tanh', drop=0.5))
  print('Model has {} parameters\n'.format(count_params(model)))
  _ = train(model, DataLoader(dataset_train, batch_size=32, shuffle=True),
            SGDOptimizer(model.parameters(), 1.), clip=1., num_epochs=30, verbose=True, device='cuda')


Model has 644225 parameters

Epoch   1 | avg loss   0.7385 | train acc 55.19 | val acc 59.29
Epoch   2 | avg loss   0.6818 | train acc 61.58 | val acc 69.50
Epoch   3 | avg loss   0.6211 | train acc 67.80 | val acc 72.82
Epoch   4 | avg loss   0.5605 | train acc 72.59 | val acc 75.46
Epoch   5 | avg loss   0.5087 | train acc 76.38 | val acc 76.83
Epoch   6 | avg loss   0.4610 | train acc 79.39 | val acc 77.64
Epoch   7 | avg loss   0.4193 | train acc 81.68 | val acc 76.49
Epoch   8 | avg loss   0.3840 | train acc 83.69 | val acc 79.36
Epoch   9 | avg loss   0.3548 | train acc 85.30 | val acc 77.98
Epoch  10 | avg loss   0.3305 | train acc 86.48 | val acc 79.24
Epoch  11 | avg loss   0.3101 | train acc 87.42 | val acc 78.21
Epoch  12 | avg loss   0.2925 | train acc 88.27 | val acc 80.05
Epoch  13 | avg loss   0.2771 | train acc 89.05 | val acc 79.24
Epoch  14 | avg loss   0.2637 | train acc 89.39 | val acc 78.90
Epoch  15 | avg loss   0.2540 | train acc 90.00 | val acc 80.62
Stopping tr

So what is the conclusion? We can't say for certain, since there *might* be some magical hyperparameter value that we didn't use, but we can say with some confidence that the CBOW model can get roughly **70-80% validation accuracy** with a right configuration. It's unlikely that we will dramatically improve over this (e.g., 90%) without fundamentally changing the model, using the same amount of labeled data.

## Convolutional neural network (CNN) encoder

A convolutional layer takes a multi-dimensional array and slides $K$ distinct learnable tensors (called filters or kernels) to induce $K \times S$ scalar values where $S$ is the number of slides (each scalar is the sum of an elementwise product). Then, for each filter vector containing $S$ values we take max (called max pooling), resulting in a final $K$-dimensional embedding. This can be seen as learning a set of small "windows" and for each window taking the most activated slide in the input. CNN is a foundational tool in image processing: see [here](https://cs231n.github.io/convolutional-networks/#conv) for a tutorial on general CNNs. It is straightforward to apply CNNs to text by treating a sentence as a 2-dimensional array ($1 \times d$ where $d$ is the word embedding dimension).

In PyTorch, we have a prebuilt conv layer `Conv2d` which makes sure that everything is done as efficiently as possible (e.g., parallelize the sliding operation). So we'll use this.

In [None]:
dim_word = 10
num_filters = 3
filter_width = 4
filter_height = dim_word
A = nn.Embedding(5, dim_word, padding_idx=0)
x = A(torch.LongTensor([[1, 2, 3, 4, 0, 0, 0],
                        [3, 3, 2, 2, 2, 1, 1]]))  # (batch_size, length, dim_word)

# Make "height 1" explicit
x = x.unsqueeze(1)  # (batch_size, 1, length, dim_word)

# The last dim of conv output is dim_word - filter_height + 1, which will be 1 in our case.
conv2 = nn.Conv2d(1, num_filters, (filter_width, filter_height))
relu = nn.ReLU()
h = conv2(x)  # (batch_size, num_filters, length - filter_width + 1, 1)
h = h.squeeze(-1)  # (batch_size, num_filters, length - filter_width + 1)
h = relu(h)

# Max pooling in this context amounts to taking max along the last axis.
z = h.max(dim=-1)[0]  # (batch_size, num_filters)

print('x ', tuple(x.size()))
print('h ', tuple(h.size()))
print('z ', tuple(z.size()))  # (num_filters)-dimensional embedding of each element in batch

x  (2, 1, 7, 10)
h  (2, 3, 4)
z  (2, 3)


We learn *multiple* such filters with $W$ different widths (e.g., 3, 4, 5, corresponding to trigrams, 4-grams, 5-grams) and concatenate their outputs, so in the end we represent each sentence by a $KW$-dimensional vector.
- **Pros**: Simple. Good at identifying local patterns ($n$-grams).
- **Cons**: The output of a CNN is still not a function of the order of the entire sequence. Thus it cannot capture long-distance dependencies.

In [None]:
class CNNEncoder(nn.Module):

  def __init__(self, vocab_size, dim_word=300, num_filters=100, filter_widths=[3, 4, 5], drop=0.5):
    super().__init__()
    self.dim = num_filters * len(filter_widths)  # This is the final dimension of a sentence embedding.
    self.wemb = nn.Embedding(vocab_size, dim_word, padding_idx=0)  # Assumes padding token has index 0

    # Convolutional layers corresponding to different widths.
    self.convs = nn.ModuleList([nn.Conv2d(1, num_filters, (filter_width, dim_word)) for filter_width in filter_widths])

    self.relu = nn.ReLU()
    self.drop = nn.Dropout(drop)

  def forward(self, sents, lengths):

    # Adding a channel dimension
    sents = sents.unsqueeze(1)

    # Embed words in sentences.
    wembs = self.wemb(sents)

    #  and applying convolutional layers.
    conv_outs = [conv(wembs) for conv in self.convs]

    # Squeezing the last dimension to remove it after convolution.
    outs = [out.squeeze(-1) for out in conv_outs]

    # Applying ReLU activation.
    relu_outs = [self.relu(out) for out in outs]

    # Applying max pooling across the output_length dimension to extract the most significant features.
    pool_outs = [out.max(dim=-1)[0] for out in relu_outs]

    # Concatenating pooled outputs from all convolutional layers.
    concat_outs = torch.cat(pool_outs, dim=1)

    # Applying dropout to the concatenated features.
    embs = self.drop(concat_outs)

    return embs

Let's try out the CNN encoder! To avoid the hassle of hyperparameter tuning as much as possible, we will mostly use reported values (e.g., filter widths [3, 4, 5], dropout 0.5, etc.) in [Kim's paper](https://arxiv.org/pdf/1408.5882.pdf) which reported one of the first successful applications of CNN for NLP. We'll also use PyTorch's predefined [Adam optimizer](https://arxiv.org/pdf/1408.5882.pdf) which uses a moving average of the gradient (i.e., momentum) and also normalizes it by an approximation of standard deviation. Adam is often more stable and converges faster than basic SGD so it's become a standard optimizer.

The configuration below was identified by just trying some perturbations of hyperparameter values that seemed to work well. In particular, Adam's [weight decay/$l_2$ regularization](https://arxiv.org/pdf/1711.05101.pdf) was helpful in using a larger learning rate.

In [None]:
if True:  # Set True to run.
  set_seed(42)
  model = BinaryClassifier(CNNEncoder(vocab_size))
  print('Model has {} parameters\n'.format(count_params(model)))
  _ = train(model, DataLoader(dataset_train, batch_size=64, shuffle=True),
            torch.optim.Adam(model.parameters(), lr=0.001, weight_decay=0.0003), clip=0., num_epochs=30, verbose=True, device='cuda')

Model has 3360601 parameters

Epoch   1 | avg loss   0.5252 | train acc 72.99 | val acc 75.69
Epoch   2 | avg loss   0.3528 | train acc 84.65 | val acc 79.47
Epoch   3 | avg loss   0.2874 | train acc 88.04 | val acc 78.44
Epoch   4 | avg loss   0.2574 | train acc 89.47 | val acc 80.85
Epoch   5 | avg loss   0.2360 | train acc 90.55 | val acc 78.78
Epoch   6 | avg loss   0.2147 | train acc 91.46 | val acc 82.00
Epoch   7 | avg loss   0.1972 | train acc 92.25 | val acc 81.31
Epoch   8 | avg loss   0.1806 | train acc 93.05 | val acc 80.62
Epoch   9 | avg loss   0.1705 | train acc 93.39 | val acc 82.00
Epoch  10 | avg loss   0.1605 | train acc 93.75 | val acc 81.77
Stopping training because loss is NaN
Final avg loss   0.1605 | final train acc 93.75 | best val acc 82.00 | train time 94 secs


With a right implementation, you should be able to get $>$ 80% accuracy easily.

So what are some conclusions?
- CNN works well, but it doesn't seem to offer a dramatic improvement over CBOW. Both saturate around 80 accuracy.
- CNN was easier to train because we used known hyperparameter values and also a more robust optimization method (Adam + weight decay).

## Recurrent neural network (RNN) encoder

An RNN should always be thought of as a mapping $\textbf{RNN}_\theta: (h_{t-1}, x_t) \mapsto h_t$, which takes **previous hidden state** and **current input state** to compute a new hidden state (initial hidden state is set to zero, unless you want it to condition on some information). Thus we can think of applying an RNN to a sequence of word embeddings $x_1 \ldots x_T$ as inducing a sequence of hidden states $h_1 \ldots h_T$ by running it left-to-right. Note that it's getting its previous output as input, hence the name "recurrent". In particular, the last hidden state $h_T$ can be viewed as a very deep feedforward network in which each "layer"'s parameters are shared (i.e., RNN weights). An RNN is a natural way to handle variable-length input.

One successful application of RNN is inducing *context-sensitive* word embeddings by running it left-to-right and right-to-left. Specifically, we compute
$$\begin{align*}
\textbf{RNN}_\theta(x_1 \ldots x_T) &= (h_1 \ldots h_T) \\
\textbf{RNN}_\phi(x_T \ldots x_1) &= (h'_1 \ldots h'_T)
\end{align*}$$
And use
$$
z_t = \mathrm{Concat}(h_t, h'_{T-t+1})
$$
as an embedding of the $t$-th token. Crucially, it's a function of all of its left tokens *and* all of its right tokens. We will use [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) as our choice of RNN (bidirectional LSTM, or **BiLSTM**), and compute these context-sensitive word embeddings then average them to obtain our final sentence embedding. Because of the recurrent nature of RNNs, we need optimization schemes like [packing variable-length sequences](https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch).


In [None]:
dim_word = 4
A = nn.Embedding(5, dim_word, padding_idx=0)
x = A(torch.LongTensor([[1, 2, 3, 4, 0, 0, 0],
                        [3, 3, 2, 2, 2, 1, 1]]))  # (batch_size, max_length, dim_word)
lengths = torch.LongTensor([4, 7])

dim_lstm = 20
num_layers = 1
bilstm = nn.LSTM(dim_word, dim_lstm, num_layers, bidirectional=True)  # By setting bidirectional, we're learning 2 LSTMs (forward and backward).

# Pack padded sequences for compututational efficiency for RNNs.
packed = nn.utils.rnn.pack_padded_sequence(x, lengths, batch_first=True, enforce_sorted=False)

# Run BiLSTM.
hiddens, (final_h, final_c) = bilstm(packed)

# Undo packing
hiddens, lengths = nn.utils.rnn.pad_packed_sequence(hiddens, batch_first=True)  # (batch_size, max_length, 2*dim_lstm)

# Left/right context-sensitive token representations
print('hiddens', hiddens.size())  # (batch_size, max_length, num_directions * dim_lstm)

# Final hidden/cell states of LSTM
print('final_h', final_h.size())  # (num_layers * num_directions, batch_size, dim_lstm)
print('final_c', final_c.size())  # (num_layers * num_directions, batch_size, dim_lstm)

hiddens torch.Size([2, 7, 40])
final_h torch.Size([2, 2, 20])
final_c torch.Size([2, 2, 20])


In [None]:
class BiLSTMEncoder(nn.Module):

  def __init__(self, vocab_size, dim_word=300, dim_lstm=300, num_layers=1, drop=0.5):
    super().__init__()
    self.dim = 2 * dim_lstm # This is the final dimension of a sentence embedding.
    self.wemb = nn.Embedding(vocab_size, dim_word, padding_idx=0)  # Assumes padding token has index 0

    self.bilstm = nn.LSTM(dim_word, dim_lstm, num_layers, bidirectional=True)
    self.drop = nn.Dropout(drop)

  def forward(self, sents, lengths):
    # Embedding the words
    x = self.wemb(sents)  # (batch_size, max_length, dim_word)

    # Packing the word embeddings
    packed = nn.utils.rnn.pack_padded_sequence(x, lengths.cpu(), batch_first=True, enforce_sorted=False)

    # Running the BiLSTM
    hiddens, (final_h, final_c) = self.bilstm(packed)

    # Unpacking hidden states
    hiddens, output_lengths = nn.utils.rnn.pad_packed_sequence(hiddens, batch_first=True)

    sums = torch.sum(hiddens, dim=1)  # Sum over the time dimension
    avgs = sums / output_lengths.view(-1, 1).float().to(sums.device)  # Averaging over non-padding elements

    # Applying dropout
    embs = self.drop(avgs)  # (batch_size, self.dim)

    return embs

Let's train it without more tuning, using a similar configuration that we used for CNN.

In [None]:
if True:  # Set True to run. We will use this model later for error analysis.
  set_seed(42)
  model_bilstm = BinaryClassifier(BiLSTMEncoder(vocab_size))
  print('Model has {} parameters\n'.format(count_params(model_bilstm)))
  _ = train(model_bilstm, DataLoader(dataset_train, batch_size=64, shuffle=True),
            torch.optim.Adam(model_bilstm.parameters(), lr=0.001, weight_decay=0.0003), clip=1., num_epochs=30, verbose=True, device='cuda', select_model=True)

Model has 4445401 parameters

Epoch   1 | avg loss   0.5087 | train acc 74.25 | val acc 76.95
Epoch   2 | avg loss   0.3536 | train acc 84.33 | val acc 78.90
Epoch   3 | avg loss   0.2885 | train acc 88.03 | val acc 82.22
Epoch   4 | avg loss   0.2597 | train acc 89.47 | val acc 81.54
Epoch   5 | avg loss   0.2425 | train acc 90.25 | val acc 82.34
Epoch   6 | avg loss   0.2273 | train acc 91.07 | val acc 83.49
Epoch   7 | avg loss   0.2176 | train acc 91.43 | val acc 83.26
Epoch   8 | avg loss   0.2082 | train acc 91.83 | val acc 82.45
Epoch   9 | avg loss   0.2015 | train acc 92.19 | val acc 79.82
Stopping training because loss is NaN
Final avg loss   0.2015 | final train acc 92.19 | best val acc 83.49 | train time 135 secs


Again, with a right implementation, you should be able to get $>$ 80% accuracy easily. It's clear that the BiLSTM encoder learns effective representations.

- **Pros**: Natural sequence handling. Function of the order of the entire sequence.
- **Cons**: Must compute previous states before computing next states. The recurrent nature makes it impossible to parallelize RNNs.

# Results

TODO: Fill in the table below with the *best* validation accuracy you could get for each type of encoder. Also record other associated quantities like number of parameters and final training loss.

| Encoder   | Num Parameters  |   Final Avg Loss | Final Train Acc | Val Acc  |
| :---:     | :---:           |:---:             | :---:           | :---:    |
| CBOW      | 644225           |  0.2540 | 90.00 | 80.62          |          
| CNN       | 3360601           |  0.1605  | 93.75| 82.00 |
| BiLSTM    | 4445401         | 0.2015         |    92.19       | 83.49  |

# Error Analysis

In [None]:
def analyze(model):
  wrongs = []
  rights = []
  model.eval()  # This turns off the training mode.
  sent_ind = 0
  with torch.no_grad():  # This deactivates autodiff for improved efficiency.
    for batch in dataloader_val:
      sents, labels, lengths = batch
      sents = sents.to(model.score.weight.device)  # Send data to same device that model is on.
      labels = labels.to(model.score.weight.device)
      lengths = lengths.to(model.score.weight.device)
      logits, _ = model(sents, lengths, labels)
      probs = torch.sigmoid(logits)
      preds = torch.where(logits > 0., 1, 0)  # 1 if p(1|x) > 0.5, 0 else
      corrects = (preds == labels)
      for i in range(corrects.size(0)):
        if corrects[i] == 0:
          wrongs.append((sent_ind, probs[i].item(), labels[i].item()))
        else:
          rights.append((sent_ind, probs[i].item(), labels[i].item()))
        sent_ind += 1
  return wrongs, rights

# We'll assume model_bilstm saved from the RNN section.
wrongs, rights = analyze(model_bilstm)
assert len(wrongs) + len(rights) == len(val_sents)
set_seed(42)
samples_right = random.sample(rights, 10)
samples_wrong = random.sample(wrongs, 10)
print('Model accuracy: {:2.2f}'.format(get_acc_val(model_bilstm)))

print('\nGot these right')
for sent_ind, prob, label in samples_right:
  print('{} (model positive prob {:5.5f}): {}'.format(label2str[label], prob, ' '.join([vocab[word_index] for word_index in val_sents[sent_ind]])))

print('\nGot these wrong')
for sent_ind, prob, label in samples_wrong:
  print('{} (model positive prob {:5.5f}): {}'.format(label2str[label], prob, ' '.join([vocab[word_index] for word_index in val_sents[sent_ind]])))

Model accuracy: 83.49

Got these right
NEGATIVE (model positive prob 0.38760): at once half-baked and <unk> .
NEGATIVE (model positive prob 0.39050): sacrifices the value of its <unk> of <unk> <unk> with its <unk> <unk> .
POSITIVE (model positive prob 0.99801): it 's an offbeat treat that pokes fun at the <unk> exercise while also <unk> its significance for those who take part .
NEGATIVE (model positive prob 0.01052): no way i can believe this load of junk .
POSITIVE (model positive prob 0.92785): a <unk> constructed , highly <unk> film , and an audacious return to form that can comfortably sit among <unk> godard 's finest work .
NEGATIVE (model positive prob 0.05721): basically a static series of <unk> ( and <unk> ) <unk> between the stars .
POSITIVE (model positive prob 0.88973): if you <unk> on david mamet 's mind tricks ... rent this movie and enjoy !
POSITIVE (model positive prob 0.90567): <unk> ... <unk> a lot of energy into his nicely nuanced narrative and <unk> himself with a c

It's a little difficult to tell with certainty, but generally the examples that the model gets correct seem "easier". They often have key phrases like *miserable*, *plot holes*, *formulaic* for negative and *delivers*, *thrilling* for positive, and there's no sentiment flipping. Indeed the model's prediction probabilities are very skewed (either close to 0 or 1) in these cases. In contrast, the examples that the model gets incorrect seem a bit more subtle (e.g., *a better title , for all concerned , might be swept under the rug .* as negative), and the model is less certain (probabilities more often around 0.5).