In [1]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2023-spring/project1.git .tmp
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

# Course 236299

## Project segment 1: Text classification

In this project segment you will build several varieties of text classifiers using PyTorch.

1. A majority baseline.
2. A naive Bayes classifer.
3. A logistic regression (single-layer perceptron) classifier.
4. A multilayer perceptron classifier.

# Preparation

In [3]:
!pip install wget

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9657 sha256=c7eb6515807562fbe731aacacf7901dd31b6ba07fe7262b482709595f6b87548
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [4]:
import copy
import re
import wget
import csv
import torch
import torch.nn as nn
import datasets

from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import normalizers
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from transformers import PreTrainedTokenizerFast
from collections import Counter
from torch import optim
from tqdm.auto import tqdm

In [5]:
# Random seed
random_seed = 1234
torch.manual_seed(random_seed)

## GPU check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


# The task: Answer types for ATIS queries

For this and future project segments, you will be working with a standard natural-language-processing dataset, the [ATIS (Airline Travel Information System) dataset](https://www.kaggle.com/siddhadev/atis-dataset-from-ms-cntk). This dataset is composed of queries about flights – their dates, times, locations, airlines, and the like.

Over the years, the dataset has been annotated in all kinds of ways, with parts of speech, informational chunks, parse trees, and even corresponding SQL database queries. You'll use various of these annotations in future assignments. For this project segment, however, you'll pursue an easier classification task: **given a query, predict the answer type**.

These queries ask for different types of answers, such as

* Flight IDs: "Show me the flights from Washington to Boston"
* Fares: "How much is the cheapest flight to Milwaukee"
* City names: "Where does flight 100 fly to?"

In all, there are some 30 answer types to the queries.

Below is an example taken from this dataset:

_Query:_

```
show me the afternoon flights from washington to boston
```

_SQL:_

```
SELECT DISTINCT flight_1.flight_id FROM flight flight_1 , airport_service airport_service_1 , city city_1 , airport_service airport_service_2 , city city_2 
   WHERE flight_1.departure_time BETWEEN 1200 AND 1800 
     AND ( flight_1.from_airport = airport_service_1.airport_code 
           AND airport_service_1.city_code = city_1.city_code 
           AND city_1.city_name = 'WASHINGTON' 
           AND flight_1.to_airport = airport_service_2.airport_code 
           AND airport_service_2.city_code = city_2.city_code 
           AND city_2.city_name = 'BOSTON' )
```

In this project segment, we will consider the answer type for a natural-language query to be the target field of the corresponding SQL query. For the above example, the answer type would be *flight_id*.

## Loading and preprocessing the data

> Read over this section, executing the cells, and **making sure you understand what's going on before proceeding to the next parts.**

First, let's download the dataset.

In [6]:
data_dir = "https://raw.githubusercontent.com/nlp-236299/data/master/ATIS/"
os.makedirs('data', exist_ok=True)
for split in ['train', 'dev', 'test']:
    wget.download(f"{data_dir}/{split}.nl", out='data/')
    wget.download(f"{data_dir}/{split}.sql", out='data/')

Next, we process the dataset by extracting answer types from SQL queries and saving in CSV format.

In [7]:
def get_label_from_query(query):
    """Returns the answer type from `query` by dead reckoning.
    It's basically the second or third token in the SQL query.
    """    
    match = re.match(r'\s*SELECT\s+(DISTINCT\s*)?(\w+\.)?(?P<label>\w+)', query)
    if match:
        label = match.group('label')
    else:
        raise RuntimeError(f'no label in query {query}')
    return label

for split in ['train', 'dev', 'test']:
    sql_file = f'data/{split}.sql'
    nl_file = f'data/{split}.nl'
    out_file = f'data/{split}.csv'
    
    with open(nl_file) as f_nl:
        with open(sql_file) as f_sql:
            with open(out_file, 'w') as fout:
                writer = csv.writer(fout)
                writer.writerow(('label','text'))
                for text, sql in zip(f_nl, f_sql):
                    text = text.strip()
                    sql = sql.strip()
                    label = get_label_from_query(sql)
                    writer.writerow((label, text))

Let's take a look at what the data file looks like.

In [8]:
shell('head "data/train.csv"')

label,text
flight_id,list all the flights that arrive at general mitchell international from various cities
flight_id,give me the flights leaving denver august ninth coming back to boston
flight_id,what flights from tacoma to orlando on saturday
fare_id,what is the most expensive one way fare from boston to atlanta on american airlines
flight_id,what flights return from denver to philadelphia on a saturday
flight_id,can you list all flights from chicago to milwaukee
flight_id,show me the flights from denver that go to pittsburgh and then atlanta
flight_id,i'd like to see flights from baltimore to atlanta that arrive before noon and i'd like to see flights from denver to atlanta that arrive before noon
flight_id,do you have an 819 flight from denver to san francisco


We use `datasets` to prepare the data, as in lab 1-5. More information on `datasets` can be found at [https://huggingface.co/docs/datasets/loading](https://huggingface.co/docs/datasets/loading).

In [9]:
atis = load_dataset('csv', data_files={'train':'data/train.csv', \
                                       'val': 'data/dev.csv', \
                                       'test': 'data/test.csv'})

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-771c775734d60f6c/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating val split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-771c775734d60f6c/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [10]:
atis

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 4379
    })
    val: Dataset({
        features: ['label', 'text'],
        num_rows: 491
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 448
    })
})

In [11]:
train_data = atis['train']
val_data = atis['val']
test_data = atis['test']

train_data.shuffle(seed=random_seed)

Dataset({
    features: ['label', 'text'],
    num_rows: 4379
})

We build a tokenizer from the training data to tokenize text and convert tokens into word ids.

In [12]:
MIN_FREQ = 3 # words appearing fewer than 3 times are treated as 'unknown'
unk_token = '[UNK]'
pad_token = '[PAD]'

tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
tokenizer.pre_tokenizer = Whitespace()
tokenizer.normalizer = normalizers.Lowercase()

trainer = WordLevelTrainer(min_frequency=MIN_FREQ, special_tokens=[pad_token, unk_token])
tokenizer.train_from_iterator(train_data['text'], trainer=trainer)

We use `datasets.Dataset.map` to convert text into word ids. As shown in lab 1-5, first we need to wrap `tokenizer` with the `transformers.PreTrainedTokenizerFast` class to be compatible with the `datasets` library.

In [13]:
hf_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer, pad_token=pad_token, unk_token=unk_token)

In [14]:
def encode(example):
    return hf_tokenizer(example['text'])

train_data = train_data.map(encode)
val_data = val_data.map(encode)
test_data = test_data.map(encode)

Map:   0%|          | 0/4379 [00:00<?, ? examples/s]

Map:   0%|          | 0/491 [00:00<?, ? examples/s]

Map:   0%|          | 0/448 [00:00<?, ? examples/s]

We also need to convert label strings into label ids.

In [15]:
# Add a new column `label_id`
train_data = train_data.add_column('label_id', train_data['label'])
val_data = val_data.add_column('label_id', val_data['label'])
test_data = test_data.add_column('label_id', test_data['label'])

# Convert feature `label_id` from strings to integer ids
train_data = train_data.class_encode_column('label_id')

# Use the label vocabulary on training data to convert val and test sets
label2id = train_data.features['label_id']._str2int
val_data = val_data.class_encode_column('label_id')
val_data = val_data.align_labels_with_mapping(label2id, "label_id")
test_data = test_data.class_encode_column('label_id')
test_data = test_data.align_labels_with_mapping(label2id, "label_id")

Casting to class labels:   0%|          | 0/4379 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/491 [00:00<?, ? examples/s]

Aligning the labels:   0%|          | 0/491 [00:00<?, ? examples/s]

Casting to class labels:   0%|          | 0/448 [00:00<?, ? examples/s]

Aligning the labels:   0%|          | 0/448 [00:00<?, ? examples/s]

In [16]:
# Compute size of vocabulary
text_vocab = tokenizer.get_vocab()
label_vocab = train_data.features['label_id']._str2int
vocab_size = len(text_vocab)
num_labels = len(label_vocab)
print(f"Size of vocab: {vocab_size}")
print(f"Number of labels: {num_labels}")

Size of vocab: 514
Number of labels: 30


To get a sense of the kinds of things that are asked about in this dataset, here is the list of all of the answer types in the training data.

In [17]:
for label in label_vocab:
    print(f"{label_vocab[label]:2d} {label}") 

 0 advance_purchase
 1 aircraft_code
 2 airline_code
 3 airport_code
 4 airport_location
 5 arrival_time
 6 basic_type
 7 booking_class
 8 city_code
 9 city_name
10 count
11 day_name
12 departure_time
13 fare_basis_code
14 fare_id
15 flight_id
16 flight_number
17 ground_fare
18 meal_code
19 meal_description
20 miles_distant
21 minimum_connect_time
22 minutes_distant
23 restriction_code
24 state_code
25 stop_airport
26 stops
27 time_elapsed
28 time_zone_code
29 transport_type


## Handling unknown words

Note that we mapped words appearing fewer than 3 times to a special _unknown_ token (we're using `[UNK]`) for two reasons: 

1. Due to the scarcity of such rare words in training data, we might not be able to learn generalizable conclusions about them.
2. Introducing an unknown token allows us to deal with out-of-vocabulary words in the test data as well: we just map those words to `[UNK]`.

In [18]:
print (f"Unknown token: {unk_token}")
unk_index = text_vocab[unk_token]
print (f"Unknown token id: {unk_index}")

# UNK example
example_unk_token = 'IAmAnUnknownWordForSure'
print (f"An unknown token: {example_unk_token}")
print (f"Mapped back to word id: {hf_tokenizer(example_unk_token).input_ids}")
print (f"Mapped to [UNK]'s?: {all([id == unk_index for id in hf_tokenizer(example_unk_token).input_ids])}")

Unknown token: [UNK]
Unknown token id: 1
An unknown token: IAmAnUnknownWordForSure
Mapped back to word id: [1]
Mapped to [UNK]'s?: True


To facilitate batching sentences of different lengths into the same tensor as we'll see later, we also reserved a special padding symbol `[PAD]`.

In [19]:
print (f"Padding token: {pad_token}")
pad_index = text_vocab[pad_token]
print (f"Padding token id: {pad_index}")

Padding token: [PAD]
Padding token id: 0


## Batching the data

To load data in batches, we use `torch.utils.data.DataLoader`. This enables us to iterate over the dataset under a given `BATCH_SIZE` which specifies how many examples we want to process at a time.

In [20]:
BATCH_SIZE = 32

# Defines how to batch a list of examples together
def collate_fn(examples):
    batch = {}
    bsz = len(examples)
    label_ids = []
    for example in examples:
        label_ids.append(example['label_id'])
    label_batch = torch.LongTensor(label_ids).to(device)
    input_ids = []
    for example in examples:
        input_ids.append(example['input_ids'])
    max_length = max([len(word_ids) for word_ids in input_ids])
    text_batch = torch.zeros(bsz, max_length).long().fill_(pad_index).to(device)
    for b in range(bsz):
        text_batch[b][:len(input_ids[b])] = torch.LongTensor(input_ids[b]).to(device)
    
    batch['label_ids'] = label_batch
    batch['input_ids'] = text_batch
    return batch

train_iter = torch.utils.data.DataLoader(train_data, 
                                         batch_size=BATCH_SIZE,
                                         collate_fn=collate_fn)
val_iter = torch.utils.data.DataLoader(val_data, 
                                       batch_size=BATCH_SIZE, 
                                       collate_fn=collate_fn)
test_iter = torch.utils.data.DataLoader(test_data, 
                                        batch_size=BATCH_SIZE, 
                                        collate_fn=collate_fn)

Let's look at a single batch from one of these iterators.

In [21]:
batch = next(iter(train_iter))
text = batch['input_ids']
print (f"Size of text batch: {text.size()}")
print (f"Third sentence in batch: {text[2]}")
print (f"Mapped back to string: {hf_tokenizer.decode(text[2])}")
print (f"Mapped back to string skipping padding: {hf_tokenizer.decode(text[2], skip_special_tokens=True)}")

label = batch['label_ids']
label_vocab_itos = train_data.features['label_id']._int2str # map from label ids to strs
print (f"Size of label batch: {label.size()}")
print (f"Third label in batch: {label[2]}")
print (f"Mapped back to string: {label_vocab_itos[label[2].item()]}")

Size of text batch: torch.Size([32, 31])
Third sentence in batch: tensor([  7,   4,   3, 180,   2, 114,   6, 119,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0], device='cuda:0')
Mapped back to string: what flights from tacoma to orlando on saturday [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]
Mapped back to string skipping padding: what flights from tacoma to orlando on saturday
Size of label batch: torch.Size([32])
Third label in batch: 15
Mapped back to string: flight_id


You might notice some padding tokens `[PAD]` when we convert word ids back to strings, or equivalently, padding ids `0` in the corresponding tensor. The reason why we need such padding is because the sentences in a batch might be of different lengths, and to save them in a 2D tensor for parallel processing, sentences that are shorter than the longest sentence need to be padded with some placeholder values. Later during training you'll need to make sure that the paddings do not affect the final results.

Alternatively, we can also directly iterate over the individual examples in `train_data`, `val_data` and `test_data`. Here the returned values are the raw sentences and labels instead of their corresponding ids, and you might need to explicitly deal with the unknown words, unlike using bucket iterators which automatically map unknown words to an unknown word id.

In [22]:
for _, example in zip(range(5), train_data):
  print(f"{example['label']:10} -- {example['text']}")

flight_id  -- list all the flights that arrive at general mitchell international from various cities
flight_id  -- give me the flights leaving denver august ninth coming back to boston
flight_id  -- what flights from tacoma to orlando on saturday
fare_id    -- what is the most expensive one way fare from boston to atlanta on american airlines
flight_id  -- what flights return from denver to philadelphia on a saturday


## Notations used

In this project segment, we'll use the following notations.

* Sequences of elements (vectors and the like) are written with angle brackets and commas ($\langle w_1, \ldots, w_M \rangle$) or directly with no punctuation ($w_1 \cdots w_M$).
* Sets are notated similarly but with braces, ($\{ v_1, \ldots, v_V \}$).
* Maximum indices ($M$, $N$, $V$, $T$, and $X$ in the following) are written as uppercase italics.
* Variables over sequences and sets are written in boldface ($\vect{w}$), typically with the same letter as the variables over their elements.

In particular,

* $\vect{w} = w_1 \cdots w_M$: A text to be classified, each element $w_j$ being a word token.
* $\vect{v} = \{ v_1, \ldots, v_V\}$: A vocabulary, each element $v_k$ being a word type.
* $\vect{x} = \langle x_1, \ldots, x_X \rangle$: Input features to a model.
* $\vect{y} = \{ y_1, \ldots, y_N \}$: The output classes of a model, each element $y_i$ being a class label.
* $\vect{T} = \langle \vect{w}^{(1)}, \ldots, \vect{w}^{(T)} \rangle$: The training corpus of texts.
* $\vect{Y} = \langle y^{(1)}, \ldots, y^{(T)} \rangle$: The corresponding gold labels for the training examples in $T$.

# To Do: Establish a majority baseline

A simple baseline for classification tasks is to always predict the most common class. 
Given a training set of texts $\vect{T}$ labeled by classes $\vect{Y}$, we classify an input text $\vect{w} = w_1 \cdots w_M$ as the class $y_i$ that occurs most frequently in the training data, that is, specified by

$$ \argmax{i} \cnt{y_i} $$

and thus ignoring the input entirely (!).

**Implement the majority baseline and compute test accuracy using the starter code below.** For this baseline, and for the naive Bayes classifier later, we don't need to use the validation set since we don't tune any hyper-parameters.

In [23]:
# TODO
def majority_baseline_accuracy(train_data, test_data):
  """Returns the most common label in the training set, and the accuracy of
     the majority baseline on the test set.
  """
  train_c = Counter(train_data['label'])
  test_c = Counter(test_data['label'])
  most_common_label, num_ocurrences = train_c.most_common(1)[0]
  test_accuracy = test_c[most_common_label] / len(test_data['label'])
  return most_common_label, test_accuracy

How well does your classifier work? Let's see:

In [24]:
# Call the method to establish a baseline
most_common_label, test_accuracy = majority_baseline_accuracy(train_data, test_data)

print(f'Most common label: {most_common_label}\n'
      f'Test accuracy:     {test_accuracy:.3f}')

Most common label: flight_id
Test accuracy:     0.683


# To Do: Implement a Naive Bayes classifier


## Review of the naive Bayes method

Recall from lab 1-3 that the Naive Bayes classification method classifies a text $\vect{w} = \langle w_1, w_2, \ldots, w_M \rangle$ as the class $y_i$ given by the following maximization:

$$
\argmax{i} \Prob(y_i \given \vect{w}) \approx \argmax{i} \Prob(y_i) \cdot \prod_{j=1}^M \Prob(w_j \given y_i)
$$

or equivalently (since taking the log is monotonic)

\begin{align}
\argmax{i} \Prob(y_i \given \vect{w}) &= \argmax{i} \log\Prob(y_i \given \vect{w}) \\
&\approx \argmax{i} \left(\log\Prob(y_i) + \sum_{j=1}^M \log\Prob(w_j \given y_i)\right)
\end{align}

All we need, then, to apply the Naive Bayes classification method is values for the various log probabilities: the priors $\log\Prob(y_i)$ and the likelihoods $\log\Prob(w_j \given y_i)$, for each feature (word) $w_j$ and each class $y_i$.

We can estimate the prior probabilities $\Prob(y_i)$ by examining the empirical probability in the training set. That is, we estimate 

$$ \Prob(y_i) \approx \frac{\cnt{y_i}}{\sum_j \cnt{y_j}} $$

We can estimate the likelihood probabilities $\Prob(w_j \given y_i)$ similarly by examining the empirical probability in the training set. That is, we estimate 

$$ \Prob(w_j \given y_i) \approx \frac{\cnt{w_j, y_i}}{\sum_{j'} \cnt{w_{j'}, y_i}} $$

To allow for cases in which the count $\cnt{w_j, y_i}$ is zero, we can use a modified estimate incorporating add-$\delta$ smoothing:

$$ \Prob(w_j \given y_i) \approx \frac{\cnt{w_j, y_i} + \delta}{\sum_{j'} \cnt{w_{j'}, y_i} + \delta \cdot V} $$

## Two conceptions of the naive Bayes method implementation

We can store all of these parameters in different ways, leading to two different implementation conceptions. We review two conceptions of implementing the naive Bayes classification of a text $\vect{w} = \langle w_1, w_2, \ldots, w_M \rangle$, corresponding to using different representations of the input $\vect{x}$ to the model: the index representation and the bag-of-words representation. 

Within each conception, the parameters of the model will be stored in one or more matrices. The conception dictates what operations will be performed with these matrices.

### Using the index representation

In the first conception, we take the input elements $\vect{x} = \langle x_1, x_2, \ldots, x_M \rangle$ to be the _vocabulary indices_ of the words $\vect{w} = w_1 \cdots w_M$. That is, each word token $w_i$ is of the word type in the vocabulary $\vect{v}$ at index $x_i$, so 

$$ v_{x_i} = w_i $$

In this representation, the input vector has the same length as the word sequence.

We think of the likelihood probabilities as forming a matrix, call it $\vect{L}$, where the $i,j$-th element stores $\log \Prob(v_j \given y_i)$. 

$$\vect{L}_{ij} = \log\Prob(v_j \given y_i)$$

Similarly, for the priors, we'll have 

$$\vect{P}_{i} = \log\Prob(y_i)$$

Now the maximization can be implemented as 

\begin{align}
\argmax{i} \log\Prob(y_i) + \sum_{j=1}^M \log\Prob(w_j \given y_i)
&= \argmax{i} \vect{P}_i + \sum_{j=1}^M \vect{L}_{i, x_j}
\end{align}

Implemented in this way, we see that the use of each input $x_i$ is as an _index_ into the likelihood matrix. 

### Using the bag-of-words representation

<img src="https://github.com/nlp-course/data/raw/master/Resources/naive-bayes-figure.png" width=400 align=right />

Notice that since each word in the input is treated separately, the order of the words doesn't matter. Rather, all that matters is how frequently each word type occurs in a text. Consequently, we can use the bag-of-words representation introduced in lab 1-1.

Recall that the bag-of-words representation of a text is just its frequency distribution over the vocabulary, which we will notate $bow(\vect{w})$. Given a vocabulary of word types $\vect{v} = \langle v_1, v_2, \ldots, v_V \rangle$, the representation of a sentence $\vect{w} = \langle w_1, w_2, \ldots, w_M \rangle$ is a vector $\vect{x}$ of size $V$, where 

$$\begin{aligned}
bow(\vect{w})_j &= \sum_{i=1}^M 1[w_i = v_j] & \mbox{for $1 \leq j \leq V$}
\end{aligned}$$

We write $1[w_i = v_j]$ to indicate 1 if $w_i = v_j$ and 0 otherwise. For convenience, we'll add an extra $(V+1)$-st element to the end of the bag-of-words vector, a single $1$ whose use will be clear shortly. That is,

$$bow(\vect{w})_{V+1} = 1$$

Under this conception, then, we'll take the input $\vect{x}$ to be $bow(\vect{w})$. Instead of the input having the same length as the text, it has the same length as the vocabulary.

As described in lecture, represented in this way, the quantity to be maximized in the naive Bayes method

$$\log\Prob(y_i) + \sum_{j=1}^M \log\Prob(w_j \given y_i)$$

can be calculated as 

$$\log\Prob(y_i) + \sum_{j=1}^V x_j \cdot \log\Prob(v_j \given y_i)$$

which is just $\vect{U} \vect{x}$ for a suitable choice of $N \times (V+1)$ matrix $\vect{U}$, namely

$$ \vect{U}_{ij} = \left\{
    \begin{array}{ll}
        \log \Prob(v_j \given y_i) & \mbox{$1 \leq i \leq N$ and $1 \leq j \leq V$} \\
        \log \Prob(y_i) & \mbox{$1 \leq i \leq N$ and $j = V+1$} 
    \end{array} \right.
$$

Under this implementation conception, we've reduced naive Bayes calculations to a single matrix operation. This conception is depicted in the figure at right.

You are free to use either conception in your implementation of naive Bayes.

## Implement the naive Bayes classifier
 
For the implementation, we ask you to implement a Python class `NaiveBayes` that will have (at least) the following three methods:

1. `__init__`: An initializer that takes `text_vocab`, `label_vocab`, and `pad_index` as inputs.

2. `train`: A method that takes a training data iterator and estimates all of the log probabilities $\log\Prob(y_i)$ and $\log\Prob(v_j \given y_i)$ as described above. Perform add-$\delta$ smoothing with $\delta=1$. These parameters will be used by the `evaluate` method to evaluate a test dataset for accuracy, so you'll want to store them in some data structures in objects of the class.

3. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

You can organize your code using either of the conceptions of Naive Bayes described above.

You should expect to achieve about an **86% test accuracy** on the ATIS task.

In [25]:
def bag_of_words(vocab_size, seq):
  bow = torch.zeros(vocab_size + 1)
  seq_c = Counter(seq)
  for k , v in seq_c.items():
    bow[k] = v
  bow[1] = 0
  bow[vocab_size] = 1
  return bow

In [26]:
class NaiveBayes():
  def __init__ (self, text_vocab, label_vocab, pad_index):
    self.V = len(text_vocab) # vocabulary size
    self.N = len(label_vocab) # the number of classes
    # TODO: Add your code here
    self.pad_index = pad_index
    self.text_vocab = text_vocab
    self.label_vocab = label_vocab
    self.delta = 1    
  
  def train(self, iterator):
    """Calculates and stores log probabilities for training dataset `iterator`."""
    # TODO: Implement this method.
    # Initialize the U tensor as a matrix of zeros of shape (V+1, N), where V is the vocabulary size and N is the number of labels in the dataset.
    self.U = torch.zeros([self.V + 1, self.N])

    # calculate probability for each label
    labels_count = torch.tensor(Counter(iterator.dataset['label_id']).most_common())  # count the labels
    labels_count = labels_count[labels_count[:, 0].argsort()]
    labels_sum = sum(labels_count[:, 1])
    self.U[-1] = labels_count[:, 1].float() / labels_sum  # calculate the probability for each label

    # calculate probability of a word given label
    for label, text_indices in zip(iterator.dataset['label_id'], iterator.dataset['input_ids']):
        new_count = torch.bincount(torch.Tensor(text_indices).to(int), minlength=self.V)  # count the frequency of each word in the text
        self.U.T[label][:-1] += new_count  # calculate probability of a word given label

    self.U += self.delta  # add delta smoothing to all labels

    for label in range(self.N):
        label_sum = torch.sum(self.U.T[label][:-1])  # calculate the sum of probabilities for each word in the label
        self.U.T[label][:-1] /= label_sum  # divide each probability by the sum to get the probability
    
    self.U = torch.log(self.U) #calculate the log of the prob

  def evaluate(self, iterator):
    """Returns the model's accuracy on a given dataset `iterator`."""
    # TODO: Implement this method.
    acc = 0
    for label, text_indices in zip(iterator.dataset['label_id'], iterator.dataset['input_ids']):
      bow = bag_of_words(vocab_size=self.V, seq=text_indices)
      y_pred = torch.argmax(bow @ self.U)
      y_true = label
      acc += y_pred==y_true
    return acc / len(iterator.dataset) * 100

In [27]:
# Instantiate and train classifier
nb_classifier = NaiveBayes(text_vocab, label_vocab, pad_index)
nb_classifier.train(train_iter)

# Evaluate model performance
print(f'Training accuracy: {nb_classifier.evaluate(train_iter):.3f}\n'
      f'Test accuracy:     {nb_classifier.evaluate(test_iter):.3f}')

Training accuracy: 87.052
Test accuracy:     86.384


# To Do: Implement a logistic regression classifier

In this part, you'll complete a PyTorch implementation of a logistic regression (equivalently, a single layer perceptron) classifier. We review logistic regression here highlighting the similarities to the matrix-multiplication conception of naive Bayes. Thus, we take the input $\vect{x}$ to be the bag-of-words representation $bow(\vect{w})$. But as before you are free to use either implementation approach.

## Review of logistic regression

Similar to naive Bayes, in logistic regression, we assign a probability to a text $\vect{x}$ by merely multiplying an $N \times V$ matrix $\vect{U}$ by it. However, we don't stipulate that the values in the matrix $\vect{U}$ be estimated from the training corpus in the "naive Bayes" manner. Instead, we allow them to take on any value, using a training regime to select good values.

In order to make sure that the output of the matrix multiplication $\vect{U}\vect{x}$ is mapped onto a probability distribution, we apply a nonlinear function to renormalize the values. We use the softmax function, a generalization of the sigmoid function from lab 1-4, defined by 

$$\softmax(\vect{z})_i = \frac{\exp(z_i)}{\sum_{j=1}^{N} \exp(z_j)}$$

for each of the indices $i$ from $1$ to $N$.

In summary, we model $\Prob (y \given \vect{x})$ as

$$ \Prob(y_i \given \vect{x}) = \softmax ( \vect{U} \vect{x} )_i $$

<img src="https://github.com/nlp-course/data/raw/master/Resources/logistic-regression-figure.png" alt="logistic regression illustration" width="400"  align=right />

The calculation of $\Prob(y \given \vect{x})$ for each text $\vect{x}$ is referred to as the _forward_ computation. In summary, the forward computation for logistic regression involves a linear calculation ($\vect{U} \vect{x}$) followed by a nonlinear calculation ($\softmax$). We think of the perceptron (and more generally many of these neural network models) as transforming from one representation to another. A perceptron performs a linear transformation from the index or bag-of-words representation of the text to a representation as a vector, followed by a nonlinear transformation, a softmax or sigmoid, giving a representation as a probability distribution over the class labels. This single-layer perceptron thus involves two _sublayers_. (In the next part of the project segment, you'll experiment with a multilayer perceptron, with two perceptron layers, and hence four sublayers.)

The loss function you'll use is the negative log probability $-\log \Prob (y \given \vect{x})$. The negative is used, since it is convention to minimize loss, whereas we want to maximize log likelihood. 

The forward and loss computations are illustrated in the figure at right. In practice, for numerical stability reasons, PyTorch absorbs the softmax operation into the loss function `nn.CrossEntropyLoss`. That is, the input to the `nn.CrossEntropyLoss` function is the vector of sums $\vect{U} \vect{x}$ (the last step in the box marked "your job" in the figure) rather than the vector of probabilities $\Prob(y \given \vect{x})$. That makes things easier for you (!), since you're responsible only for the first sublayer.

Given a forward computation, the weights can then be adjusted by taking a step opposite to the gradient of the loss function. Adjusting the weights in this way is referred to as the _backward_ computation. Fortunately, `torch` takes care of the backward computation for you, just as in lab 1-5.

The optimization process of performing the forward computation, calculating the loss, and performing the backward computation to improve the weights is done repeatedly until the process converges on a (hopefully) good set of weights. You'll find this optimization process in the `train_all` method that we've provided. The trained weights can then be used to perform classification on a test set. See the `evaluate` method.

## Implement the logistic regression classifier

For the implementation, we ask you to implement a logistic regression classifier as a subclass of [`torch.nn.Module`](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module). You need to implement the following methods:

1. `__init__`: an initializer that takes `text_vocab`, `label_vocab`, and `pad_index` as inputs.

    During initialization, you'll want to define a [tensor](https://pytorch.org/docs/stable/tensors.html#torch-tensor) of weights, wrapped in [`torch.nn.Parameter`](https://pytorch.org/docs/master/generated/torch.nn.parameter.Parameter.html#torch.nn.parameter.Parameter), [initialized randomly](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_), which plays the role of $\vect{U}$. The elements of this tensor are the parameters of the `torch.nn` instance in the following special technical sense: It is the parameters of the module whose gradients will be calculated and whose values will be updated. Alternatively, **you might find it easier** to use the [`nn.Embedding` module](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) which is a wrapper to the weight tensor with a lookup implementation.

2. `forward`: given a text batch of size `batch_size X max_length`, return a tensor of logits of size `batch_size X num_labels`. That is, for each text $\vect{x}$ in the batch and each label $y$, you'll be calculating $\vect{U}\vect{x}$ as shown in the figure, returning a tensor of these values. Note that the softmax operation is absorbed into [`nn.CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) so you won't need to deal with that.

3. `train_all`: A method that performs training. You might find lab 1-5 useful.

4. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

Some things to consider:

1. The parameters of the model, the weights, need to be initialized properly. We suggest initializing them to some small random values. See [`torch.uniform_`](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_).

2. You'll want to make sure that padding tokens are handled properly. What should the weight be for the padding token?

3. In extracting the proper weights to sum up, based on the word types in a sentence, we are essentially doing a lookup operation. You might find [`nn.Embedding`](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) or [`torch.gather`](https://pytorch.org/docs/stable/generated/torch.gather.html#torch-gather) useful.

You should expect to achieve about **90%** accuracy on the ATIS classificiation task. 

In [28]:
class LogisticRegression(nn.Module):
  def __init__ (self, text_vocab, label_vocab, pad_index):
    super().__init__()
    self.pad_index = pad_index
    # Keep the vocabulary sizes available
    self.N = len(label_vocab) # num_classes
    self.V = len(text_vocab)  # vocab_size
    self.label_vocab = label_vocab
    # Specify cross-entropy loss for optimization
    self.criterion = nn.CrossEntropyLoss()
    # TODO: Create and initialize a tensor for the weights,
    #       or create an nn.Embedding module and initialize
    U = (torch.Tensor(self.V, self.N)).uniform_(0,0.2)
    U[:,pad_index] = 0
    self.U = torch.nn.Parameter(U)

  def forward(self, text_batch):
    # TODO: Calculate the logits (Ux) for the `text_batch`, 
    #       returning a tensor of size batch_size x num_labels
    Ux = torch.nn.functional.embedding(text_batch, self.U, padding_idx=self.pad_index).sum(dim=1)
    return Ux

  def train_all(self, train_iter, val_iter, epochs=8, learning_rate=3e-3):
    # Switch the module to training mode
    self.train()
    # Use Adam to optimize the parameters
    optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
    best_validation_accuracy = -float('inf')
    best_model = None
    # Run the optimization for multiple epochs
    with tqdm(range(epochs), desc='train', position=0) as pbar:
      for epoch in pbar:
        c_num = 0
        total = 0
        running_loss = 0.0

        for batch in tqdm(train_iter, desc='batch', leave=False):
          # TODO: set labels, compute logits (Ux in this model), 
          #       loss, and update parameters
          optim.zero_grad()
          labels = batch['label_ids']
          logits = self.forward(batch['input_ids'])
          loss = self.criterion(logits.float(), labels)
          loss.backward()
          optim.step()

          # Prepare to compute the accuracy
          predictions = torch.argmax(logits, dim=1)
          total += predictions.size(0)
          c_num += (predictions == labels).float().sum().item()        
          running_loss += loss.item() * predictions.size(0)

        # Evaluate and track improvements on the validation dataset
        validation_accuracy = self.evaluate(val_iter)
        if validation_accuracy > best_validation_accuracy:
          best_validation_accuracy = validation_accuracy
          self.best_model = copy.deepcopy(self.state_dict())
        epoch_loss = running_loss / total
        epoch_acc = c_num / total
        pbar.set_postfix(epoch=epoch+1, loss=epoch_loss, train_acc = epoch_acc, val_acc=validation_accuracy)

  def evaluate(self, iterator):
    """Returns the model's accuracy on a given dataset `iterator`."""
    self.eval()   # switch the module to evaluation mode
    # TODO: Compute accuracy
    total_ = 0
    correct = 0 

    for batch in iterator:
      b_labels = batch['label_ids']
      b_logits = self.forward(batch['input_ids'])
      predict = torch.argmax(b_logits, dim=1)
      total_ += predict.size(0)
      correct += (predict == b_labels).float().sum().item() 
    return correct/total_


In [29]:
# Instantiate the logistic regression classifier and run it
model = LogisticRegression(text_vocab, label_vocab, pad_index).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)
test_accuracy = model.evaluate(test_iter)
print (f'Test accuracy: {test_accuracy:.4f}')

train:   0%|          | 0/8 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

Test accuracy: 0.9308


# To Do: Implement a multilayer perceptron

## Review of multilayer perceptrons

<img src="https://github.com/nlp-course/data/raw/master/Resources/multilayer-perceptron-figure.png" alt="multilayer perceptron illustration" width="400"  align=right />

In the last part, you implemented a perceptron, a model that involved a linear calculation (the sum of weights) followed by a nonlinear calculation (the softmax, which converts the summed weight values to probabilities). In a multi-layer perceptron, we take the output of the first perceptron to be the input of a second perceptron (and of course, we could continue on with a third or even more).

In this part, you'll implement the forward calculation of a two-layer perceptron, again letting PyTorch handle the backward calculation as well as the optimization of parameters. The first layer will involve a linear summation as before and a **sigmoid** as the nonlinear function. The second will involve a linear summation and a softmax (the latter absorbed, as before, into the loss function). Thus, the difference from the logistic regression implementation is simply the adding of the sigmoid and second linear calculations. See the figure for the structure of the computation. 



## Implement a multilayer perceptron classifier

For the implementation, we ask you to implement a two layer perceptron classifier, again as a subclass of the [`torch.nn` module](https://pytorch.org/docs/stable/nn.html). You might reuse quite a lot of the code from logistic regression. As before, you need to implement the following methods:

1. `__init__`: An initializer that takes `text_vocab`, `label_vocab`, `pad_index`, and `hidden_size` specifying the size of the hidden layer (e.g., in the above illustration, `hidden_size` is `D`).

    During initialization, you'll want to define two tensors of weights, which serve as the parameters of this model, one for each layer. You'll want to [initialize them randomly](https://pytorch.org/docs/stable/tensors.html#torch.Tensor.uniform_). 
    
    The weights in the first layer are a kind of lookup (as in the previous part), mapping words to a vector of size `hidden_size`. The [`nn.Embedding` module](https://pytorch.org/docs/master/generated/torch.nn.Embedding.html) is a good way to set up and make use of this weight tensor.
    
    The weights in the second layer define a linear mapping from vectors of size `hidden_size` to vectors of size `num_labels`. The [`nn.Linear` module](https://pytorch.org/docs/master/generated/torch.nn.Linear.html) or [`torch.mm`](https://pytorch.org/docs/master/generated/torch.mm.html) for matrix multiplication may be helpful here.

2. `forward`: Given a text batch of size `batch_size X max_length`, the `forward` function returns a tensor of logits of size `batch_size X num_labels`. 

    That is, for each text $\vect{x}$ in the batch and each label $c$, you'll be calculating $MLP(bow(\vect{x}))$ as shown in the illustration above, returning a tensor of these values. Note that the softmax operation is absorbed into [`nn.CrossEntropyLoss`](https://pytorch.org/docs/master/generated/torch.nn.CrossEntropyLoss.html) so you don't need to worry about that.
    
    For the sigmoid sublayer, you might find [`nn.Sigmoid`](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html) useful.
    
3. `train_all`: A method that performs training. You might find lab 1-5 useful.

4. `evaluate`: A method that takes a test data iterator and evaluates the accuracy of the trained model on the test set.

You should expect to achieve at least **90%** accuracy on the ATIS classificiation task. 

In [30]:
class MultiLayerPerceptron(nn.Module):
  def __init__ (self, text_vocab, label_vocab, pad_index, hidden_size=128): 
    super().__init__ ()
    self.pad_index = pad_index
    self.hidden_size = hidden_size
    # Keep the vocabulary sizes available
    self.N = len(label_vocab) # num_classes
    self.V = len(text_vocab)  # vocab_size
    # Specify cross-entropy loss for optimization
    self.criterion = nn.CrossEntropyLoss()
    self.ll1= nn.Linear(self.V, self.hidden_size, bias=False)
    self.sigmoid = nn.Sigmoid()
    self.ll2= nn.Linear(self.hidden_size, self.N, bias=False)

    # TODO: Create and initialize neural modules
    U = (torch.Tensor(self.V, self.hidden_size)).uniform_(0,0.2)
    U[:,pad_index] = 0
    self.U = torch.nn.Parameter(U)

  def forward(self, text_batch):
    # TODO: Calculate the logits for the `text_batch`, 
    #       returning a tensor of size batch_size x num_labels
    # x = torch.zeros(len(text_batch), self.V)
    # for indx, text in enumerate(text_batch): 
    #   for word in text:
    #     if word == self.pad_index:
    #       break 
    #     x[indx][word] += 1
    # return self.ll2(self.sigmoid(self.ll1(x.to(device))))
    Ux = torch.nn.functional.embedding(text_batch, self.U, padding_idx=self.pad_index).sum(dim=1)
    y= self.sigmoid(Ux)
    return self.ll2(y)

  
  def train_all(self, train_iter, val_iter, epochs=8, learning_rate=3e-3):
    # Switch the module to training mode
    self.train()
    # Use Adam to optimize the parameters
    optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
    best_validation_accuracy = -float('inf')
    best_model = None
    # Run the optimization for multiple epochs
    with tqdm(range(epochs), desc='train', position=0) as pbar:
      for epoch in pbar:
        c_num = 0
        total = 0
        running_loss = 0.0
        for batch in tqdm(train_iter, desc='batch', leave=False):
          # TODO: set labels, compute logits (Ux in this model), 
          #       loss, and update parameters
          optim.zero_grad()
          labels = batch['label_ids']
          logits = self.forward(batch['input_ids'])
          loss = self.criterion(logits.float(), labels)
          loss.backward()
          optim.step()
          # Prepare to compute the accuracy
          predictions = torch.argmax(logits, dim=1)
          total += predictions.size(0)
          c_num += (predictions == labels).float().sum().item()        
          running_loss += loss.item() * predictions.size(0)

        # Evaluate and track improvements on the validation dataset
        validation_accuracy = self.evaluate(val_iter)
        if validation_accuracy > best_validation_accuracy:
          best_validation_accuracy = validation_accuracy
          self.best_model = copy.deepcopy(self.state_dict())
        epoch_loss = running_loss / total
        epoch_acc = c_num / total
        pbar.set_postfix(epoch=epoch+1, loss=epoch_loss, train_acc = epoch_acc, val_acc=validation_accuracy)

  def evaluate(self, iterator):
    """Returns the model's accuracy on a given dataset `iterator`."""
    self.eval()
    # TODO: Compute accuracy
    total_ = 0
    correct = 0 

    for batch in iterator:
      b_labels = batch['label_ids']
      b_logits = self.forward(batch['input_ids'])
      predict = torch.argmax(b_logits, dim=1)
      total_ += predict.size(0)
      correct += (predict == b_labels).float().sum().item() 
    return correct/total_

In [31]:
# Instantiate classifier and run it
model = MultiLayerPerceptron(text_vocab, label_vocab, pad_index, hidden_size=128).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)
test_accuracy = model.evaluate(test_iter)
print (f'Test accuracy: {test_accuracy:.4f}')

train:   0%|          | 0/8 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

Test accuracy: 0.9196


<!-- BEGIN QUESTION -->

# Lessons learned

Take a look at some of the examples that were classified correctly and incorrectly by your best method.

**Question:** Do you notice anything about the incorrectly classified examples that might indicate _why_ they were classified incorrectly?

<!--
BEGIN QUESTION
name: open_response_lessons
manual: true
-->

After looking at the examples that were classified correctly and incorrectly by your best method.
we noticed that:
1. The incorrect predictions are caracterizes by shorter examples than the correct ones. we assume that because of the shorter context
2. The model predict better (with better acuuracy) when it has more examples  for the label id (meaning the label is more frequent). i.e. as we can see by the hist below, the label id "15" has many examples and was predicted correctly 100% of the time, but the label id "18" has only 2 examples and was predicted one time correctly and one time incorrectly. we asuume that when the model classified incorrectly, it usually predicted the label id "15", which was the majority class.
3. The model is more likely to have a correct predict when the sentence doesnt have UNK words in it. i.e. we can see that in the correctly classified axemples 14% of the examples contained at least one UNK token, and in the incorrectly classified examples 30% of the examples had at least one UNK token. (most of the examples didnt have more then one UNK token)


<!-- END QUESTION -->

In [32]:
model = LogisticRegression(text_vocab, label_vocab, pad_index).to(device) 
model.train_all(train_iter, val_iter)
model.load_state_dict(model.best_model)


train:   0%|          | 0/8 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

batch:   0%|          | 0/137 [00:00<?, ?it/s]

<All keys matched successfully>

In [33]:
def  classified_correctly_and_incorrectly(model, iterator):
  """Returns the model's examples that were classified correctly and incorrectly by your best method.acy on a given dataset `iterator`."""
  classified_correctly = {'text': [], 'list': []}
  classified_incorrectly =  {'text': [], 'list': []}
  hist = {'correct': {}, 'incorrect': {}, 'comparison': {}}
  
  labels =  iterator.dataset['label_id']
  predicts = []
  for batch in tqdm(iterator, desc='batch', leave=False):
    logits = model.forward(batch['input_ids'])
    predicts += torch.argmax(logits, dim=1).tolist()


  for indx, label, predict in zip(range(len(labels)), labels, predicts):
    x = f"{iterator.dataset['input_ids'][indx]} label={label}, predict={predict}"
    y = [iterator.dataset['input_ids'][indx], label, predict]
    if label==predict:
      classified_correctly['text'].append(x)
      classified_correctly['list'].append(y)
      if label not in hist['correct'].keys():
        hist['correct'][label] = 1
      else: hist['correct'][label]+=1
    else:
      classified_incorrectly['text'].append(x)
      classified_incorrectly['list'].append(y)
      if label not in hist['incorrect'].keys():
        hist['incorrect'][label] = 1
      else:
        hist['incorrect'][label]+=1
      if predict not in hist['comparison'].keys():
        hist['comparison'][predict] = 1
      else:
        hist['comparison'][predict]+=1


  hist['correct'] = dict(sorted(hist['correct'].items(), key=lambda x: x[1]))
  hist['incorrect'] = dict(sorted(hist['incorrect'].items(), key=lambda x: x[1]))
  return classified_correctly, classified_incorrectly, hist

classified_correctly, classified_incorrectly, hist = classified_correctly_and_incorrectly(model, test_iter)

batch:   0%|          | 0/14 [00:00<?, ?it/s]

In [34]:
classified_correctly['text']

['[7, 4, 82, 3, 22, 2, 125] label=15, predict=15',
 '[7, 4, 82, 3, 125, 2, 139, 137, 56] label=15, predict=15',
 '[12, 54, 156, 124, 9, 3, 96, 2, 14] label=15, predict=15',
 '[7, 308, 31, 48, 51, 28, 59, 16, 14] label=29, predict=29',
 '[7, 4, 82, 3, 14, 2, 90, 89, 177, 6, 84, 37] label=15, predict=15',
 '[21, 48, 51, 59, 16, 90, 89, 177] label=29, predict=29',
 '[12, 54, 2, 40, 3, 90, 89, 177, 2, 96, 6, 47, 63] label=15, predict=15',
 '[4, 3, 34, 2, 128] label=15, predict=15',
 '[4, 3, 18, 2, 128] label=15, predict=15',
 '[4, 3, 13, 112, 2, 128] label=15, predict=15',
 '[12, 38, 26, 9, 83, 3, 125, 2, 14] label=15, predict=15',
 '[170, 12, 60, 9, 83, 6, 4, 3, 139, 137, 56, 2, 125, 36] label=15, predict=15',
 '[170, 12, 60, 9, 83, 6, 4, 3, 20, 2, 125, 36] label=15, predict=15',
 '[12, 38, 26, 83, 6, 4, 35, 3, 34, 75, 2, 14] label=15, predict=15',
 '[12, 54, 83, 6, 4, 3, 34, 2, 10, 32, 70, 6, 15, 119] label=15, predict=15',
 '[12, 54, 5, 4, 3, 34, 2, 146, 6, 15, 119] label=15, predict=15

In [35]:
classified_incorrectly['text']

['[12, 54, 5, 66, 6, 4, 3, 34, 2, 111, 6, 15, 119] label=14, predict=15',
 '[12, 54, 9, 416, 19, 30, 41, 4, 164, 3, 43, 2, 139, 137, 56, 6, 79, 164, 61, 287] label=16, predict=15',
 '[29, 5, 1, 86] label=3, predict=5',
 '[29, 194, 242] label=3, predict=5',
 '[29, 194, 242] label=3, predict=5',
 '[29, 242] label=3, predict=5',
 '[7, 21, 1] label=3, predict=2',
 '[7, 167, 31, 134, 28, 212, 3, 140, 2, 22, 61, 94] label=6, predict=1',
 '[12, 54, 15, 159, 3, 172, 2, 128] label=14, predict=15',
 '[12, 54, 15, 159, 3, 172, 507, 2, 128] label=14, predict=15',
 '[12, 27, 46, 26, 15, 65, 67, 159, 3, 96, 2, 114, 349, 47, 91, 169, 79, 37] label=14, predict=15',
 '[7, 52, 39, 131, 311, 158] label=7, predict=13',
 '[7, 52, 39, 131, 324, 158] label=7, predict=13',
 '[7, 52, 39, 131, 311, 158] label=7, predict=13',
 '[7, 52, 5, 249, 1, 158] label=23, predict=3',
 '[7, 52, 39, 131, 324, 158] label=7, predict=13',
 '[29, 371, 3, 242, 2, 135, 16, 69, 76] label=20, predict=15',
 '[29, 241] label=8, predic

In [36]:
hist

{'correct': {23: 1,
  7: 1,
  18: 1,
  10: 2,
  13: 7,
  3: 10,
  14: 13,
  29: 20,
  1: 20,
  2: 21,
  15: 303},
 'incorrect': {16: 1,
  23: 1,
  20: 1,
  18: 1,
  24: 1,
  6: 2,
  10: 2,
  2: 3,
  15: 3,
  7: 4,
  1: 4,
  8: 5,
  3: 8,
  14: 13},
 'comparison': {15: 26, 5: 6, 2: 7, 1: 1, 13: 4, 3: 3, 23: 1, 29: 1}}

In [37]:
def precentage_of_unk(list):
  count = 0
  for e in list:
    text_ids = e[0]
    if text_ids.count(text_vocab[unk_token]):
      count += 1

  return count/len(list)

def len_of_exapmle(list):
  count = 0
  for e in list:
    text_ids = e[0]
    count += len(text_ids)

  return count/len(list)

print(f"precentage of unk in correctly classified = {precentage_of_unk(classified_correctly['list'])}")
print(f"precentage of unk in incorrectly classified = {precentage_of_unk(classified_incorrectly['list'])}")
print(f"Avg. len of exapmle in correctly classified = {len_of_exapmle(classified_correctly['list'])}")
print(f"Avg. len of exapmle in incorrectly classified = {len_of_exapmle(classified_incorrectly['list'])}")

precentage of unk in correctly classified = 0.14035087719298245
precentage of unk in incorrectly classified = 0.30612244897959184
Avg. len of exapmle in correctly classified = 9.854636591478696
Avg. len of exapmle in incorrectly classified = 8.63265306122449


<!-- BEGIN QUESTION -->

# Debrief

**Question:** We're interested in any thoughts you have about this project segment so that we can improve it for later years, and to inform later segments for this year. Please list any issues that arose or comments you have to improve the project segment. Useful things to comment on include the following: 

* Was the project segment clear or unclear? Which portions?
* Were the readings appropriate background for the project segment? 
* Are there additions or changes you think would make the project segment better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# Instructions for submission of the project segment

This project segment should be submitted to Gradescope at <https://rebrand.ly/project1-submit-code> and <https://rebrand.ly/project1-submit-pdf>, which will be made available some time before the due date.

Project segment notebooks are manually graded, not autograded using otter as labs are. (Otter is used within project segment notebooks to synchronize distribution and solution code however.) **We will not run your notebook before grading it.** Instead, we ask that you submit the already freshly run notebook. The best method is to "restart kernel and run all cells", allowing time for all cells to be run to completion. You should submit your code to Gradescope at the code submission assignment at <https://rebrand.ly/project1-submit-code>.

We also request that you **submit a PDF of the freshly run notebook**. The simplest method is to use "Export notebook to PDF", which will render the notebook to PDF via LaTeX. If that doesn't work, the method that seems to be most reliable is to export the notebook as HTML (if you are using Jupyter Notebook, you can do so using `File -> Print Preview`), open the HTML in a browser, and print it to a file. Then make sure to add the file to your git commit. Please name the file the same name as this notebook, but with a `.pdf` extension. (Conveniently, the methods just described will use that name by default.) You can then perform a git commit and push and submit the commit to Gradescope at <https://rebrand.ly/project1-submit-pdf>.

# End of project segment 1