# Homework and bakeoff: Multi-domain sentiment

In [None]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2023"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/main/hw_sentiment.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/main/hw_sentiment.ipynb)

If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook.

## Overview

This homework and associated bakeoff are devoted to supervised sentiment analysis in a ternary label setting (positive, negative, neutral). Your ultimate goal is to develop systems that can make accurate predictions in multiple domains.

The homework questions ask you to implement some baseline systems using DynaSent Round 1, DynaSent Round 2, and the Stanford Sentiment Treebank. The bakeoff challenge is to define a system that does well on the DynaSent test sets, the SST-3 test set, and a set of mystery examples that don't correspond to the DynaSent or SST-3 domains.

__Important methodological note:__ The DynaSent and SST-3 test sets are already publicly distributed, so we are counting on people not to cheat by developing their models on these test sets. You must do all your development without using these test sets at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. _Much of the scientific integrity of our field depends on people adhering to this honor code._

This notebook briefly introduces our three development datasets, states the homework questions, and then provides guidance on the original system and associated bakeoff entry.

## Set-up

In [None]:
try:
    # Sort of randomly chosen import to see whether the requirements
    # are met:
    import datasets
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    import sys
    sys.path.append("cs224u")

Cloning into 'cs224u'...
remote: Enumerating objects: 2209, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (78/78), done.[K
remote: Total 2209 (delta 51), reused 62 (delta 39), pack-reused 2092[K
Receiving objects: 100% (2209/2209), 41.48 MiB | 16.97 MiB/s, done.
Resolving deltas: 100% (1350/1350), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/stanfordnlp/dsp (from -r cs224u/requirements.txt (line 15))
  Cloning https://github.com/stanfordnlp/dsp to /tmp/pip-req-build-_6bqeuru
  Running command git clone --filter=blob:none --quiet https://github.com/stanfordnlp/dsp /tmp/pip-req-build-_6bqeuru
  Resolved https://github.com/stanfordnlp/dsp to commit 693be4d83c5037e0c7cca5d58b42a7bb8e3b7e9a
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyter>=1.0.0
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting torch==1.13.1

In [None]:
from collections import defaultdict, Counter
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import torch

ModuleNotFoundError: ignored

## Datasets

### DynaSent round 1

The DynaSent dataset of [Potts, Wu, et al. 2021](https://aclanthology.org/2021.acl-long.186/) is a ternary sentiment benchmark consisting of two rounds (so far). The dataset is available on [Hugging Face](https://huggingface.co/datasets/dynabench/dynasent).

For Round 1, the authors collected sentences from the [Yelp Academic Dataset](https://www.yelp.com/dataset) that fooled a top-performing sentiment model but were intuitive for humans. The model was used only to heuristically find the examples. Crowdworkers multiply-labeled all of them.

The round contains a lot of metadata that could be useful for developing sentiment models. We will focus on just the sentences and labels, but you are free to make use of this additional metadata in developing uour system.

In [None]:
dynasent_r1 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r1.all')

Downloading builder script:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.97k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

Downloading and preparing dataset dynasent/dynabench.dynasent.r1.all (download: 16.26 MiB, generated: 23.94 MiB, post-processed: Unknown size, total: 40.20 MiB) to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r1.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967...


Downloading data:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/80488 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3600 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3600 [00:00<?, ? examples/s]

Dataset dynasent downloaded and prepared to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r1.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
dynasent_r1

DatasetDict({
    train: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'indices_into_review_text', 'model_0_label', 'model_0_probs', 'text_id', 'review_id', 'review_rating', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 80488
    })
    validation: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'indices_into_review_text', 'model_0_label', 'model_0_probs', 'text_id', 'review_id', 'review_rating', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 3600
    })
    test: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'indices_into_review_text', 'model_0_label', 'model_0_probs', 'text_id', 'review_id', 'review_rating', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 3600
    })
})

Splits:

In [None]:
def print_label_dist(dataset, labelname='gold_label', splitnames=('train', 'validation')):
    for splitname in splitnames:
        print(splitname)
        dist = sorted(Counter(dataset[splitname][labelname]).items())
        for k, v in dist:
            print(f"\t{k:>14s}: {v}")

In [None]:
print_label_dist(dynasent_r1)

train
	      negative: 14021
	       neutral: 45076
	      positive: 21391
validation
	      negative: 1200
	       neutral: 1200
	      positive: 1200


### DynaSent round 2

DynaSent Round 2 was created using different methods than Round 1. For Round 2, crowdworkers edited sentences from the Yelp Academic Dataset seeking to achieve a particular sentiment goal (e.g., expressing a positive sentiment) while fooling a top-performing model. This work was done on the [Dynabench](https://dynabench.org) platform. The hope is that this directly adversarial goal will lead to examples that are very hard for present-day models but intuitive for humans. All the examples were multiply-labeled by separate annotators.

In [None]:
dynasent_r2 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r2.all')

Downloading and preparing dataset dynasent/dynabench.dynasent.r2.all (download: 16.26 MiB, generated: 4.89 MiB, post-processed: Unknown size, total: 21.15 MiB) to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967...


Generating train split:   0%|          | 0/13065 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/720 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/720 [00:00<?, ? examples/s]

Dataset dynasent downloaded and prepared to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
print_label_dist(dynasent_r2)

train
	      negative: 4579
	       neutral: 2448
	      positive: 6038
validation
	      negative: 240
	       neutral: 240
	      positive: 240


### Stanford Sentiment Treebank

The [Stanford Sentiment Treebank (SST)](http://nlp.stanford.edu/sentiment/) of [Socher et al. 2013](https://aclanthology.org/D13-1170/) is a widely-used resource for evaluating supervised models. It consists of sentences from Rotten Tomatoes Movie Reviews (see [Pang and Lee's project page](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.home.html)). We will use the ternary version of the task (SST-3).

SST examples are special in that they are labeled at the phrase-level as well as the sentence level, which provides very extensive and detailed supervision for sentiment. We will use only the sentence-level labels for the homework, but you are free to use the phrase-level labels as well in designing your original system. (To do this, you will need to get the dataset from the above project page, since the Hugging Face SST-3 we are using does not include these labels.)

In [None]:
sst = load_dataset("SetFit/sst5")

Downloading readme:   0%|          | 0.00/421 [00:00<?, ?B/s]

Downloading and preparing dataset json/SetFit--sst5 to /root/.cache/huggingface/datasets/SetFit___json/SetFit--sst5-4c07b9d5881ae209/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/171k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/SetFit___json/SetFit--sst5-4c07b9d5881ae209/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
sst

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 8544
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2210
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1101
    })
})

Out of the box, this is a five-way task:

In [None]:
print_label_dist(sst, labelname='label_text')

train
	      negative: 2218
	       neutral: 1624
	      positive: 2322
	 very negative: 1092
	 very positive: 1288
validation
	      negative: 289
	       neutral: 229
	      positive: 279
	 very negative: 139
	 very positive: 165


The above labels are not aligned with our ternary task, and the dataset distribution uses slightly different keys from those of DynaSent. The following code converts the dataset to SST-3 and also aligns the dataset keys:

In [None]:
def convert_sst_label(s):
    return s.split(" ")[-1]

In [None]:
for splitname in ('train', 'validation', 'test'):
    dist = [convert_sst_label(s) for s in sst[splitname]['label_text']]
    sst[splitname] = sst[splitname].add_column('gold_label', dist)
    sst[splitname] = sst[splitname].add_column('sentence', sst[splitname]['text'])

In [None]:
print_label_dist(sst)

train
	      negative: 3310
	       neutral: 1624
	      positive: 3610
validation
	      negative: 428
	       neutral: 229
	      positive: 444


## Question 1: Linear classifiers

Our first set of experiments will use simple linear classifiers with sparse representations derived from counting unigrams. These experiments will introduce some useful techniques and provide a baseline for original systems. 

### Background: Feature functions

The following is a flexible format for writing feature functions in the context of scikit-learn modeling. The function maps a string to a count dictionary, using the simple procedure of splitting on whitespace and counting the resulting elements:

In [None]:
def unigrams_phi(s):
    """The basis for a unigrams feature function.

    Downcases all tokens.

    Parameters
    ----------
    s : str
        The example to represent

    Returns
    -------
    Counter
        A map from tokens (str) to their counts in `text`

    """
    return Counter(s.lower().split())

Quick example:

In [None]:
unigrams_phi("Here's an example with an emoticon :)!")

Counter({"here's": 1,
         'an': 2,
         'example': 1,
         'with': 1,
         'emoticon': 1,
         ':)!': 1})

### Background: Feature space vectorization

Functions like `unigrams_phi`  are just the __basis__ for feature representations. In truth, our models typically don't represent examples as dictionaries, but rather as vectors embedded in a matrix. In general, to manage the translation from dictionaries to vectors, we use [sklearn.feature_extraction.DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) instances. Here's a brief overview of how these work:

To start, suppose that we had just two examples to represent, and our feature function mapped them to the following list of dictionaries:

In [None]:
train_feats = [
    {'a': 1, 'b': 1},
    {'b': 1, 'c': 2}]

Now we create a `DictVectorizer`. So that we can more easily inspect the resulting matrix, I've set `sparse=False`, so that the return value is a dense matrix. For real problems, you'll probably want to use `sparse=True`, as it will be vastly more efficient for the very sparse feature matrices that you are likely to be creating.

In [None]:
vec = DictVectorizer(sparse=False)  # Use `sparse=True` for real problems!

The `fit_transform` method maps our list of dictionaries to a matrix:

In [None]:
X_train = vec.fit_transform(train_feats)

Here I'll create a `pd.Datafame` just to help us inspect `X_train`:

In [None]:
pd.DataFrame(X_train, columns=vec.get_feature_names_out())

Unnamed: 0,a,b,c
0,1.0,1.0,0.0
1,0.0,1.0,2.0


Now we can see that, intuitively, the feature called "a" is embedded in the first column, "b" in the second column, and "c" in the third.

Now suppose we have some new test examples:

In [None]:
test_feats = [
    {'a': 2, 'c': 1},
    {'a': 4, 'b': 2, 'd': 1}]

If we have trained a model on `X_train`, then it will not have any way to deal with this new feature "d". This shows that we need to embed `test_feats` in the same space as `X_train`. To do this, one just calls `transform` on the existing vectorizer:

In [None]:
X_test = vec.transform(test_feats)  # Not `fit_transform`!

In [None]:
pd.DataFrame(X_test, columns=vec.get_feature_names_out())

Unnamed: 0,a,b,c
0,2.0,0.0,1.0
1,4.0,2.0,0.0


In [None]:
pd.DataFrame(X_train, columns=vec.get_feature_names_out())

Unnamed: 0,a,b,c
0,1.0,1.0,0.0
1,0.0,1.0,2.0


The most common mistake with `DictVectorizer` is calling `fit_transform` on test examples. This will wipe out the existing representation scheme, replacing it with one that matches the test examples. That will happen silently, but then you'll find that the new representations are incompatible with the model you fit. This is likely to manifest itself as a `ValueError` relating to feature counts. Here's an example that might help you spot this if and when it arises in your own work:

In [None]:
toy_mod = LogisticRegression()

vec = DictVectorizer(sparse=False)

X_train = vec.fit_transform(train_feats)

toy_mod.fit(X_train, [0, 1])

# Here's the error! Don't use `fit_transform` again! 
# Use `transform`!
X_test = vec.fit_transform(test_feats)

try:
    toy_mod.predict(X_test)
except ValueError as err:
    print("ValueError: {}".format(err))

ValueError: X has 4 features, but LogisticRegression is expecting 3 features as input.


### Background: scikit-learn models

scikit-learn is an amazing package with, among many other things, an incredible array of classifier model implementations. We're going to use a simple softmax classifier for this homework question, but you will find that you can swap in essentially any scikit-learn classifier and see how it does.

The core rhythm for scikit-learn models:

1. Instantiate the model with any hyperparamters.
2. `fit` 
3. `predict`

Here's a quick example that also shows off scikit-learn's functionality for creating synthetic datasets and random train/test splits:

In [None]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X_toy, y_toy = make_classification(
    n_samples=200, n_classes=3, 
    n_informative=15, n_features=20, 
    weights=[0.2, 0.2, 0.6],
    random_state=1)

X_toy_train, X_toy_test, y_toy_train, y_toy_test = train_test_split(
    X_toy, y_toy, test_size=0.20, stratify=y_toy, random_state=1)

toymod = LogisticRegression(penalty='l2', C=1, fit_intercept=True)

toymod.fit(X_toy_train, y_toy_train)

toypreds = toymod.predict(X_toy_test)

### Background: Classifier assessment

When assessing a classifier, the best first step is usually to get a classification report:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_toy_test, toypreds, digits=3))

              precision    recall  f1-score   support

           0      0.444     0.500     0.471         8
           1      0.444     0.500     0.471         8
           2      0.909     0.833     0.870        24

    accuracy                          0.700        40
   macro avg      0.599     0.611     0.604        40
weighted avg      0.723     0.700     0.710        40



In this course, we will generally focus in the __macro-average F1 score__ (macro avg above). This is simply the mean of the per-class F1 scores, without any attention paid to the overall size of the class. This is our default because, in NLP, we tend to care about small classes as much as (often more than) large classes.

The scikit-learn implementation of `macro_f1` can be finicky, so our course code provides a convenient wrapper:

In [None]:
import utils

utils.safe_macro_f1(y_toy_test, toypreds)

0.6035805626598466

Note: scikit-learn models have a `score` method. For classifiers, this is set to use `accuracy` by default:

In [None]:
toymod.score(X_toy_test, y_toy_test)

0.7

Accuracy generally isn't well-aligned with our goals, so we discourage use of this method (and of accuracy scores in general).

scikit-learn also makes it very easy to perform automatic hyperparameter tuning. A quick example:

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'C': (0.1, 0.2, 0.3), 'fit_intercept': [True, False]}

toymod_tuned = LogisticRegression()

clf = GridSearchCV(toymod_tuned, params, scoring='f1_macro')

_ = clf.fit(X_toy, y_toy)

Here's the best model found by this search:

In [None]:
clf.best_estimator_

Because we set `scoring='f1_macro'`, the above model was selected using our favored classifier scoring metric:

In [None]:
clf.best_score_

0.6943888670150135

With this best model in hand, we can perform our usual assessment:

In [None]:
bestpreds = clf.best_estimator_.predict(X_toy_test)

In [None]:
print(classification_report(bestpreds, y_toy_test, digits=3))

              precision    recall  f1-score   support

           0      0.750     0.600     0.667        10
           1      0.750     0.750     0.750         8
           2      0.833     0.909     0.870        22

    accuracy                          0.800        40
   macro avg      0.778     0.753     0.762        40
weighted avg      0.796     0.800     0.795        40



### Task 1: Feature functions [1 point]

The tokenization scheme used by `unigrams_phi` is very basic and leads to unintuitive tokens with punctuation attached to them. Your task here is to complete `tweetgrams_phi`, which should lead to more intuitive results. The task is really just to use the NLTK [TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer) in place of the simple whitespace tokenization of `unigrams_phi` above.

In [None]:
# Your `tweetgrams_phi` should tokenize data according to this tokenizer from NLTK:
from nltk.tokenize import TweetTokenizer

def tweetgrams_phi(s, **kwargs):
    """The basis for a feature function using `TweetTokenizer`.

    Parameters
    ----------
    s : str
    kwargs : dict
        Passed to `TweetTokenizer`

    Returns
    -------
    Counter
        A map from tokens to their counts in `text`

    """
    tknzr = TweetTokenizer(**kwargs)
    return Counter((tknzr.tokenize(s)))



Here's a test you can use to check that your implementation is correct:

In [None]:
def test_tweetgrams_phi(func):
    examples = [
        (
            "Here's an example with an emoticon :)", 
            Counter({'an': 2, "Here's": 1, 'example': 1, 'with': 1, 'emoticon': 1, ':)': 1})
        ),
        (
            "The URL is https://pytorch.org!", 
            Counter({'The': 1, 'URL': 1, 'is': 1, 'https://pytorch.org': 1, '!': 1})
        )
    ]
    errcount = 0
    for ex, expected in examples:
        result = func(ex, preserve_case=True)
        if result != expected:
            errcount += 1
            print(f"Error for `{func.__name__}`: For input {ex}, "
                  f"expected {expected} but got {result}")
    caps_ex = "CAPS"
    caps_result = func(caps_ex, preserve_case=False)
    caps_expected = Counter({"caps": 1})
    if caps_result != caps_expected:
        errcount += 1
        print(f"Error for `{func.__name__}`: For input {caps_ex}, "
              f"expected {caps_expected} but got {caps_result}")    
    if errcount == 0:
        print(f"All tests passed for `{func.__name__}`")    

In [None]:
test_tweetgrams_phi(tweetgrams_phi)

All tests passed for `tweetgrams_phi`


### Task 2: Model training [1 point]

Your task is to complete `train_linear_model`:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report

In [None]:
def train_linear_model(model, featfunc, train_dataset):
    """Train an sklearn classifier.

    Parameters
    ----------
    model : sklearn classifier model
    featfunc : func
        Maps strings to Counter instances
    train_dataset: dict
        Must have a key "sentence" containing strings that `featfunc` 
        will process, and a key "gold_label" giving labels

    Returns
    -------
    tuple
        * A trained version of `model`
        * A fitted `vectorizer` for the train set

    """
    # Step 1: Featurize all the examples in `train_dataset['sentence']`
    # creates a list of count dictionaries
    feat_list = []
    for sentence in train_dataset['sentence']:
      feat_list.append(featfunc(sentence))


    # Step 2: Instantiate and use a `DictVectorizer`:
    vec = DictVectorizer(sparse=True)
    X_train = vec.fit_transform(feat_list)
    



    # Step 3: Train the model on the feature matrix and
    # train_dataset['gold_label']:
    model.fit(X_train, train_dataset['gold_label'])



    # Step 4: Return (model, vectorizer):
    return (model, vec)




You can use the following test to help ensure that your implementation is correct:

In [None]:
def test_train_linear_model(func):
    train_dataset = {
        'sentence': ['A A', 'A B', 'B B', 'B A', 'B'],
        'gold_label': [0, 1, 0, 1, 1]}
    def featfunc(s):
        return Counter(s.split())
    model = LogisticRegression()
    result = func(model, featfunc, train_dataset)
    if not isinstance(result, tuple) or len(result) != 2:
        print(f"Error for `{func.__name__}`: Incorrect return type")
        return
    model, vectorizer = result
    if not hasattr(vectorizer, 'vocabulary_'):
        print(f"Error for `{func.__name__}`: "
              f"Second return value is not a trained vectorizer")
        return
    if not hasattr(model, 'classes_'):
        print(f"Error for `{func.__name__}`: "
              f"First return value is not a trained classifier")
        return
    print(f"No errors found for `{func.__name__}`")

In [None]:
_ = test_train_linear_model(train_linear_model)

No errors found for `train_linear_model`


You can now very easily train models on our datasets. Quick example (this shouldn't take more than a couple of minutes to run even on a CPU):

In [None]:
lr_unigrams, vec_unigrams = train_linear_model(
    LogisticRegression(max_iter=1000), 
    unigrams_phi, dynasent_r1['train'])

### Task 3: Model assessment [1 point]

Having now trained a model, we'd like to perform assessments on new data. Your task is to complete the wrapper function `assess_linear_model` to do this. The primary things you need to put into practice are (1) how to use a trained vectorizer on new data and (2) how to make predictions with your trained model. (Both of these steps are reviewed earlier in this notebook.)

In [None]:
def assess_linear_model(model, featfunc, vectorizer, assess_dataset):
    """Assess a trained sklearn model.

    Parameters
    ----------
    model: trained sklearn model
    featfunc : func
        Maps strings to count dicts
    vectorizer : fitted DictVectorizer
    assess_dataset: dict
        Must have a key "sentence" containing strings that `featfunc` 
        will process, and a key "gold_label" giving labels

    Returns
    -------
    A classification report (multiline string)

    """
    pass
    # Step 1: Featurize the assessment data:
    # creates a list of count dictionaries
    feat_list = []
    for sentence in assess_dataset['sentence']:
      feat_list.append(featfunc(sentence))


    # Step 2: Vectorize the assessment data features:
    ##### YOUR CODE HERE
    X_test = vectorizer.transform(feat_list)


    # Step 3: Make predictions:
    preds = model.predict(X_test)



    # Step 4: Return a classification report (str):
    return classification_report(assess_dataset['gold_label'], preds, digits=3)



Here's a quick test you can use:

In [None]:
def test_assess_linear_model(assessfunc, trainfunc):
    train_dataset = {
        'sentence': ['A A', 'A B', 'B B', 'B A', 'A', 'B'],
        'gold_label': [0, 1, 0, 1, 0, 1]}
    assess_dataset = {
        'sentence': ['A C', 'B A'],
        'gold_label': [0, 1]}
    def featfunc(s):
        return Counter(s.split())
    model = LogisticRegression()
    model, vectorizer = trainfunc(model, featfunc, train_dataset)
    result = assessfunc(model, featfunc, vectorizer, assess_dataset)
    errcount = 0
    if len(vectorizer.vocabulary_) != 2:
        print(f"Error for `{assessfunc.__name__}`: Unexpected feature count")
        errcount += 1
    if 'weighted avg' not in result:
        print(f"Error for `{assessfunc.__name__}`: Unexpected return value")
        errcount += 1
    if errcount == 0:
        print(f"No errors found for `{assessfunc.__name__}`")

In [None]:
test_assess_linear_model(assess_linear_model, train_linear_model)

No errors found for `assess_linear_model`


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


If you trained a model `lr_unigrams` above, you can now easily assess it. An example:

In [None]:
report = assess_linear_model(
    lr_unigrams,
    unigrams_phi,
    vec_unigrams,
    dynasent_r1['validation'])

print(report)

              precision    recall  f1-score   support

    negative      0.756     0.365     0.492      1200
     neutral      0.523     0.889     0.659      1200
    positive      0.700     0.573     0.630      1200

    accuracy                          0.609      3600
   macro avg      0.660     0.609     0.594      3600
weighted avg      0.660     0.609     0.594      3600



## Question 2: Transformer fine-tuning

We're now going to move into a more modern mode: fine-tuning pretrained components.

We'll use BERT-mini (originally from [the BERT repo](https://github.com/google-research/bert)) for the homework so that we can rapdily develop prototypes. You can then consider scaling up to larger models.

In [None]:
import transformers
from transformers import AutoModel, AutoTokenizer

The `transformers` library does a lot of logging. To avoid ending up with a cluttered notebook, I am changing the logging level. You might want to skip this as you scale up to building production systems, since the logging is very good – it gives you a lot of insights into what the models and code are doing.

In [None]:
transformers.logging.set_verbosity_error()

Here we set ourselves up to use BERT-mini:

In [None]:
weights_name = "prajjwal1/bert-mini"

bert = AutoModel.from_pretrained(weights_name)

bert_tokenizer = AutoTokenizer.from_pretrained(weights_name)

Downloading (…)lve/main/config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

### Background: Tokenization

Tokenization in Transformer models is handled differently from tokenization in linear models of the sort we used in Question 1. For Transformer models, we need to use the tokenizer that comes with the model so that we reliably have embedding representations for every token.

In [None]:
example_text = "Bert knows Snuffleupagus"

Here's a basic tokenization step:

In [None]:
bert_tokenizer.tokenize(example_text)

['bert', 'knows', 's', '##nu', '##ffle', '##up', '##ag', '##us']

Notice that the tokenizer split "Snuffleupagus" into a bunch of subword tokens.

The above use of the tokenizer, where we map from strings to lists of strings, is really for us humans. For modeling, the most important step for tokenization is mapping individual strings to sequences of integer ids. These ids key into the lowest embedding layer of the model.

In [None]:
ex_ids = bert_tokenizer.encode(example_text, add_special_tokens=True)

ex_ids

[101, 14324, 4282, 1055, 11231, 18142, 6279, 8490, 2271, 102]

We can get map these indices back to "words" if we want:

In [None]:
bert_tokenizer.convert_ids_to_tokens(ex_ids)

['[CLS]',
 'bert',
 'knows',
 's',
 '##nu',
 '##ffle',
 '##up',
 '##ag',
 '##us',
 '[SEP]']

### Background: Representation

Having mapped our string to a list of tokens, we can use the `forward` method of the model to get representations:

In [None]:
with torch.no_grad():
    reps = bert(torch.tensor([ex_ids]))

There are a lot of options for which representations to get. With the above call, we got the following:

In [None]:
reps.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The value of `last_hidden_state` hidden state is the sequence of final output states from the model:

In [None]:
reps.last_hidden_state.shape

torch.Size([1, 10, 256])

This is: 1 example, 10 token representations, each one a 256 dimension vector.

The value of `pooler_output` is a set of currently random parameters sitting on top of the first output hidden state. You can see here that it is a single vector representation per example:

In [None]:
reps.pooler_output.shape

torch.Size([1, 256])

I often feel unsure of precisely what this model component is. Here we can have a quick look:

In [None]:
bert.pooler

BertPooler(
  (dense): Linear(in_features=256, out_features=256, bias=True)
  (activation): Tanh()
)

So this is a dense linear layer (a single matrix of weights) with a bias term, and a tanh activation function is applied to the output. We could put a classifier head on top of this if we wanted to, but we might have mixed feelings about being stuck with that tanh step.

### Background: Masking

Where examples from a single batch have different lengths, we need to mask the padded tokens to get the intended results from the model.

For a quick example, here we process our full example from above and print out the first five values:

In [None]:
with torch.no_grad():
    reps = bert(torch.tensor([ex_ids]))
    print(reps.last_hidden_state[0][0][: 5])

tensor([-0.3763, -0.3209,  0.8817,  0.4568, -1.0314])


And now we do the same thing, but with masking of the final five positions to illustate:

In [None]:
with torch.no_grad():
    # Mask the last 5 tokens:
    am = torch.tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
    maskreps = bert(torch.tensor([ex_ids]), attention_mask=am)
    print(maskreps.last_hidden_state[0][0][: 5])

tensor([-0.1793, -0.8994,  0.9695,  0.9130, -0.7129])


### Task 1: Batch tokenization [1 point]

Your task here is to use the `batch_encode_plus` method for `bert_tokenizer` to tokenize a list of strings. You should complete `get_batch_token_ids` according to the specification in the doctring. All these steps can be handled with a single call to `batch_encode_plus`.

In [None]:
def get_batch_token_ids(batch, tokenizer):
    """Map `batch` to a tensor of ids. The return
    value should meet the following specification:

    1. The max length should be 512.
    2. Examples longer than the max length should be truncated
    3. Examples should be padded to the max length for the batch.
    4. The special [CLS] should be added to the start and the special 
       token [SEP] should be added to the end.
    5. The attention mask should be returned
    6. The return value of each component should be a tensor.    

    Parameters
    ----------
    batch: list of str
    tokenizer: Hugging Face tokenizer

    Returns
    -------
    dict with at least "input_ids" and "attention_mask" as keys,
    each with Tensor values

    """
    max_length = 512
    return tokenizer.batch_encode_plus(batch, add_special_tokens=True, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt', return_attention_mask=True)



Here's a test you can use:

In [None]:
def test_get_batch_token_ids(func):
    examples = [
        "Bert knows Snuffleupagus",
        "ELMo knew Bert.",
        "Buffalo " * 520
    ]
    test_tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-mini")
    result = func(examples, test_tokenizer)
    errcount = 0
    if 'attention_mask' not in result:
        errcount += 1  
        print(f"Error for `{func.__name__}`: "
              f"Attention mask was not returned")
    ids = result['input_ids']
    if not isinstance(ids, torch.Tensor):
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Return values are not tensors")
    if ids.shape[1] != 512:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Expected sequence length 512; got {ids.shape[1]}")
    if ids[0][0] != bert_tokenizer.cls_token_id:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Special tokens were not added")
    if errcount == 0:
        print(f"No errors found for `{func.__name__}`")

In [None]:
test_get_batch_token_ids(get_batch_token_ids)

No errors found for `get_batch_token_ids`


### Task 2: Contextual representations [1 point]

This next task is not used directly in fine-tuning, but it should help ensure that you understand how BERT representations are created and how they need to be managed.

Your task is to complete `get_reps` so that, given a dataset (list of strings), it returns a single tensor in which each row is the output hidden state above the [CLS] token for that example. `gets_reps` has a batchsize argument that the user can manage depending on how much available memory they have and how large their model is.

In [None]:
from nltk.tokenize.util import regexp_span_tokenize

def get_reps(dataset, model, tokenizer, batchsize=20):
    """Represent each example in `dataset` with the final hidden state 
    above the [CLS] token.

    Parameters
    ----------
    dataset : list of str
    model : BertModel
    tokenizer : BertTokenizerFast
    batchsize : int

    Returns
    -------
    torch.Tensor with shape `(n_examples, dim)` where `dim` is the
    dimensionality of the representations for `model`

    """
    reps = None
    with torch.no_grad():
        # Iterate over `dataset` in batches:
        for i in range(int(len(dataset)/batchsize)):
            tokenization_dict = get_batch_token_ids(dataset[batchsize*i:batchsize*(i+1)], tokenizer)

            # Encode the batch with `get_batch_token_ids`:
            token_ids = tokenization_dict['input_ids']


            # Get the representations from the model, making
            # sure to pay attention to masking:
            maskreps = bert(token_ids, attention_mask=tokenization_dict['attention_mask'])
            if reps is not None:
              reps = torch.cat((reps, maskreps.last_hidden_state[:,0,:]))
            else:
              reps = maskreps.last_hidden_state[:,0,:]

        # Return a single tensor:
        return reps



Quick test:

In [None]:
def test_get_reps(func):
    examples = ["The cat slept.", "The bird chirped."] * 20
    weights_name = "prajjwal1/bert-mini"
    test_model = AutoModel.from_pretrained(weights_name)
    test_tokenizer = AutoTokenizer.from_pretrained(weights_name)
    result = func(examples, test_model, test_tokenizer, batchsize=2)
    errcount = 0
    if result.shape != (40, 256):
        print(f"Error for `{func.__name__}`: "
              f"Expected shape {(40, 256)}, got {result.shape}")
    if round(result[0][0].item(), 2) != -0.64:
        print(f"Error for `{func.__name__}`: "
              f"Representations seem to be incorrect")
    print(f"No errors found for `{func.__name__}`")

In [None]:
test_get_reps(get_reps)

No errors found for `get_reps`


### Task 3: Fine-tuning module [1 point]

We can now put the above together into a basic `nn.Module` that will fine-tune our BERT model. Most of the module is written for you. The pieces you need to implement:

1. in the `init` methid, define `self.classifier_layer` using [nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)
2. Complete the `forward` method.

Precise instructions are provided in the docstrings for the model.

In [None]:
import torch.nn as nn

class BertClassifierModule(nn.Module):
    def __init__(self, 
            n_classes, 
            hidden_activation, 
            weights_name="prajjwal1/bert-mini"):
        """This module loads a Transformer based on  `weights_name`, 
        puts it in train mode, add a dense layer with activation 
        function give by `hidden_activation`, and puts a classifier
        layer on top of that as the final output. The output of
        the dense layer should have the same dimensionality as the
        model input.

        Parameters
        ----------
        n_classes : int
            Number of classes for the output layer
        hidden_activation : torch activation function
            e.g., nn.Tanh()
        weights_name : str
            Name of pretrained model to load from Hugging Face

        """
        super().__init__()
        self.n_classes = n_classes
        self.weights_name = weights_name
        self.bert = AutoModel.from_pretrained(self.weights_name)
        self.bert.train()
        self.hidden_activation = hidden_activation
        self.hidden_dim = self.bert.embeddings.word_embeddings.embedding_dim
        # Add the new parameters here using `nn.Sequential`. 
        # We can define this layer as
        # 
        #  h = f(cW1 + b_h)
        #  y = hW2 + b_y
        #
        # where c is the final hidden state above the [CLS] token,
        # W1 has dimensionality (self.hidden_dim, self.hidden_dim),
        # W2 has dimensionality (self.hidden_dim, self.n_classes), 
        # and we rely on the PyTorch loss function to add apply a
        # softmax to y.  
        self.classifier_layer = nn.Sequential(
            nn.Linear(self.hidden_dim, self.hidden_dim, bias=True),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes, bias=True)
        )



    def forward(self, indices, mask):
        """Process `indices` with `mask` by feeding these arguments
        to `self.bert` and then feeding the initial hidden state
        in `last_hidden_state` to `self.classifier_layer`.

        Parameters
        ----------
        indices : tensor.LongTensor of shape (n_batch, k)
            Indices into the `self.bert` embedding layer. `n_batch` is
            the number of examples and `k` is the sequence length for
            this batch
        mask : tensor.LongTensor of shape (n_batch, d)
            Binary vector indicating which values should be masked.
            `n_batch` is the number of examples and `k` is the
            sequence length for this batch

        Returns
        -------
        tensor.FloatTensor
            Predicted values, shape `(n_batch, self.n_classes)`

        """
        maskreps = self.bert(indices, attention_mask=mask)
        return self.classifier_layer(maskreps.last_hidden_state[:,0,:])


In [None]:
bert_module = BertClassifierModule(n_classes=3, hidden_activation=nn.Tanh())

In [None]:
ids = get_batch_token_ids(
    dynasent_r1['train']['sentence'][: 2],
    bert_tokenizer)

bert_module(ids['input_ids'], ids['attention_mask'])

tensor([[-0.1566,  0.1413,  0.1815],
        [-0.1789,  0.3326,  0.1825]], grad_fn=<AddmmBackward0>)

In [None]:
def test_bert_classifier_module(moduleclass): 
    expected_out = 5
    expected_hidden = 256
    expected_activation = nn.ReLU()
    mod = moduleclass(expected_out, expected_activation)
    errcount = 0

    # Basic layer structure:
    if not hasattr(mod, "classifier_layer") or mod.classifier_layer is None:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Missing attribute `classifier_layer`")
        return 
    for i in range(3):
        try:
            bert_module.classifier_layer[i]
        except IndexError:
            errcount += 1
            print(f"Error for `{moduleclass.__name__}`: "
                  f"`classifier_layer` is not an `nn.Sequential` "
                  f"and/or does not have the right structure")
    # Correct first layer dimensionality:
    result_hidden = mod.classifier_layer[0].out_features
    if result_hidden != expected_hidden:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Expected `classifier_layer` hidden dim {expected_hidden}, "
              f"got {result_hidden}") 
    # Correct activation:
    result_activation = mod.classifier_layer[1].__class__.__name__
    if result_activation != expected_activation.__class__.__name__:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Incorrect hidden activation")
    # Correct output dimensionality:
    result_out = mod.classifier_layer[2].out_features
    if result_out != expected_out:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Expected `classifier_layer` out dim {expected_out}, "
              f"got {result_out}")
    # forward method:
    ids = get_batch_token_ids(["A B C", "A B"], bert_tokenizer)
    result = mod(ids['input_ids'], ids['attention_mask'])
    if result.shape != (2, 5):
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Expected output shape {(2, 5)}, got {result.shape}")
    if errcount == 0:
        print(f"No errors found for `{moduleclass.__name__}`")

In [None]:
test_bert_classifier_module(BertClassifierModule)

No errors found for `BertClassifierModule`


### Optional use: Classifier interface

The above module doesn't have functionality for processing data and fitting models. Our course code includes some general purpose code for adding these features. Here is an example that should work well with the module you wrote above. For more details on the design of these interfaces, see [tutorial_pytorch_models.ipynb](tutorial_pytorch_models.ipynb).

In [None]:
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier

class BertClassifier(TorchShallowNeuralClassifier):
    def __init__(self, weights_name, *args, **kwargs):
        self.weights_name = weights_name
        self.tokenizer = AutoTokenizer.from_pretrained(self.weights_name)
        super().__init__(*args, **kwargs)
        self.params += ['weights_name']

    def build_graph(self):
        return BertClassifierModule(
            self.n_classes_, self.hidden_activation, self.weights_name)

    def build_dataset(self, X, y=None):
        data = get_batch_token_ids(X, self.tokenizer)
        if y is None:
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'])
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'], y)
        return dataset

And here is a training run that should do pretty well for our problem. 

__Note__: This step should not be run on CPU machines. On Google Colab with a GPU, it will likely take about an hour.

In [None]:
bert_finetune = BertClassifier(
    weights_name="prajjwal1/bert-mini",
    hidden_activation=nn.ReLU(),
    eta=0.00005,          # Low learning rate for effective fine-tuning.
    batch_size=8,         # Small batches to avoid memory overload.
    gradient_accumulation_steps=4,  # Increase the effective batch size to 32.
    early_stopping=True,  # Early-stopping
    n_iter_no_change=5)   # params.

In [None]:
%%time

_ = bert_finetune.fit(
    dynasent_r1['train']['sentence'],
    dynasent_r1['train']['gold_label'])

Stopping after epoch 9. Validation score did not improve by tol=1e-05 for more than 5 epochs. Final error is 608.378317643539

CPU times: user 1h 5min 28s, sys: 21.8 s, total: 1h 5min 50s
Wall time: 1h 6min 32s


In [None]:
preds = bert_finetune.predict(sst['validation']['sentence'])

In [None]:
print(classification_report(sst['validation']['gold_label'], preds, digits=3))

              precision    recall  f1-score   support

    negative      0.558     0.668     0.608       428
     neutral      0.345     0.358     0.351       229
    positive      0.703     0.554     0.620       444

    accuracy                          0.558      1101
   macro avg      0.535     0.527     0.526      1101
weighted avg      0.572     0.558     0.559      1101



In [None]:
preds = bert_finetune.predict(dynasent_r1['validation']['sentence'])

In [None]:
print(classification_report(dynasent_r1['validation']['gold_label'], preds, digits=3))

              precision    recall  f1-score   support

    negative      0.769     0.564     0.651      1200
     neutral      0.642     0.871     0.739      1200
    positive      0.735     0.669     0.700      1200

    accuracy                          0.701      3600
   macro avg      0.715     0.701     0.697      3600
weighted avg      0.715     0.701     0.697      3600



## Question 3: Your original system [3 points]

Your task is to develop an original ternary sentiment classifier model. There are many options. The only rule:

__You cannot make any use of the test sets for DynaSent-R1, DynaSent-R2, or SST-3, at any time during the course of development.__

The integrity of the bakeoff depends on this rule being followed.

It's fine to use the dev sets for system development – indeed, we encourage this.

For system development, here are some relatively manageable ideas that you might try:

* Different pretrained models. There are many models available on the [Hugging Face models hub](https://huggingface.co/models) that will be drop-in replacements for BERT-mini as we used it above.

* Different fine-tuning regimes. We used the [CLS] token above. This doesn't make especially good use of the output states of the models. Pooling across these representtions (with sum, average, etc.) is likely to be better.

* Different training regimes. You have three train sets at your disposal, and there may be other sentiment datasets that could contribute to making your system more robust in new domains.

* Entirely different approaches. There is no requirement that you make use of any of the concepts from the homework questions in constructing your original system. Anything goes as long as you follow the one rule given above in bold.

We want to emphasize that this needs to be an original system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it. See the course website for additional guidance on how original systems will be evaluated.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [None]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.

# Description: Our model is a transformer that's a fine-tuned version of the pre-trained BERT model 
# BERT-small (we tried larger BERT models, but they were too computationally expensive for our machines).
# The hyperparameters used for fine-tuning include a learning rate of 5e-5, a maximum sequence length of 512,
# a batch size of 8, and a maximum of 5 epochs with early stopping if the validation loss does not improve f
# or 5 consecutive epochs. We included evaluation of the fine-tuned models using the classification_report function
# from the sklearn library, which outputs precision, recall, and f1-scores for each class in the dataset. 
# Our model achieved a macro avg f1-score of 0.577 on the Stanford Sentiment Treebank dataset, a macro avg f1-score
# of 0.716 on the DynaSent R1 dataset, and a macro avg f1-score of 0.589 on the DynaSent R2 dataset. These f1-scores
# all outperformed the baseline model given by taking the output hidden states above the [CLS] token for each 
# sentence. We found representations for sentences by summing the output hidden states above each token. Our tokenization
# scheme was the one included with BERT-small. Finally, we used the TorchDeepnNeuralClassifier class as the base class
# for our model. 

# This cell is for fitting the pytorch model to the data.
# Necessary imports
try:
    # Sort of randomly chosen import to see whether the requirements
    # are met:
    import datasets
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    import sys
    sys.path.append("cs224u")
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report, f1_score
import torch
import torch.nn as nn
import copy
import numpy as np
import pickle
from sklearn.model_selection import train_test_split

# For the pretrained model we chose
from transformers import AutoModel
from transformers import AutoTokenizer


# Step 0: Copied code from utils.py, torch_model_base.py, torch_shallow_neural_classifier.py, and 
# torch_deep_neural_classifier.py for the autograder
def progress_bar(msg, verbose=True):
    """
    Simple over-writing progress bar.

    """
    if verbose:
        sys.stderr.write('\r')
        sys.stderr.write(msg)
        sys.stderr.flush()


def fix_random_seeds(
        seed=42,
        set_system=True,
        set_torch=True,
        set_tensorflow=False,
        set_torch_cudnn=True):
    """
    Fix random seeds for reproducibility.

    Parameters
    ----------
    seed : int
        Random seed to be set.

    set_system : bool
        Whether to set `np.random.seed(seed)` and `random.seed(seed)`

    set_tensorflow : bool
        Whether to set `tf.random.set_random_seed(seed)`

    set_torch : bool
        Whether to set `torch.manual_seed(seed)`

    set_torch_cudnn: bool
        Flag for whether to enable cudnn deterministic mode.
        Note that deterministic mode can have a performance impact,
        depending on your model.
        https://pytorch.org/docs/stable/notes/randomness.html

    Notes
    -----
    The function checks that PyTorch and TensorFlow are installed
    where the user asks to set seeds for them. If they are not
    installed, the seed-setting instruction is ignored. The intention
    is to make it easier to use this function in environments that lack
    one or both of these libraries.

    Even though the random seeds are explicitly set,
    the behavior may still not be deterministic (especially when a
    GPU is enabled), due to:

    * CUDA: There are some PyTorch functions that use CUDA functions
    that can be a source of non-determinism:
    https://pytorch.org/docs/stable/notes/randomness.html

    * PYTHONHASHSEED: On Python 3.3 and greater, hash randomization is
    turned on by default. This seed could be fixed before calling the
    python interpreter (PYTHONHASHSEED=0 python test.py). However, it
    seems impossible to set it inside the python program:
    https://stackoverflow.com/questions/30585108/disable-hash-randomization-from-within-python-program

    """
    # set system seed
    if set_system:
        np.random.seed(seed)
        random.seed(seed)

    # set torch seed
    if set_torch:
        try:
            import torch
        except ImportError:
            pass
        else:
            torch.manual_seed(seed)

    # set torch cudnn backend
    if set_torch_cudnn:
        try:
            import torch
        except ImportError:
            pass
        else:
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False


def safe_macro_f1(y, y_pred, **kwargs):
    """
    Macro-averaged F1, forcing `sklearn` to report as a multiclass
    problem even when there are just two classes. `y` is the list of
    gold labels and `y_pred` is the list of predicted labels.

    """
    return f1_score(y, y_pred, average='macro', pos_label=None)


class TorchModelBase:
    def __init__(self,
            batch_size=1028,
            max_iter=1000,
            eta=0.001,
            optimizer_class=torch.optim.Adam,
            l2_strength=0,
            gradient_accumulation_steps=1,
            max_grad_norm=None,
            warm_start=False,
            early_stopping=False,
            validation_fraction=0.1,
            shuffle_train=True,
            n_iter_no_change=10,
            tol=1e-5,
            device=None,
            display_progress=True,
            **optimizer_kwargs):
        """
        Base class for all the PyTorch-based models.

        Parameters
        ----------
        batch_size: int
            Number of examples per batch. Batching is handled by a
            `torch.utils.data.DataLoader`. Final batches can have fewer
            examples, depending on the total number of examples in the
            dataset.

        max_iter: int
            Maximum number of training iterations. This will interact
            with `early_stopping`, `n_iter_no_change`, and `tol` in the
            sense that this limit will be reached if and only if and
            conditions triggered by those other parameters are not met.

        eta : float
            Learning rate for the optimizer.

        optimizer_class: `torch.optimizer.Optimizer`
            Any PyTorch optimizer should work. Additional arguments
            can be passed to this object via `**optimizer_kwargs`. The
            optimizer itself is built by `self.build_optimizer` when
            `fit` is called.

        l2_strength: float
            L2 regularization parameters for the optimizer. The default
            of 0 means no regularization, and larger values correspond
            to stronger regularization.

        gradient_accumulation_steps: int
            Controls how often the model parameters are updated during
            learning. For example, with `gradient_accumulation_steps=2`,
            the parameters are updated after every other batch. The primary
            use case for `gradient_accumulation_steps > 1` is where the
            model is very large, so only small batches of examples can be
            fit into memory. The updates based on these small batches can
            have high variance, so accumulating a few batches before
            updating can smooth the process out.

        max_grad_norm: None or float
            If not `None`, then `torch.nn.utils.clip_grad_norm_` is used
            to clip all the model parameters to within the range set
            by this value. This is a kind of brute-force way of keeping
            the parameter values from growing absurdly large or small.

        warm_start: bool
            If `False`, then repeated calls to `fit` will reset all the
            optimization settings: the model parameters, the optimizer,
            and the metadata we collect during optimization. If `True`,
            then calling `fit` twice with `max_iter=N` should be the same
            as calling fit once with `max_iter=N*2`.

        early_stopping: bool
            If `True`, then `validation_fraction` of the data given to
            `fit` are held out and used to assess the model after every
            epoch. The best scoring model is stored in an attribute
            `best_parameters`. If an improvement of at least `self.tol`
            isn't seen after `n_iter_no_change` iterations, then training
            stops and `self.model` is set to use `best_parameters`.

        validation_fraction: float
            Percentage of the data given to `fit` to hold out for use in
            early stopping. Ignored if `early_stopping=False`

        shuffle_train: bool
            Whether to shuffle the training data.

        n_iter_no_change: int
            Number of epochs used to control convergence and early
            stopping. Where `early_stopping=True`, training stops if an
            improvement of more than `self.tol` isn't seen after this
            many epochs. If `early_stopping=False`, then training stops
            if the epoch error doesn't drop by at least `self.tol` after
            this many epochs.

        tol: float
            Value used to control `early_stopping` and convergence.

        device: str or None
            Used to set the device on which the PyTorch computations will
            be done. If `device=None`, this will choose a CUDA device if
            one is available, else the CPU is used.

        display_progress: bool
            Whether to print optimization information incrementally to
            `sys.stderr` during training.

        **optimizer_kwargs: kwargs
            Any additional keywords given to the model will be passed to
            the optimizer -- see `self.build_optimizer`. The intent is to
            make it easy to tune these as hyperparameters will still
            allowing the user to specify just `optimizer_class` rather
            than setting up a full optimizer.

        Attributes
        ----------
        params: list
             All the keyword arguments are parameters and, with the
             exception of `display_progress`, their names are added to
             this list to support working with them using tools from
             `sklearn.model_selection`.

        """
        self.batch_size = batch_size
        self.max_iter = max_iter
        self.eta = eta
        self.optimizer_class = optimizer_class
        self.l2_strength = l2_strength
        self.gradient_accumulation_steps = max([gradient_accumulation_steps, 1])
        self.max_grad_norm = max_grad_norm
        self.warm_start = warm_start
        self.early_stopping = early_stopping
        self.validation_fraction = validation_fraction
        self.shuffle_train = shuffle_train
        self.n_iter_no_change = n_iter_no_change
        self.tol = tol
        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.display_progress = display_progress
        self.optimizer_kwargs = optimizer_kwargs
        for k, v in self.optimizer_kwargs.items():
            setattr(self, k, v)
        self.params = [
            'batch_size',
            'max_iter',
            'eta',
            'optimizer_class',
            'l2_strength',
            'gradient_accumulation_steps',
            'max_grad_norm',
            'validation_fraction',
            'early_stopping',
            'n_iter_no_change',
            'warm_start',
            'tol']
        self.params += list(optimizer_kwargs.keys())

    def build_dataset(self, *args, **kwargs):
        """
        Subclasses are required to define this method. Perhaps the most
        important design note is that the function should be prepared to
        return datasets that are appropriate for both training and
        prediction. For training, we expect `*args` to have labels in
        final position. For prediction, we expect all of `*args` to be
        model inputs. For example, in a simple classifier, we expect
        `*args` to be a pair `(X, y)` for training and so this method
        should return something like:

        `torch.utils.data.TensorDataset(X, y)`

        For prediction, we get only `X`, so we should return

        `torch.utils.data.TensorDataset(X)`

        Parameters
        ----------
        *args: any arguments to be used to create the dataset

        **kwargs: any desired keyword arguments

        Returns
        -------
        `torch.utils.data.Dataset` or a custom subclass thereof

        """
        raise NotImplementedError

    def build_graph(self, *args, **kwargs):
        """
        Build the core computational graph. This is called only after
        `fit` is called. The return value of this function becomes the
        the `self.model` attribute.

        Parameters
        ----------
        *args: any arguments to be used to create the dataset

        **kwargs: any desired keyword arguments

        Returns
        -------
        nn.Module or subclass thereof

        """
        raise NotImplementedError

    def score(self, *args):
        """
        Required by the `sklearn.model_selection` tools. This function
        needs to take the same arguments as `fit`. For `*args` is usually
        an `(X, y)` pair of features and labels, and `self.predict(X)`
        is called and then some kind of scoring function is used to
        compare those predictions with `y`. The return value should be
        some kind of appropriate score for the model in question.

        Notes
        -----
        For early stopping, we use this function to get scores and
        assume that larger scores are better. This would conflict with
        using, say, a mean-squared-error scoring function.

        """
        raise NotImplementedError

    def build_optimizer(self):
        """
        Builds the optimizer. This function is called only when `fit`
        is called.

        Returns
        -------
        torch.optimizer.Optimizer

        """
        return self.optimizer_class(
            self.model.parameters(),
            lr=self.eta,
            weight_decay=self.l2_strength,
            **self.optimizer_kwargs)

    def fit(self, *args):
        """
        Generic optimization method.

        Parameters
        ----------
        *args: list of objects
            We assume that the final element of args give the labels
            and all the preceding elements give the system inputs.
            For regular supervised learning, this is like (X, y), but
            we allow for models that might use multiple data structures
            for their inputs.

        Attributes
        ----------
        model: nn.Module or subclass thereof
            Set by `build_graph`. If `warm_start=True`, then this is
            initialized only by the first call to `fit`.

        optimizer: torch.optimizer.Optimizer
            Set by `build_optimizer`. If `warm_start=True`, then this is
            initialized only by the first call to `fit`.

        errors: list of float
            List of errors. If `warm_start=True`, then this is
            initialized only by the first call to `fit`. Thus, where
            `max_iter=5`, if we call `fit` twice with `warm_start=True`,
            then `errors` will end up with 10 floats in it.

        validation_scores: list
            List of scores. This is filled only if `early_stopping=True`.
            If `warm_start=True`, then this is initialized only by the
            first call to `fit`. Thus, where `max_iter=5`, if we call
            `fit` twice with `warm_start=True`, then `validation_scores`
            will end up with 10 floats in it.

        no_improvement_count: int
            Used to control early stopping and convergence. These values
            are controlled by `_update_no_improvement_count_early_stopping`
            or `_update_no_improvement_count_errors`.  If `warm_start=True`,
            then this is initialized only by the first call to `fit`. Thus,
            in that situation, the values could accumulate across calls to
            `fit`.

        best_error: float
           Used to control convergence. Smaller is assumed to be better.
           If `warm_start=True`, then this is initialized only by the first
           call to `fit`. It will be reset by
           `_update_no_improvement_count_errors` depending on how the
           optimization is proceeding.

        best_score: float
           Used to control early stopping. If `warm_start=True`, then this
           is initialized only by the first call to `fit`. It will be reset
           by `_update_no_improvement_count_early_stopping` depending on how
           the optimization is proceeding. Important: we currently assume
           that larger scores are better. As a result, we will not get the
           correct results for, e.g., a scoring function based in
           `mean_squared_error`. See `self.score` for additional details.

        best_parameters: dict
            This is a PyTorch state dict. It is used if and only if
            `early_stopping=True`. In that case, it is updated whenever
            `best_score` is improved numerically. If the early stopping
            criteria are met, then `self.model` is reset to contain these
            parameters before `fit` exits.

        Returns
        -------
        self

        """
        if self.early_stopping:
            args, dev = self._build_validation_split(
                *args, validation_fraction=self.validation_fraction)

        # Dataset:
        dataset = self.build_dataset(*args)
        dataloader = self._build_dataloader(dataset, shuffle=self.shuffle_train)

        # Set up parameters needed to use the model. This is a separate
        # function to support using pretrained models for prediction,
        # where it might not be desirable to call `fit`.
        self.initialize()

        # Make sure the model is where we want it:
        self.model.to(self.device)

        self.model.train()
        self.optimizer.zero_grad()

        for iteration in range(1, self.max_iter+1):

            epoch_error = 0.0

            for batch_num, batch in enumerate(dataloader, start=1):

                batch = [x.to(self.device) for x in batch]

                X_batch = batch[: -1]
                y_batch = batch[-1]

                batch_preds = self.model(*X_batch)

                err = self.loss(batch_preds, y_batch)

                if self.gradient_accumulation_steps > 1 and \
                  self.loss.reduction == "mean":
                    err /= self.gradient_accumulation_steps

                err.backward()

                epoch_error += err.item()

                if batch_num % self.gradient_accumulation_steps == 0 or \
                  batch_num == len(dataloader):
                    if self.max_grad_norm is not None:
                        torch.nn.utils.clip_grad_norm_(
                            self.model.parameters(), self.max_grad_norm)
                    self.optimizer.step()
                    self.optimizer.zero_grad()

            # Stopping criteria:

            if self.early_stopping:
                self._update_no_improvement_count_early_stopping(*dev)
                if self.no_improvement_count > self.n_iter_no_change:
                    progress_bar(
                        "Stopping after epoch {}. Validation score did "
                        "not improve by tol={} for more than {} epochs. "
                        "Final error is {}".format(iteration, self.tol,
                            self.n_iter_no_change, epoch_error),
                        verbose=self.display_progress)
                    break

            else:
                self._update_no_improvement_count_errors(epoch_error)
                if self.no_improvement_count > self.n_iter_no_change:
                    progress_bar(
                        "Stopping after epoch {}. Training loss did "
                        "not improve more than tol={}. Final error "
                        "is {}.".format(iteration, self.tol, epoch_error),
                        verbose=self.display_progress)
                    break

            progress_bar(
                "Finished epoch {} of {}; error is {}".format(
                    iteration, self.max_iter, epoch_error),
                verbose=self.display_progress)

        if self.early_stopping:
            self.model.load_state_dict(self.best_parameters)

        return self

    def initialize(self):
        """
        Method called by `fit` to establish core attributes. To use a
        pretrained model without calling `fit`, one can use this
        method.

        """
        if not self.warm_start or not hasattr(self, "model"):
            self.model = self.build_graph()
            # This device move has to happen before the optimizer is built:
            # https://pytorch.org/docs/master/optim.html#constructing-it
            self.model.to(self.device)
            self.optimizer = self.build_optimizer()
            self.errors = []
            self.validation_scores = []
            self.no_improvement_count = 0
            self.best_error = np.inf
            self.best_score = -np.inf
            self.best_parameters = None

    @staticmethod
    def _build_validation_split(*args, validation_fraction=0.2):
        """
        Split `*args` into train and dev portions for early stopping.
        We use `train_test_split`. For args of length N, then delivers
        N*2 objects, arranged as

        X1_train, X1_test, X2_train, X2_test, ..., y_train, y_test

        Parameters
        ----------
        *args: List of objects to split.

        validation_fraction: float
            Percentage of the examples to use for the dev portion. In
            `fit`, this is determined by `self.validation_fraction`.
            We give it as an argument here to facilitate unit testing.

        Returns
        -------
        Pair of tuples `train` and `dev`

        """
        if validation_fraction == 1.0:
            return args, args
        results = train_test_split(*args, test_size=validation_fraction)
        train = results[::2]
        dev = results[1::2]
        return train, dev

    def _build_dataloader(self, dataset, shuffle=True):
        """
        Internal method used to create a dataloader from a dataset.
        This is used by `fit` and `_predict`.

        Parameters
        ----------
        dataset: torch.utils.data.Dataset

        shuffle: bool
            When training, this is `True`. For prediction, this is
            crucially set to `False` so that the examples are not
            shuffled out of order with respect to labels that might
            be used for assessment.

        Returns
        -------
        torch.utils.data.DataLoader

        """
        if hasattr(dataset, "collate_fn"):
            collate_fn = dataset.collate_fn
        else:
            collate_fn = None
        dataloader = torch.utils.data.DataLoader(
            dataset,
            batch_size=self.batch_size,
            shuffle=shuffle,
            pin_memory=True,
            collate_fn=collate_fn)
        return dataloader

    def _update_no_improvement_count_early_stopping(self, *dev):
        """
        Internal method used by `fit` to control early stopping.
        The method uses `self.score(*dev)` for scoring and updates
        `self.validation_scores`, `self.no_improvement_count`,
        `self.best_score`, `self.best_parameters` as appropriate.

        """
        score = self.score(*dev)
        self.validation_scores.append(score)
        # If the score isn't at least `self.tol` better, increment:
        if score < (self.best_score + self.tol):
            self.no_improvement_count += 1
        else:
            self.no_improvement_count = 0
        # If the current score is numerically better than all previous
        # scores, update the best parameters:
        if score > self.best_score:
            self.best_parameters = copy.deepcopy(self.model.state_dict())
            self.best_score = score
        self.model.train()

    def _update_no_improvement_count_errors(self, epoch_error):
        """
        Internal method used by `fit` to control convergence.
        The method uses `epoch_error`, `self.best_error`, and
        `self.tol` to make decisions, and it updates `self.errors`,
        `self.no_improvement_count`, and `self.best_error` as
        appropriate.

        """
        if epoch_error > (self.best_error - self.tol):
            self.no_improvement_count += 1
        else:
            self.no_improvement_count = 0
        if epoch_error < self.best_error:
            self.best_error = epoch_error
        self.errors.append(epoch_error)

    def _predict(self, *args, device=None):
        """
        Internal method that subclasses are expected to use to define
        their own `predict` functions. The hope is that this method
        can do all the data organization and other details, allowing
        subclasses to have compact predict methods that just encode
        the core logic specific to them.

        Parameters
        ----------
        *args: system inputs

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        The precise return value depends on the nature of the predictions.
        If the predictions have the same shape across all batches, then
        we return a single tensor concatenation of them. If the shape
        can vary across batches, as is common for sequence prediction,
        then we return a list of tensors of varying length.

        """
        device = self.device if device is None else torch.device(device)

        # Dataset:
        dataset = self.build_dataset(*args)
        dataloader = self._build_dataloader(dataset, shuffle=False)

        # Model:
        self.model.to(device)
        self.model.eval()

        preds = []
        with torch.no_grad():
            for batch in dataloader:
                X = [x.to(device) for x in batch]
                preds.append(self.model(*X))

        # Make sure the model is back on the instance device:
        self.model.to(self.device)

        # If the batch outputs differ only in their batch size, sharing
        # all other dimensions, then we can concatenate them and maintain
        # a tensor. For simple classification problems, this should hold.
        if all(x.shape[1: ] == preds[0].shape[1: ] for x in preds[1: ]):
            return torch.cat(preds, axis=0)
        # The batch outputs might differ along other dimensions. This is
        # common for sequence prediction, where different batches might
        # have different max lengths, since we pad on a per-batch basis.
        # In this case, we can't concatenate them, so we return a list
        # of the predictions, where each prediction is a tensor. Note:
        # the predictions might still be padded and so need trimming on a
        # per example basis.
        else:
            return [p for batch in preds for p in batch]

    def get_params(self, deep=True):
        params = self.params.copy()
        # Obligatorily add `vocab` so that sklearn passes it in when
        # creating new model instances during cross-validation:
        if hasattr(self, 'vocab'):
            params += ['vocab']
        return {p: getattr(self, p) for p in params}

    def set_params(self, **params):
        for key, val in params.items():
            if key not in self.params:
                raise ValueError(
                    "{} is not a parameter for {}. For the list of "
                    "available parameters, use `self.params`.".format(
                        key, self.__class__.__name__))
            else:
                setattr(self, key, val)
        return self

    def to_pickle(self, output_filename):
        """
        Serialize the entire class instance. Importantly, this is
        different from using the standard `torch.save` method:

        torch.save(self.model.state_dict(), output_filename)

        The above stores only the underlying model parameters. In
        contrast, the current method ensures that all of the model
        parameters are on the CPU and then stores the full instance.
        This is necessary to ensure that we retain all the information
        needed to read new examples, do additional training, make
        predictions, and so forth.

        Parameters
        ----------
        output_filename : str
            Full path for the output file.

        """
        self.model = self.model.cpu()
        with open(output_filename, 'wb') as f:
            pickle.dump(self, f)

    @staticmethod
    def from_pickle(src_filename):
        """
        Load an entire class instance onto the CPU. This also sets
        `self.warm_start=True` so that the loaded parameters are used
        if `fit` is called.

        Importantly, this is different from recommended PyTorch method:

        self.model.load_state_dict(torch.load(src_filename))

        We cannot reliably do this with new instances, because we need
        to see new examples in order to set some of the model
        dimensionalities and obtain information about what the class
        labels are. Thus, the current method loads an entire serialized
        class as created by `to_pickle`.

        The training and prediction code move the model parameters to
        `self.device`.

        Parameters
        ----------
        src_filename : str
            Full path to the serialized model file.

        """
        with open(src_filename, 'rb') as f:
            return pickle.load(f)

    def __repr__(self):
        param_str = ["{}={}".format(a, getattr(self, a)) for a in self.params]
        param_str = ",\n\t".join(param_str)
        return "{}(\n\t{})".format(self.__class__.__name__, param_str)


class TorchShallowNeuralClassifier(TorchModelBase):
    def __init__(self,
            hidden_dim=50,
            hidden_activation=nn.Tanh(),
            **base_kwargs):
        """
        A model

        h = f(xW_xh + b_h)
        y = softmax(hW_hy + b_y)

        with a cross-entropy loss and f determined by `hidden_activation`.

        Parameters
        ----------
        hidden_dim : int
            Dimensionality of the hidden layer.

        hidden_activation : nn.Module
            The non-activation function used by the network for the
            hidden layer.

        **base_kwargs
            For details, see `torch_model_base.py`.

        Attributes
        ----------
        loss: nn.CrossEntropyLoss(reduction="mean")

        self.params: list
            Extends TorchModelBase.params with names for all of the
            arguments for this class to support tuning of these values
            using `sklearn.model_selection` tools.

        """
        self.hidden_dim = hidden_dim
        self.hidden_activation = hidden_activation
        super().__init__(**base_kwargs)
        self.loss = nn.CrossEntropyLoss(reduction="mean")
        self.params += ['hidden_dim', 'hidden_activation']

    def build_graph(self):
        """
        Define the model's computation graph.

        Returns
        -------
        nn.Module

        """
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes_))

    def build_dataset(self, X, y=None):
        """
        Define datasets for the model.

        Parameters
        ----------
        X : iterable of length `n_examples`
           Each element must have the same length.

        y: None or iterable of length `n_examples`

        Attributes
        ----------
        input_dim : int
            Set based on `X.shape[1]` after `X` has been converted to
            `np.array`.

        Returns
        -------
        torch.utils.data.TensorDataset` Where `y=None`, the dataset will
        yield single tensors `X`. Where `y` is specified, it will yield
        `(X, y)` pairs.

        """
        X = np.array(X)
        self.input_dim = X.shape[1]
        X = torch.FloatTensor(X)
        if y is None:
            dataset = torch.utils.data.TensorDataset(X)
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(X, y)
        return dataset

    def score(self, X, y, device=None):
        """
        Uses macro-F1 as the score function. Note: this departs from
        `sklearn`, where classifiers use accuracy as their scoring
        function. Using macro-F1 is more consistent with our course.

        This function can be used to evaluate models, but its primary
        use is in cross-validation and hyperparameter tuning.

        Parameters
        ----------
        X: np.array, shape `(n_examples, n_features)`

        y: iterable, shape `len(n_examples)`
            These can be the raw labels. They will converted internally
            as needed. See `build_dataset`.

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        float

        """
        preds = self.predict(X, device=device)
        return utils.safe_macro_f1(y, preds)

    def predict_proba(self, X, device=None):
        """
        Predicted probabilities for the examples in `X`.

        Parameters
        ----------
        X : np.array, shape `(n_examples, n_features)`

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        np.array, shape `(len(X), self.n_classes_)`
            Each row of this matrix will sum to 1.0.

        """
        preds = self._predict(X, device=device)
        probs = torch.softmax(preds, dim=1).cpu().numpy()
        return probs

    def predict(self, X, device=None):
        """
        Predicted labels for the examples in `X`. These are converted
        from the integers that PyTorch needs back to their original
        values in `self.classes_`.

        Parameters
        ----------
        X : np.array, shape `(n_examples, n_features)`

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        list, length len(X)

        """
        probs = self.predict_proba(X, device=device)
        return [self.classes_[i] for i in probs.argmax(axis=1)]


def simple_example():
    """Assess on the digits dataset."""
    from sklearn.datasets import load_digits
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report, accuracy_score

    utils.fix_random_seeds()

    digits = load_digits()
    X = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)

    mod = TorchShallowNeuralClassifier()

    print(mod)

    mod.fit(X_train, y_train)
    preds = mod.predict(X_test)

    print("\nClassification report:")

    print(classification_report(y_test, preds))

    return accuracy_score(y_test, preds)


if __name__ == '__main__':
    simple_example()


__author__ = "Atticus Geiger"
__version__ = "CS224u, Stanford, Spring 2022"


class ActivationLayer(torch.nn.Module):
    def __init__(self, input_dim, output_dim, device, hidden_activation):
        super().__init__()
        self.linear = nn.Linear(input_dim, output_dim, device=device)
        self.activation = hidden_activation

    def forward(self, x):
        return self.activation(self.linear(x))


class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self,
            num_layers=1,
            **base_kwargs):
        """
        A dense, feed-forward network with the number of hidden layers
        set by `num_layers`.

        Parameters
        ----------
        num_layers : int
            Number of hidden layers in the network.

        **base_kwargs
            For details, see `torch_model_base.py`.

        Attributes
        ----------
        loss: nn.CrossEntropyLoss(reduction="mean")

        self.params: list
            Extends TorchModelBase.params with names for all of the
            arguments for this class to support tuning of these values
            using `sklearn.model_selection` tools.

        """
        self.num_layers = num_layers
        super().__init__(**base_kwargs)
        self.loss = nn.CrossEntropyLoss(reduction="mean")
        self.params += ['num_layers']

    def build_graph(self):
        """
        Define the model's computation graph.

        Returns
        -------
        nn.Module

        """
        # Input to hidden:
        self.layers = [
            ActivationLayer(
                self.input_dim, self.hidden_dim, self.device, self.hidden_activation)]
        # Hidden to hidden:
        for _ in range(self.num_layers-1):
            self.layers += [
                ActivationLayer(
                    self.hidden_dim, self.hidden_dim, self.device, self.hidden_activation)]
        # Hidden to output:
        self.layers.append(
            nn.Linear(self.hidden_dim, self.n_classes_, device=self.device))
        return nn.Sequential(*self.layers)



def simple_example():
    """Assess on the digits dataset."""
    from sklearn.datasets import load_digits
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report, accuracy_score

    utils.fix_random_seeds()

    digits = load_digits()
    X = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)

    mod = TorchDeepNeuralClassifier(num_layers=2)

    print(mod)

    mod.fit(X_train, y_train)
    preds = mod.predict(X_test)

    print("\nClassification report:")

    print(classification_report(y_test, preds))

    return accuracy_score(y_test, preds)


if __name__ == '__main__':
    simple_example()



# Step 1: Choose pretrained model to use from hugging face
# https://huggingface.co/prajjwal1/bert-small
weights_name = "prajjwal1/bert-small"


# Step 2: Use tokenizer that comes with model
tokenizer = AutoTokenizer.from_pretrained(weights_name)


# Step 3: Potentially look for more datasets; if not, just use dynasent round 1 and 2 and sst
dynasent_r1 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r1.all')
dynasent_r2 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r2.all')
sst = load_dataset("SetFit/sst5")

def convert_sst_label(s):
    return s.split(" ")[-1]
for splitname in ('train', 'validation', 'test'):
    dist = [convert_sst_label(s) for s in sst[splitname]['label_text']]
    sst[splitname] = sst[splitname].add_column('gold_label', dist)
    sst[splitname] = sst[splitname].add_column('sentence', sst[splitname]['text'])


# Step 4: Get representations of tokens
def get_batch_token_ids(batch, tokenizer):
    """Map `batch` to a tensor of ids. The return
    value should meet the following specification:

    1. The max length should be 512.
    2. Examples longer than the max length should be truncated
    3. Examples should be padded to the max length for the batch.
    4. The special [CLS] should be added to the start and the special 
       token [SEP] should be added to the end.
    5. The attention mask should be returned
    6. The return value of each component should be a tensor.    

    Parameters
    ----------
    batch: list of str
    tokenizer: Hugging Face tokenizer

    Returns
    -------
    dict with at least "input_ids" and "attention_mask" as keys,
    each with Tensor values

    """
    max_length = 512
    return tokenizer.batch_encode_plus(batch, add_special_tokens=True, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt', return_attention_mask=True)


# Step 5: Define the graph for the neural network
class BertClassifierModule(nn.Module):
    def __init__(self, 
            n_classes, 
            hidden_activation):
        """This module loads a Transformer, adds a dense layer with activation 
        function give by `hidden_activation`, and puts a classifier
        layer on top of that as the final output. The output of
        the dense layer should have the same dimensionality as the
        model input.

        Parameters
        ----------
        n_classes : int
            Number of classes for the output layer
        hidden_activation : torch activation function
            e.g., nn.Tanh()
        weights_name : str
            Name of pretrained model to load from Hugging Face

        """
        super().__init__()
        self.n_classes = n_classes
        self.bert = AutoModel.from_pretrained(weights_name) # v1 and v2

        self.bert.train()
        self.hidden_activation = hidden_activation
        self.hidden_dim = self.bert.embeddings.word_embeddings.embedding_dim
        # Add the new parameters here using `nn.Sequential`. 
        # We can define this layer as
        # 
        #  h = f(cW1 + b_h)
        #  y = hW2 + b_y
        #
        # where c is the final hidden state above the [CLS] token,
        # W1 has dimensionality (self.hidden_dim, self.hidden_dim),
        # W2 has dimensionality (self.hidden_dim, self.n_classes), 
        # and we rely on the PyTorch loss function to add apply a
        # softmax to y.  
        self.classifier_layer = nn.Sequential(
            nn.Linear(self.hidden_dim, self.hidden_dim, bias=True),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes, bias=True)
        )



    def forward(self, indices, mask):
        """Process `indices` with `mask` by feeding these arguments
        to `self.bert` and then feeding the initial hidden state
        in `last_hidden_state` to `self.classifier_layer`.

        Parameters
        ----------
        indices : tensor.LongTensor of shape (n_batch, k)
            Indices into the `self.bert` embedding layer. `n_batch` is
            the number of examples and `k` is the sequence length for
            this batch
        mask : tensor.LongTensor of shape (n_batch, d)
            Binary vector indicating which values should be masked.
            `n_batch` is the number of examples and `k` is the
            sequence length for this batch

        Returns
        -------
        tensor.FloatTensor
            Predicted values, shape `(n_batch, self.n_classes)`

        """
        maskreps = self.bert(indices, attention_mask=mask)
        return self.classifier_layer(torch.sum(maskreps.last_hidden_state, dim=1))



# Step 6: Use torch_deep_neural_classifier and fine tune it on the datasets
class OriginalClassifier(TorchDeepNeuralClassifier):
    def __init__(self, *args, **kwargs):
        self.tokenizer = tokenizer
        super().__init__(*args, **kwargs)

    def build_graph(self):
        return BertClassifierModule(
            self.n_classes_, self.hidden_activation)

    def build_dataset(self, X, y=None):
        data = get_batch_token_ids(X, self.tokenizer)
        if y is None:
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'])
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'], y)
        return dataset

# Step 7: Create and train model
original_model = OriginalClassifier(
    hidden_activation=nn.ReLU(),
    eta=0.00005,          # Low learning rate for effective fine-tuning.
    batch_size=8,         # Small batches to avoid memory overload.
    gradient_accumulation_steps=4,  # Increase the effective batch size to 32.
    early_stopping=True,  # Early-stopping
    n_iter_no_change=5)   # params.


# train on all 3 datasets at once
X = dynasent_r1['train']['sentence'] + dynasent_r2['train']['sentence'] + sst['train']['sentence']
y = dynasent_r1['train']['gold_label'] + dynasent_r2['train']['gold_label'] + sst['train']['gold_label']

_ = original_model.fit(X, y)

# save trained model to a file and download it so that it's stored after current
# colab environment resets
filename = 'sum_representation_model.pkl'
original_model.to_pickle(filename)
from google.colab import files
files.download('sum_representation_model.pkl')

sst_preds = original_model.predict(sst['validation']['sentence'])
print(classification_report(sst['validation']['gold_label'], sst_preds, digits=3))
dynasent_r1_preds = original_model.predict(dynasent_r1['validation']['sentence'])
print(classification_report(dynasent_r1['validation']['gold_label'], dynasent_r1_preds, digits=3))
dynasent_r2_preds = original_model.predict(dynasent_r2['validation']['sentence'])
print(classification_report(dynasent_r2['validation']['gold_label'], dynasent_r2_preds, digits=3))

# STOP COMMENT: Please do not remove this comment.

Cloning into 'cs224u'...
remote: Enumerating objects: 2209, done.[K
remote: Counting objects: 100% (117/117), done.[K
remote: Compressing objects: 100% (78/78), done.[K
remote: Total 2209 (delta 51), reused 62 (delta 39), pack-reused 2092[K
Receiving objects: 100% (2209/2209), 41.48 MiB | 21.66 MiB/s, done.
Resolving deltas: 100% (1350/1350), done.
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/stanfordnlp/dsp (from -r cs224u/requirements.txt (line 15))
  Cloning https://github.com/stanfordnlp/dsp to /tmp/pip-req-build-to9o6_z7
  Running command git clone --filter=blob:none --quiet https://github.com/stanfordnlp/dsp /tmp/pip-req-build-to9o6_z7
  Resolved https://github.com/stanfordnlp/dsp to commit 693be4d83c5037e0c7cca5d58b42a7bb8e3b7e9a
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting jupyter>=1.0.0
  Downloading jupyter-1.0.0-py2.py3-none-any.whl (2.7 kB)
Collecting torch==1.13.1

Downloading (…)lve/main/config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.97k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/13.7k [00:00<?, ?B/s]

Downloading and preparing dataset dynasent/dynabench.dynasent.r1.all (download: 16.26 MiB, generated: 23.94 MiB, post-processed: Unknown size, total: 40.20 MiB) to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r1.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967...


Downloading data:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/80488 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3600 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3600 [00:00<?, ? examples/s]

Dataset dynasent downloaded and prepared to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r1.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading and preparing dataset dynasent/dynabench.dynasent.r2.all (download: 16.26 MiB, generated: 4.89 MiB, post-processed: Unknown size, total: 21.15 MiB) to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967...


Generating train split:   0%|          | 0/13065 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/720 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/720 [00:00<?, ? examples/s]

Dataset dynasent downloaded and prepared to /root/.cache/huggingface/datasets/dynabench___dynasent/dynabench.dynasent.r2.all/1.1.0/ab89971d9ae1aacc59ed44d6855bf0e89167417257e2c2666f38e532148f2967. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading readme:   0%|          | 0.00/421 [00:00<?, ?B/s]

Downloading and preparing dataset json/SetFit--sst5 to /root/.cache/huggingface/datasets/SetFit___json/SetFit--sst5-4c07b9d5881ae209/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/343k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/171k [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/SetFit___json/SetFit--sst5-4c07b9d5881ae209/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading pytorch_model.bin:   0%|          | 0.00/116M [00:00<?, ?B/s]

Some weights of the model checkpoint at prajjwal1/bert-small were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Finished epoch 2 of 1000; error is 2694.325246530585

In [None]:
# This cell is for when you are loading the pytorch model and you've already fit it to the data.

# Description: Our model is a transformer that's a fine-tuned version of the pre-trained BERT model 
# BERT-small (we tried larger BERT models, but they were too computationally expensive for our machines).
# The hyperparameters used for fine-tuning include a learning rate of 5e-5, a maximum sequence length of 512,
# a batch size of 8, and a maximum of 5 epochs with early stopping if the validation loss does not improve f
# or 5 consecutive epochs. We included evaluation of the fine-tuned models using the classification_report function
# from the sklearn library, which outputs precision, recall, and f1-scores for each class in the dataset. 
# Our model achieved a macro avg f1-score of 0.577 on the Stanford Sentiment Treebank dataset, a macro avg f1-score
# of 0.716 on the DynaSent R1 dataset, and a macro avg f1-score of 0.589 on the DynaSent R2 dataset. These f1-scores
# all outperformed the baseline model given by taking the output hidden states above the [CLS] token for each 
# sentence. We found representations for sentences by summing the output hidden states above each token. Our tokenization
# scheme was the one included with BERT-small. Finally, we used the TorchDeepnNeuralClassifier class as the base class
# for our model. 

# This cell is for loading the pytorch model after it's already been fit.
# Necessary imports
try:
    # Sort of randomly chosen import to see whether the requirements
    # are met:
    import datasets
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    import sys
    sys.path.append("cs224u")
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report, f1_score
import torch
import torch.nn as nn
import copy
import numpy as np
import pickle
from sklearn.model_selection import train_test_split

# For the pretrained model we chose
from transformers import AutoModel
from transformers import AutoTokenizer


# Step 0: Copied code from utils.py, torch_model_base.py, torch_shallow_neural_classifier.py, and 
# torch_deep_neural_classifier.py for the autograder
def progress_bar(msg, verbose=True):
    """
    Simple over-writing progress bar.

    """
    if verbose:
        sys.stderr.write('\r')
        sys.stderr.write(msg)
        sys.stderr.flush()


def fix_random_seeds(
        seed=42,
        set_system=True,
        set_torch=True,
        set_tensorflow=False,
        set_torch_cudnn=True):
    """
    Fix random seeds for reproducibility.

    Parameters
    ----------
    seed : int
        Random seed to be set.

    set_system : bool
        Whether to set `np.random.seed(seed)` and `random.seed(seed)`

    set_tensorflow : bool
        Whether to set `tf.random.set_random_seed(seed)`

    set_torch : bool
        Whether to set `torch.manual_seed(seed)`

    set_torch_cudnn: bool
        Flag for whether to enable cudnn deterministic mode.
        Note that deterministic mode can have a performance impact,
        depending on your model.
        https://pytorch.org/docs/stable/notes/randomness.html

    Notes
    -----
    The function checks that PyTorch and TensorFlow are installed
    where the user asks to set seeds for them. If they are not
    installed, the seed-setting instruction is ignored. The intention
    is to make it easier to use this function in environments that lack
    one or both of these libraries.

    Even though the random seeds are explicitly set,
    the behavior may still not be deterministic (especially when a
    GPU is enabled), due to:

    * CUDA: There are some PyTorch functions that use CUDA functions
    that can be a source of non-determinism:
    https://pytorch.org/docs/stable/notes/randomness.html

    * PYTHONHASHSEED: On Python 3.3 and greater, hash randomization is
    turned on by default. This seed could be fixed before calling the
    python interpreter (PYTHONHASHSEED=0 python test.py). However, it
    seems impossible to set it inside the python program:
    https://stackoverflow.com/questions/30585108/disable-hash-randomization-from-within-python-program

    """
    # set system seed
    if set_system:
        np.random.seed(seed)
        random.seed(seed)

    # set torch seed
    if set_torch:
        try:
            import torch
        except ImportError:
            pass
        else:
            torch.manual_seed(seed)

    # set torch cudnn backend
    if set_torch_cudnn:
        try:
            import torch
        except ImportError:
            pass
        else:
            torch.backends.cudnn.deterministic = True
            torch.backends.cudnn.benchmark = False


def safe_macro_f1(y, y_pred, **kwargs):
    """
    Macro-averaged F1, forcing `sklearn` to report as a multiclass
    problem even when there are just two classes. `y` is the list of
    gold labels and `y_pred` is the list of predicted labels.

    """
    return f1_score(y, y_pred, average='macro', pos_label=None)


class TorchModelBase:
    def __init__(self,
            batch_size=1028,
            max_iter=1000,
            eta=0.001,
            optimizer_class=torch.optim.Adam,
            l2_strength=0,
            gradient_accumulation_steps=1,
            max_grad_norm=None,
            warm_start=False,
            early_stopping=False,
            validation_fraction=0.1,
            shuffle_train=True,
            n_iter_no_change=10,
            tol=1e-5,
            device=None,
            display_progress=True,
            **optimizer_kwargs):
        """
        Base class for all the PyTorch-based models.

        Parameters
        ----------
        batch_size: int
            Number of examples per batch. Batching is handled by a
            `torch.utils.data.DataLoader`. Final batches can have fewer
            examples, depending on the total number of examples in the
            dataset.

        max_iter: int
            Maximum number of training iterations. This will interact
            with `early_stopping`, `n_iter_no_change`, and `tol` in the
            sense that this limit will be reached if and only if and
            conditions triggered by those other parameters are not met.

        eta : float
            Learning rate for the optimizer.

        optimizer_class: `torch.optimizer.Optimizer`
            Any PyTorch optimizer should work. Additional arguments
            can be passed to this object via `**optimizer_kwargs`. The
            optimizer itself is built by `self.build_optimizer` when
            `fit` is called.

        l2_strength: float
            L2 regularization parameters for the optimizer. The default
            of 0 means no regularization, and larger values correspond
            to stronger regularization.

        gradient_accumulation_steps: int
            Controls how often the model parameters are updated during
            learning. For example, with `gradient_accumulation_steps=2`,
            the parameters are updated after every other batch. The primary
            use case for `gradient_accumulation_steps > 1` is where the
            model is very large, so only small batches of examples can be
            fit into memory. The updates based on these small batches can
            have high variance, so accumulating a few batches before
            updating can smooth the process out.

        max_grad_norm: None or float
            If not `None`, then `torch.nn.utils.clip_grad_norm_` is used
            to clip all the model parameters to within the range set
            by this value. This is a kind of brute-force way of keeping
            the parameter values from growing absurdly large or small.

        warm_start: bool
            If `False`, then repeated calls to `fit` will reset all the
            optimization settings: the model parameters, the optimizer,
            and the metadata we collect during optimization. If `True`,
            then calling `fit` twice with `max_iter=N` should be the same
            as calling fit once with `max_iter=N*2`.

        early_stopping: bool
            If `True`, then `validation_fraction` of the data given to
            `fit` are held out and used to assess the model after every
            epoch. The best scoring model is stored in an attribute
            `best_parameters`. If an improvement of at least `self.tol`
            isn't seen after `n_iter_no_change` iterations, then training
            stops and `self.model` is set to use `best_parameters`.

        validation_fraction: float
            Percentage of the data given to `fit` to hold out for use in
            early stopping. Ignored if `early_stopping=False`

        shuffle_train: bool
            Whether to shuffle the training data.

        n_iter_no_change: int
            Number of epochs used to control convergence and early
            stopping. Where `early_stopping=True`, training stops if an
            improvement of more than `self.tol` isn't seen after this
            many epochs. If `early_stopping=False`, then training stops
            if the epoch error doesn't drop by at least `self.tol` after
            this many epochs.

        tol: float
            Value used to control `early_stopping` and convergence.

        device: str or None
            Used to set the device on which the PyTorch computations will
            be done. If `device=None`, this will choose a CUDA device if
            one is available, else the CPU is used.

        display_progress: bool
            Whether to print optimization information incrementally to
            `sys.stderr` during training.

        **optimizer_kwargs: kwargs
            Any additional keywords given to the model will be passed to
            the optimizer -- see `self.build_optimizer`. The intent is to
            make it easy to tune these as hyperparameters will still
            allowing the user to specify just `optimizer_class` rather
            than setting up a full optimizer.

        Attributes
        ----------
        params: list
             All the keyword arguments are parameters and, with the
             exception of `display_progress`, their names are added to
             this list to support working with them using tools from
             `sklearn.model_selection`.

        """
        self.batch_size = batch_size
        self.max_iter = max_iter
        self.eta = eta
        self.optimizer_class = optimizer_class
        self.l2_strength = l2_strength
        self.gradient_accumulation_steps = max([gradient_accumulation_steps, 1])
        self.max_grad_norm = max_grad_norm
        self.warm_start = warm_start
        self.early_stopping = early_stopping
        self.validation_fraction = validation_fraction
        self.shuffle_train = shuffle_train
        self.n_iter_no_change = n_iter_no_change
        self.tol = tol
        if device is None:
            device = "cuda" if torch.cuda.is_available() else "cpu"
        self.device = torch.device(device)
        self.display_progress = display_progress
        self.optimizer_kwargs = optimizer_kwargs
        for k, v in self.optimizer_kwargs.items():
            setattr(self, k, v)
        self.params = [
            'batch_size',
            'max_iter',
            'eta',
            'optimizer_class',
            'l2_strength',
            'gradient_accumulation_steps',
            'max_grad_norm',
            'validation_fraction',
            'early_stopping',
            'n_iter_no_change',
            'warm_start',
            'tol']
        self.params += list(optimizer_kwargs.keys())

    def build_dataset(self, *args, **kwargs):
        """
        Subclasses are required to define this method. Perhaps the most
        important design note is that the function should be prepared to
        return datasets that are appropriate for both training and
        prediction. For training, we expect `*args` to have labels in
        final position. For prediction, we expect all of `*args` to be
        model inputs. For example, in a simple classifier, we expect
        `*args` to be a pair `(X, y)` for training and so this method
        should return something like:

        `torch.utils.data.TensorDataset(X, y)`

        For prediction, we get only `X`, so we should return

        `torch.utils.data.TensorDataset(X)`

        Parameters
        ----------
        *args: any arguments to be used to create the dataset

        **kwargs: any desired keyword arguments

        Returns
        -------
        `torch.utils.data.Dataset` or a custom subclass thereof

        """
        raise NotImplementedError

    def build_graph(self, *args, **kwargs):
        """
        Build the core computational graph. This is called only after
        `fit` is called. The return value of this function becomes the
        the `self.model` attribute.

        Parameters
        ----------
        *args: any arguments to be used to create the dataset

        **kwargs: any desired keyword arguments

        Returns
        -------
        nn.Module or subclass thereof

        """
        raise NotImplementedError

    def score(self, *args):
        """
        Required by the `sklearn.model_selection` tools. This function
        needs to take the same arguments as `fit`. For `*args` is usually
        an `(X, y)` pair of features and labels, and `self.predict(X)`
        is called and then some kind of scoring function is used to
        compare those predictions with `y`. The return value should be
        some kind of appropriate score for the model in question.

        Notes
        -----
        For early stopping, we use this function to get scores and
        assume that larger scores are better. This would conflict with
        using, say, a mean-squared-error scoring function.

        """
        raise NotImplementedError

    def build_optimizer(self):
        """
        Builds the optimizer. This function is called only when `fit`
        is called.

        Returns
        -------
        torch.optimizer.Optimizer

        """
        return self.optimizer_class(
            self.model.parameters(),
            lr=self.eta,
            weight_decay=self.l2_strength,
            **self.optimizer_kwargs)

    def fit(self, *args):
        """
        Generic optimization method.

        Parameters
        ----------
        *args: list of objects
            We assume that the final element of args give the labels
            and all the preceding elements give the system inputs.
            For regular supervised learning, this is like (X, y), but
            we allow for models that might use multiple data structures
            for their inputs.

        Attributes
        ----------
        model: nn.Module or subclass thereof
            Set by `build_graph`. If `warm_start=True`, then this is
            initialized only by the first call to `fit`.

        optimizer: torch.optimizer.Optimizer
            Set by `build_optimizer`. If `warm_start=True`, then this is
            initialized only by the first call to `fit`.

        errors: list of float
            List of errors. If `warm_start=True`, then this is
            initialized only by the first call to `fit`. Thus, where
            `max_iter=5`, if we call `fit` twice with `warm_start=True`,
            then `errors` will end up with 10 floats in it.

        validation_scores: list
            List of scores. This is filled only if `early_stopping=True`.
            If `warm_start=True`, then this is initialized only by the
            first call to `fit`. Thus, where `max_iter=5`, if we call
            `fit` twice with `warm_start=True`, then `validation_scores`
            will end up with 10 floats in it.

        no_improvement_count: int
            Used to control early stopping and convergence. These values
            are controlled by `_update_no_improvement_count_early_stopping`
            or `_update_no_improvement_count_errors`.  If `warm_start=True`,
            then this is initialized only by the first call to `fit`. Thus,
            in that situation, the values could accumulate across calls to
            `fit`.

        best_error: float
           Used to control convergence. Smaller is assumed to be better.
           If `warm_start=True`, then this is initialized only by the first
           call to `fit`. It will be reset by
           `_update_no_improvement_count_errors` depending on how the
           optimization is proceeding.

        best_score: float
           Used to control early stopping. If `warm_start=True`, then this
           is initialized only by the first call to `fit`. It will be reset
           by `_update_no_improvement_count_early_stopping` depending on how
           the optimization is proceeding. Important: we currently assume
           that larger scores are better. As a result, we will not get the
           correct results for, e.g., a scoring function based in
           `mean_squared_error`. See `self.score` for additional details.

        best_parameters: dict
            This is a PyTorch state dict. It is used if and only if
            `early_stopping=True`. In that case, it is updated whenever
            `best_score` is improved numerically. If the early stopping
            criteria are met, then `self.model` is reset to contain these
            parameters before `fit` exits.

        Returns
        -------
        self

        """
        if self.early_stopping:
            args, dev = self._build_validation_split(
                *args, validation_fraction=self.validation_fraction)

        # Dataset:
        dataset = self.build_dataset(*args)
        dataloader = self._build_dataloader(dataset, shuffle=self.shuffle_train)

        # Set up parameters needed to use the model. This is a separate
        # function to support using pretrained models for prediction,
        # where it might not be desirable to call `fit`.
        self.initialize()

        # Make sure the model is where we want it:
        self.model.to(self.device)

        self.model.train()
        self.optimizer.zero_grad()

        for iteration in range(1, self.max_iter+1):

            epoch_error = 0.0

            for batch_num, batch in enumerate(dataloader, start=1):

                batch = [x.to(self.device) for x in batch]

                X_batch = batch[: -1]
                y_batch = batch[-1]

                batch_preds = self.model(*X_batch)

                err = self.loss(batch_preds, y_batch)

                if self.gradient_accumulation_steps > 1 and \
                  self.loss.reduction == "mean":
                    err /= self.gradient_accumulation_steps

                err.backward()

                epoch_error += err.item()

                if batch_num % self.gradient_accumulation_steps == 0 or \
                  batch_num == len(dataloader):
                    if self.max_grad_norm is not None:
                        torch.nn.utils.clip_grad_norm_(
                            self.model.parameters(), self.max_grad_norm)
                    self.optimizer.step()
                    self.optimizer.zero_grad()

            # Stopping criteria:

            if self.early_stopping:
                self._update_no_improvement_count_early_stopping(*dev)
                if self.no_improvement_count > self.n_iter_no_change:
                    progress_bar(
                        "Stopping after epoch {}. Validation score did "
                        "not improve by tol={} for more than {} epochs. "
                        "Final error is {}".format(iteration, self.tol,
                            self.n_iter_no_change, epoch_error),
                        verbose=self.display_progress)
                    break

            else:
                self._update_no_improvement_count_errors(epoch_error)
                if self.no_improvement_count > self.n_iter_no_change:
                    progress_bar(
                        "Stopping after epoch {}. Training loss did "
                        "not improve more than tol={}. Final error "
                        "is {}.".format(iteration, self.tol, epoch_error),
                        verbose=self.display_progress)
                    break

            progress_bar(
                "Finished epoch {} of {}; error is {}".format(
                    iteration, self.max_iter, epoch_error),
                verbose=self.display_progress)

        if self.early_stopping:
            self.model.load_state_dict(self.best_parameters)

        return self

    def initialize(self):
        """
        Method called by `fit` to establish core attributes. To use a
        pretrained model without calling `fit`, one can use this
        method.

        """
        if not self.warm_start or not hasattr(self, "model"):
            self.model = self.build_graph()
            # This device move has to happen before the optimizer is built:
            # https://pytorch.org/docs/master/optim.html#constructing-it
            self.model.to(self.device)
            self.optimizer = self.build_optimizer()
            self.errors = []
            self.validation_scores = []
            self.no_improvement_count = 0
            self.best_error = np.inf
            self.best_score = -np.inf
            self.best_parameters = None

    @staticmethod
    def _build_validation_split(*args, validation_fraction=0.2):
        """
        Split `*args` into train and dev portions for early stopping.
        We use `train_test_split`. For args of length N, then delivers
        N*2 objects, arranged as

        X1_train, X1_test, X2_train, X2_test, ..., y_train, y_test

        Parameters
        ----------
        *args: List of objects to split.

        validation_fraction: float
            Percentage of the examples to use for the dev portion. In
            `fit`, this is determined by `self.validation_fraction`.
            We give it as an argument here to facilitate unit testing.

        Returns
        -------
        Pair of tuples `train` and `dev`

        """
        if validation_fraction == 1.0:
            return args, args
        results = train_test_split(*args, test_size=validation_fraction)
        train = results[::2]
        dev = results[1::2]
        return train, dev

    def _build_dataloader(self, dataset, shuffle=True):
        """
        Internal method used to create a dataloader from a dataset.
        This is used by `fit` and `_predict`.

        Parameters
        ----------
        dataset: torch.utils.data.Dataset

        shuffle: bool
            When training, this is `True`. For prediction, this is
            crucially set to `False` so that the examples are not
            shuffled out of order with respect to labels that might
            be used for assessment.

        Returns
        -------
        torch.utils.data.DataLoader

        """
        if hasattr(dataset, "collate_fn"):
            collate_fn = dataset.collate_fn
        else:
            collate_fn = None
        dataloader = torch.utils.data.DataLoader(
            dataset,
            batch_size=self.batch_size,
            shuffle=shuffle,
            pin_memory=True,
            collate_fn=collate_fn)
        return dataloader

    def _update_no_improvement_count_early_stopping(self, *dev):
        """
        Internal method used by `fit` to control early stopping.
        The method uses `self.score(*dev)` for scoring and updates
        `self.validation_scores`, `self.no_improvement_count`,
        `self.best_score`, `self.best_parameters` as appropriate.

        """
        score = self.score(*dev)
        self.validation_scores.append(score)
        # If the score isn't at least `self.tol` better, increment:
        if score < (self.best_score + self.tol):
            self.no_improvement_count += 1
        else:
            self.no_improvement_count = 0
        # If the current score is numerically better than all previous
        # scores, update the best parameters:
        if score > self.best_score:
            self.best_parameters = copy.deepcopy(self.model.state_dict())
            self.best_score = score
        self.model.train()

    def _update_no_improvement_count_errors(self, epoch_error):
        """
        Internal method used by `fit` to control convergence.
        The method uses `epoch_error`, `self.best_error`, and
        `self.tol` to make decisions, and it updates `self.errors`,
        `self.no_improvement_count`, and `self.best_error` as
        appropriate.

        """
        if epoch_error > (self.best_error - self.tol):
            self.no_improvement_count += 1
        else:
            self.no_improvement_count = 0
        if epoch_error < self.best_error:
            self.best_error = epoch_error
        self.errors.append(epoch_error)

    def _predict(self, *args, device=None):
        """
        Internal method that subclasses are expected to use to define
        their own `predict` functions. The hope is that this method
        can do all the data organization and other details, allowing
        subclasses to have compact predict methods that just encode
        the core logic specific to them.

        Parameters
        ----------
        *args: system inputs

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        The precise return value depends on the nature of the predictions.
        If the predictions have the same shape across all batches, then
        we return a single tensor concatenation of them. If the shape
        can vary across batches, as is common for sequence prediction,
        then we return a list of tensors of varying length.

        """
        device = self.device if device is None else torch.device(device)

        # Dataset:
        dataset = self.build_dataset(*args)
        dataloader = self._build_dataloader(dataset, shuffle=False)

        # Model:
        self.model.to(device)
        self.model.eval()

        preds = []
        with torch.no_grad():
            for batch in dataloader:
                X = [x.to(device) for x in batch]
                preds.append(self.model(*X))

        # Make sure the model is back on the instance device:
        self.model.to(self.device)

        # If the batch outputs differ only in their batch size, sharing
        # all other dimensions, then we can concatenate them and maintain
        # a tensor. For simple classification problems, this should hold.
        if all(x.shape[1: ] == preds[0].shape[1: ] for x in preds[1: ]):
            return torch.cat(preds, axis=0)
        # The batch outputs might differ along other dimensions. This is
        # common for sequence prediction, where different batches might
        # have different max lengths, since we pad on a per-batch basis.
        # In this case, we can't concatenate them, so we return a list
        # of the predictions, where each prediction is a tensor. Note:
        # the predictions might still be padded and so need trimming on a
        # per example basis.
        else:
            return [p for batch in preds for p in batch]

    def get_params(self, deep=True):
        params = self.params.copy()
        # Obligatorily add `vocab` so that sklearn passes it in when
        # creating new model instances during cross-validation:
        if hasattr(self, 'vocab'):
            params += ['vocab']
        return {p: getattr(self, p) for p in params}

    def set_params(self, **params):
        for key, val in params.items():
            if key not in self.params:
                raise ValueError(
                    "{} is not a parameter for {}. For the list of "
                    "available parameters, use `self.params`.".format(
                        key, self.__class__.__name__))
            else:
                setattr(self, key, val)
        return self

    def to_pickle(self, output_filename):
        """
        Serialize the entire class instance. Importantly, this is
        different from using the standard `torch.save` method:

        torch.save(self.model.state_dict(), output_filename)

        The above stores only the underlying model parameters. In
        contrast, the current method ensures that all of the model
        parameters are on the CPU and then stores the full instance.
        This is necessary to ensure that we retain all the information
        needed to read new examples, do additional training, make
        predictions, and so forth.

        Parameters
        ----------
        output_filename : str
            Full path for the output file.

        """
        self.model = self.model.cpu()
        with open(output_filename, 'wb') as f:
            pickle.dump(self, f)

    @staticmethod
    def from_pickle(src_filename):
        """
        Load an entire class instance onto the CPU. This also sets
        `self.warm_start=True` so that the loaded parameters are used
        if `fit` is called.

        Importantly, this is different from recommended PyTorch method:

        self.model.load_state_dict(torch.load(src_filename))

        We cannot reliably do this with new instances, because we need
        to see new examples in order to set some of the model
        dimensionalities and obtain information about what the class
        labels are. Thus, the current method loads an entire serialized
        class as created by `to_pickle`.

        The training and prediction code move the model parameters to
        `self.device`.

        Parameters
        ----------
        src_filename : str
            Full path to the serialized model file.

        """
        with open(src_filename, 'rb') as f:
            return pickle.load(f)

    def __repr__(self):
        param_str = ["{}={}".format(a, getattr(self, a)) for a in self.params]
        param_str = ",\n\t".join(param_str)
        return "{}(\n\t{})".format(self.__class__.__name__, param_str)


class TorchShallowNeuralClassifier(TorchModelBase):
    def __init__(self,
            hidden_dim=50,
            hidden_activation=nn.Tanh(),
            **base_kwargs):
        """
        A model

        h = f(xW_xh + b_h)
        y = softmax(hW_hy + b_y)

        with a cross-entropy loss and f determined by `hidden_activation`.

        Parameters
        ----------
        hidden_dim : int
            Dimensionality of the hidden layer.

        hidden_activation : nn.Module
            The non-activation function used by the network for the
            hidden layer.

        **base_kwargs
            For details, see `torch_model_base.py`.

        Attributes
        ----------
        loss: nn.CrossEntropyLoss(reduction="mean")

        self.params: list
            Extends TorchModelBase.params with names for all of the
            arguments for this class to support tuning of these values
            using `sklearn.model_selection` tools.

        """
        self.hidden_dim = hidden_dim
        self.hidden_activation = hidden_activation
        super().__init__(**base_kwargs)
        self.loss = nn.CrossEntropyLoss(reduction="mean")
        self.params += ['hidden_dim', 'hidden_activation']

    def build_graph(self):
        """
        Define the model's computation graph.

        Returns
        -------
        nn.Module

        """
        return nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes_))

    def build_dataset(self, X, y=None):
        """
        Define datasets for the model.

        Parameters
        ----------
        X : iterable of length `n_examples`
           Each element must have the same length.

        y: None or iterable of length `n_examples`

        Attributes
        ----------
        input_dim : int
            Set based on `X.shape[1]` after `X` has been converted to
            `np.array`.

        Returns
        -------
        torch.utils.data.TensorDataset` Where `y=None`, the dataset will
        yield single tensors `X`. Where `y` is specified, it will yield
        `(X, y)` pairs.

        """
        X = np.array(X)
        self.input_dim = X.shape[1]
        X = torch.FloatTensor(X)
        if y is None:
            dataset = torch.utils.data.TensorDataset(X)
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(X, y)
        return dataset

    def score(self, X, y, device=None):
        """
        Uses macro-F1 as the score function. Note: this departs from
        `sklearn`, where classifiers use accuracy as their scoring
        function. Using macro-F1 is more consistent with our course.

        This function can be used to evaluate models, but its primary
        use is in cross-validation and hyperparameter tuning.

        Parameters
        ----------
        X: np.array, shape `(n_examples, n_features)`

        y: iterable, shape `len(n_examples)`
            These can be the raw labels. They will converted internally
            as needed. See `build_dataset`.

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        float

        """
        preds = self.predict(X, device=device)
        return utils.safe_macro_f1(y, preds)

    def predict_proba(self, X, device=None):
        """
        Predicted probabilities for the examples in `X`.

        Parameters
        ----------
        X : np.array, shape `(n_examples, n_features)`

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        np.array, shape `(len(X), self.n_classes_)`
            Each row of this matrix will sum to 1.0.

        """
        preds = self._predict(X, device=device)
        probs = torch.softmax(preds, dim=1).cpu().numpy()
        return probs

    def predict(self, X, device=None):
        """
        Predicted labels for the examples in `X`. These are converted
        from the integers that PyTorch needs back to their original
        values in `self.classes_`.

        Parameters
        ----------
        X : np.array, shape `(n_examples, n_features)`

        device: str or None
            Allows the user to temporarily change the device used
            during prediction. This is useful if predictions require a
            lot of memory and so are better done on the CPU. After
            prediction is done, the model is returned to `self.device`.

        Returns
        -------
        list, length len(X)

        """
        probs = self.predict_proba(X, device=device)
        return [self.classes_[i] for i in probs.argmax(axis=1)]


def simple_example():
    """Assess on the digits dataset."""
    from sklearn.datasets import load_digits
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report, accuracy_score

    utils.fix_random_seeds()

    digits = load_digits()
    X = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)

    mod = TorchShallowNeuralClassifier()

    print(mod)

    mod.fit(X_train, y_train)
    preds = mod.predict(X_test)

    print("\nClassification report:")

    print(classification_report(y_test, preds))

    return accuracy_score(y_test, preds)


if __name__ == '__main__':
    simple_example()


__author__ = "Atticus Geiger"
__version__ = "CS224u, Stanford, Spring 2022"


class ActivationLayer(torch.nn.Module):
    def __init__(self, input_dim, output_dim, device, hidden_activation):
        super().__init__()
        self.linear = nn.Linear(input_dim, output_dim, device=device)
        self.activation = hidden_activation

    def forward(self, x):
        return self.activation(self.linear(x))


class TorchDeepNeuralClassifier(TorchShallowNeuralClassifier):
    def __init__(self,
            num_layers=1,
            **base_kwargs):
        """
        A dense, feed-forward network with the number of hidden layers
        set by `num_layers`.

        Parameters
        ----------
        num_layers : int
            Number of hidden layers in the network.

        **base_kwargs
            For details, see `torch_model_base.py`.

        Attributes
        ----------
        loss: nn.CrossEntropyLoss(reduction="mean")

        self.params: list
            Extends TorchModelBase.params with names for all of the
            arguments for this class to support tuning of these values
            using `sklearn.model_selection` tools.

        """
        self.num_layers = num_layers
        super().__init__(**base_kwargs)
        self.loss = nn.CrossEntropyLoss(reduction="mean")
        self.params += ['num_layers']

    def build_graph(self):
        """
        Define the model's computation graph.

        Returns
        -------
        nn.Module

        """
        # Input to hidden:
        self.layers = [
            ActivationLayer(
                self.input_dim, self.hidden_dim, self.device, self.hidden_activation)]
        # Hidden to hidden:
        for _ in range(self.num_layers-1):
            self.layers += [
                ActivationLayer(
                    self.hidden_dim, self.hidden_dim, self.device, self.hidden_activation)]
        # Hidden to output:
        self.layers.append(
            nn.Linear(self.hidden_dim, self.n_classes_, device=self.device))
        return nn.Sequential(*self.layers)



def simple_example():
    """Assess on the digits dataset."""
    from sklearn.datasets import load_digits
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report, accuracy_score

    utils.fix_random_seeds()

    digits = load_digits()
    X = digits.data
    y = digits.target

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.33, random_state=42)

    mod = TorchDeepNeuralClassifier(num_layers=2)

    print(mod)

    mod.fit(X_train, y_train)
    preds = mod.predict(X_test)

    print("\nClassification report:")

    print(classification_report(y_test, preds))

    return accuracy_score(y_test, preds)


if __name__ == '__main__':
    simple_example()



# Step 1: Choose pretrained model to use from hugging face
# https://huggingface.co/prajjwal1/bert-small
weights_name = "prajjwal1/bert-small"


# Step 2: Use tokenizer that comes with model
tokenizer = AutoTokenizer.from_pretrained(weights_name)


# Step 3: Potentially look for more datasets; if not, just use dynasent round 1 and 2 and sst
dynasent_r1 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r1.all')
dynasent_r2 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r2.all')
sst = load_dataset("SetFit/sst5")

def convert_sst_label(s):
    return s.split(" ")[-1]
for splitname in ('train', 'validation', 'test'):
    dist = [convert_sst_label(s) for s in sst[splitname]['label_text']]
    sst[splitname] = sst[splitname].add_column('gold_label', dist)
    sst[splitname] = sst[splitname].add_column('sentence', sst[splitname]['text'])


# Step 4: Get representations of tokens
def get_batch_token_ids(batch, tokenizer):
    """Map `batch` to a tensor of ids. The return
    value should meet the following specification:

    1. The max length should be 512.
    2. Examples longer than the max length should be truncated
    3. Examples should be padded to the max length for the batch.
    4. The special [CLS] should be added to the start and the special 
       token [SEP] should be added to the end.
    5. The attention mask should be returned
    6. The return value of each component should be a tensor.    

    Parameters
    ----------
    batch: list of str
    tokenizer: Hugging Face tokenizer

    Returns
    -------
    dict with at least "input_ids" and "attention_mask" as keys,
    each with Tensor values

    """
    max_length = 512
    return tokenizer.batch_encode_plus(batch, add_special_tokens=True, padding='max_length', truncation=True, max_length=max_length, return_tensors='pt', return_attention_mask=True)


# Step 5: Define the graph for the neural network
class BertClassifierModule(nn.Module):
    def __init__(self, 
            n_classes, 
            hidden_activation):
        """This module loads a Transformer, adds a dense layer with activation 
        function give by `hidden_activation`, and puts a classifier
        layer on top of that as the final output. The output of
        the dense layer should have the same dimensionality as the
        model input.

        Parameters
        ----------
        n_classes : int
            Number of classes for the output layer
        hidden_activation : torch activation function
            e.g., nn.Tanh()
        weights_name : str
            Name of pretrained model to load from Hugging Face

        """
        super().__init__()
        self.n_classes = n_classes
        self.bert = AutoModel.from_pretrained(weights_name) # v1 and v2

        self.bert.train()
        self.hidden_activation = hidden_activation
        self.hidden_dim = self.bert.embeddings.word_embeddings.embedding_dim
        # Add the new parameters here using `nn.Sequential`. 
        # We can define this layer as
        # 
        #  h = f(cW1 + b_h)
        #  y = hW2 + b_y
        #
        # where c is the final hidden state above the [CLS] token,
        # W1 has dimensionality (self.hidden_dim, self.hidden_dim),
        # W2 has dimensionality (self.hidden_dim, self.n_classes), 
        # and we rely on the PyTorch loss function to add apply a
        # softmax to y.  
        self.classifier_layer = nn.Sequential(
            nn.Linear(self.hidden_dim, self.hidden_dim, bias=True),
            self.hidden_activation,
            nn.Linear(self.hidden_dim, self.n_classes, bias=True)
        )



    def forward(self, indices, mask):
        """Process `indices` with `mask` by feeding these arguments
        to `self.bert` and then feeding the initial hidden state
        in `last_hidden_state` to `self.classifier_layer`.

        Parameters
        ----------
        indices : tensor.LongTensor of shape (n_batch, k)
            Indices into the `self.bert` embedding layer. `n_batch` is
            the number of examples and `k` is the sequence length for
            this batch
        mask : tensor.LongTensor of shape (n_batch, d)
            Binary vector indicating which values should be masked.
            `n_batch` is the number of examples and `k` is the
            sequence length for this batch

        Returns
        -------
        tensor.FloatTensor
            Predicted values, shape `(n_batch, self.n_classes)`

        """
        maskreps = self.bert(indices, attention_mask=mask)
        return self.classifier_layer(torch.sum(maskreps.last_hidden_state, dim=1))



# Step 6: Use torch_deep_neural_classifier and fine tune it on the datasets
class OriginalClassifier(TorchDeepNeuralClassifier):
    def __init__(self, *args, **kwargs):
        self.tokenizer = tokenizer
        super().__init__(*args, **kwargs)

    def build_graph(self):
        return BertClassifierModule(
            self.n_classes_, self.hidden_activation)

    def build_dataset(self, X, y=None):
        data = get_batch_token_ids(X, self.tokenizer)
        if y is None:
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'])
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'], y)
        return dataset

# Step 7: Create model
original_model = OriginalClassifier(
    hidden_activation=nn.ReLU(),
    eta=0.00005,          # Low learning rate for effective fine-tuning.
    batch_size=8,         # Small batches to avoid memory overload.
    gradient_accumulation_steps=4,  # Increase the effective batch size to 32.
    early_stopping=True,  # Early-stopping
    n_iter_no_change=5)   # params.


# Step 8: Make predictions and assess model performance
saved_model = OriginalClassifier(
    hidden_activation=nn.ReLU(),
    eta=0.00005,          # Low learning rate for effective fine-tuning.
    batch_size=8,         # Small batches to avoid memory overload.
    gradient_accumulation_steps=4,  # Increase the effective batch size to 32.
    early_stopping=True,  # Early-stopping
    n_iter_no_change=5)   # params.

filename = 'sum_representation_model.pkl'
saved_model.from_pickle(filename)


# train on all 3 datasets at once
X = dynasent_r1['train']['sentence'] + dynasent_r2['train']['sentence'] + sst['train']['sentence']
y = dynasent_r1['train']['gold_label'] + dynasent_r2['train']['gold_label'] + sst['train']['gold_label']

saved_model.build_dataset(X, y)
saved_model.initialize()

sst_preds = saved_model.predict(sst['validation']['sentence'])
print(classification_report(sst['validation']['gold_label'], sst_preds, digits=3))
dynasent_r1_preds = saved_model.predict(dynasent_r1['validation']['sentence'])
print(classification_report(dynasent_r1['validation']['gold_label'], dynasent_r1_preds, digits=3))
dynasent_r2_preds = saved_model.predict(dynasent_r2['validation']['sentence'])
print(classification_report(dynasent_r2['validation']['gold_label'], dynasent_r2_preds, digits=3))



  0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/3 [00:00<?, ?it/s]



  0%|          | 0/3 [00:00<?, ?it/s]

Some weights of the model checkpoint at prajjwal1/bert-small were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


FileNotFoundError: ignored

## Question 4: Bakeoff entry [1 point]

The bakeoff dataset is available at 

https://web.stanford.edu/class/cs224u/data/cs224u-sentiment-test-unlabeled.csv

This code should grab it for you and put it in `data/sentiment` if you are working in the cloud:

In [None]:
import os

if not os.path.exists(os.path.join("data", "sentiment", "cs224u-sentiment-test-unlabeled.csv")):
    !mkdir -p data/sentiment
    !wget https://web.stanford.edu/class/cs224u/data/cs224u-sentiment-test-unlabeled.csv -P data/sentiment/

If the above fails, you can just download the file and place it in `data/sentiment`.

Once you have the file, you can load it to a `pd.DataFrame`:

In [None]:
bakeoff_df = pd.read_csv(
    os.path.join("data", "sentiment", "cs224u-sentiment-test-unlabeled.csv"))

In [None]:
bakeoff_df.head()

In [None]:
preds = saved_model.predict(bakeoff_df['sentence'].tolist())
print(preds)
bakeoff_df["prediction"] = pd.Series(preds)
bakeoff_df.to_csv('cs224u-sentiment-bakeoff-entry.csv')

To enter the bakeoff, you simply need to use your original system t:

1. Add a column named 'prediction' to `cs224u-sentiment-test-unlabeled.csv` with your model predictions (which are strings in {`positive`, `negative`, `neutral`}). The existing columns should remain.

2. Save the file as `cs224u-sentiment-bakeoff-entry.csv`.

Submit the following files to Gradescope:

* `hw_sentiment.ipynb` (this notebook)
* `cs224u-sentiment-bakeoff-entry.csv` (bake-off output)

Please make sure you use these filenames. The autograder looks for files with these names.

You are not permitted to do any tuning of your system based on what you see in our bakeoff prediction file – you should not study that file in anyway, beyond perhaps checking that it contains what you expected it to contain. The upload function will do some additional checking to ensure that your file is well-formed.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points.