# Homework and bakeoff: Multi-domain sentiment

In [7]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2023"

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cgpotts/cs224u/blob/main/hw_sentiment.ipynb)
[![Open in SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/cgpotts/cs224u/blob/main/hw_sentiment.ipynb)

If Colab is opened with this badge, please **save a copy to drive** (from the File menu) before running the notebook.

## Overview

This homework and associated bakeoff are devoted to supervised sentiment analysis in a ternary label setting (positive, negative, neutral). Your ultimate goal is to develop systems that can make accurate predictions in multiple domains.

The homework questions ask you to implement some baseline systems using DynaSent Round 1, DynaSent Round 2, and the Stanford Sentiment Treebank. The bakeoff challenge is to define a system that does well on the DynaSent test sets, the SST-3 test set, and a set of mystery examples that don't correspond to the DynaSent or SST-3 domains.

__Important methodological note:__ The DynaSent and SST-3 test sets are already publicly distributed, so we are counting on people not to cheat by developing their models on these test sets. You must do all your development without using these test sets at all, and then evaluate exactly once on the test set and turn in the results, with no further system tuning or additional runs. _Much of the scientific integrity of our field depends on people adhering to this honor code._

This notebook briefly introduces our three development datasets, states the homework questions, and then provides guidance on the original system and associated bakeoff entry.

## Set-up

In [8]:
try:
    # Sort of randomly chosen import to see whether the requirements
    # are met:
    import datasets
except ModuleNotFoundError:
    !git clone https://github.com/cgpotts/cs224u/
    !pip install -r cs224u/requirements.txt
    import sys
    sys.path.append("cs224u")

In [119]:
%pip install nlpaug

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl.metadata (14 kB)
Collecting gdown>=4.0.0 (from nlpaug)
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown, nlpaug
Successfully installed gdown-5.2.0 nlpaug-1.1.11
Note: you may need to restart the kernel to use updated packages.


In [1]:
from collections import defaultdict, Counter
from datasets import load_dataset
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
import torch

## Datasets

### DynaSent round 1

The DynaSent dataset of [Potts, Wu, et al. 2021](https://aclanthology.org/2021.acl-long.186/) is a ternary sentiment benchmark consisting of two rounds (so far). The dataset is available on [Hugging Face](https://huggingface.co/datasets/dynabench/dynasent).

For Round 1, the authors collected sentences from the [Yelp Academic Dataset](https://www.yelp.com/dataset) that fooled a top-performing sentiment model but were intuitive for humans. The model was used only to heuristically find the examples. Crowdworkers multiply-labeled all of them.

The round contains a lot of metadata that could be useful for developing sentiment models. We will focus on just the sentences and labels, but you are free to make use of this additional metadata in developing uour system.

In [2]:
dynasent_r1 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r1.all', trust_remote_code=True)

Repo card metadata block was not found. Setting CardData to empty.


In [10]:
dynasent_r1

DatasetDict({
    train: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'indices_into_review_text', 'model_0_label', 'model_0_probs', 'text_id', 'review_id', 'review_rating', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 80488
    })
    validation: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'indices_into_review_text', 'model_0_label', 'model_0_probs', 'text_id', 'review_id', 'review_rating', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 3600
    })
    test: Dataset({
        features: ['id', 'hit_ids', 'sentence', 'indices_into_review_text', 'model_0_label', 'model_0_probs', 'text_id', 'review_id', 'review_rating', 'label_distribution', 'gold_label', 'metadata'],
        num_rows: 3600
    })
})

Splits:

In [3]:
def print_label_dist(dataset, labelname='gold_label', splitnames=('train', 'validation')):
    for splitname in splitnames:
        print(splitname)
        dist = sorted(Counter(dataset[splitname][labelname]).items())
        for k, v in dist:
            print(f"\t{k:>14s}: {v}")

In [13]:
print_label_dist(dynasent_r1)

train
	      negative: 14021
	       neutral: 45076
	      positive: 21391
validation
	      negative: 1200
	       neutral: 1200
	      positive: 1200


### DynaSent round 2

DynaSent Round 2 was created using different methods than Round 1. For Round 2, crowdworkers edited sentences from the Yelp Academic Dataset seeking to achieve a particular sentiment goal (e.g., expressing a positive sentiment) while fooling a top-performing model. This work was done on the [Dynabench](https://dynabench.org) platform. The hope is that this directly adversarial goal will lead to examples that are very hard for present-day models but intuitive for humans. All the examples were multiply-labeled by separate annotators.

In [4]:
dynasent_r2 = load_dataset("dynabench/dynasent", 'dynabench.dynasent.r2.all')

Repo card metadata block was not found. Setting CardData to empty.


In [15]:
print_label_dist(dynasent_r2)

train
	      negative: 4579
	       neutral: 2448
	      positive: 6038
validation
	      negative: 240
	       neutral: 240
	      positive: 240


### Stanford Sentiment Treebank

The [Stanford Sentiment Treebank (SST)](http://nlp.stanford.edu/sentiment/) of [Socher et al. 2013](https://aclanthology.org/D13-1170/) is a widely-used resource for evaluating supervised models. It consists of sentences from Rotten Tomatoes Movie Reviews (see [Pang and Lee's project page](https://www.cs.cornell.edu/home/llee/papers/pang-lee-stars.home.html)). We will use the ternary version of the task (SST-3).

SST examples are special in that they are labeled at the phrase-level as well as the sentence level, which provides very extensive and detailed supervision for sentiment. We will use only the sentence-level labels for the homework, but you are free to use the phrase-level labels as well in designing your original system. (To do this, you will need to get the dataset from the above project page, since the Hugging Face SST-3 we are using does not include these labels.)

In [5]:
sst = load_dataset("SetFit/sst5")

Repo card metadata block was not found. Setting CardData to empty.


In [17]:
sst

DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 8544
    })
    validation: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 1101
    })
    test: Dataset({
        features: ['text', 'label', 'label_text'],
        num_rows: 2210
    })
})

Out of the box, this is a five-way task:

In [18]:
print_label_dist(sst, labelname='label_text')

train
	      negative: 2218
	       neutral: 1624
	      positive: 2322
	 very negative: 1092
	 very positive: 1288
validation
	      negative: 289
	       neutral: 229
	      positive: 279
	 very negative: 139
	 very positive: 165


The above labels are not aligned with our ternary task, and the dataset distribution uses slightly different keys from those of DynaSent. The following code converts the dataset to SST-3 and also aligns the dataset keys:

In [6]:
def convert_sst_label(s):
    return s.split(" ")[-1]

In [7]:
for splitname in ('train', 'validation', 'test'):
    dist = [convert_sst_label(s) for s in sst[splitname]['label_text']]
    sst[splitname] = sst[splitname].add_column('gold_label', dist)
    sst[splitname] = sst[splitname].add_column('sentence', sst[splitname]['text'])

In [21]:
print_label_dist(sst)

train
	      negative: 3310
	       neutral: 1624
	      positive: 3610
validation
	      negative: 428
	       neutral: 229
	      positive: 444


## Question 1: Linear classifiers

Our first set of experiments will use simple linear classifiers with sparse representations derived from counting unigrams. These experiments will introduce some useful techniques and provide a baseline for original systems. 

### Background: Feature functions

The following is a flexible format for writing feature functions in the context of scikit-learn modeling. The function maps a string to a count dictionary, using the simple procedure of splitting on whitespace and counting the resulting elements:

In [22]:
def unigrams_phi(s):
    """The basis for a unigrams feature function.

    Downcases all tokens.

    Parameters
    ----------
    s : str
        The example to represent

    Returns
    -------
    Counter
        A map from tokens (str) to their counts in `text`

    """
    return Counter(s.lower().split())

Quick example:

In [23]:
unigrams_phi("Here's an example with an emoticon :)!")

Counter({'an': 2,
         "here's": 1,
         'example': 1,
         'with': 1,
         'emoticon': 1,
         ':)!': 1})

### Background: Feature space vectorization

Functions like `unigrams_phi`  are just the __basis__ for feature representations. In truth, our models typically don't represent examples as dictionaries, but rather as vectors embedded in a matrix. In general, to manage the translation from dictionaries to vectors, we use [sklearn.feature_extraction.DictVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) instances. Here's a brief overview of how these work:

To start, suppose that we had just two examples to represent, and our feature function mapped them to the following list of dictionaries:

In [24]:
train_feats = [
    {'a': 1, 'b': 1},
    {'b': 1, 'c': 2}]

Now we create a `DictVectorizer`. So that we can more easily inspect the resulting matrix, I've set `sparse=False`, so that the return value is a dense matrix. For real problems, you'll probably want to use `sparse=True`, as it will be vastly more efficient for the very sparse feature matrices that you are likely to be creating.

In [25]:
vec = DictVectorizer(sparse=False)  # Use `sparse=True` for real problems!

The `fit_transform` method maps our list of dictionaries to a matrix:

In [26]:
X_train = vec.fit_transform(train_feats)

Here I'll create a `pd.Datafame` just to help us inspect `X_train`:

In [27]:
pd.DataFrame(X_train, columns=vec.get_feature_names_out())

Unnamed: 0,a,b,c
0,1.0,1.0,0.0
1,0.0,1.0,2.0


Now we can see that, intuitively, the feature called "a" is embedded in the first column, "b" in the second column, and "c" in the third.

Now suppose we have some new test examples:

In [28]:
test_feats = [
    {'a': 2, 'c': 1},
    {'a': 4, 'b': 2, 'd': 1}]

If we have trained a model on `X_train`, then it will not have any way to deal with this new feature "d". This shows that we need to embed `test_feats` in the same space as `X_train`. To do this, one just calls `transform` on the existing vectorizer:

In [29]:
X_test = vec.transform(test_feats)  # Not `fit_transform`!

In [30]:
pd.DataFrame(X_test, columns=vec.get_feature_names_out())

Unnamed: 0,a,b,c
0,2.0,0.0,1.0
1,4.0,2.0,0.0


The most common mistake with `DictVectorizer` is calling `fit_transform` on test examples. This will wipe out the existing representation scheme, replacing it with one that matches the test examples. That will happen silently, but then you'll find that the new representations are incompatible with the model you fit. This is likely to manifest itself as a `ValueError` relating to feature counts. Here's an example that might help you spot this if and when it arises in your own work:

In [31]:
toy_mod = LogisticRegression()

vec = DictVectorizer(sparse=False)

X_train = vec.fit_transform(train_feats)

toy_mod.fit(X_train, [0, 1])

# Here's the error! Don't use `fit_transform` again!
# Use `transform`!
X_test = vec.fit_transform(test_feats)

try:
    toy_mod.predict(X_test)
except ValueError as err:
    print("ValueError: {}".format(err))

ValueError: X has 4 features, but LogisticRegression is expecting 3 features as input.


### Background: scikit-learn models

scikit-learn is an amazing package with, among many other things, an incredible array of classifier model implementations. We're going to use a simple softmax classifier for this homework question, but you will find that you can swap in essentially any scikit-learn classifier and see how it does.

The core rhythm for scikit-learn models:

1. Instantiate the model with any hyperparamters.
2. `fit` 
3. `predict`

Here's a quick example that also shows off scikit-learn's functionality for creating synthetic datasets and random train/test splits:

In [32]:
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X_toy, y_toy = make_classification(
    n_samples=200, n_classes=3,
    n_informative=15, n_features=20,
    weights=[0.2, 0.2, 0.6],
    random_state=1)

X_toy_train, X_toy_test, y_toy_train, y_toy_test = train_test_split(
    X_toy, y_toy, test_size=0.20, stratify=y_toy, random_state=1)

toymod = LogisticRegression(penalty='l2', C=1, fit_intercept=True)

toymod.fit(X_toy_train, y_toy_train)

toypreds = toymod.predict(X_toy_test)

### Background: Classifier assessment

When assessing a classifier, the best first step is usually to get a classification report:

In [33]:
from sklearn.metrics import classification_report

print(classification_report(y_toy_test, toypreds, digits=3))

              precision    recall  f1-score   support

           0      0.444     0.500     0.471         8
           1      0.444     0.500     0.471         8
           2      0.909     0.833     0.870        24

    accuracy                          0.700        40
   macro avg      0.599     0.611     0.604        40
weighted avg      0.723     0.700     0.710        40



In this course, we will generally focus in the __macro-average F1 score__ (macro avg above). This is simply the mean of the per-class F1 scores, without any attention paid to the overall size of the class. This is our default because, in NLP, we tend to care about small classes as much as (often more than) large classes.

The scikit-learn implementation of `macro_f1` can be finicky, so our course code provides a convenient wrapper:

In [34]:
import utils

utils.safe_macro_f1(y_toy_test, toypreds)

0.6035805626598466

Note: scikit-learn models have a `score` method. For classifiers, this is set to use `accuracy` by default:

In [35]:
toymod.score(X_toy_test, y_toy_test)

0.7

Accuracy generally isn't well-aligned with our goals, so we discourage use of this method (and of accuracy scores in general).

scikit-learn also makes it very easy to perform automatic hyperparameter tuning. A quick example:

In [36]:
from sklearn.model_selection import GridSearchCV

params = {'C': (0.1, 0.2, 0.3), 'fit_intercept': [True, False]}

toymod_tuned = LogisticRegression()

clf = GridSearchCV(toymod_tuned, params, scoring='f1_macro')

_ = clf.fit(X_toy, y_toy)

Here's the best model found by this search:

In [37]:
clf.best_estimator_

Because we set `scoring='f1_macro'`, the above model was selected using our favored classifier scoring metric:

In [38]:
clf.best_score_

0.6943888670150135

With this best model in hand, we can perform our usual assessment:

In [39]:
bestpreds = clf.best_estimator_.predict(X_toy_test)

In [40]:
print(classification_report(bestpreds, y_toy_test, digits=3))

              precision    recall  f1-score   support

           0      0.750     0.600     0.667        10
           1      0.750     0.750     0.750         8
           2      0.833     0.909     0.870        22

    accuracy                          0.800        40
   macro avg      0.778     0.753     0.762        40
weighted avg      0.796     0.800     0.795        40



### Task 1: Feature functions [1 point]

The tokenization scheme used by `unigrams_phi` is very basic and leads to unintuitive tokens with punctuation attached to them. Your task here is to complete `tweetgrams_phi`, which should lead to more intuitive results. The task is really just to use the NLTK [TweetTokenizer](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer) in place of the simple whitespace tokenization of `unigrams_phi` above.

In [41]:
# Your `tweetgrams_phi` should tokenize data according to this tokenizer from NLTK:
from nltk.tokenize import TweetTokenizer

def tweetgrams_phi(s, **kwargs):
    """The basis for a feature function using `TweetTokenizer`.

    Parameters
    ----------
    s : str
    kwargs : dict
        Passed to `TweetTokenizer`

    Returns
    -------
    Counter
        A map from tokens to their counts in `text`

    """
    pass
    ##### YOUR CODE HERE
    tokenizer = TweetTokenizer(preserve_case=kwargs['preserve_case'])
    return Counter(tokenizer.tokenize(s))




Here's a test you can use to check that your implementation is correct:

In [42]:
def test_tweetgrams_phi(func):
    examples = [
        (
            "Here's an example with an emoticon :)", 
            Counter({'an': 2, "Here's": 1, 'example': 1, 'with': 1, 'emoticon': 1, ':)': 1})
        ),
        (
            "The URL is https://pytorch.org!", 
            Counter({'The': 1, 'URL': 1, 'is': 1, 'https://pytorch.org': 1, '!': 1})
        )
    ]
    errcount = 0
    for ex, expected in examples:
        result = func(ex, preserve_case=True)
        if result != expected:
            errcount += 1
            print(f"Error for `{func.__name__}`: For input {ex}, "
                  f"expected {expected} but got {result}")
    caps_ex = "CAPS"
    caps_result = func(caps_ex, preserve_case=False)
    caps_expected = Counter({"caps": 1})
    if caps_result != caps_expected:
        errcount += 1
        print(f"Error for `{func.__name__}`: For input {caps_ex}, "
              f"expected {caps_expected} but got {caps_result}")
    if errcount == 0:
        print(f"All tests passed for `{func.__name__}`")

In [43]:
test_tweetgrams_phi(tweetgrams_phi)

All tests passed for `tweetgrams_phi`


### Task 2: Model training [1 point]

Your task is to complete `train_linear_model`:

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.metrics import classification_report

In [45]:
def train_linear_model(model, featfunc, train_dataset):
    """Train an sklearn classifier.

    Parameters
    ----------
    model : sklearn classifier model
    featfunc : func
        Maps strings to Counter instances
    train_dataset: dict
        Must have a key "sentence" containing strings that `featfunc`
        will process, and a key "gold_label" giving labels

    Returns
    -------
    tuple
        * A trained version of `model`
        * A fitted `vectorizer` for the train set

    """
    pass
    # Step 1: Featurize all the examples in `train_dataset['sentence']`
    ##### YOUR CODE HERE
    train_fitted = list(map(lambda x: featfunc(x), train_dataset['sentence']))


    # Step 2: Instantiate and use a `DictVectorizer`:
    ##### YOUR CODE HERE
    vec = DictVectorizer()
    X_train = vec.fit_transform(train_fitted)


    # Step 3: Train the model on the feature matrix and
    # train_dataset['gold_label']:
    ##### YOUR CODE HERE
    model.fit(X_train, train_dataset['gold_label'])


    # Step 4: Return (model, vectorizer):
    ##### YOUR CODE HERE
    return (model, vec)




You can use the following test to help ensure that your implementation is correct:

In [46]:
def test_train_linear_model(func):
    train_dataset = {
        'sentence': ['A A', 'A B', 'B B', 'B A', 'B'],
        'gold_label': [0, 1, 0, 1, 1]}
    def featfunc(s):
        return Counter(s.split())
    model = LogisticRegression()
    result = func(model, featfunc, train_dataset)
    if not isinstance(result, tuple) or len(result) != 2:
        print(f"Error for `{func.__name__}`: Incorrect return type")
        return
    model, vectorizer = result
    if not hasattr(vectorizer, 'vocabulary_'):
        print(f"Error for `{func.__name__}`: "
              f"Second return value is not a trained vectorizer")
        return
    if not hasattr(model, 'classes_'):
        print(f"Error for `{func.__name__}`: "
              f"First return value is not a trained classifier")
        return
    print(f"No errors found for `{func.__name__}`")

In [47]:
_ = test_train_linear_model(train_linear_model)

No errors found for `train_linear_model`


You can now very easily train models on our datasets. Quick example (this shouldn't take more than a couple of minutes to run even on a CPU):

In [48]:
lr_unigrams, vec_unigrams = train_linear_model(
    LogisticRegression(max_iter=1000), 
    unigrams_phi, dynasent_r1['train'])

### Task 3: Model assessment [1 point]

Having now trained a model, we'd like to perform assessments on new data. Your task is to complete the wrapper function `assess_linear_model` to do this. The primary things you need to put into practice are (1) how to use a trained vectorizer on new data and (2) how to make predictions with your trained model. (Both of these steps are reviewed earlier in this notebook.)

In [49]:
def assess_linear_model(model, featfunc, vectorizer, assess_dataset):
    """Assess a trained sklearn model.

    Parameters
    ----------
    model: trained sklearn model
    featfunc : func
        Maps strings to count dicts
    vectorizer : fitted DictVectorizer
    assess_dataset: dict
        Must have a key "sentence" containing strings that `featfunc`
        will process, and a key "gold_label" giving labels

    Returns
    -------
    A classification report (multiline string)

    """
    pass
    # Step 1: Featurize the assessment data:
    ##### YOUR CODE HERE
    val_fitted = list(map(lambda x: featfunc(x), assess_dataset['sentence']))


    # Step 2: Vectorize the assessment data features:
    ##### YOUR CODE HERE
    X_val = vectorizer.transform(val_fitted)



    # Step 3: Make predictions:
    ##### YOUR CODE HERE
    y_preds_val = model.predict(X_val)


    # Step 4: Return a classification report (str):
    ##### YOUR CODE HERE
    return classification_report(assess_dataset['gold_label'], y_preds_val, digits=3)




Here's a quick test you can use:

In [50]:
def test_assess_linear_model(assessfunc, trainfunc):
    train_dataset = {
        'sentence': ['A A', 'A B', 'B B', 'B A', 'A', 'B'],
        'gold_label': [0, 1, 0, 1, 0, 1]}
    assess_dataset = {
        'sentence': ['A C', 'B A'],
        'gold_label': [0, 1]}
    def featfunc(s):
        return Counter(s.split())
    model = LogisticRegression()
    model, vectorizer = trainfunc(model, featfunc, train_dataset)
    result = assessfunc(model, featfunc, vectorizer, assess_dataset)
    errcount = 0
    if len(vectorizer.vocabulary_) != 2:
        print(f"Error for `{assessfunc.__name__}`: Unexpected feature count")
        errcount += 1
    if 'weighted avg' not in result:
        print(f"Error for `{assessfunc.__name__}`: Unexpected return value")
        errcount += 1
    if errcount == 0:
        print(f"No errors found for `{assessfunc.__name__}`")

In [51]:
test_assess_linear_model(assess_linear_model, train_linear_model)

No errors found for `assess_linear_model`


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


If you trained a model `lr_unigrams` above, you can now easily assess it. An example:

In [52]:
report = assess_linear_model(
    lr_unigrams,
    unigrams_phi,
    vec_unigrams,
    dynasent_r1['validation'])

print(report)

              precision    recall  f1-score   support

    negative      0.756     0.364     0.492      1200
     neutral      0.523     0.889     0.659      1200
    positive      0.699     0.572     0.629      1200

    accuracy                          0.608      3600
   macro avg      0.659     0.608     0.593      3600
weighted avg      0.659     0.608     0.593      3600



## Question 2: Transformer fine-tuning

We're now going to move into a more modern mode: fine-tuning pretrained components.

We'll use BERT-mini (originally from [the BERT repo](https://github.com/google-research/bert)) for the homework so that we can rapdily develop prototypes. You can then consider scaling up to larger models.

In [53]:
import transformers
from transformers import AutoModel, AutoTokenizer

The `transformers` library does a lot of logging. To avoid ending up with a cluttered notebook, I am changing the logging level. You might want to skip this as you scale up to building production systems, since the logging is very good – it gives you a lot of insights into what the models and code are doing.

In [54]:
transformers.logging.set_verbosity_error()

Here we set ourselves up to use BERT-mini:

In [55]:
weights_name = "prajjwal1/bert-mini"

bert = AutoModel.from_pretrained(weights_name)

bert_tokenizer = AutoTokenizer.from_pretrained(weights_name)

config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



### Background: Tokenization

Tokenization in Transformer models is handled differently from tokenization in linear models of the sort we used in Question 1. For Transformer models, we need to use the tokenizer that comes with the model so that we reliably have embedding representations for every token.

In [56]:
example_text = "Bert knows Snuffleupagus"

Here's a basic tokenization step:

In [57]:
bert_tokenizer.tokenize(example_text)

['bert', 'knows', 's', '##nu', '##ffle', '##up', '##ag', '##us']

Notice that the tokenizer split "Snuffleupagus" into a bunch of subword tokens.

The above use of the tokenizer, where we map from strings to lists of strings, is really for us humans. For modeling, the most important step for tokenization is mapping individual strings to sequences of integer ids. These ids key into the lowest embedding layer of the model.

In [58]:
ex_ids = bert_tokenizer.encode(example_text, add_special_tokens=True)

ex_ids

[101, 14324, 4282, 1055, 11231, 18142, 6279, 8490, 2271, 102]

We can get map these indices back to "words" if we want:

In [59]:
bert_tokenizer.convert_ids_to_tokens(ex_ids)

['[CLS]',
 'bert',
 'knows',
 's',
 '##nu',
 '##ffle',
 '##up',
 '##ag',
 '##us',
 '[SEP]']

### Background: Representation

Having mapped our string to a list of tokens, we can use the `forward` method of the model to get representations:

In [60]:
with torch.no_grad():
    reps = bert(torch.tensor([ex_ids]))

There are a lot of options for which representations to get. With the above call, we got the following:

In [61]:
reps.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The value of `last_hidden_state` hidden state is the sequence of final output states from the model:

In [62]:
reps.last_hidden_state.shape

torch.Size([1, 10, 256])

This is: 1 example, 10 token representations, each one a 256 dimension vector.

The value of `pooler_output` is a set of currently random parameters sitting on top of the first output hidden state. You can see here that it is a single vector representation per example:

In [63]:
reps.pooler_output.shape

torch.Size([1, 256])

I often feel unsure of precisely what this model component is. Here we can have a quick look:

In [64]:
bert.pooler

BertPooler(
  (dense): Linear(in_features=256, out_features=256, bias=True)
  (activation): Tanh()
)

So this is a dense linear layer (a single matrix of weights) with a bias term, and a tanh activation function is applied to the output. We could put a classifier head on top of this if we wanted to, but we might have mixed feelings about being stuck with that tanh step.

### Background: Masking

Where examples from a single batch have different lengths, we need to mask the padded tokens to get the intended results from the model.

For a quick example, here we process our full example from above and print out the first five values:

In [65]:
with torch.no_grad():
    reps = bert(torch.tensor([ex_ids]))
    print(reps.last_hidden_state[0][0][: 5])

tensor([-0.3763, -0.3209,  0.8817,  0.4568, -1.0314])


And now we do the same thing, but with masking of the final five positions to illustate:

In [66]:
with torch.no_grad():
    # Mask the last 5 tokens:
    am = torch.tensor([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]])
    maskreps = bert(torch.tensor([ex_ids]), attention_mask=am)
    print(maskreps.last_hidden_state[0][0][: 5])

tensor([-0.1793, -0.8994,  0.9695,  0.9130, -0.7129])


### Task 1: Batch tokenization [1 point]

Your task here is to use the `batch_encode_plus` method for `bert_tokenizer` to tokenize a list of strings. You should complete `get_batch_token_ids` according to the specification in the doctring. All these steps can be handled with a single call to `batch_encode_plus`.

In [18]:
def get_batch_token_ids(batch, tokenizer):
    """Map `batch` to a tensor of ids. The return
    value should meet the following specification:

    1. The max length should be 512.
    2. Examples longer than the max length should be truncated
    3. Examples should be padded to the max length for the batch.
    4. The special [CLS] should be added to the start and the special
       token [SEP] should be added to the end.
    5. The attention mask should be returned
    6. The return value of each component should be a tensor.

    Parameters
    ----------
    batch: list of str
    tokenizer: Hugging Face tokenizer

    Returns
    -------
    dict with at least "input_ids" and "attention_mask" as keys,
    each with Tensor values

    """
    pass
    ##### YOUR CODE HERE
    with torch.no_grad():
      return tokenizer.batch_encode_plus(batch_text_or_text_pairs=batch, max_length=512, truncation=True, padding='max_length', return_tensors='pt', return_attention_mask=True)




Here's a test you can use:

In [68]:
def test_get_batch_token_ids(func):
    examples = [
        "Bert knows Snuffleupagus",
        "ELMo knew Bert.",
        "Buffalo " * 520
    ]
    test_tokenizer = AutoTokenizer.from_pretrained("prajjwal1/bert-mini")
    result = func(examples, test_tokenizer)
    errcount = 0
    if 'attention_mask' not in result:
        errcount += 1  
        print(f"Error for `{func.__name__}`: "
              f"Attention mask was not returned")
    ids = result['input_ids']
    if not isinstance(ids, torch.Tensor):
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Return values are not tensors")
    if ids.shape[1] != 512:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Expected sequence length 512; got {ids.shape[1]}")
    if ids[0][0] != bert_tokenizer.cls_token_id:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Special tokens were not added")
    if errcount == 0:
        print(f"No errors found for `{func.__name__}`")

In [69]:
test_get_batch_token_ids(get_batch_token_ids)

No errors found for `get_batch_token_ids`


### Task 2: Contextual representations [1 point]

This next task is not used directly in fine-tuning, but it should help ensure that you understand how BERT representations are created and how they need to be managed.

Your task is to complete `get_reps` so that, given a dataset (list of strings), it returns a single tensor in which each row is the output hidden state above the [CLS] token for that example. `gets_reps` has a batchsize argument that the user can manage depending on how much available memory they have and how large their model is.

In [17]:
def get_reps(dataset, model, tokenizer, batchsize=20):
    """Represent each example in `dataset` with the final hidden state 
    above the [CLS] token.

    Parameters
    ----------
    dataset : list of str
    model : BertModel
    tokenizer : BertTokenizerFast
    batchsize : int

    Returns
    -------
    torch.Tensor with shape `(n_examples, dim)` where `dim` is the
    dimensionality of the representations for `model`

    """
    data = torch.tensor([])
    with torch.no_grad():
        pass
        # Iterate over `dataset` in batches:
        ##### YOUR CODE HERE
        for i in range(len(dataset)):
            if (i*batchsize) < len(dataset) and (i+1)*batchsize <= len(dataset):
                batch_dataset = dataset[(i*batchsize): (i+1)*batchsize]
                if ((len(batch_dataset) % batchsize) == 0):
                    batch_tokenized = get_batch_token_ids(batch_dataset, tokenizer)
                    model_output = model(batch_tokenized['input_ids'], attention_mask=batch_tokenized['attention_mask']).last_hidden_state
                    data = torch.cat((data, model_output[:, 0, :]))
                else:
                    batch_dataset = dataset[len(dataset) - batchsize:]
                    batch_tokenized = get_batch_token_ids(batch_dataset, tokenizer)
                    model_output = model(batch_tokenized['input_ids'], attention_mask=batch_tokenized['attention_mask']).last_hidden_state
                    data = torch.cat((data, model_output[:, 0, :]))
        return data


            # Encode the batch with `get_batch_token_ids`:
            ##### YOUR CODE HERE



            # Get the representations from the model, making
            # sure to pay attention to masking:
            ##### YOUR CODE HERE



        # Return a single tensor:
        ##### YOUR CODE HERE




Quick test:

In [71]:
def test_get_reps(func):
    examples = ["The cat slept.", "The bird chirped."] * 20
    weights_name = "prajjwal1/bert-mini"
    test_model = AutoModel.from_pretrained(weights_name)
    test_tokenizer = AutoTokenizer.from_pretrained(weights_name)
    result = func(examples, test_model, test_tokenizer, batchsize=2)
    errcount = 0
    if result.shape != (40, 256):
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Expected shape {(40, 256)}, got {result.shape}")
    if round(result[0][0].item(), 2) != -0.64:
        errcount += 1
        print(f"Error for `{func.__name__}`: "
              f"Representations seem to be incorrect")
    if errcount == 0:
        print(f"No errors found for `{func.__name__}`")

In [72]:
test_get_reps(get_reps)

No errors found for `get_reps`


### Task 3: Fine-tuning module [1 point]

We can now put the above together into a basic `nn.Module` that will fine-tune our BERT model. Most of the module is written for you. The pieces you need to implement:

1. in the `init` methid, define `self.classifier_layer` using [nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)
2. Complete the `forward` method.

Precise instructions are provided in the docstrings for the model.

In [73]:
import torch.nn as nn

class BertClassifierModule(nn.Module):
    def __init__(self, 
            n_classes, 
            hidden_activation, 
            weights_name="prajjwal1/bert-mini"):
        """This module loads a Transformer based on  `weights_name`,
        puts it in train mode, add a dense layer with activation
        function give by `hidden_activation`, and puts a classifier
        layer on top of that as the final output. The output of
        the dense layer should have the same dimensionality as the
        model input.

        Parameters
        ----------
        n_classes : int
            Number of classes for the output layer
        hidden_activation : torch activation function
            e.g., nn.Tanh()
        weights_name : str
            Name of pretrained model to load from Hugging Face

        """
        super().__init__()
        self.n_classes = n_classes
        self.weights_name = weights_name
        self.bert = AutoModel.from_pretrained(self.weights_name)
        self.bert.train()
        self.hidden_activation = hidden_activation
        self.hidden_dim = self.bert.embeddings.word_embeddings.embedding_dim
        # Add the new parameters here using `nn.Sequential`.
        # We can define this layer as
        # 
        #  h = f(cW1 + b_h)
        #  y = hW2 + b_y
        #
        # where c is the final hidden state above the [CLS] token,
        # W1 has dimensionality (self.hidden_dim, self.hidden_dim),
        # W2 has dimensionality (self.hidden_dim, self.n_classes),
        # f is the hidden activation, and we rely on the PyTorch loss
        # function to add apply a softmax to y.
        self.classifier_layer = torch.nn.Sequential(
            torch.nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim),
            self.hidden_activation,
            torch.nn.Linear(in_features=self.hidden_dim, out_features=self.n_classes),
            torch.nn.Softmax()
        )
        ##### YOUR CODE HERE



    def forward(self, indices, mask):
        """Process `indices` with `mask` by feeding these arguments
        to `self.bert` and then feeding the initial hidden state
        in `last_hidden_state` to `self.classifier_layer`

        Parameters
        ----------
        indices : tensor.LongTensor of shape (n_batch, k)
            Indices into the `self.bert` embedding layer. `n_batch` is
            the number of examples and `k` is the sequence length for
            this batch
        mask : tensor.LongTensor of shape (n_batch, d)
            Binary vector indicating which values should be masked.
            `n_batch` is the number of examples and `k` is the
            sequence length for this batch

        Returns
        -------
        tensor.FloatTensor
            Predicted values, shape `(n_batch, self.n_classes)`

        """
        pass
        ##### YOUR CODE HERE
        return self.classifier_layer(self.bert(indices, attention_mask=mask).last_hidden_state[:, 0, :])




In [74]:
bert_module = BertClassifierModule(n_classes=3, hidden_activation=nn.Tanh())

In [75]:
ids = get_batch_token_ids(
    dynasent_r1['train']['sentence'][: 2],
    bert_tokenizer)

bert_module(ids['input_ids'], ids['attention_mask'])

  return self._call_impl(*args, **kwargs)


tensor([[0.2519, 0.2793, 0.4688],
        [0.2459, 0.2816, 0.4725]], grad_fn=<SoftmaxBackward0>)

In [76]:
def test_bert_classifier_module(moduleclass): 
    expected_out = 5
    expected_hidden = 256
    expected_activation = nn.ReLU()
    mod = moduleclass(expected_out, expected_activation)
    errcount = 0

    # Basic layer structure:
    if not hasattr(mod, "classifier_layer") or mod.classifier_layer is None:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Missing attribute `classifier_layer`")
        return 
    for i in range(3):
        try:
            bert_module.classifier_layer[i]
        except IndexError:
            errcount += 1
            print(f"Error for `{moduleclass.__name__}`: "
                  f"`classifier_layer` is not an `nn.Sequential` "
                  f"and/or does not have the right structure")
    # Correct first layer dimensionality:
    result_hidden = mod.classifier_layer[0].out_features
    if result_hidden != expected_hidden:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Expected `classifier_layer` hidden dim {expected_hidden}, "
              f"got {result_hidden}") 
    # Correct activation:
    result_activation = mod.classifier_layer[1].__class__.__name__
    if result_activation != expected_activation.__class__.__name__:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Incorrect hidden activation")
    # Correct output dimensionality:
    result_out = mod.classifier_layer[2].out_features
    if result_out != expected_out:
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Expected `classifier_layer` out dim {expected_out}, "
              f"got {result_out}")
    # forward method:
    ids = get_batch_token_ids(["A B C", "A B"], bert_tokenizer)
    result = mod(ids['input_ids'], ids['attention_mask'])
    if result.shape != (2, 5):
        errcount += 1
        print(f"Error for `{moduleclass.__name__}`: "
              f"Expected output shape {(2, 5)}, got {result.shape}")
    if errcount == 0:
        print(f"No errors found for `{moduleclass.__name__}`")

In [77]:
test_bert_classifier_module(BertClassifierModule)

No errors found for `BertClassifierModule`


### Optional use: Classifier interface

The above module doesn't have functionality for processing data and fitting models. Our course code includes some general purpose code for adding these features. Here is an example that should work well with the module you wrote above. For more details on the design of these interfaces, see [tutorial_pytorch_models.ipynb](tutorial_pytorch_models.ipynb).

In [78]:
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier

class BertClassifier(TorchShallowNeuralClassifier):
    def __init__(self, weights_name, *args, **kwargs):
        self.weights_name = weights_name
        self.tokenizer = AutoTokenizer.from_pretrained(self.weights_name)
        super().__init__(*args, **kwargs)
        self.params += ['weights_name']

    def build_graph(self):
        return BertClassifierModule(
            self.n_classes_, self.hidden_activation, self.weights_name)

    def build_dataset(self, X, y=None):
        data = get_batch_token_ids(X, self.tokenizer)
        if y is None:
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'])
        else:
            self.classes_ = sorted(set(y))
            self.n_classes_ = len(self.classes_)
            class2index = dict(zip(self.classes_, range(self.n_classes_)))
            y = [class2index[label] for label in y]
            y = torch.tensor(y)
            dataset = torch.utils.data.TensorDataset(
                data['input_ids'], data['attention_mask'], y)
        return dataset

And here is a training run that should do pretty well for our problem. 

__Note__: This step should not be run on CPU machines. On Google Colab with a GPU, it will likely take about an hour.

In [79]:
bert_finetune = BertClassifier(
    weights_name="prajjwal1/bert-mini",
    hidden_activation=nn.ReLU(),
    eta=0.00005,          # Low learning rate for effective fine-tuning.
    batch_size=8,         # Small batches to avoid memory overload.
    gradient_accumulation_steps=4,  # Increase the effective batch size to 32.
    early_stopping=True,  # Early-stopping
    n_iter_no_change=5)   # params.

In [80]:
%%time

_ = bert_finetune.fit(
    dynasent_r1['train']['sentence'],
    dynasent_r1['train']['gold_label'])

  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
  return self._call_impl(*args, **kwargs)
Stopping after epoch 18. Validation score did not improve by tol=1e-05 for more than 5 epochs. Final error is 1564.4846777170897

CPU times: user 1h 49min 12s, sys: 31.8 s, total: 1h 49min 43s
Wall time: 1h 48min 22s


In [118]:
torch.save(bert_finetune, './models/fine_tuned_model_bert_mini_v2.pth')


In [97]:
loaded_model = torch.load('./models/fine_tuned_model_bert_mini.pth')

  loaded_model = torch.load('./models/fine_tuned_model_bert_mini.pth')


In [98]:
loaded_model_preds = loaded_model.predict(sst['validation']['sentence'])

  return self._call_impl(*args, **kwargs)


In [112]:
preds = bert_finetune.predict(sst['validation']['sentence'])

  return self._call_impl(*args, **kwargs)


In [114]:
preds_dynaset_r1 = bert_finetune.predict(dynasent_r1['validation']['sentence'])

  return self._call_impl(*args, **kwargs)


In [115]:
preds_dynaset_r2 = bert_finetune.predict(dynasent_r2['validation']['sentence'])

In [99]:
print(classification_report(sst['validation']['gold_label'], loaded_model_preds, digits=3))

              precision    recall  f1-score   support

    negative      0.580     0.533     0.555       428
     neutral      0.304     0.406     0.348       229
    positive      0.637     0.577     0.605       444

    accuracy                          0.524      1101
   macro avg      0.507     0.505     0.503      1101
weighted avg      0.546     0.524     0.532      1101



In [113]:
print(classification_report(sst['validation']['gold_label'], preds, digits=3))

              precision    recall  f1-score   support

    negative      0.710     0.715     0.712       428
     neutral      0.453     0.253     0.325       229
    positive      0.673     0.822     0.740       444

    accuracy                          0.662      1101
   macro avg      0.612     0.597     0.593      1101
weighted avg      0.642     0.662     0.643      1101



In [None]:
print(classification_report(dynasent_r1['validation']['gold_label'], preds, digits=3))

In [116]:
print(classification_report(dynasent_r2['validation']['gold_label'], preds_dynaset_r2, digits=3))

              precision    recall  f1-score   support

    negative      0.564     0.496     0.528       240
     neutral      0.604     0.629     0.616       240
    positive      0.548     0.592     0.569       240

    accuracy                          0.572       720
   macro avg      0.572     0.572     0.571       720
weighted avg      0.572     0.572     0.571       720



In [117]:
print(classification_report(dynasent_r1['validation']['gold_label'], preds_dynaset_r1, digits=3))

              precision    recall  f1-score   support

    negative      0.757     0.534     0.626      1200
     neutral      0.612     0.863     0.716      1200
    positive      0.731     0.646     0.686      1200

    accuracy                          0.681      3600
   macro avg      0.700     0.681     0.676      3600
weighted avg      0.700     0.681     0.676      3600



In [83]:
preds = bert_finetune.predict(dynasent_r1['validation']['sentence'])

  return self._call_impl(*args, **kwargs)


In [84]:
print(classification_report(dynasent_r1['validation']['gold_label'], preds, digits=3))

              precision    recall  f1-score   support

    negative      0.782     0.504     0.613      1200
     neutral      0.603     0.862     0.709      1200
    positive      0.731     0.676     0.702      1200

    accuracy                          0.681      3600
   macro avg      0.705     0.681     0.675      3600
weighted avg      0.705     0.681     0.675      3600



## Question 3: Your original system [3 points]

Your task is to develop an original ternary sentiment classifier model. There are many options. The only rule:

__You cannot make any use of the test sets for DynaSent-R1, DynaSent-R2, or SST-3, at any time during the course of development.__

The integrity of the bakeoff depends on this rule being followed.

It's fine to use the dev sets for system development – indeed, we encourage this.

For system development, here are some relatively manageable ideas that you might try:

* Different pretrained models. There are many models available on the [Hugging Face models hub](https://huggingface.co/models) that will be drop-in replacements for BERT-mini as we used it above.

* Different fine-tuning regimes. We used the [CLS] token above. This doesn't make especially good use of the output states of the models. Pooling across these representtions (with sum, average, etc.) is likely to be better.

* Different training regimes. You have three train sets at your disposal, and there may be other sentiment datasets that could contribute to making your system more robust in new domains.

* Entirely different approaches. There is no requirement that you make use of any of the concepts from the homework questions in constructing your original system. Anything goes as long as you follow the one rule given above in bold.

We want to emphasize that this needs to be an original system. It doesn't suffice to download code from the Web, retrain, and submit. You can build on others' code, but you have to do something new and meaningful with it. See the course website for additional guidance on how original systems will be evaluated.

In the cell below, please provide a brief technical description of your original system, so that the teaching team can gain an understanding of what it does. This will help us to understand your code and analyze all the submissions to identify patterns and strategies.

In [8]:
# PLEASE MAKE SURE TO INCLUDE THE FOLLOWING BETWEEN THE START AND STOP COMMENTS:
#   1) Textual description of your system.
#   2) The code for your original system.
# PLEASE MAKE SURE NOT TO DELETE OR EDIT THE START AND STOP COMMENTS

# START COMMENT: Enter your system description in this cell.
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
import torch 
from sklearn.metrics import classification_report

from transformers import pipeline
import torch.nn as nn
import nlpaug.augmenter.word as naw 
import nlpaug.augmenter.char as nac 
import nlpaug.augmenter.sentence as nas 
import numpy as np 
import gc
import random 
from operator import attrgetter
# I choose to implement a system by implementing Weighting voting classifier using the top 5 pretrained models (bart, deberta mnli, deberta zeroshot, finbert, roberta) in macro f1 score  
# On validation datasets sst, dynasent r1, dynasent r2 when applying zero shot classification 
# I benchmark the solution with other models like :
#  https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli with zero shot classification, 
# Roberta pretrained model for text classification
# Finbert pretrained model 
# Facebook pretrained bart model
# Pretrained model Deberta mnli https://huggingface.co/MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli which is also popular
# The main criterias to choose the right model was : 
# - popularity, - Downloads count,- Datasets used to train the model, - Trend, - Community, - Evaluation of the pretrained model on validation datasets

#Function to clear gpu state and cache
def reset_gpu(): 
    """ 
    Function to reset the GPU state in PyTorch. 
    """ 
    gc.collect()
    # Clear the PyTorch cache 
    if torch.cuda.is_available():
        torch.cuda.empty_cache() 
        torch.cuda.ipc_collect() # Collect any inter-process memory print("GPU state has been reset.") 
        

# Function that format the benchmark output per model
def format_classification_report(params, title):
    sst_gold_label = params["sst"]['gold_label']
    predictions_sst = params["prediction_sst"]
    dynasent_r1_gold_label = params["dynasent_r1"]['gold_label']
    predictions_dynasent_r1 = params["prediction_dynasent_r1"]
    dynasent_r2_gold_label = params["dynasent_r2"]['gold_label']
    predictions_dynasent_r2 = params["prediction_dynasent_r2"]
    print('-------------------------------------------------------------------------------------------------')
    print(title)
    print('-------------------------------------------------------------------------------------------------')
    print(f"Classification report on sst\n\n{classification_report(sst_gold_label, predictions_sst, digits=3)}\n\nClassification report on dynasent r1\n\n{classification_report(dynasent_r1_gold_label, predictions_dynasent_r1, digits=3)}\n\nClassification report on dynasent r2\n\n{classification_report(dynasent_r2_gold_label, predictions_dynasent_r2, digits=3)}")
    
    

#Function to augment a list of texts using nlpAug augmentation methods
def augment_texts(texts, label, n=5): 
    """ 

    Augments a list of texts using random NLP augmentation techniques. 
    
    Args: 
        texts (list): List of input texts to augment. 
        n (int): Number of augmented datasets to generate for each input text. 
        label (str): Label of the text
    
    Returns: 
        augmented_datasets (list): A list containing n augmented datasets for each input text. 

    """ 
    # Define the augmentation methods available 
    aug_methods = [ 
        naw.SynonymAug(aug_src='wordnet'), # Synonym replacement 
        #naw.RandomWordAug(action='swap'), # Randomly swap words in the text 
        #naw.RandomWordAug(action='delete'), # Randomly delete words in the text 
        #naw.AntonymAug(),
        naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert"), # Insert words using BERT 
        #naw.BackTranslationAug(
            #from_model_name='facebook/wmt19-en-de', 
            #to_model_name='facebook/wmt19-de-en'
        #),
        #naw.WordEmbsAug(model_type='word2vec', action="substitute")
        #nac.RandomCharAug(action='insert'), # Randomly insert characters in the text 
        #nac.RandomCharAug(action='delete'), # Randomly delete characters in the text 
        #nac.RandomCharAug(action='substitute'), # Randomly substitute characters in the text 
    ] 
    augmented_texts = [] 
    labels = []
    # For each input text, generate n augmented versions 
    #refacto to do to only augment based on n parameter to increase the size of sample
    while len(augment_texts) < n:
        # Randomly select an augmentation method 
        aug = random.choice(aug_methods)
        # Randomly select a text
        text = random.choice(texts)
        # Apply augmentation to the text 
        augmented_text = aug.augment(text)
        #Add augmented text to list
        augment_texts.append(augmented_text)
        labels.append(label)
    return augmented_texts, labels

#Format sentiment analysis output
def format_sentiment_analysis_output(output):
    labels = []
    scores = []
    for result in sorted(output[0], key=lambda x: x["label"]):
        labels.append(result["label"])
        scores.append(result["score"])
    return scores, labels

# Function that benchmark all these differents models and generate prediction for backoff dataset
def generate_models_predictions(sst_validation_dataset, dynasent_r1_validation_dataset, dynasent_r2_validation_dataset, backoff_df):
    labels = ['negative', 'neutral', 'positive']
    #clean gpu state
    reset_gpu()
    
    #Initialize models by specifying the task and pretrained model
    pipe_roberta = pipeline("sentiment-analysis", top_k=None, model="cardiffnlp/twitter-roberta-base-sentiment-latest", device=0)
    #Get prediction from dynasent r1 validation set for roberta model 
    preds_roberta_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        preds_roberta_dynasent_r1.append(labels[np.argmax(scores)])

    #Get prediction from dynasent r2 validation set for roberta model 
    preds_roberta_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        preds_roberta_dynasent_r2.append(labels[np.argmax(scores)])

    #Get prediction from sst validation set for roberta model 
    preds_roberta_sst = []
    for text in sst_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        preds_roberta_sst.append(labels[np.argmax(scores)])

    roberta_params = {
        "dynasent_r1": dynasent_r1_validation_dataset,
        "prediction_dynasent_r1": preds_roberta_dynasent_r1,
        "dynasent_r2": dynasent_r2_validation_dataset,
        "prediction_dynasent_r2": preds_roberta_dynasent_r2,
        "sst": sst_validation_dataset,
        "prediction_sst": preds_roberta_sst
    }
    format_classification_report(roberta_params, 'Roberta text-classification evaluation on different validation datasets')
    backoff_roberta_preds = []
    for text in backoff_df['sentence'].tolist():
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        backoff_roberta_preds.append(scores)

    del pipe_roberta
    reset_gpu()

    pipe_finbert = pipeline("sentiment-analysis", top_k=None, model="ProsusAI/finbert", device=0)
    #Get prediction from dynasent r1 validation set for finbert model 
    preds_finbert_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        preds_finbert_dynasent_r1.append(labels[np.argmax(scores)])

    #Get prediction from dynasent r2 validation set for finbert model 
    preds_finbert_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        preds_finbert_dynasent_r2.append(labels[np.argmax(scores)])

    #Get prediction from sst validation set for finbert model 
    preds_finbert_sst = []
    for text in sst_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        preds_finbert_sst.append(labels[np.argmax(scores)])

    finbert_params = {
        "dynasent_r1": dynasent_r1_validation_dataset,
        "prediction_dynasent_r1": preds_finbert_dynasent_r1,
        "dynasent_r2": dynasent_r2_validation_dataset,
        "prediction_dynasent_r2": preds_finbert_dynasent_r2,
        "sst": sst_validation_dataset,
        "prediction_sst": preds_finbert_sst
    }
    format_classification_report(finbert_params, 'Finbert text-classification evaluation on different validation datasets')
    backoff_finbert_preds = []
    for text in backoff_df['sentence'].tolist():
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        backoff_finbert_preds.append(scores)
    
    del pipe_finbert
    reset_gpu()

    pipe_bart = pipeline('zero-shot-classification', model="facebook/bart-large-mnli", device=0)
    #Get prediction from dynasent r1 validation set for bart model 
    preds_bart_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        pred = pipe_bart(text, labels)
        preds_bart_dynasent_r1.append(pred['labels'][np.argmax(pred['scores'])])

    #Get prediction from dynasent r2 validation set for bart model 
    preds_bart_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        pred = pipe_bart(text, labels)
        preds_bart_dynasent_r2.append(pred['labels'][np.argmax(pred['scores'])])

    #Get prediction from sst validation set for bart model 
    preds_bart_sst = []
    for text in sst_validation_dataset['sentence']:
        pred = pipe_bart(text, labels)
        preds_bart_sst.append(pred['labels'][np.argmax(pred['scores'])])

    bart_params = {
        "dynasent_r1": dynasent_r1_validation_dataset,
        "prediction_dynasent_r1": preds_bart_dynasent_r1,
        "dynasent_r2": dynasent_r2_validation_dataset,
        "prediction_dynasent_r2": preds_bart_dynasent_r2,
        "sst": sst_validation_dataset,
        "prediction_sst": preds_bart_sst
    }
    format_classification_report(bart_params, 'Bart zero shot classification evaluation on different validation datasets')
    backoff_bart_preds = list(map(lambda x: pipe_bart(x, labels)['scores'], backoff_df['sentence'].tolist()))

    del pipe_bart
    reset_gpu()
    
    pipe_deberta_mnli = pipeline("zero-shot-classification", model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", device=0)
    #Get prediction from dynasent r1 validation set for deberta mnli model 
    preds_deberta_mnli_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        pred = pipe_deberta_mnli(text, labels)
        preds_deberta_mnli_dynasent_r1.append(pred['labels'][np.argmax(pred['scores'])])

    #Get prediction from dynasent r2 validation set for deberta mnli model 
    preds_deberta_mnli_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        pred = pipe_deberta_mnli(text, labels)
        preds_deberta_mnli_dynasent_r2.append(pred['labels'][np.argmax(pred['scores'])])

    #Get prediction from sst validation set for deberta mnli model 
    preds_deberta_mnli_sst = []
    for text in sst_validation_dataset['sentence']:
        pred = pipe_deberta_mnli(text, labels)
        preds_deberta_mnli_sst.append(pred['labels'][np.argmax(pred['scores'])])

    deberta_mnli_params = {
        "dynasent_r1": dynasent_r1_validation_dataset,
        "prediction_dynasent_r1": preds_deberta_mnli_dynasent_r1,
        "dynasent_r2": dynasent_r2_validation_dataset,
        "prediction_dynasent_r2": preds_deberta_mnli_dynasent_r2,
        "sst": sst_validation_dataset,
        "prediction_sst": preds_deberta_mnli_sst
    }
    format_classification_report(deberta_mnli_params, 'Deberta mnli zero shot classification evaluation on different validation datasets')
    backoff_deberta_mnli_preds = list(map(lambda x: pipe_deberta_mnli(x, labels)['scores'], backoff_df['sentence'].tolist()))

    del pipe_deberta_mnli
    reset_gpu()

    pipe_deberta_zeroshot = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0", device=0)
    #Get prediction from dynasent r1 validation set for deberta zeroshot model 
    preds_deberta_zeroshot_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        pred = pipe_deberta_zeroshot(text, labels)
        preds_deberta_zeroshot_dynasent_r1.append(pred['labels'][np.argmax(pred['scores'])])

    #Get prediction from dynasent r2 validation set for deberta zeroshot model 
    preds_deberta_zeroshot_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        pred = pipe_deberta_zeroshot(text, labels)
        preds_deberta_zeroshot_dynasent_r2.append(pred['labels'][np.argmax(pred['scores'])])

    #Get prediction from sst validation set for deberta zeroshot model 
    preds_deberta_zeroshot_sst = []
    for text in sst_validation_dataset['sentence']:
        pred = pipe_deberta_zeroshot(text, labels)
        preds_deberta_zeroshot_sst.append(pred['labels'][np.argmax(pred['scores'])])

    deberta_zeroshot_params = {
        "dynasent_r1": dynasent_r1_validation_dataset,
        "prediction_dynasent_r1": preds_deberta_zeroshot_dynasent_r1,
        "dynasent_r2": dynasent_r2_validation_dataset,
        "prediction_dynasent_r2": preds_deberta_zeroshot_dynasent_r2,
        "sst": sst_validation_dataset,
        "prediction_sst": preds_deberta_zeroshot_sst
    }
    format_classification_report(deberta_zeroshot_params, 'Deberta zero shot classification evaluation on different validation datasets')
    backoff_deberta_zeroshot_preds = list(map(lambda x: pipe_deberta_zeroshot(x, labels)['scores'], backoff_df['sentence'].tolist()))

    del pipe_deberta_zeroshot
    reset_gpu()

    return [backoff_deberta_zeroshot_preds, backoff_deberta_mnli_preds, backoff_roberta_preds, backoff_bart_preds, backoff_finbert_preds], labels, [0.3, 0.2, 0.2, 0.2, 0.1]
    
def weighted_voting_classifier(prediction_scores, labels, weights): 
    """ 
    Returns the final predicted labels based on weighted voting from multiple models' prediction scores. 
    Parameters: 
        prediction_scores (list of lists of np.array): A list where each element is a list of predicted scores from a model. Each np.array represents predicted scores for each class. 
        weights (list of floats): A list of weights corresponding to each model's predictions. 
    Returns: 
        list: A list of final predicted labels based on weighted voting. 
    """ 
    # Convert prediction_scores to a numpy array for easier manipulation 
    prediction_scores = np.array(prediction_scores) # Shape: (num_models, num_samples, num_classes) 
    # Ensure the number of weights matches the number of models 
    assert len(weights) == prediction_scores.shape[0], "Number of weights must match the number of models" 
    # Weighted sum of prediction scores across models 
    weighted_scores = np.tensordot(weights, prediction_scores, axes=(0, 0)) # Shape: (num_samples, num_classes) 
    # Get the final predicted labels by selecting the class with the maximum weighted score 
    index_set = np.argmax(weighted_scores, axis=1) 
    final_predictions = []
    for index in index_set:
        final_predictions.append(labels[index])
    return final_predictions

#Function to evaluate weigthing voting classifier on validation sets
def evaluate_weighting_voting_classifier_on_validation_sets(sst_validation_dataset, dynasent_r1_validation_dataset, dynasent_r2_validation_dataset):
    labels = ['negative', 'neutral', 'positive']
    #clean gpu state
    reset_gpu()
    
    #Initialize models by specifying the task and pretrained model
    pipe_roberta = pipeline("sentiment-analysis", top_k=None, model="cardiffnlp/twitter-roberta-base-sentiment-latest", device=0)
    #Get prediction from dynasent r1 validation set for roberta model 
    preds_roberta_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        preds_roberta_dynasent_r1.append(scores)

    #Get prediction from dynasent r2 validation set for roberta model 
    preds_roberta_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        preds_roberta_dynasent_r2.append(scores)

    #Get prediction from sst validation set for roberta model 
    preds_roberta_sst = []
    for text in sst_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_roberta(text))
        preds_roberta_sst.append(scores)

    del pipe_roberta
    reset_gpu()

    pipe_finbert = pipeline("sentiment-analysis", top_k=None, model="ProsusAI/finbert", device=0)
    #Get prediction from dynasent r1 validation set for finbert model 
    preds_finbert_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        preds_finbert_dynasent_r1.append(scores)

    #Get prediction from dynasent r2 validation set for finbert model 
    preds_finbert_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        preds_finbert_dynasent_r2.append(scores)

    #Get prediction from sst validation set for finbert model 
    preds_finbert_sst = []
    for text in sst_validation_dataset['sentence']:
        scores, labels = format_sentiment_analysis_output(pipe_finbert(text))
        preds_finbert_sst.append(scores)

    del pipe_finbert
    reset_gpu()

    pipe_bart = pipeline('zero-shot-classification', model="facebook/bart-large-mnli", device=0)
    #Get prediction from dynasent r1 validation set for bart model 
    preds_bart_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        pred = pipe_bart(text, labels)
        preds_bart_dynasent_r1.append(pred['scores'])

    #Get prediction from dynasent r2 validation set for bart model 
    preds_bart_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        pred = pipe_bart(text, labels)
        preds_bart_dynasent_r2.append(pred['scores'])

    #Get prediction from sst validation set for bart model 
    preds_bart_sst = []
    for text in sst_validation_dataset['sentence']:
        pred = pipe_bart(text, labels)
        preds_bart_sst.append(pred['scores'])

    del pipe_bart
    reset_gpu()
    
    pipe_deberta_mnli = pipeline("zero-shot-classification", model="MoritzLaurer/DeBERTa-v3-base-mnli-fever-anli", device=0)
    #Get prediction from dynasent r1 validation set for deberta mnli model 
    preds_deberta_mnli_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        pred = pipe_deberta_mnli(text, labels)
        preds_deberta_mnli_dynasent_r1.append(pred['scores'])

    #Get prediction from dynasent r2 validation set for deberta mnli model 
    preds_deberta_mnli_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        pred = pipe_deberta_mnli(text, labels)
        preds_deberta_mnli_dynasent_r2.append(pred['scores'])

    #Get prediction from sst validation set for deberta mnli model 
    preds_deberta_mnli_sst = []
    for text in sst_validation_dataset['sentence']:
        pred = pipe_deberta_mnli(text, labels)
        preds_deberta_mnli_sst.append(pred['scores'])

    del pipe_deberta_mnli
    reset_gpu()

    pipe_deberta_zeroshot = pipeline("zero-shot-classification", model="MoritzLaurer/deberta-v3-large-zeroshot-v2.0", device=0)
    #Get prediction from dynasent r1 validation set for deberta zeroshot model 
    preds_deberta_zeroshot_dynasent_r1 = []
    for text in dynasent_r1_validation_dataset['sentence']:
        pred = pipe_deberta_zeroshot(text, labels)
        preds_deberta_zeroshot_dynasent_r1.append(pred['scores'])

    #Get prediction from dynasent r2 validation set for deberta zeroshot model 
    preds_deberta_zeroshot_dynasent_r2 = []
    for text in dynasent_r2_validation_dataset['sentence']:
        pred = pipe_deberta_zeroshot(text, labels)
        preds_deberta_zeroshot_dynasent_r2.append(pred['scores'])

    #Get prediction from sst validation set for deberta zeroshot model 
    preds_deberta_zeroshot_sst = []
    for text in sst_validation_dataset['sentence']:
        pred = pipe_deberta_zeroshot(text, labels)
        preds_deberta_zeroshot_sst.append(pred['scores'])

    weights = [0.3, 0.2, 0.2, 0.2, 0.1]
    weight_voting_classifier_params = {
        "dynasent_r1": dynasent_r1_validation_dataset,
        "prediction_dynasent_r1": weighted_voting_classifier([preds_deberta_zeroshot_dynasent_r1, preds_deberta_mnli_dynasent_r1, preds_roberta_dynasent_r1, preds_bart_dynasent_r1, preds_finbert_dynasent_r1], labels, weights),
        "dynasent_r2": dynasent_r2_validation_dataset,
        "prediction_dynasent_r2": weighted_voting_classifier([preds_deberta_zeroshot_dynasent_r2, preds_deberta_mnli_dynasent_r2, preds_roberta_dynasent_r2, preds_bart_dynasent_r2, preds_finbert_dynasent_r2], labels, weights),
        "sst": sst_validation_dataset,
        "prediction_sst": weighted_voting_classifier([preds_deberta_zeroshot_sst, preds_deberta_mnli_sst, preds_roberta_sst, preds_bart_sst, preds_finbert_sst], labels, weights)
    }
    format_classification_report(weight_voting_classifier_params, 'Weight voting classifier evaluation on different validation datasets')

    del pipe_deberta_zeroshot
    reset_gpu()


# STOP COMMENT: Please do not remove this comment.

In [43]:
weighted_voting_classifier(
    [[[0.2, 0.1, 0.7], [0.5, 0.4, 0.1], [0.2, 0.5, 0.3], [0.1, 0.4, 0.5]], [[0.5, 0.1, 0.4], [0.5, 0.2, 0.3], [0.2, 0.3, 0.5], [0.1, 0.7, 0.3]], [[0.2, 0.3, 0.5], [0.5, 0.2, 0.3], [0.2, 0.3, 0.5], [0.1, 0.7, 0.3]]], 
    ['positive', 'neutral', 'negative'], 
    [0.5, 0.3, 0.2]
)

weight scores [[0.29 0.14 0.57]
 [0.5  0.3  0.2 ]
 [0.2  0.4  0.4 ]
 [0.1  0.55 0.4 ]]
index set [2 0 2 1]
final_predictions ['negative', 'positive', 'negative', 'neutral']


['negative', 'positive', 'negative', 'neutral']

In [15]:
reset_gpu()
pipe_facebook_opt = pipeline("text2text-generation", model="facebook/opt-125m", device=0)
print(pipe_facebook_opt("generate a text with positive sentiment"))

The model 'OPTForCausalLM' is not supported for text2text-generation. Supported models are ['BartForConditionalGeneration', 'BigBirdPegasusForConditionalGeneration', 'BlenderbotForConditionalGeneration', 'BlenderbotSmallForConditionalGeneration', 'EncoderDecoderModel', 'FSMTForConditionalGeneration', 'GPTSanJapaneseForConditionalGeneration', 'LEDForConditionalGeneration', 'LongT5ForConditionalGeneration', 'M2M100ForConditionalGeneration', 'MarianMTModel', 'MBartForConditionalGeneration', 'MT5ForConditionalGeneration', 'MvpForConditionalGeneration', 'NllbMoeForConditionalGeneration', 'PegasusForConditionalGeneration', 'PegasusXForConditionalGeneration', 'PLBartForConditionalGeneration', 'ProphetNetForConditionalGeneration', 'SeamlessM4TForTextToText', 'SeamlessM4Tv2ForTextToText', 'SwitchTransformersForConditionalGeneration', 'T5ForConditionalGeneration', 'UMT5ForConditionalGeneration', 'XLMProphetNetForConditionalGeneration'].


[{'generated_text': "generate a text with positive sentiment and then post it on reddit.\nI'm not"}]




In [16]:
reset_gpu()
def generate_system_prompt():
    return f"""
    You are a linguist expert. \
    Your task is to evaluate the sentiment of the text provided by a user and classify it to exclusively into one of the following three categories: positive, neutral or negative.
    """
messages = [
    {"role": "system", "content": "Who are you?"},
    {"role": "user", "content": "Who are you?"},
]
user_prompts = list(map(lambda x: {"role": "user", "content": x}, dynasent_r1["validation"]['sentence'][:20]))
prompts = [{"role": "system", "content": generate_system_prompt()}] + user_prompts
generation_args = { 
    "max_new_tokens": 20, 
    "return_full_text": False, 
    "temperature": 0.0, 
    "do_sample": False, 
} 
pipe_phi3 = pipeline("text-generation", model="microsoft/Phi-3-mini-128k-instruct", torch_dtype=torch.float16, device=0, trust_remote_code=True, **generation_args)
print(pipe_phi3(prompts))

`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

: 

In [32]:
reset_gpu()
def generate_system_prompt():
    return f"""
    You are a text classification model. \
    Your task is to analyse the sentiment of the provided text and classify it into one of tree categories: positive, neutral and negative.\
    Consider the overall tone, emotion, and intent of the text to make your classification.\
    - Positive: The text expresses a favorable, happy, or supportive sentiment.\
    - Neutral: The text is objective, factual, or lacks any emotional or subjective sentiment.\
    - Negative: The text expresses an unfavorable, critical, toxic or unhappy sentiment.\
    Respond with only one of these three categories: positive, neutral and negative.\
    If you encounter any text that you cannot process or classify confidently, categorize it as neutral.
    """
def generate_prompts(text):
    return [{"role": "system", "content": generate_system_prompt()}, {"role": "user", "content": text}]
generation_args = { 
    "max_new_tokens": 20, 
    "return_full_text": False, 
    "do_sample": False, 
} 
pipe_phi3 = pipeline("text-generation", model="Qwen/Qwen2-0.5B-Instruct", torch_dtype="auto", device=0, trust_remote_code=True, **generation_args)
preds_qwen_dynaset_v1 = list(map(lambda x: pipe_phi3(generate_prompts(x))[0]["generated_text"], dynasent_r1["validation"]['sentence']))
print(preds_qwen_dynaset_v1)



['neutral', 'neutral', 'positive', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'negative', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'negative', 'positive', 'neutral', 'neutral', 'neutral', 'negative', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'positive', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'positive', 'neutral', 'neutral', 'neutral', 'po

In [9]:
reset_gpu()
def generate_system_prompt():
    return f"""
    You are a text classification model. \
    Your task is to analyse the sentiment of the provided text and classify it into one of tree categories: positive, neutral and negative.\
    Consider the overall tone, emotion, and intent of the text to make your classification.\
    - Positive: The text expresses a favorable, happy, or supportive sentiment.\
    - Neutral: The text is objective, factual, or lacks any emotional or subjective sentiment.\
    - Negative: The text expresses an unfavorable, critical, toxic or unhappy sentiment.\
    Respond with only one of these three categories: positive, neutral and negative.\
    If you encounter any text that you cannot process or classify confidently, categorize it as neutral.
    """
def generate_prompts(text):
    return [{"role": "system", "content": generate_system_prompt()}, {"role": "user", "content": text}]
generation_args = { 
    "max_new_tokens": 20, 
    "return_full_text": False, 
    "do_sample": False, 
} 
pipe_phi3 = pipeline("text-generation", model="microsoft/phi-2", torch_dtype="auto", device=0, trust_remote_code=True, **generation_args)
preds_qwen_dynaset_v1 = list(map(lambda x: pipe_phi3(generate_prompts(x))[0]["generated_text"], dynasent_r1["validation"]['sentence']))
print(preds_qwen_dynaset_v1)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

ValueError: Cannot use apply_chat_template() because tokenizer.chat_template is not set and no template argument was passed! For information about writing templates and setting the tokenizer.chat_template attribute, please see the documentation at https://huggingface.co/docs/transformers/main/en/chat_templating

In [37]:
reset_gpu()

In [30]:
for i in range(len(preds_qwen_dynaset_v1)):
    format_text = preds_qwen_dynaset_v1[i].lower()
    if format_text = "positive" or format_text != "negative" or format_text != "neutral":
        print(i)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [28]:
dynasent_r1["validation"]['sentence'][0]

"He didn't just try to shove medication down her throat."

In [33]:
print(classification_report(dynasent_r1["validation"]["gold_label"], list(map(lambda x: str(x).lower(), preds_qwen_dynaset_v1)), digits=3))

              precision    recall  f1-score   support

    negative      0.840     0.035     0.067      1200
     neutral      0.372     0.948     0.535      1200
    positive      0.800     0.329     0.466      1200

    accuracy                          0.438      3600
   macro avg      0.671     0.438     0.356      3600
weighted avg      0.671     0.438     0.356      3600



In [5]:
# Example usage 
texts = ["I love data science.", "Natural language processing is fascinating."] 
augmented_datasets, label_datasets = augment_texts(texts, 'positive', n=3) 
print("Augmented Datasets:", len(augmented_datasets), len(label_datasets)) 

aug method Name:Synonym_Aug, Aug Src:wordnet, Action:substitute, Method:word
augmented text ['Single love data skill.']
aug method Name:ContextualWordEmbs_Aug, Action:insert, Method:word
augmented text ['sometimes i just love data science.']
aug method Name:ContextualWordEmbs_Aug, Action:insert, Method:word
augmented text ['although i love data based science.']
aug method Name:Synonym_Aug, Aug Src:wordnet, Action:substitute, Method:word
augmented text ['Natural language processing is bewitch.']
aug method Name:Synonym_Aug, Aug Src:wordnet, Action:substitute, Method:word
augmented text ['Natural speech communication processing is catch.']
aug method Name:Synonym_Aug, Aug Src:wordnet, Action:substitute, Method:word
augmented text ['Innate language processing be fascinating.']
Augmented Datasets: 6 6


## Question 4: Bakeoff entry [1 point]

The bakeoff dataset is available at 

https://web.stanford.edu/class/cs224u/data/cs224u-sentiment-test-unlabeled.csv

This code should grab it for you and put it in `data/sentiment` if you are working in the cloud:

In [19]:
import os
import wget

if not os.path.exists(os.path.join("data", "sentiment", "cs224u-sentiment-test-unlabeled.csv")):
    os.makedirs(os.path.join('data', 'sentiment'), exist_ok=True)
    wget.download('https://web.stanford.edu/class/cs224u/data/cs224u-sentiment-test-unlabeled.csv', out='data/sentiment/')

If the above fails, you can just download the file and place it in `data/sentiment`.

Once you have the file, you can load it to a `pd.DataFrame`:

In [20]:
bakeoff_df = pd.read_csv(
    os.path.join("data", "sentiment", "cs224u-sentiment-test-unlabeled.csv"))

In [49]:
torch.cuda.get_device_name(0) 

'NVIDIA GeForce RTX 2070 SUPER'

In [125]:
evaluate_weighting_voting_classifier_on_validation_sets(sst["validation"], dynasent_r1["validation"], dynasent_r2["validation"])

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


-------------------------------------------------------------------------------------------------
Weight voting classifier evaluation on different validation datasets
-------------------------------------------------------------------------------------------------
Classification report on sst

              precision    recall  f1-score   support

    negative      0.390     0.998     0.561       428
     neutral      0.333     0.009     0.017       229
    positive      0.000     0.000     0.000       444

    accuracy                          0.390      1101
   macro avg      0.241     0.335     0.193      1101
weighted avg      0.221     0.390     0.222      1101


Classification report on dynasent r1

              precision    recall  f1-score   support

    negative      0.336     0.988     0.501      1200
     neutral      0.435     0.025     0.047      1200
    positive      0.000     0.000     0.000      1200

    accuracy                          0.338      3600
   macro avg 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [111]:
predictions_scores, labels, weights = generate_models_predictions(sst["validation"], dynasent_r1["validation"], dynasent_r2["validation"], bakeoff_df)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


-------------------------------------------------------------------------------------------------
Roberta text-classification evaluation on different validation datasets
-------------------------------------------------------------------------------------------------
Classification report on sst

              precision    recall  f1-score   support

    negative      0.712     0.757     0.734       428
     neutral      0.356     0.454     0.399       229
    positive      0.839     0.669     0.744       444

    accuracy                          0.658      1101
   macro avg      0.636     0.627     0.626      1101
weighted avg      0.689     0.658     0.668      1101


Classification report on dynasent r1

              precision    recall  f1-score   support

    negative      0.746     0.552     0.635      1200
     neutral      0.554     0.855     0.672      1200
    positive      0.797     0.571     0.665      1200

    accuracy                          0.659      3600
   macro a



-------------------------------------------------------------------------------------------------
Finbert text-classification evaluation on different validation datasets
-------------------------------------------------------------------------------------------------
Classification report on sst

              precision    recall  f1-score   support

    negative      0.788     0.339     0.474       428
     neutral      0.226     0.856     0.357       229
    positive      0.854     0.092     0.167       444

    accuracy                          0.347      1101
   macro avg      0.623     0.429     0.333      1101
weighted avg      0.698     0.347     0.326      1101


Classification report on dynasent r1

              precision    recall  f1-score   support

    negative      0.776     0.233     0.359      1200
     neutral      0.369     0.961     0.533      1200
    positive      0.867     0.082     0.149      1200

    accuracy                          0.425      3600
   macro a



-------------------------------------------------------------------------------------------------
Bart zero shot classification evaluation on different validation datasets
-------------------------------------------------------------------------------------------------
Classification report on sst

              precision    recall  f1-score   support

    negative      0.645     0.921     0.758       428
     neutral      0.368     0.061     0.105       229
    positive      0.792     0.806     0.799       444

    accuracy                          0.696      1101
   macro avg      0.602     0.596     0.554      1101
weighted avg      0.647     0.696     0.639      1101


Classification report on dynasent r1

              precision    recall  f1-score   support

    negative      0.593     0.846     0.697      1200
     neutral      0.676     0.082     0.146      1200
    positive      0.544     0.789     0.644      1200

    accuracy                          0.572      3600
   macro

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


-------------------------------------------------------------------------------------------------
Deberta mnli zero shot classification evaluation on different validation datasets
-------------------------------------------------------------------------------------------------
Classification report on sst

              precision    recall  f1-score   support

    negative      0.680     0.778     0.725       428
     neutral      0.333     0.183     0.237       229
    positive      0.742     0.811     0.775       444

    accuracy                          0.668      1101
   macro avg      0.585     0.591     0.579      1101
weighted avg      0.633     0.668     0.644      1101


Classification report on dynasent r1

              precision    recall  f1-score   support

    negative      0.681     0.701     0.691      1200
     neutral      0.618     0.619     0.618      1200
    positive      0.688     0.667     0.677      1200

    accuracy                          0.662      3600


In [112]:
predictions_scores

[[[0.996288537979126, 0.003425558330491185, 0.0002859627129510045],
  [0.7028729319572449, 0.24098341166973114, 0.0561436228454113],
  [0.5655627250671387, 0.4241155683994293, 0.010321716777980328],
  [0.5666266679763794, 0.2863191068172455, 0.14705422520637512],
  [0.9955180287361145, 0.003929045982658863, 0.0005528934416361153],
  [0.7020024657249451, 0.29547175765037537, 0.002525802468881011],
  [0.9997953176498413, 0.00012200989294797182, 8.261785114882514e-05],
  [0.9990659356117249, 0.0007920298958197236, 0.00014199737051967531],
  [0.9635266065597534, 0.035782404243946075, 0.0006909547373652458],
  [0.8869084715843201, 0.05657331645488739, 0.056518204510211945],
  [0.9709895849227905, 0.027057891711592674, 0.0019525064853951335],
  [0.567943274974823, 0.4055348336696625, 0.026521872729063034],
  [0.9671646952629089, 0.03155718743801117, 0.001278121955692768],
  [0.5522529482841492, 0.4282830059528351, 0.019464025273919106],
  [0.5287962555885315, 0.47057580947875977, 0.000627996

In [113]:
weights

[0.3, 0.2, 0.2, 0.2, 0.1]

In [110]:
bakeoff_df

Unnamed: 0,example_id,sentence,prediction
0,0,This year we were at a restaurant that clearly...,positive
1,1,A long way.,neutral
2,2,A friend and I went on a Thursday evening aro...,neutral
3,3,You'll love to say I used to be married to tha...,neutral
4,4,I feel like any place I move will be a downgra...,positive
...,...,...,...
2995,2995,despite its many infuriating flaws -- not the ...,negative
2996,2996,A bone cyst is a hollow spot of bone filled wi...,neutral
2997,2997,The portions are big & the check is small.,negative
2998,2998,Service and food was mediocre at best.,negative


In [105]:
bakeoff_df.head()

Unnamed: 0,example_id,sentence
0,0,This year we were at a restaurant that clearly...
1,1,A long way.
2,2,A friend and I went on a Thursday evening aro...
3,3,You'll love to say I used to be married to tha...
4,4,I feel like any place I move will be a downgra...


In [114]:
bakeoff_preds = weighted_voting_classifier(predictions_scores, labels, weights)

In [119]:
Counter(bakeoff_preds)

Counter({'negative': 2857, 'neutral': 143})

To enter the bakeoff, you simply need to use your original system to:

1. Add a column named 'prediction' to `cs224u-sentiment-test-unlabeled.csv` with your model predictions (which are strings in {`positive`, `negative`, `neutral`}). The existing columns should remain.

2. Save the file as `cs224u-sentiment-bakeoff-entry.csv`. Here is a good snippet of code for writing this file:

In [116]:
# This is a placeholder for adding the "prediction" column:
bakeoff_df['prediction'] = bakeoff_preds

# Write to disk
bakeoff_df.to_csv("cs224u-sentiment-bakeoff-entry.csv")

In particular, you need to be sure that `example_id` is a column rather than an index when read in by Pandas. Here is a quick test:

In [117]:
def test_bakeoff_entry(filename="cs224u-sentiment-bakeoff-entry.csv"):
    gold_df = pd.read_csv(
        os.path.join("data", "sentiment", "cs224u-sentiment-test-unlabeled.csv"))
    entry_df = pd.read_csv(filename)

    # Check that no required columns are missing:
    expected_cols = {'example_id', 'sentence', 'prediction'}
    missing_cols = expected_cols - set(entry_df.columns)
    errcount = 0
    if len(missing_cols) != 0:
        errcount += 1
        print(f"Entry is missing required columns {missing_cols}")
        return

    # Check that the predictions are in our space:
    labels = {'positive', 'negative', 'neutral'}
    predtypes = set(entry_df.prediction.unique())
    unexpected = predtypes - labels
    if len(unexpected) != 0:
        errcount += 1
        print(f"Prediction column has unexpected values: {unexpected}")

    # Check that the dataset hasn't been rearranged:
    for colname in ('example_id', 'sentence'):
        if not entry_df[colname].equals(gold_df[colname]):
            errcount += 1
            print(f"Entry is misaligned with test data on column {colname}")

    # Clean bill of health:
    if errcount == 0:
        print("No errors detected with `test_bakeoff_entry`.")

In [118]:
test_bakeoff_entry("cs224u-sentiment-bakeoff-entry.csv")

No errors detected with `test_bakeoff_entry`.


Submit the following files to Gradescope:

* `hw_sentiment.ipynb` (this notebook)
* `cs224u-sentiment-bakeoff-entry.csv` (bake-off output)

Please make sure you use these filenames. The autograder looks for files with these names.

You are not permitted to do any tuning of your system based on what you see in our bakeoff prediction file – you should not study that file in anyway, beyond perhaps checking that it contains what you expected it to contain. The upload function will do some additional checking to ensure that your file is well-formed.

People who enter will receive the additional homework point, and people whose systems achieve the top score will receive an additional 0.5 points. We will test the top-performing systems ourselves, and only systems for which we can reproduce the reported results will win the extra 0.5 points.

Late entries will be accepted, but they cannot earn the extra 0.5 points.