<a href="https://colab.research.google.com/github/jan-kreischer/UZH_ML4NLP/blob/main/Project-05/ex05_sent.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 5 - Sequence and Sentiment Classification using Transformers

## Part 2: Resource Limited Competition: Sentiment Analysis

## 1. Setup
### 1.1 Dependencies
Disclaimer: The output of cells which do not produce not helpful output (for example the pip install comands) were cleared to make the program easier to read

In [2]:
!pip install datasets transformers sklearn simpletransformers

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 5.1 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.12.5-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 60.5 MB/s 
Collecting simpletransformers
  Downloading simpletransformers-0.63.3-py3-none-any.whl (247 kB)
[K     |████████████████████████████████| 247 kB 44.9 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 51.3 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 52.5 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 37.3 MB/s 
[?25hCollecting huggingface-hub


### 1.2 Imports

In [18]:
import datasets
from datasets import load_dataset
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures, Trainer, TrainingArguments
from sklearn.preprocessing import LabelEncoder

# Misc
import os
import shutil
import csv
import re
from io import StringIO
import requests
import string
import numpy as np
import matplotlib.pyplot as plt  
import seaborn as sn

# Pandas
import pandas as pd
pd.set_option('display.max_rows', 100)
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

# Warnings
import warnings
warnings.filterwarnings('ignore')

# Keras
import keras.preprocessing
from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, AveragePooling1D, Dense, Dropout, Activation, Embedding
from keras import backend as K
from keras.callbacks import EarlyStopping
from tensorflow.keras.utils import to_categorical

# tensorflow
import tensorflow as tf

# Torch
import torch

# Sklearn
from sklearn.utils import resample
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score,f1_score

# simpletransformers
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import logging

SyntaxError: ignored

### 1.3 Constants

In [4]:
train_range=(10000,15000)
test_range=(11500,13500)

### 1.4 Environment
We check if the environment we are using is properly setup, such that we are using GPU for training our models.

In [5]:
# Check if device supports CUDA interface
CUDA = torch.cuda.is_available()
# Make program run on gpu (cuda:0) if available
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu:0')
torch.cuda.set_device(device)
print('Using device:', device)

Using device: cuda:0


In [6]:
# Check and print information about available GPU
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Tue Nov 30 01:58:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.44       Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    30W / 250W |      2MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [7]:
# Get GPU name
!nvidia-smi -L

GPU 0: Tesla P100-PCIE-16GB (UUID: GPU-09cdc3b9-a810-eb0c-1470-11f3b63223b6)


In [8]:
# Check Memory
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


## Import


Here we import the data from the Stanford Repository.

In [9]:
URL = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"

dataset = tf.keras.utils.get_file(fname="aclImdb_v1.tar.gz", 
                                  origin=URL,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz


After download the data from the above link, we found that the directory sturcture is like this:

main_directory/                 
...train/                
......a_text_1.txt                
......a_text_2.txt                
...test/                
......a_text_1.txt                
......a_text_2.txt                
...unsup/                
......                

We formalize the path to the main directory and its subdirectory. We also remove the "unsup" directory which contains unlabeled reviews for unsupervised learning.

In [10]:
main_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')
train_dir = os.path.join(main_dir, 'train')
remove_dir = os.path.join(train_dir, 'unsup')
shutil.rmtree(remove_dir)   

In [11]:
# read data into dataframe: train_data. According to the requeirement, we read train[10000:15000] as train_data.
train = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=25000,
    shuffle=False,
    validation_split=0)

for i in train.take(1):
  train_feat = i[0].numpy()
  train_lab = i[1].numpy()

train = pd.DataFrame([train_feat, train_lab]).T
train.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
train['DATA_COLUMN'] = train['DATA_COLUMN'].str.decode("utf-8")
train_data=train[train_range[0]:train_range[1]]

Found 25000 files belonging to 2 classes.


In [12]:
# similarly read the test data into dataframe. According to the requeirement, we read test[11500:13500] as test_data.
test = tf.keras.preprocessing.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=25000,
    shuffle=False,
    validation_split=0)

for i in test.take(1):
  test_feat = i[0].numpy()
  test_lab = i[1].numpy()

test = pd.DataFrame([test_feat, test_lab]).T
test.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
test['DATA_COLUMN'] = test['DATA_COLUMN'].str.decode("utf-8")
test_data=test[test_range[0]:test_range[1]]

Found 25000 files belonging to 2 classes.


In [13]:
print("Train_data has a shape of {}. \n\n The number of positive(1) and negative(0) reviews are:\n {}".format(
    train_data.shape,train_data['LABEL_COLUMN'].value_counts()))

Train_data has a shape of (5000, 2). 

 The number of positive(1) and negative(0) reviews are:
 1    2500
0    2500
Name: LABEL_COLUMN, dtype: int64


In [14]:
print("Test_data has a shape of {}. \n\n The number of positive(1) and negative(0) reviews are:\n {}".format(
    test_data.shape,test_data['LABEL_COLUMN'].value_counts()))

Test_data has a shape of (2000, 2). 

 The number of positive(1) and negative(0) reviews are:
 1    1000
0    1000
Name: LABEL_COLUMN, dtype: int64


In [15]:
train_data

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
10000,"First, the CGI in this movie was horrible. I w...",0
10001,The film is about a sabretooth on the lose at ...,0
10002,Everything about this film is hog wash. Pitifu...,0
10003,Spoilers will be in this. The movie could have...,0
10004,Three giant sabretooth tigers(..created in a l...,0
...,...,...
14995,The minute I started watching this I realised ...,1
14996,i really loved this version of Emma the best. ...,1
14997,Until the 1990s there had never been a film ba...,1
14998,Old Jane's mannered tale seems very popular th...,1


## Models and Classification Arguements
Here we import some Models and compare their prediction results. 

In [26]:
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)

from sklearn.preprocessing import MultiLabelBinarizer

def f1(predictions, outputs):
    mlb = MultiLabelBinarizer()
    return f1_score(
        mlb.fit_transform(test_data['LABEL_COLUMN']),
        mlb.fit_transform(predictions),
        average='weighted'
    )

### Model 1: distilbert-base-uncased-finetuned-sst-2-english
This model is based on the DistilBERT base model, which is the distilled version of the BERT base model and is later fine-tunned by the Stanford Sentiment Treebank(SST). The Stanford Sentiment Treebank consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment of a given sentence. We use the two-way (positive/negative) class split, and use only sentence-level labels.

In [21]:
model_args1 = ClassificationArgs(num_train_epochs=10,
                                use_early_stopping=True,
                                output_dir="outputs1/",
                                overwrite_output_dir=True,
                                 weight_decay=0.01)

model1 = ClassificationModel(
    "distilbert", 
    "distilbert-base-uncased-finetuned-sst-2-english", 
    args=model_args1,
    num_labels=2,
    weight=[1,1]
)

Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

In [22]:
model1.train_model(train_data)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/5000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_distilbert_128_2_2


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of distilbert model complete. Saved to outputs1/.


(6250, 0.10465459409236907)

In [28]:
# Evaluate the model
result, model_outputs, wrong_predictions = model1.eval_model(test_data,acc=accuracy_score,f1_score=f1_score)

print(result)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/2000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_distilbert_128_2_2


Running Evaluation:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.6925001657447307, 'tp': 865, 'tn': 827, 'fp': 173, 'fn': 135, 'auroc': 0.92067, 'auprc': 0.9088425974442569, 'acc': 0.846, 'f1_score': 0.8488714425907753, 'eval_loss': 1.4441605228185654}


{'mcc': 0.6925001657447307, 'tp': 865, 'tn': 827, 'fp': 173, 'fn': 135, 'auroc': 0.92067, 'auprc': 0.9088425974442569, 'acc': 0.846, 'f1_score': 0.8488714425907753, 'eval_loss': 1.4441605228185654}


### Model 2: echarlaix/bert-base-uncased-sst2-acc91.1-d37-hybrid
This model is interesting because it introduce a block pruning methods

In [29]:
model_args2 = ClassificationArgs(num_train_epochs=10,
                                use_early_stopping=True,
                                output_dir="outputs2/",
                                overwrite_output_dir=True,
                                 weight_decay=0.01)

model2 = ClassificationModel(
    "bert", 
    "echarlaix/bert-base-uncased-sst2-acc91.1-d37-hybrid", 
    args=model_args2,
    num_labels=2,
    weight=[1,1]
)

Downloading:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/352M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/285 [00:00<?, ?B/s]

In [30]:
model2.train_model(train_data)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/5000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_bert_128_2_2


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 4 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 5 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 6 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 7 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 8 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 9 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model: Training of bert model complete. Saved to outputs2/.


(6250, 0.11170781915664672)

In [31]:
# Evaluate the model
result, model_outputs, wrong_predictions = model2.eval_model(test_data,acc=accuracy_score,f1=f1_score)
print(result)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/2000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_dev_bert_128_2_2


Running Evaluation:   0%|          | 0/250 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_model:{'mcc': 0.6804410687651714, 'tp': 858, 'tn': 822, 'fp': 178, 'fn': 142, 'auroc': 0.9164924999999999, 'auprc': 0.908275267636448, 'acc': 0.84, 'f1': 0.8428290766208252, 'eval_loss': 1.5012480400204657}


{'mcc': 0.6804410687651714, 'tp': 858, 'tn': 822, 'fp': 178, 'fn': 142, 'auroc': 0.9164924999999999, 'auprc': 0.908275267636448, 'acc': 0.84, 'f1': 0.8428290766208252, 'eval_loss': 1.5012480400204657}


In [32]:
predictions, raw_outputs = model2.predict(["This movie is great"])
print(predictions)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

[1]


In [33]:
predictions, raw_outputs = model2.predict(["This movie sucks"])
print(predictions)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

[0]


### Model 3: siebert/sentiment-roberta-large-english
This model is a fine-tuned checkpoint of RoBERTa-large ([Liu et al. 2019](https://arxiv.org/pdf/1907.11692.pdf)). It enables reliable binary sentiment analysis for various types of English-language text. According to [Liu et al. 2019](https://arxiv.org/pdf/1907.11692.pdf), this model has improved in the following 4 aspects:
 - (1) training the model longer, with bigger batches,
over more data; 
 - (2) removing the next sentence
prediction objective; 
 - (3) training on longer sequences; and 
 - (4) dynamically changing the masking pattern applied to the training data.    
               
They also collect a large new dataset (CC-NEWS) of comparable size to other privately used datasets, to better control for training set size effects.


In [34]:
model_args3 = ClassificationArgs(num_train_epochs=10,
                                use_early_stopping=True,
                                output_dir="outputs3/",
                                overwrite_output_dir=True,
                                 weight_decay=0.01)

model3 = ClassificationModel(
    "roberta", 
    "siebert/sentiment-roberta-large-english", 
    args=model_args3,
    num_labels=2,
    weight=[1,1]
)

Downloading:   0%|          | 0.00/687 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/780k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/256 [00:00<?, ?B/s]

In [None]:
model3.train_model(train_data)

INFO:simpletransformers.classification.classification_utils: Converting to features started. Cache is not used.


  0%|          | 0/5000 [00:00<?, ?it/s]

INFO:simpletransformers.classification.classification_utils: Saving features into cached file cache_dir/cached_train_roberta_128_2_2


Epoch:   0%|          | 0/10 [00:00<?, ?it/s]

Running Epoch 0 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 1 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 2 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

Running Epoch 3 of 10:   0%|          | 0/625 [00:00<?, ?it/s]

In [None]:
# Evaluate the model
result, model_outputs, wrong_predictions = model3.eval_model(test_data,acc=accuracy_score,f1=f1_score)
print(result)

In [None]:
predictions, raw_outputs = model3.predict(["This movie is great"])
print(predictions)

In [None]:
predictions, raw_outputs = model3.predict(["This movie sucks"])
print(predictions)



```
`# This is formatted as code`
```

### Model 4: gchhablani/bert-base-cased-finetuned-sst2
Compared to the bert-base-cased model, this model replaces the self-attention sublayers with simple linear transformations that "mix" input tokens. They show that Transformer encoder can be sped up, with limited accuracy costs. In (this paper[https://arxiv.org/abs/2105.03824]), they showed that these linear mixers, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. 


In [None]:
model_args4 = ClassificationArgs(num_train_epochs=10,
                                use_early_stopping=True,
                                output_dir="outputs4/",
                                overwrite_output_dir=True,
                                 weight_decay=0.01)

model4 = ClassificationModel(
    "bert", 
    "gchhablani/bert-base-cased-finetuned-sst2", 
    args=model_args4,
    num_labels=2,
    weight=[1,1]
)

In [None]:
model4.train_model(train_data)

In [None]:
# Evaluate the model
result, model_outputs, wrong_predictions = model4.eval_model(test_data,acc=accuracy_score,f1=f1_score)
print(result)

In [None]:
predictions, raw_outputs = model4.predict(["This movie is great"])
print(predictions)

In [None]:
predictions, raw_outputs = model4.predict(["This movie sucks"])
print(predictions)