In [1]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 5.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 50.4 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 55.2 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.11.1 tokenizers-0.13.2 transformers-4.25.1


In [2]:
import numpy as np
import pandas as pd
import torch
import transformers as t

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split


In [3]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [5]:
df.shape

(6920, 2)

In [6]:
df = df[:2000]
df.shape

(2000, 2)

Importing pre-trained DistillBERT model, and tokenizer

In [9]:
model_class, tokenizer_class, pretrained_wts = (t.DistilBertModel, t.DistilBertTokenizer, "distilbert-base-uncased")
tokenizer = tokenizer_class.from_pretrained(pretrained_wts)
model = model_class.from_pretrained(pretrained_wts)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Prepare sentence embeddings using distilbert

In [10]:
tokenized = df[0].apply(lambda x: tokenizer.encode(x, add_special_tokens=True))

To process all the sentences at once (as one batch), we need to pad all lists to the same size, so that we can represent the input as one 2-D array


In [11]:
max_len =0
for i in tokenized.values:
  if len(i) > max_len:
    max_len = len(i)
padded = np.array([i+[0]*(max_len-len(i)) for i in tokenized.values])

# check the shape of padded dataset
np.array(padded).shape

(2000, 59)

We create a variable to tell it to ignore (mask) the padding we added, when it is processing its input.

In [12]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(2000, 59)

Now, we use the distilBERT model and get the embeddings.

In [13]:
# convert tokens into ids
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)
with torch.no_grad():
  # sentence embeddings for each tokens will be in results
  results = model(input_ids, attention_mask=attention_mask)

For sentence classification, we only need the output from first token of each sentence. ie. results[0][:,0, :] i.e [all_sentences, only the first position[CLS], all hidden unit outputs]. This values will be features for our classifier (logistic regression)

In [16]:
features = results[0][:, 0, :].numpy()
labels = df[1]

Split the data into train and test set

In [18]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

We can use gridsearch to find best hyperparameter for logistic regression. But first lets try with default values.

In [19]:
clf = LogisticRegression()
clf.fit(train_features, train_labels)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

Evaluate the model using our test set

In [20]:
clf.score(test_features, test_labels)

0.838

Comparing it with dummy classifier

In [21]:
from sklearn.dummy import DummyClassifier

c = DummyClassifier()
scores = cross_val_score(c, train_features, train_labels)
print(f"Dummy classifier score: {scores.mean()}")

Dummy classifier score: 0.5266666666666666
