DistilBERT is a pre-trained Hugging face version of BERT(Bidirectional Transformer) for language understanding/sentiment analysis
This follows along the explanation and tutorial here: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

I also included the steps for performing your own inference on new movie reviews after training the model following the tutorial.

In [27]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV

In [2]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)

In [3]:
df.head()

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [4]:
# Import Pre-trained DistilBERT model + Tokenizer
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)
tokenized = df[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

(…)bert-base-uncased/resolve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 5.35MB/s]
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
(…)cased/resolve/main/tokenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<?, ?B/s]
(…)rt-base-uncased/resolve/main/config.json: 100%|██████████| 483/483 [00:00<?, ?B/s] 
model.safetensors: 100%|██████████| 268M/268M [00:07<00:00, 37.0MB/s] 


In [16]:
# padded_input_length
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])


In [19]:
# Attention Mask
np.array(padded).shape
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(6920, 67)

In [20]:
# Convert sentences with DistilBERT
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [26]:
features = last_hidden_states[0][:,0,:].numpy()
labels = df[1]
train_features, test_features, train_lables, test_labels = train_test_split(features, labels)

In [28]:
# Gridsearch to find best C value for logistic regression
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_lables)
print('Best Parameters:', grid_search.best_params_)
print('Best Score:', grid_search.best_score_)
# Results for dataset
# Best Parameters: {'C': 5.263252631578947}
# Best Score: 0.8315992292870906

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Parameters: {'C': 5.263252631578947}
Best Score: 0.8315992292870906


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [29]:
lr_clf = LogisticRegression(C=grid_search.best_score_)
lr_clf.fit(train_features, train_lables)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [30]:
lr_clf.score(test_features, test_labels)

0.8531791907514451

In [31]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()
scores = cross_val_score(clf, train_features, train_lables)
print("Dummy Classifier Score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std()*2))

Dummy Classifier Score: 0.518 (+/- 0.00)


Update the below cells if you're trying to find sentiment for new movie reviews.

In [75]:
# Converting a sentance string to padded tokenized array with BERT
# In order to run inference with the model, you'll need to follow the same steps used to format and normalize the training and testing datasets with your new sentence.

# Replace this with your new sentence
new_sentence_inf = "The laughs are plentiful and the set design is eye-popping, but what makes Barbie truly work is its deft tackling of numerous themes."

# tokenize string, pad array, and create attention mask for the new dense array
token_nsinf = tokenizer.encode(new_sentence_inf, add_special_tokens=True)
padded_new_token = np.array(np.pad(token_nsinf, (0,max_len-(len(token_nsinf)))))
test_attention_mask = np.array(np.where(padded_new_token != 0, 1, 0))

# single test is still array of one
padded_new_token = padded_new_token.reshape(1,max_len)
test_attention_mask = test_attention_mask.reshape(1,max_len)

# Create tensors for input and mask
input_id = torch.tensor(padded_new_token)
test_attention_mask = torch.tensor(test_attention_mask)

#
with torch.no_grad():
    new_last_hidden_states = model(input_id, attention_mask=test_attention_mask)

(1, 67)
(1, 67)
torch.Size([1, 67])


In [76]:
# Retrieve CLS token created by DistilBERT model
new_test_feature = new_last_hidden_states[0][:,0,:].numpy()

# Reshape for model inference/prediction
singel_test_feature = new_test_feature.reshape(1,-1)
resp = lr_clf.predict(singel_test_feature)

In [79]:
print("Barbie Movie Review Sentiment\n'{}'\nSentiment: {}".format(new_sentence_inf), "Negative" if resp[0] == 0 else "Positve")