In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier

# XLNet specifics
import torch
from transformers import XLNetModel, XLNetTokenizer

# Dataset info
Using dataset SST2 (XLNet was able to get 94.4% accuracy)

In [2]:
df_train = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv',
                 delimiter='\t',
                 names=['sentence','label'])

df_test = pd.read_csv('https://raw.githubusercontent.com/clairett/pytorch-sentiment-classification/master/data/SST2/test.tsv',
                 delimiter='\t',
                 names=['sentence','label'])

split_point = len(df_train)
df = pd.concat([df_train, df_test])

In [3]:
# df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv',
#                  delimiter='\t',
#                  names=['sentence','label'])
# df = df[:2000] # take only 2000 sentences for speed rn

In [4]:
df_train['label'].value_counts()

1    3610
0    3310
Name: label, dtype: int64

In [5]:
df_test['label'].value_counts()

0    912
1    909
Name: label, dtype: int64

# Training Model

Training an XLnet from scratch is a very complicated task. Because of this, we instead use a pretrained, light version of the model (xlnet-base-cased: 110M params), and grabbed embeddings to put into a classification layer. This is different from the results from XLNet.

Typically, the way to use XLNet would be to take the pretrained version (xlnet-large-cased), and then fine-tune to the data set. This would yield the high results, as found at the end.

In [6]:
 # See here for all pretrained https://huggingface.co/transformers/pretrained_models.html?highlight=pretrained
pretrained_label = 'xlnet-base-cased'

tokenizer = XLNetTokenizer.from_pretrained(pretrained_label)
model = XLNetModel.from_pretrained(pretrained_label)

# Data preparation
### Tokenization

In [7]:
tokenized = df['sentence'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

These added special tokens include a classifier token, '<clf>' which we are mainly interested in.

In [8]:
print(f'Tokenizer string form: {tokenizer.cls_token}, id: {tokenizer.cls_token_id}')

Tokenizer string form: <cls>, id: 3


In [9]:
max_len = 0
clf_positions = np.zeros(len(tokenized), dtype=np.int64)
for posn, i in enumerate(tokenized.values):
    clf_positions[posn] = len(i) - 1
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

A sample padded sentence looks like the following:

In [10]:
padded[0]

array([   24, 16003,    17,    19,  5787,    21,  1381, 21469,    17,
          88,  7693, 15930,    56,    20,  4111,    21,    18, 11740,
          21,  4974,    23,  6941,  2701,     4,     3,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0])

We can see the cls token is at the end of the sentence, before the padding.

### Masking
A slight nuance, we need to pass in a attention masking map, which allows XLNet to identify where the sentence is.

In [12]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(8741, 86)

## Embedding
We need to pass the tokenized sentences into XLNet now, and get the embeddings to pass into
another classifier.

In [13]:
inputs = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(inputs, attention_mask=attention_mask)

In order to select the correct term, we select the classifier token from the mapping.
We know that the final hidden state maps the classifier tokens to their respective position.

In [14]:
# Features
final_hidden_state = last_hidden_states[0]
X = final_hidden_state[np.arange(len(final_hidden_state)), clf_positions]

# Labels
y = df['label']

## Classification

With all of the encoded values, now we can pass into any classifier we'd like to. For simplicity,
we chose to take the default values of the sklearn MLPClassifier.

In [15]:
num_test_pts = len(df) - split_point
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=num_test_pts)

In [16]:
clf = MLPClassifier()
clf.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=200,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [17]:
clf.score(X_test, y_test)

0.7874794069192751

This is a lot better than random guessing! It doesn't reach XLNets full accuracy in the paper (94.4%), but this model has:  
- Simple, unoptimized classification layer
- Fewer, untuned weights in XLNet

# Fine-tuning XLNet

Although the heursitic test above is fast and easy to implement, it doesn't obtain the results
found in the XLNet paper (94.4% accuracy).

By running fine tuning on the SST-2 dataset, we were able to up the accuracy to 94.0%. This was accomplished
by setting up the transformers repo, and running the [examples](https://github.com/huggingface/transformers/blob/master/examples/README.md):

```bash
export GLUE_DIR=/path/to/data
export TASK_NAME=SST-2

python run_glue.py \
  --model_type xlnet \
  --model_name_or_path xlnet-base-cased \
  --task_name $TASK_NAME \
  --do_train \
  --do_eval \
  --data_dir $GLUE_DIR/$TASK_NAME \
  --max_seq_length 128 \
  --per_gpu_train_batch_size 16 \
  --learning_rate 2e-5 \
  --num_train_epochs 3.0 \
  --output_dir $GLUE_DIR/tmp/$TASK_NAME/
```

yields:
```bash
$ cat eval_results.txt
acc = 0.9403669724770642
```

This task took significantly longer to do, but was able to fine tune to a particular data set incredibly well.