Note - this lesson is going to use a slightly-lower-level library (still reasonably high-level)

# History

ULMFiT: Take a corpus and build a language model - e.g. try to predict the next word of every wikipedia article

This lanugage model needs to understand quite a bit to acheive that

Started with random weights, and able to predict >30% of the time the next word in a wikipedia article.

Super interesting - we take one of these models, and run a few more epochs on movie reviews, and then a few more epochs for movie review sentiment.

Language models don't require any labels - super interesting! Just need the corpus. This was done with RNNs.

Transformers released around this time. They can use hardware pretty well. NOT for predicting next word in a sentence. Instead - take chunks, delete word at random, and ask it to predict the deleted word.

Today we'll do transformers - NOT RNNs.

## Fine Tuning

We take the initial model, which has a shitload of embedded useful activations

Then we throw away the last layer (which was doing whatever) and replace it with a random layer for the classifier (or whatever) that we're trying to build

And we train some epochs.

In [1]:
from pathlib import Path
import kaggle,zipfile

In [2]:
path = Path('us-patent-phrase-to-phrase-matching')

Note - I had to

1. Create the kaggle account
1. Verify my phone number
1. Accept the rules of the US patent competition

Before doing the above, I was getting 403 permission denied trying to download the dataset below

In [3]:
if not path.exists():
    print("Gonna download it")
    kaggle.api.competition_download_cli(str(path))
    print("Extracting")
    zipfile.ZipFile(f'{path}.zip').extractall(path)
    print("Done!")
else:
    print("Already have it!")

Already have it!


In [4]:
!ls {path}

sample_submission.csv  test.csv  train.csv


In [5]:
import pandas as pd

In [6]:
df = pd.read_csv(path / 'train.csv')

By the way, worth learning all of:

- numpy
- matplotlib
- pandas
- pytorch

Fastai book & Wes McKinney's book

In [7]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [8]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


Observation - not that much "language" in this collection of documents.

In [9]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC: ' + df.anchor

when setting a field in the dataspace, always use the bracket index syntax (why? whatever)

In [10]:
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC: abatement
2    TEXT1: A47; TEXT2: active catalyst; ANC: abate...
3    TEXT1: A47; TEXT2: eliminating process; ANC: a...
4     TEXT1: A47; TEXT2: forest region; ANC: abatement
Name: input, dtype: object

we have some documents, but we need some numbers to work with the neural network

1. going to split it into tokens/words
1. get all the unique words (vocabulary) and each gets a number
    1. we don't want it too big, that affects the memory needed - now the strategy is to use "subwords"


In [11]:
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)

Note - these are huggingface datasets

In [12]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

those above steps are called "tokenization" and "numericalization"

Not worth making your own decisions about tokenization/numericalization. It needs to match the vocabulary of the pre-trained model.

Like TIMM, huggingface has a bunch of models we can use. Notably, each model is trained on a corpus.

There are some generally-good models. deberta is a good starting point for a lot of things. start with 'small' because it's faster to train and iterate.

In [13]:
model_nm = 'microsoft/deberta-v3-small'

note - install protobuf with specific version!

`pip install --force-reinstall -v "protobuf=3.20.3"`

restart the kernal if you get collisions in the descriptor pool if you try to make this change live.

In [14]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [15]:
tokz.tokenize("Good morning reader. I'm John doing the fast.ai course. It's very interesting.")

['▁Good',
 '▁morning',
 '▁reader',
 '.',
 '▁I',
 "'",
 'm',
 '▁John',
 '▁doing',
 '▁the',
 '▁fast',
 '.',
 'ai',
 '▁course',
 '.',
 '▁It',
 "'",
 's',
 '▁very',
 '▁interesting',
 '.']

In [16]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

In [17]:
def tok_func(x): return tokz(x["input"])

In [18]:
tokz.tokenize(ds[0]['input'])

['▁TEXT',
 '1',
 ':',
 '▁A',
 '47',
 ';',
 '▁TEXT',
 '2',
 ':',
 '▁abatement',
 '▁of',
 '▁pollution',
 ';',
 '▁ANC',
 ':',
 '▁abatement']

In [19]:
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

In [20]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  294,
  47284,
  2])

Note - these underscores are "fancy" underscores. 

In [21]:
#import re
#[x for x in list(tokz.vocab.keys()) if re.search("▁of", x)]
tokz.vocab["▁of"]

265

In [22]:
tok_ds = tok_ds.rename_columns({'score':'labels'}) # special thing expected by huggingface

In [23]:
eval_df = pd.read_csv(path / 'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


important idea - separate training, validation, and test sets

we need to indentify and control overfitting

I'm not going to do all of those quadratic exercises in this notebook

Strategy - take like 20% of the dataset, fit to the remaining 80%, and then see how it performs on the validation set.

(check correctness here - the validation set it **not** used in the calculation of the loss function. only in the calcualtion of a metric, which can be used to "stop" training)

note - might be like random sample of the elements, but like in a time series, maybe truncate

building confidence that you're not overfitting sometimes only comes after screwing it up a few times :) my specialty

example

- "is this person distracted" you might want your validation set to include some photos of people that do not appear at all in the training set
- "what fish are in the picture" was actually learned from the kind of boat that was fishing

be careful - fast.ai provides the randomizer thingy, but it's not always appropriate

the test set is one more separate set, but it's not touched until the very end after training. this is to prevent against training a whole bunch of models from a whole bunch of different validation set partitions, and selecting the best on, where by random chance you've overfit using that validation set.

this can *suck* if you've worked on a model for a long time, and then you "open the safe" and it's poor-performing on the test set. gotta start over. Sorry! but better to know than to ship it.

a validation set is used to measure metrics, like "accuracy". it's not necessarilty the loss function, it probably isn't

average error / mse is usually stronger

IRL - the model you use goes through a whole SDLC

In [24]:
# there's one important bit to re-use from the housing section (also not included)
import numpy as np
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}


In [25]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [33]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

In [26]:
from transformers import TrainingArguments,Trainer # equivalent to learner in fast.ai

In [27]:
# batches - how much to stuff into the GPU, parallelization bounded by hardware
bs = 256 # I'm fine-tuning this to my own local hardware
epochs = 4

In [28]:
# learning rate - future will show us to do it automated. just try a few is decent. or just start small and double it
lr = 8e-5

In [29]:
args = TrainingArguments('output', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
                         evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
                         num_train_epochs=epochs, weight_decay=0.01, report_to='none')

In [30]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2ForSequenceClassification: ['mask_predictions.dense.weight', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.dense.bias', 'mask_predictions.classifier.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.classifier.weight']
- This IS expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2ForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from 

In [31]:
trainer.train()

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.027748,0.785947
2,No log,0.024449,0.818949
3,No log,0.023149,0.824796


TrainOutput(global_step=428, training_loss=0.04094683566940165, metrics={'train_runtime': 51.2402, 'train_samples_per_second': 2135.353, 'train_steps_per_second': 8.353, 'total_flos': 718943372412600.0, 'train_loss': 0.04094683566940165, 'epoch': 4.0})

this is only possible with the pre-trained model. already pretty darn good with just a single epoch.

In [35]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

array([[ 0.57568359],
       [ 0.71679688],
       [ 0.54345703],
       [ 0.36035156],
       [ 0.01372528],
       [ 0.61083984],
       [ 0.52880859],
       [ 0.08062744],
       [ 0.29638672],
       [ 1.10351562],
       [ 0.28662109],
       [ 0.22741699],
       [ 0.74609375],
       [ 0.8671875 ],
       [ 0.78955078],
       [ 0.40869141],
       [ 0.26416016],
       [ 0.00309563],
       [ 0.70751953],
       [ 0.36254883],
       [ 0.42626953],
       [ 0.2253418 ],
       [ 0.0838623 ],
       [ 0.25146484],
       [ 0.57568359],
       [-0.06103516],
       [-0.03259277],
       [ 0.00908661],
       [-0.06567383],
       [ 0.62011719],
       [ 0.33276367],
       [ 0.01774597],
       [ 0.68457031],
       [ 0.47485352],
       [ 0.46728516],
       [ 0.19238281]])

In [36]:
preds = np.clip(preds, 0, 1)
preds

array([[0.57568359],
       [0.71679688],
       [0.54345703],
       [0.36035156],
       [0.01372528],
       [0.61083984],
       [0.52880859],
       [0.08062744],
       [0.29638672],
       [1.        ],
       [0.28662109],
       [0.22741699],
       [0.74609375],
       [0.8671875 ],
       [0.78955078],
       [0.40869141],
       [0.26416016],
       [0.00309563],
       [0.70751953],
       [0.36254883],
       [0.42626953],
       [0.2253418 ],
       [0.0838623 ],
       [0.25146484],
       [0.57568359],
       [0.        ],
       [0.        ],
       [0.00908661],
       [0.        ],
       [0.62011719],
       [0.33276367],
       [0.01774597],
       [0.68457031],
       [0.47485352],
       [0.46728516],
       [0.19238281]])

In [37]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv')

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1054