## Introduction

In [1]:
import fastai

import matplotlib.pyplot as plt
from pathlib import Path
import numpy as np
import os
import torch
import pandas as pd
import seaborn as sns

from fastai import *

print("Fastai version:", fastai.__version__)
np.set_printoptions(linewidth=130)

# Track whether on Kaggle or not
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

Fastai version: 2.7.17


In [2]:
# Set torch to use cuda if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)
# If using CUDA, check the version and which GPU is being used
if device.type == "cuda":
    print("CUDA Version:", torch.version.cuda)
    print("GPU:", torch.cuda.get_device_name(0))

Device: cuda
CUDA Version: 12.4
GPU: NVIDIA GeForce RTX 4080 SUPER


In [3]:
path = Path('us-patent-phrase-to-phrase-matching')

In [4]:
# Using Kaggle API to download the dataset

if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(Path('input'))

us-patent-phrase-to-phrase-matching.zip: Skipping, found more recently modified local copy (use --force to force download)


## Import and EDA

In [5]:
if iskaggle:
    path = Path('../input/us-patent-phrase-to-phrase-matching')
    ! pip install -q datasets

Documents in NLP datasets are generally in one of two main forms:

- **Larger documents**: One text file per document, often organised into one folder per category
- **Smaller documents**: One document (or document pair, optionally with metadata) per row in a [CSV file](https://realpython.com/python-csv/).

Let's set a path to our data:

In [6]:
df = pd.read_csv('input/train.csv')

In [7]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


We can see that in the 36473 rows, there are 733 unique anchors, 106 contexts, and nearly 30000 targets. Some anchors are very common, with "component composite coating" for instance appearing 152 times.

Earlier, I suggested we could represent the input to the model as something like "*TEXT1: abatement; TEXT2: eliminating process*". We'll need to add the context to this too. In Pandas, we just use `+` to concatenate, like so:

In [8]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

We can refer to a column (also known as a *series*) either using regular python "dotted" notation, or access it like a dictionary. To get the first few rows, use `head()`:

In [9]:
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC...
1    TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2    TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3    TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4    TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

## Tokenization

In [10]:
from datasets import Dataset,DatasetDict

ds = Dataset.from_pandas(df)

Here's how it's displayed in a notebook:

In [11]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- *Tokenization*: Split each text up into words (or actually, as we'll see, into *tokens*)
- *Numericalization*: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace "small" with "large" for a slower but more accurate model, once you've finished exploring):

In [12]:
# When iterating, try to use a small model first, then move to a larger model to get higher accuracy once experimentation is done

model_nm = 'microsoft/deberta-v3-small'

`AutoTokenizer` will create a tokenizer appropriate for a given model:

In [13]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)



Here's an example of how the tokenizer splits a text into "tokens" (which are like words, but can be sub-word pieces, as you see below):

In [14]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

Uncommon words will be split into pieces. The start of a new word is represented by `▁`:

In [15]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

Here's a simple function which tokenizes our inputs:

In [16]:
def tok_func(x): return tokz(x["input"])

To run this quickly in parallel on every row in our dataset, use `map`:

In [17]:
# Use map function to run the tokenization function on the dataset in parallel

tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

This adds a new item to our dataset called `input_ids`. For instance, here is the input and IDs for the first row of our data:

In [18]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

So, what are those IDs and where do they come from? The secret is that there's a list called `vocab` in the tokenizer which contains a unique integer for every possible token string. We can look them up like this, for instance to find the token for the word "of":

In [19]:
tokz.vocab['▁of']

265

Looking above at our input IDs, we do indeed see that `265` appears as expected.

Finally, we need to prepare our labels. Transformers always assumes that your labels has the column name `labels`, but in our dataset it's currently `score`. Therefore, we need to rename it:

In [20]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

Now that we've prepared our tokens and labels, we need to create our validation set.

## Test and validation sets

### Validation set

Transformers uses a `DatasetDict` for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use `train_test_split`:

In [21]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

As you see above, the validation set here is called `test` and not `validate`, so be careful!

In practice, a random split like we've used here might not be a good idea -- here's what Dr Rachel Thomas has to say about it:

> "*One of the most likely culprits for this disconnect between results in development vs results in production is a poorly chosen validation set (or even worse, no validation set at all). Depending on the nature of your data, choosing a validation set can be the most important step. Although sklearn offers a `train_test_split` method, this method takes a random subset of the data, which is a poor choice for many real-world problems.*"

I strongly recommend reading her article [How (and why) to create a good validation set](https://www.fast.ai/2017/11/13/validation-sets/) to more fully understand this critical topic.

### Test set

So that's the validation set explained, and created. What about the "test set" then -- what's that for?

The *test set* is yet another dataset that's held out from training. But it's held out from reporting metrics too! The accuracy of your model on the test set is only ever checked after you've completed your entire training process, including trying different models, training methods, data processing, etc.

You see, as you try all these different things, to see their impact on the metrics on the validation set, you might just accidentally find a few things that entirely coincidentally improve your validation set metrics, but aren't really better in practice. Given enough time and experiments, you'll find lots of these coincidental improvements. That means you're actually over-fitting to your validation set!

That's why we keep a test set held back. Kaggle's public leaderboard is like a test set that you can check from time to time. But don't check too often, or you'll be even over-fitting to the test set!

Kaggle has a *second* test set, which is yet another held-out dataset that's only used at the *end* of the competition to assess your predictions. That's called the "private leaderboard". Here's a [great post](https://gregpark.io/blog/Kaggle-Psychopathy-Postmortem/) about what can happen if you overfit to the public leaderboard.

We'll use `eval` as our name for the test set, to avoid confusion with the `test` dataset that was created above.

In [22]:
eval_df = pd.read_csv('input/test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


In [23]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

## Metrics and correlation

When we're training a model, there will be one or more *metrics* that we're interested in maximising or minimising. These are the measurements that should, hopefully, represent how well our model will works for us.

In real life, outside of Kaggle, things not easy... As my partner Dr Rachel Thomas notes in [The problem with metrics is a big problem for AI](https://www.fast.ai/2019/09/24/metrics/):

>  At their heart, what most current AI approaches do is to optimize metrics. The practice of optimizing metrics is not new nor unique to AI, yet AI can be particularly efficient (even too efficient!) at doing so. This is important to understand, because any risks of optimizing metrics are heightened by AI. While metrics can be useful in their proper place, there are harms when they are unthinkingly applied. Some of the scariest instances of algorithms run amok all result from over-emphasizing metrics. We have to understand this dynamic in order to understand the urgent risks we are facing due to misuse of AI.

In Kaggle, however, it's very straightforward to know what metric to use: Kaggle will tell you! According to this competition's [evaluation page](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview/evaluation), "*submissions are evaluated on the [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) between the predicted and actual similarity scores*." This coefficient is usually abbreviated using the single letter *r*. It is the most widely used measure of the degree of relationship between two variables.

r can vary between `-1`, which means perfect inverse correlation, and `+1`, which means perfect positive correlation. The mathematical formula for it is much less important than getting a good intuition for what the different values look like. To start to get that intuition, let's look at some examples using the [California Housing](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) dataset, which shows "*is the median house value for California districts, expressed in hundreds of thousands of dollars*". This dataset is provided by the excellent [scikit-learn](https://scikit-learn.org/stable/) library, which is the most widely used library for machine learning outside of deep learning.

## Training

## Training our model

To train a model in Transformers we'll need this:

In [24]:
from transformers import TrainingArguments,Trainer

We pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly:

In [25]:
bs = 128
epochs = 4

The most important hyperparameter is the learning rate. fastai provides a learning rate finder to help you figure this out, but Transformers doesn't, so you'll just have to use trial and error. The idea is to find the largest value you can, but which doesn't result in training failing.

In [26]:
lr = 8e-5

Transformers uses the `TrainingArguments` class to set up arguments. Don't worry too much about the values we're using here -- they should generally work fine in most cases. It's just the 3 parameters above that you may need to change for different models.

In [28]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    eval_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

We can now create our model, and `Trainer`, which is a class which combines the data and model together (just like `Learner` in fastai):

In [32]:
def corr(x,y): return np.corrcoef(x,y)[0][1]

In [33]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

In [34]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As you see, Transformers spits out lots of warnings. You can safely ignore them.

Let's train our model!

In [35]:
trainer.train();

  0%|          | 0/856 [00:00<?, ?it/s]

  0%|          | 0/36 [00:00<?, ?it/s]

{'eval_loss': 0.02530163899064064, 'eval_pearson': 0.7917664128863953, 'eval_runtime': 1.0971, 'eval_samples_per_second': 8311.904, 'eval_steps_per_second': 32.814, 'epoch': 1.0}


  0%|          | 0/36 [00:00<?, ?it/s]

{'eval_loss': 0.02233044244349003, 'eval_pearson': 0.8209119995419936, 'eval_runtime': 1.0932, 'eval_samples_per_second': 8341.85, 'eval_steps_per_second': 32.932, 'epoch': 2.0}
{'loss': 0.0342, 'grad_norm': 0.20000532269477844, 'learning_rate': 3.544034360437838e-05, 'epoch': 2.34}


  0%|          | 0/36 [00:00<?, ?it/s]

{'eval_loss': 0.02283954806625843, 'eval_pearson': 0.8316344555829821, 'eval_runtime': 1.1038, 'eval_samples_per_second': 8261.58, 'eval_steps_per_second': 32.615, 'epoch': 3.0}


  0%|          | 0/36 [00:00<?, ?it/s]

{'eval_loss': 0.02246575616300106, 'eval_pearson': 0.8321053641740096, 'eval_runtime': 1.1188, 'eval_samples_per_second': 8150.731, 'eval_steps_per_second': 32.177, 'epoch': 4.0}
{'train_runtime': 66.9909, 'train_samples_per_second': 1633.297, 'train_steps_per_second': 12.778, 'train_loss': 0.025648368296222152, 'epoch': 4.0}


Lots more warning from Transformers again -- you can ignore these as before.

The key thing to look at is the "Pearson" value in table above. As you see, it's increasing, and is already above 0.8. That's great news! We can now submit our predictions to Kaggle if we want them to be scored on the official leaderboard. Let's get some predictions on the test set:

In [36]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

  0%|          | 0/1 [00:00<?, ?it/s]

array([[ 0.57910156],
       [ 0.71533203],
       [ 0.57470703],
       [ 0.33837891],
       [-0.02989197],
       [ 0.47314453],
       [ 0.48754883],
       [-0.0140152 ],
       [ 0.29736328],
       [ 1.08984375],
       [ 0.2590332 ],
       [ 0.25341797],
       [ 0.80566406],
       [ 0.8515625 ],
       [ 0.74023438],
       [ 0.41430664],
       [ 0.29980469],
       [-0.02618408],
       [ 0.69042969],
       [ 0.37133789],
       [ 0.47753906],
       [ 0.22766113],
       [ 0.07049561],
       [ 0.27075195],
       [ 0.55029297],
       [-0.02171326],
       [-0.03387451],
       [-0.03942871],
       [-0.04327393],
       [ 0.65087891],
       [ 0.26416016],
       [ 0.00157166],
       [ 0.68652344],
       [ 0.51904297],
       [ 0.4387207 ],
       [ 0.21899414]])

Look out - some of our predictions are <0, or >1! This once again shows the value of remember to actually *look* at your data. Let's fix those out-of-bounds predictions:

In [37]:
preds = np.clip(preds, 0, 1)

In [38]:
preds

array([[0.57910156],
       [0.71533203],
       [0.57470703],
       [0.33837891],
       [0.        ],
       [0.47314453],
       [0.48754883],
       [0.        ],
       [0.29736328],
       [1.        ],
       [0.2590332 ],
       [0.25341797],
       [0.80566406],
       [0.8515625 ],
       [0.74023438],
       [0.41430664],
       [0.29980469],
       [0.        ],
       [0.69042969],
       [0.37133789],
       [0.47753906],
       [0.22766113],
       [0.07049561],
       [0.27075195],
       [0.55029297],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.65087891],
       [0.26416016],
       [0.00157166],
       [0.68652344],
       [0.51904297],
       [0.4387207 ],
       [0.21899414]])

OK, now we're ready to create our submission file. If you save a CSV in your notebook, you will get the option to submit it later.

In [39]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1059

Unfortunately this is a *code competition* and internet access is disabled. That means the `pip install datasets` command we used above won't work if you want to submit to Kaggle. To fix this, you'll need to download the pip installers to Kaggle first, as [described here](https://www.kaggle.com/c/severstal-steel-defect-detection/discussion/113195). Once you've done that, disable internet in your notebook, go to the Kaggle leaderboards page, and click the *Submission* button.

## The end

Next steps: [Iterate Like a Grandmaster](https://www.kaggle.com/code/jhoward/iterate-like-a-grandmaster/) notebook.