# NLP with HuggingFace Transformers

For the Kaggle [US Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/), we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they're somewhat similar, but not identical.  
It turns out that this can be represented as a classification problem. How? By representing the question like this:

For the following text...: "TEXT1: abatement; TEXT2: eliminating process" ...chose a category of meaning similarity: "Different; Similar; Identical".

In this notebook we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

## Getting the data

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# If you haven't installed kaggle
!pip install kaggle

We'll create Kaggle API token and use it to download the dataset:

In [None]:
creds = '{username:"maureenwamuyumugo", key:"6937e0396ac38e5f307a2fccbf466138}'

In [None]:
from pathlib import Path

cred_path = Path
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Now you can download datasets from Kaggle.

In [None]:
path = Path('us-patent-phrase-to-phrase-matching')

And use the Kaggle API to download the dataset to that path, and extract it:

Now we can check what's in path:

In [None]:
!ls {path}

These are CSv files and we can use pandas to read them:

In [None]:
import pandas as pd

Let's set a path to our data:

In [None]:
df = pd.read_csv(path/'train.csv')

This creates a DataFrame, which is a table of columns, a bit like a database table.  
To view the first and last 5 rows, and row count of a DataFrame, just type its name:

In [None]:
df

It's important to carefully read the dataset description to understand how each of these columns is used.  
.describe() method is also important for understanding a DataFrame.

In [None]:
df.describe(include='object')

To create a single string:

In [None]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

To get the first few rows, use head():

In [None]:
df.input.head()

## Tokenization

We'll turn our pandas DataFrame into a HuggingFace dataset as Transformers uses a Dataset:

In [None]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

Here's how it's displayed in a notebook:

In [None]:
ds

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- Tokenization: Split each text up into words (or actually, as we'll see, into tokens)
- Numericalization: Convert each word (or token) into a number.  

Before Tokenization, you have to decide what model to use. HuggingFace has good models that work for a lot of things most of the time like `deberta-v3`. We'll start with small because its faster to train and we can do more iterations.

In [None]:
model_nm = 'microsoft/deberta-v3-small'

To tell the transformer to tokenize the same way the model was built to tokenize, we use `AutoTokenizer`.  
AutoTokenizer -> dictionary that creates a tokenizer appropriate for a given model:

In [None]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Now we can take the tokenizer and pass a string to it:

In [None]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

In [None]:
tokz.tekenize("A platypus is an ornithorhynchus anatinus.")

This splits the string into tokens. All the tokens have to be in the vocabulary, the list of unique tokens that was created when this particular pretrained model was first trained. The `_` represents the start of a word.

Here's a simple function that takes a document, grabs its input and tokenizes it:

In [None]:
def tok_func(x): return tokz(x["input"])

To run this quickly in parallel on every row in our dataset, use map:

In [None]:
tok_ds = ds.map(tok_func, batched=True)

`batched=True` allows it to go through the tokenizer libary a bunch at a time.  
This adds a new item to our dataset called `input_ids`. For instance, here is the input and IDs for the first row of our data:

In [None]:
row = tok_ds[0]
row['input'], row['input_ids']

So now we have turned the strings from tokens to numbers. This is called `Numericalization`.  

We can look them up like this, for instance to find the token for the word "of":

In [None]:
tokz.vocab['▁of']

Its 265 and in the previous output we had 265.  

HuggingFace transformers expects that your target is a column called `labels`, so we can rename our token-dataset score column to 'label':

In [None]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

In ML, its important to have a separate training, validation and test dataset. Test and validation set are all about identifying and controlling overfitting.

### Test and Validation Set
You may have noticed that our directory contained another file:

In [None]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

This is the test set. 

Transformers uses a DatasetDict for holding your training and validation sets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use train_test_split:

In [None]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

The test set is yet another dataset that's held out from training. But it's held out from reporting metrics too! The accuracy of your model on the test set is only ever checked after you've completed your entire training process, including trying different models, training methods, data processing, etc.  
We'll use eval as our name for the test set, to avoid confusion with the test dataset that was created above:

In [None]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

### Metrics and Correlation

When we're training a model, there will be one or more metrics that we're interested in maximising or minimising. These are the measurements that should, hopefully, represent how well our model will works for us.  
In Kaggle, however, it's very straightforward to know what metric to use: Kaggle will tell you! According to this competition's [evaluation page](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/overview/evaluation), "submissions are evaluated on the `Pearson correlation coefficient` between the predicted and actual similarity scores." This coefficient is usually abbreviated using the single letter `r`. It is the most widely used measure of the degree of relationship between two variables.  
r can vary between -1, which means perfect inverse correlation(i.e predicted exactly the wrong answer), and +1, which means perfect positive correlation(i.e predicted exactly the right answer). The mathematical formula for it is much less important than getting a good intuition for what the different values look like.

Transformers expects metrics to be returned as a `dict`, since that way the trainer knows what label to use, so let's create a function to do that:

In [None]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

## Training our Model

Fastai uses `Learner`, HuggingFace transformer's is called `Trainer`. So let's get that:

In [None]:
from transformers import TrainingArguments, Trainer

We pick a batch size that fits our GPU, and small number of epochs so we can run experiments quickly:

In [None]:
bs = 128
epochs = 4

The most important hyperparameter is the learning rate. fastai provides a learning rate finder to help you figure this out, but Transformers doesn't, so you'll just have to use trial and error. The idea is to find the largest value you can, but which doesn't result in training failing.

In [None]:
lr = 8e-5

Transformers uses the `TrainingArguments` class to set up arguments:

In [None]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

We can now create our model, and `Trainer`, which is a class which combines the data and model together (just like Learner in fastai):

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Let's train our model!

In [None]:
trainer.train();

We get a Pearson value above 0.8 which is pretty good.  
Let's get some predictions on the test set:

In [None]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

In [None]:
Look out - some of our predictions are <0, or >1! This once again shows the value of remember to actually look at your data. Let's fix those out-of-bounds predictions:

In [None]:
preds = np.clip(preds, 0, 1)