# NLP with HuggingFace Transformers

For the Kaggle [US Patent Phrase to Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/), we are tasked with comparing two words or short phrases, and scoring them based on whether they're similar or not, based on which patent class they were used in. With a score of 1 it is considered that the two inputs have identical meaning, and 0 means they have totally different meaning. For instance, abatement and eliminating process have a score of 0.5, meaning they're somewhat similar, but not identical.  
It turns out that this can be represented as a classification problem. How? By representing the question like this:

For the following text...: "TEXT1: abatement; TEXT2: eliminating process" ...chose a category of meaning similarity: "Different; Similar; Identical".

In this notebook we'll see how to solve the Patent Phrase Matching problem by treating it as a classification task, by representing it in a very similar way to that shown above.

## Getting the data

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
# If you haven't installed kaggle
!pip install kaggle

We'll create Kaggle API token and use it to download the dataset:

In [None]:
creds = '{username:"maureenwamuyumugo", key:"6937e0396ac38e5f307a2fccbf466138}'

In [None]:
from pathlib import Path

cred_path = Path
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

Now you can download datasets from Kaggle.

In [None]:
path = Path('us-patent-phrase-to-phrase-matching')

And use the Kaggle API to download the dataset to that path, and extract it:

Now we can check what's in path:

In [None]:
!ls {path}

These are CSv files and we can use pandas to read them:

In [None]:
import pandas as pd

Let's set a path to our data:

In [None]:
df = pd.read_csv(path/'train.csv')

This creates a DataFrame, which is a table of columns, a bit like a database table.  
To view the first and last 5 rows, and row count of a DataFrame, just type its name:

In [None]:
df

It's important to carefully read the dataset description to understand how each of these columns is used.  
.describe() method is also important for understanding a DataFrame.

In [None]:
df.describe(include='object')

To create a single string:

In [None]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

To get the first few rows, use head():

In [None]:
df.input.head()

## Tokenization

We'll turn our pandas DataFrame into a HuggingFace dataset as Transformers uses a Dataset:

In [None]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

Here's how it's displayed in a notebook:

In [None]:
ds

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- Tokenization: Split each text up into words (or actually, as we'll see, into tokens)
- Numericalization: Convert each word (or token) into a number.  

Before Tokenization, you have to decide what model to use. HuggingFace has good models that work for a lot of things most of the time like `deberta-v3`. We'll start with small because its faster to train and we can do more iterations.

In [None]:
model_nm = 'microsoft/deberta-v3-small'

To tell the transformer to tokenize the same way the model was built to tokenize, we use `AutoTokenizer`.  
AutoTokenizer -> dictionary that creates a tokenizer appropriate for a given model:

In [None]:
from transformers import AutoModelForSequenceClassification,AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

Now we can take the tokenizer and pass a string to it:

In [None]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")