# Chapter 4 | Natural Language Processing, Transformers, Huggingface
> A New Library, New Model Architectures, And Natural Language Data
>Checkout this notebook in [colab](https://colab.research.google.com/github/nglillywhite/blog/blob/main/posts/wotwot)

This week focuses on building models with interact with natural language which is very different from images or structured tabular data. Jeremy starts off discussing some model architectures (some of which he pioneered with collaborators) like 'ULMFit', 'Transformers', and Recurrent Neural Nets (RNNs) which are often seen in the news headlines as being significant. This week will also use huggingface, which is another library from fastai but is seemingly the best library as of writing this for working with language. We're going to dive into tokenising and other natural language specific problems.

## Lecture Content
### US Patent Phrase Matching

We're working with the [US Patent Phrase Matching](https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching) Kaggle Competition data for this lesson lecture, I've just visited the page and downloaded the dataset into my local data path but you can of course use the kaggle APIs.

This competition in particular is about comparing two short phrases and whether they are similar to each other based on which patent class they were used with. Scores of 1 being identical and 0 meaning they are different. The scores are represented in a set of 0, 0.25, 0.5, and 1 which acts like a classification problem rather than regression because they're distinct instead of smooth.

Jeremy starts by proposing that we can feed this into our model successfully by representing the data as "TEXT1: abatement; TEXT2: eliminating process" of which we then pick a distinct category from the above. To my understand this relates to the 'anchors' and 'target' in this dataset which we'll look at below.

### EDA of Patent Data

In [1]:
from pathlib import Path
from fastai.vision.all import *

data_path = Path("../data/us-patent-phrase-to-matching/")

In [2]:
data_path.ls()

(#2) [Path('../data/us-patent-phrase-to-matching/test.csv'),Path('../data/us-patent-phrase-to-matching/train.csv')]

In [3]:
import pandas as pd

df = pd.read_csv(data_path / "train.csv")

df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


In [4]:
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


From having a quick look at a data sample and a description of the dataframe itself, there simply aren't that many unique contexts or anchors, and the anchor 'component composite coating' turns up 152 times which seems like a lot our of the 733 total.

Lets now build our first feature which will be the 'input' feature. We'll structure this feature as we noted earlier in the TEXT1: ...; TEXT2:...; format

In [5]:
df['input'] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; ANC1: " + df.anchor

df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement
1            TEXT1: A47; TEXT2: act of abating; ANC1: abatement
2           TEXT1: A47; TEXT2: active catalyst; ANC1: abatement
3       TEXT1: A47; TEXT2: eliminating process; ANC1: abatement
4             TEXT1: A47; TEXT2: forest region; ANC1: abatement
Name: input, dtype: object

Ok this looks great, I'm already a bit confused since the dataset has described the df.context as the CPC classification subject and the anchor and target being the first and second phrase respectively whereas Jeremy has put the context as the first text. Nonethless, I'm sure the naming of each of these features is sort of irrelevant as long as its consistent. If I called it 'text1' or 'feature1' or 'f1' it wouldn't matter to the model, its all matrices in the end.


### Tokenisation and Numericalisation
Lets now take on the first two new topics which are tokenisation and numericalisation, this is the act of transforming our words into tokens which represent unique examples of words and turning those tokens into numbers which we can then matrix multiply and then convert back after the fact. We're going to use huggingface's datasets library and use their "Dataset" class which overlaps a class name with fastai and pytorch.

In [7]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

Ok not learning a huge amount from the output but it at least lets us know what our features are and how many rows we have.

In [8]:
doc(Dataset)

Doesn't tell me a whole lot about the class or purpose but lets move on for now.

An important thing to note is that particular tokenisers are used for particular models and if you want to use a pre-trained model, you have to make sure you have the same tokeniser (and numericalisation process) otherwise you aren't representing the language the same way as the model was trained. Lets grab a model and get to work

In [9]:
model_name = "microsoft/deberta-v3-small"

Now we're pulling from the [huggingface database](https://huggingface.co/models) of models, many of which are trained for particular tasks. Feel free to have a browse and search for models you might be interested in!

In [10]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokeniser = AutoTokenizer.from_pretrained(model_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
tokeniser

PreTrainedTokenizerFast(name_or_path='microsoft/deberta-v3-small', vocab_size=128000, model_max_len=1000000000000000019884624838656, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '[CLS]', 'eos_token': '[SEP]', 'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

Ok some interesting output here, looks like there's a bunch of settings on tokenisers that represent special handling of things like spaces or sizing a sentence with padding and separators etc.

In [13]:
tokeniser.tokenize("Jeremy is the GOAT")

['▁Jeremy', '▁is', '▁the', '▁GOAT']

Ok awesome there are my tokens, Jeremy also notes that uncommon words will be split up and the start of words is represented by an underscore.

In [17]:
tokeniser.tokenize("I suffered a flabbergasting conniption when I saw the spelling of ornithorhynchus anatinus.")

['▁I',
 '▁suffered',
 '▁a',
 '▁flab',
 'berg',
 'as',
 'ting',
 '▁con',
 'ni',
 'ption',
 '▁when',
 '▁I',
 '▁saw',
 '▁the',
 '▁spelling',
 '▁of',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']