# Getting Started with NLP for Absolute Beginners

This is my notebook that I created while following the kaggle hosted course, here : https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

Best to just follow those instructions, but note that you need to prepare by :
- Signing up to kaggle
- Creating an api token and downloading to ~/\<username\>/.kaggle/kaggle.json
- Accepting the competition here : https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching

## DOWNLOAD DATA

In [1]:
import os

# This will return false in my notebook because I am not running on Kaggle.
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')


In [2]:
creds = ''

In [3]:
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

In [4]:
path = Path('us-patent-phrase-to-phrase-matching')
print(path)

us-patent-phrase-to-phrase-matching


In order to run this notebook, need to accept the rules for the U.S. Patent Phrase to Phrase Matching competition here : https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching

In [5]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

In [6]:
#list files downloaded
!dir {path} /b

sample_submission.csv
test.csv
train.csv


## READ DATA

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv(path/'train.csv')

In [9]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [10]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


In [11]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

In [12]:
df.head()

Unnamed: 0,id,anchor,target,context,score,input
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5,TEXT1: A47; TEXT2: abatement of pollution; ANC...
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75,TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2,36d72442aefd8232,abatement,active catalyst,A47,0.25,TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5,TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0,TEXT1: A47; TEXT2: forest region; ANC1: abatement


## CREATE DATASET

Transformers uses a Dataset object for storing a... well a dataset, of course! We can create one like so:

In [13]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
print(ds)

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})


## TOKENIZE AND NUMERICALIZE

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- Tokenization: Split each text up into words (or actually, as we'll see, into tokens)
- Numericalization: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace "small" with "large" for a slower but more accurate model, once you've finished exploring):

In [15]:
model_nm = 'microsoft/deberta-v3-small'

In [16]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)



In [17]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

In [18]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

In [19]:
tokz('hello')

{'input_ids': [1, 12018, 2], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}

In [20]:
def tok_func(x): return tokz(x["input"])

In [21]:
tok_ds = ds.map(tok_func, batched=True)

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

Map: 100%|██████████| 36473/36473 [00:01<00:00, 24034.44 examples/s]


In [22]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [23]:
tokz.vocab['▁of']

265

we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it's currently score. Therefore, we need to rename it:

In [24]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

In [25]:
tok_ds[0]

{'id': '37d61fd2272659b1',
 'anchor': 'abatement',
 'target': 'abatement of pollution',
 'context': 'A47',
 'labels': 0.5,
 'input': 'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 'input_ids': [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [26]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


In [27]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [28]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func ,batched=True)

Map: 100%|██████████| 36/36 [00:00<00:00, 5956.88 examples/s]


In [29]:
eval_df['input'][0]

'TEXT1: G02; TEXT2: inorganic photoconductor drum; ANC1: opc drum'

In [30]:
from transformers import TrainingArguments, Trainer

In [31]:
# Batch size and epochs (number of times through training)
bs = 64
epochs = 4

In [32]:
# Learning Rate
lr = 8e-5

In [33]:
import numpy as np
def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

In [34]:
args = TrainingArguments('outputs', 
                         learning_rate=lr, 
                         warmup_ratio=0.1, 
                         lr_scheduler_type='cosine', 
                         fp16=True, 
                         evaluation_strategy="epoch", 
                         per_device_train_batch_size=bs, 
                         per_device_eval_batch_size=bs*2, 
                         num_train_epochs=epochs, 
                         weight_decay=0.01, 
                         report_to= 'none')

In [35]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'], tokenizer=tokz, compute_metrics=corr_d)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [36]:
trainer.train()

                                                  
 25%|██▌       | 430/1712 [00:34<07:40,  2.78it/s]

{'eval_loss': 0.027642928063869476, 'eval_pearson': 0.7965393573348385, 'eval_runtime': 1.889, 'eval_samples_per_second': 4827.537, 'eval_steps_per_second': 38.116, 'epoch': 1.0}


 29%|██▉       | 500/1712 [00:40<01:31, 13.27it/s]Checkpoint destination directory outputs\checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'loss': 0.0458, 'learning_rate': 7.142530263651553e-05, 'epoch': 1.17}


                                                  
 50%|█████     | 858/1712 [01:13<05:07,  2.77it/s]

{'eval_loss': 0.02324846014380455, 'eval_pearson': 0.8155750283657118, 'eval_runtime': 1.8803, 'eval_samples_per_second': 4849.827, 'eval_steps_per_second': 38.292, 'epoch': 2.0}


 58%|█████▊    | 1000/1712 [01:24<00:56, 12.71it/s]Checkpoint destination directory outputs\checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'loss': 0.0193, 'learning_rate': 3.535928522825979e-05, 'epoch': 2.34}


                                                   
 75%|███████▌  | 1286/1712 [01:52<02:34,  2.75it/s]

{'eval_loss': 0.02159656025469303, 'eval_pearson': 0.8352198834247654, 'eval_runtime': 1.9335, 'eval_samples_per_second': 4716.349, 'eval_steps_per_second': 37.238, 'epoch': 3.0}


 88%|████████▊ | 1500/1712 [02:08<00:16, 13.09it/s]Checkpoint destination directory outputs\checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


{'loss': 0.0124, 'learning_rate': 3.717094297490973e-06, 'epoch': 3.5}


                                                   
100%|██████████| 1712/1712 [02:31<00:00, 11.33it/s]

{'eval_loss': 0.02254074439406395, 'eval_pearson': 0.8357027108120587, 'eval_runtime': 1.9205, 'eval_samples_per_second': 4748.199, 'eval_steps_per_second': 37.49, 'epoch': 4.0}
{'train_runtime': 151.0534, 'train_samples_per_second': 724.353, 'train_steps_per_second': 11.334, 'train_loss': 0.0239382986431924, 'epoch': 4.0}





TrainOutput(global_step=1712, training_loss=0.0239382986431924, metrics={'train_runtime': 151.0534, 'train_samples_per_second': 724.353, 'train_steps_per_second': 11.334, 'train_loss': 0.0239382986431924, 'epoch': 4.0})

In [44]:
eval_ds
eval_ds[1]

{'id': '09e418c93a776564',
 'anchor': 'adjust gas flow',
 'target': 'altering gas flow',
 'context': 'F23',
 'input': 'TEXT1: F23; TEXT2: altering gas flow; ANC1: adjust gas flow',
 'input_ids': [1,
  54453,
  435,
  294,
  1107,
  3304,
  346,
  54453,
  445,
  294,
  18829,
  1698,
  2155,
  346,
  23702,
  435,
  294,
  5024,
  1698,
  2155,
  2],
 'token_type_ids': [0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0,
  0],
 'attention_mask': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1]}

In [37]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

100%|██████████| 1/1 [00:00<?, ?it/s]


array([[ 5.28808594e-01],
       [ 6.26464844e-01],
       [ 5.56640625e-01],
       [ 3.44970703e-01],
       [-1.87835693e-02],
       [ 5.77636719e-01],
       [ 5.38085938e-01],
       [-2.36663818e-02],
       [ 2.93212891e-01],
       [ 1.08691406e+00],
       [ 2.51953125e-01],
       [ 2.38769531e-01],
       [ 8.06152344e-01],
       [ 9.55078125e-01],
       [ 7.64648438e-01],
       [ 3.88427734e-01],
       [ 2.71728516e-01],
       [-1.95922852e-02],
       [ 6.67968750e-01],
       [ 4.22119141e-01],
       [ 4.10644531e-01],
       [ 2.06665039e-01],
       [ 5.83801270e-02],
       [ 2.37060547e-01],
       [ 5.38574219e-01],
       [-7.92503357e-04],
       [-3.32641602e-02],
       [-1.66320801e-02],
       [-2.80914307e-02],
       [ 5.81054688e-01],
       [ 3.75488281e-01],
       [-1.00402832e-02],
       [ 7.29003906e-01],
       [ 5.14160156e-01],
       [ 4.32617188e-01],
       [ 2.51220703e-01]])

In [38]:
preds = np.clip(preds, 0, 1)

In [39]:
preds

array([[0.52880859],
       [0.62646484],
       [0.55664062],
       [0.3449707 ],
       [0.        ],
       [0.57763672],
       [0.53808594],
       [0.        ],
       [0.29321289],
       [1.        ],
       [0.25195312],
       [0.23876953],
       [0.80615234],
       [0.95507812],
       [0.76464844],
       [0.38842773],
       [0.27172852],
       [0.        ],
       [0.66796875],
       [0.42211914],
       [0.41064453],
       [0.20666504],
       [0.05838013],
       [0.23706055],
       [0.53857422],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.        ],
       [0.58105469],
       [0.37548828],
       [0.        ],
       [0.72900391],
       [0.51416016],
       [0.43261719],
       [0.2512207 ]])