# Getting Started with NLP for Absolute Beginners

This is my notebook that I created while following the kaggle hosted course, here : https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners

Best to just follow those instructions, but note that you need to prepare by :
- Signing up to kaggle
- Creating an api token and downloading to ~/\<username\>/.kaggle/kaggle.json
- Accepting the competition here : https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching

## DOWNLOAD DATA

In [1]:
import os

# This will return false in my notebook because I am not running on Kaggle.
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')


In [2]:
creds = ''

In [4]:
from pathlib import Path

cred_path = Path('~/.kaggle/kaggle.json').expanduser()
if not cred_path.exists():
    cred_path.parent.mkdir(exist_ok=True)
    cred_path.write_text(creds)
    cred_path.chmod(0o600)

In [5]:
path = Path('us-patent-phrase-to-phrase-matching')
print(path)

us-patent-phrase-to-phrase-matching


In order to run this notebook, need to accept the rules for the U.S. Patent Phrase to Phrase Matching competition here : https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching

In [11]:
if not iskaggle and not path.exists():
    import zipfile,kaggle
    kaggle.api.competition_download_cli(str(path))
    zipfile.ZipFile(f'{path}.zip').extractall(path)

Downloading us-patent-phrase-to-phrase-matching.zip to c:\dev\learning-2024\nlp-for-beginners


100%|██████████| 682k/682k [00:00<00:00, 5.88MB/s]







In [6]:
#list files downloaded
!dir {path} /b

sample_submission.csv
test.csv
train.csv


## READ DATA

In [7]:
import pandas as pd

In [8]:
df = pd.read_csv(path/'train.csv')

In [9]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [10]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


In [11]:
df['input'] = 'TEXT1: ' + df.context + '; TEXT2: ' + df.target + '; ANC1: ' + df.anchor

In [12]:
df.head()

Unnamed: 0,id,anchor,target,context,score,input
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5,TEXT1: A47; TEXT2: abatement of pollution; ANC...
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75,TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2,36d72442aefd8232,abatement,active catalyst,A47,0.25,TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5,TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0,TEXT1: A47; TEXT2: forest region; ANC1: abatement


## CREATE DATASET

Transformers uses a Dataset object for storing a... well a dataset, of course! We can create one like so:

In [13]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

  from .autonotebook import tqdm as notebook_tqdm


In [14]:
print(ds)

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})


## TOKENIZE AND NUMERICALIZE

But we can't pass the texts directly into a model. A deep learning model expects numbers as inputs, not English sentences! So we need to do two things:

- Tokenization: Split each text up into words (or actually, as we'll see, into tokens)
- Numericalization: Convert each word (or token) into a number.

The details about how this is done actually depend on the particular model we use. So first we'll need to pick a model. There are thousands of models available, but a reasonable starting point for nearly any NLP problem is to use this (replace "small" with "large" for a slower but more accurate model, once you've finished exploring):

In [15]:
model_nm = 'microsoft/deberta-v3-small'

In [16]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


In [17]:
tokz.tokenize("G'day folks, I'm Jeremy from fast.ai!")

['▁G',
 "'",
 'day',
 '▁folks',
 ',',
 '▁I',
 "'",
 'm',
 '▁Jeremy',
 '▁from',
 '▁fast',
 '.',
 'ai',
 '!']

In [18]:
tokz.tokenize("A platypus is an ornithorhynchus anatinus.")

['▁A',
 '▁platypus',
 '▁is',
 '▁an',
 '▁or',
 'ni',
 'tho',
 'rhynch',
 'us',
 '▁an',
 'at',
 'inus',
 '.']

In [22]:
tokz('hello')

{'input_ids': [1, 12018, 2], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}

In [19]:
def tok_func(x): return tokz(x["input"])

In [20]:
tok_ds = ds.map(tok_func, batched=True)

Map: 100%|██████████| 36473/36473 [00:00<00:00, 38242.24 examples/s]


In [21]:
row = tok_ds[0]
row['input'], row['input_ids']

('TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2])

In [36]:
tokz.vocab['▁of']

265

we need to prepare our labels. Transformers always assumes that your labels has the column name labels, but in our dataset it's currently score. Therefore, we need to rename it:

In [37]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

In [39]:
tok_ds[0]

{'id': '37d61fd2272659b1',
 'anchor': 'abatement',
 'target': 'abatement of pollution',
 'context': 'A47',
 'labels': 0.5,
 'input': 'TEXT1: A47; TEXT2: abatement of pollution; ANC1: abatement',
 'input_ids': [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  23702,
  435,
  294,
  47284,
  2],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [40]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,el display,inorganic photoconductor drum,G02
freq,1,2,1,3


In [41]:
dds = tok_ds.train_test_split(0.25, seed=42)
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

In [42]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor
eval_ds = Dataset.from_pandas(eval_df).map(tok_func ,batched=True)

Map: 100%|██████████| 36/36 [00:00<00:00, 5247.62 examples/s]


In [45]:
eval_df['input'][0]

'TEXT1: G02; TEXT2: inorganic photoconductor drum; ANC1: opc drum'

In [46]:
from transformers import TrainingArguments, Trainer

In [50]:
# Batch size and epochs (number of times through training)
bs = 128
epochs = 4

In [51]:
# Learning Rate
lr = 8e-5

In [52]:
args = TrainingArguments('outputs', 
                         learning_rate=lr, 
                         warmup_ratio=0.1, 
                         lr_scheduler_type='cosine', 
                         fp16=True, 
                         evaluation_strategy="epoch", 
                         per_device_train_batch_size=bs, 
                         per_device_eval_batch_size=bs*2, 
                         num_train_epochs=epochs, 
                         weight_decay=0.01, 
                         report_to= 'none')

In [54]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=['test'], tokenizer=tokz, compute_metrics=corr_d)

ImportError: 
AutoModelForSequenceClassification requires the PyTorch library but it was not found in your environment. Checkout the instructions on the
installation page: https://pytorch.org/get-started/locally/ and follow the ones that match your environment.
Please note that you may need to restart your runtime after installation.
