## Learning about NLP Transformers

### Kaggle Competition: U.S. Patent Phrase to Phrase Matching
https://www.kaggle.com/competitions/us-patent-phrase-to-phrase-matching/data?select=train.csv

In [2]:
from fastai.text.all import *

In [3]:
from rich import inspect

### Examine Data

In [41]:
path = Path("us-patent-phrase-to-phrase-matching")

In [42]:
df = pd.read_csv(path/"train.csv")
df.head()

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0


### Create and Preprocess DataFrame
Concanenate records into a single document and add to `df`

In [6]:
df['input'] = "TEXT1: " + df.context + "; TEXT2: " + df.target + " ;ANC1; " + df.anchor
df.head()

Unnamed: 0,id,anchor,target,context,score,input
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.5,TEXT1: A47; TEXT2: abatement of pollution ;ANC1; abatement
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75,TEXT1: A47; TEXT2: act of abating ;ANC1; abatement
2,36d72442aefd8232,abatement,active catalyst,A47,0.25,TEXT1: A47; TEXT2: active catalyst ;ANC1; abatement
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.5,TEXT1: A47; TEXT2: eliminating process ;ANC1; abatement
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.0,TEXT1: A47; TEXT2: forest region ;ANC1; abatement


### Create Dataset from DataFrame

In [8]:
from datasets import Dataset, DatasetDict

In [11]:
ds = Dataset.from_pandas(df)
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

### Tokenize Dataset
Need to know the `model` in order to get the correct Tokenizer with `AutoTokenizer(model)`

In [12]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

In [13]:
model_nm = 'microsoft/deberta-v3-small'

**It is important that the tokenizer used for the training is the same as used for new documents**

In [14]:
tokz = AutoTokenizer.from_pretrained(model_nm)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Here's a simple function which tokenizes our inputs:

In [15]:
def tok_func(x): return tokz(x["input"])

Map `tok_func` on `ds['input']`, which uses `tokz` to tokenize each element. This creates a tokenized dataset `tok_ds`

In [16]:
tok_ds = ds.map(tok_func, batched=True)

  0%|          | 0/37 [00:00<?, ?ba/s]

#### The labelcolumn name needs to be`labels`.  Need to rename `score` to `labels'

In [17]:
tok_ds = tok_ds.rename_columns({'score':'labels'})

In [19]:
tok_ds[0]

{'id': '37d61fd2272659b1',
 'anchor': 'abatement',
 'target': 'abatement of pollution',
 'context': 'A47',
 'labels': 0.5,
 'input': 'TEXT1: A47; TEXT2: abatement of pollution ;ANC1; abatement',
 'input_ids': [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  2600,
  64097,
  435,
  346,
  47284,
  2],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

### Create the Training and Test/Valid datasets from tok_ds

`DatasetDict` here, `dds` holds training and validation datasets. To create one that contains 25% of our data for the validation set, and 75% for the training set, use `train_test_split`:

In [20]:
dds = tok_ds.train_test_split(.25) # DataSetDict
dds

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

### Metrics

Pearson coefficient between (x,y)

In [21]:
def corr(x,y): return np.corrcoef(x,y)[0][1]

Transformers expects metrics to be returned as a `dict`, since that way the trainer knows what label to use, so let's create a function to do that:

In [22]:
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

### Create Model

In [43]:
from transformers import TrainingArguments, Trainer

Another Auto Factory Method using the `model_nm` to create a model consisitent with the `AutoTokenizer`

In [35]:
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1);

loading configuration file https://huggingface.co/microsoft/deberta-v3-small/resolve/main/config.json from cache at /home/cdaniels/.cache/huggingface/transformers/8e0c12a7672d1d36f647c86e5fc3a911f189d8704e2bc94dde4a1ffe38f648fa.9df96bac06c2c492bc77ad040068f903c93beec14607428f25bf9081644ad0da
Model config DebertaV2Config {
  "_name_or_path": "microsoft/deberta-v3-small",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_eps": 1e-07,
  "max_position_embeddings": 512,
  "max_relative_positions": -1,
  "model_type": "deberta-v2",
  "norm_rel_ebd": "layer_norm",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 0,
  "pooler_dropout": 0,
  "pooler_hidden_act": "gelu",
  "pooler_hidden_size": 768,
  "pos_att_type": [
    "p2c",
    "c2p"
  ],
  "position_bias

### Create Trainer

In [36]:
bs = 128
epochs = 2
lr = 8e-5

All of the paramaters related to the `Trainer` go into `TrainerArguments`

In [37]:
args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

PyTorch: setting up devices


In [38]:
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)

Using amp half precision backend


### Execute Training

In [39]:
trainer.train();

The following columns in the training set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, anchor, id, input, target. If context, anchor, id, input, target are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 27354
  Num Epochs = 2
  Instantaneous batch size per device = 128
  Total train batch size (w. parallel, distributed & accumulation) = 256
  Gradient Accumulation steps = 1
  Total optimization steps = 214


Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.033595,0.768747
2,No log,0.02645,0.801013


The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, anchor, id, input, target. If context, anchor, id, input, target are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 512
The following columns in the evaluation set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, anchor, id, input, target. If context, anchor, id, input, target are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 512


Training completed. Do not forget to share your model on huggingface.co/models =)




### Create Test Dataset

Use `eval` as our name for the test set, to avoid confusion with the `test` dataset that was created above.

In [44]:
eval_df = pd.read_csv(path/'test.csv')
eval_df.head()

Unnamed: 0,id,anchor,target,context
0,4112d61851461f60,opc drum,inorganic photoconductor drum,G02
1,09e418c93a776564,adjust gas flow,altering gas flow,F23
2,36baf228038e314b,lower trunnion,lower locating,B60
3,1f37ead645e7f0c8,cap component,upper portion,D06
4,71a5b6ad068d531f,neural stimulation,artificial neural network,H04


In [45]:
eval_df['input'] = 'TEXT1: ' + eval_df.context + '; TEXT2: ' + eval_df.target + '; ANC1: ' + eval_df.anchor

In [46]:
eval_ds = Dataset.from_pandas(eval_df).map(tok_func, batched=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

### Prediction and Inference

Use `trainer.predict(eval_ds)` to make predictions on the `eval_ds`

In [48]:
preds = trainer.predict(eval_ds).predictions.astype(float)
preds

The following columns in the test set don't have a corresponding argument in `DebertaV2ForSequenceClassification.forward` and have been ignored: context, anchor, id, input, target. If context, anchor, id, input, target are not expected by `DebertaV2ForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 36
  Batch size = 512


array([[ 0.53369141],
       [ 0.69970703],
       [ 0.33984375],
       [ 0.39233398],
       [-0.02444458],
       [ 0.52636719],
       [ 0.33374023],
       [ 0.09204102],
       [ 0.16113281],
       [ 1.12304688],
       [ 0.18237305],
       [ 0.36230469],
       [ 0.734375  ],
       [ 0.70458984],
       [ 0.79394531],
       [ 0.45092773],
       [ 0.17443848],
       [ 0.06079102],
       [ 0.546875  ],
       [ 0.2310791 ],
       [ 0.38452148],
       [ 0.25097656],
       [ 0.12054443],
       [ 0.18103027],
       [ 0.54248047],
       [-0.04666138],
       [ 0.01157379],
       [ 0.00126743],
       [ 0.01052094],
       [ 0.79150391],
       [ 0.19909668],
       [ 0.07568359],
       [ 0.69628906],
       [ 0.35449219],
       [ 0.36962891],
       [ 0.1973877 ]])

In [49]:
preds = np.clip(preds, 0, 1)
preds

array([[0.53369141],
       [0.69970703],
       [0.33984375],
       [0.39233398],
       [0.        ],
       [0.52636719],
       [0.33374023],
       [0.09204102],
       [0.16113281],
       [1.        ],
       [0.18237305],
       [0.36230469],
       [0.734375  ],
       [0.70458984],
       [0.79394531],
       [0.45092773],
       [0.17443848],
       [0.06079102],
       [0.546875  ],
       [0.2310791 ],
       [0.38452148],
       [0.25097656],
       [0.12054443],
       [0.18103027],
       [0.54248047],
       [0.        ],
       [0.01157379],
       [0.00126743],
       [0.01052094],
       [0.79150391],
       [0.19909668],
       [0.07568359],
       [0.69628906],
       [0.35449219],
       [0.36962891],
       [0.1973877 ]])

### Create our CSV Submission Results

In [50]:
import datasets

submission = datasets.Dataset.from_dict({
    'id': eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1059

In [51]:
pd.read_csv("submission.csv")

Unnamed: 0,id,score
0,4112d61851461f60,[0.53369141]
1,09e418c93a776564,[0.69970703]
2,36baf228038e314b,[0.33984375]
3,1f37ead645e7f0c8,[0.39233398]
4,71a5b6ad068d531f,[0.]
5,474c874d0c07bd21,[0.52636719]
6,442c114ed5c4e3c9,[0.33374023]
7,b8ae62ea5e1d8bdb,[0.09204102]
8,faaddaf8fcba8a3f,[0.16113281]
9,ae0262c02566d2ce,[1.]
