# Lesson 4: Getting Started with NLP

In this mini-project I'll attempt to tackle the 'US Patent Phrase to Phrase Matching' NLP Kaggle competition that we went over in lecutre using Huggingface transformers and a pre-trained model.


The competition challenges participants to find and measure similarity in patent documents. More specifically, the dataset contains pairs of phrases, and we are tasked with rating how similar they are on a scale of 0 to 1. 

The scores are actually divided into increments of 0.25 with distinct meanings: 0.0 signifies the two phrases are unrelated, 0.25 -- somehwat related, 0.5 -- synonyms which don't have nearly the same meaning (i.e. same function, same properties), 0.75 -- synonyms which have exactly or nearly the same meaning, and 1.0 -- very close matches, maybe off by only a difference in conjugation, plurality, or stop-words.

Let's take a closer look at the data to see what we're working with:


In [1]:
# first we need to check if we are running on Kaggle or on another machine
from pathlib import Path
import os
import pandas as pd
is_kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

# if we are on kaggle the data is already available to us, if not we need to download it first
if not is_kaggle:
    import zipfile
    import kaggle
    import datasets

    download_path = Path('us-patent-phrase-to-phrase-matching')
    if not download_path.exists():
        kaggle.api.competition_download_cli(str(download_path))
        zipfile.ZipFile(f'{download_path}.zip').extractall(download_path)
else:
    download_path = Path('../input/us-patent-phrase-to-phrase-matching')
    ! pip install -q datasets





[0m

In [2]:
df = pd.read_csv(download_path/'train.csv')
eval_df = pd.read_csv(download_path/'test.csv')
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


We can view the column descriptions on the competition's data page.
* *id*: unique identifier for the pair of phrases
* *anchor*: the first phrase in the pair
* *target*: the second phrase in the pair
* *context*: CPC classification defining the subject within which the similarity is to be scored
* *score*: this is our label, the similarity score between the two phrases

Most of these are self-explanatory, our label in this case is the *score* field, which is given on the aforementioned 0.25-increment scale.

The CPC classifcation is most foreign here, it is heirarchical classifcation system originally developed by the European Patent Office. Their are actually five distinct labels generally used by the CPC system, but our label data only gives the first two. For example, a context of *A47* entails Section A, Class 47. You can read more here: https://en.wikipedia.org/wiki/Cooperative_Patent_Classification

In [3]:
df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,37d61fd2272659b1,component composite coating,composition,H01
freq,1,152,24,2186


Speaking more on contexts, we see that this dataset includes 106 unique CPC identifiers. 

It's also interesting to note that while there are 29340 unique target phrases, there are only 733 unique anchors. It'd be interesting to see if these fields overlap or are compelely disjoint.

In [4]:
pd.unique(df[['anchor', 'target']].values.ravel('K')).size

29815

The two sets are not (completely) disjoint! If they were, we'd expect to see 29340 + 733 = 30073 unique values amongst the anchor and target fields.

One thing we could perhaps take advantage of here is symmetry - i.e. a label for (anchor, target) should also be the same for (target, anchor) across the same context.

Symmetrical pairs may already exist in the dataset, but we can also generate the ones that don't ourselves in order to expand our training data. It will be interesting to compare performance on the original dataset vs. our expanded dataset to see if this actually makes any difference.

Let's make the expanded dataset.

First we'll create a new dataframe where all the 'target' and 'anchor' values are flipped. I don't think it's a big deal that we'll now have some identical ids, as we won't be training on those anyways.

In [5]:
symmetrical_df = df.copy(deep=True)
symmetrical_df.update(symmetrical_df[['anchor', 'target']].rename(columns={'anchor': 'target', 'target': 'anchor'}))
symmetrical_df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement of pollution,abatement,A47,0.50
1,7b9652b17b68b7a4,act of abating,abatement,A47,0.75
2,36d72442aefd8232,active catalyst,abatement,A47,0.25
3,5296b0c19e1ce60e,eliminating process,abatement,A47,0.50
4,54c1e3b9184cb5b6,forest region,abatement,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wooden article,wood article,B44,1.00
36469,42d9e032d1cd3242,wooden box,wood article,B44,0.50
36470,208654ccb9e14fa3,wooden handle,wood article,B44,0.50
36471,756ec035e694722b,wooden material,wood article,B44,0.75


Then, we can concat the original dataframe with the new one, and drop any duplicate (anchor, target, context) combos.

In [6]:
symmetrical_df = pd.concat([df.copy(deep=True), symmetrical_df])
symmetrical_df = symmetrical_df.drop_duplicates(subset=['anchor', 'target', 'context'])
symmetrical_df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wooden article,wood article,B44,1.00
36469,42d9e032d1cd3242,wooden box,wood article,B44,0.50
36470,208654ccb9e14fa3,wooden handle,wood article,B44,0.50
36471,756ec035e694722b,wooden material,wood article,B44,0.75


In [7]:
symmetrical_df.describe(include='object')

Unnamed: 0,id,anchor,target,context
count,72667,72667,72667,72667
unique,36473,29815,29815,106
top,37d61fd2272659b1,component composite coating,component composite coating,H01
freq,2,152,152,4372


I'm legitimately not sure if this will make a difference or perhaps even make our performance worse, so we'll train a model with the original dataset as well as with this augmented dataset.

As we discussed in lecutre, we can turn this into a classifcation problem by combining our fields into a single *document*, and then use a transformers to pick the correct label from the 5 possiblities. Let's choose a different pre-trained transformer and format our input data a little differently than we did in lecture to mix things up.

As far as I'm aware the performance of different input representations is fairly arbitrary, so we'll choose something very simple:

In [8]:
# we'll add in input field to both dataframes since we'd like to compare performance
symmetrical_df['input'] = symmetrical_df.context + ' ; ' + symmetrical_df.anchor + ' ; ' + symmetrical_df.target
df['input'] = df.context + ' ; ' + df.anchor + ' ; ' + df.target

eval_df['input'] = eval_df.context + ' ; ' + eval_df.anchor + ' ; ' + eval_df.target

symmetrical_df.input.head()

0    A47 ; abatement ; abatement of pollution
1            A47 ; abatement ; act of abating
2           A47 ; abatement ; active catalyst
3       A47 ; abatement ; eliminating process
4             A47 ; abatement ; forest region
Name: input, dtype: object

Next we can load our pandas dataframe into a tranformer Dataset object.

In [9]:
from datasets import Dataset
ds = Dataset.from_pandas(df)
symmetrical_ds = Dataset.from_pandas(symmetrical_df)
eval_ds = Dataset.from_pandas(eval_df)

Next, we'll need to tokenize and numeralize our data. In order to tokenize our data we'll need to know which tokenizer our model uses -- the best pre-trained model available (just judging by descriptions) seems to be BERT for Patents (https://huggingface.co/anferico/bert-for-patents).

In [10]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = 'anferico/bert-for-patents'

tokenizer = AutoTokenizer.from_pretrained(model_name)



Downloading:   0%|          | 0.00/327 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/322k [00:00<?, ?B/s]

Now that we have the correct tokenizer, we can tokenize our input data.

In [11]:
def tokenize_input(x):
    return tokenizer(x['input'])


tokenized_ds = ds.map(tokenize_input, batched=True)
tokenized_symmetrical_ds = symmetrical_ds.map(tokenize_input, batched=True)
tokenized_eval_ds = eval_ds.map(tokenize_input, batched=True)

  0%|          | 0/37 [00:00<?, ?ba/s]

  0%|          | 0/73 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

Our tokenizer vocabulary has 39859 unique tokens. Finally, we'll need to ensure our label data is titled 'label' in order for it to work with the Transformer.

In [12]:
tokenized_ds = tokenized_ds.rename_column('score', 'label')
tokenized_symmetrical_ds = tokenized_symmetrical_ds.rename_column('score', 'label')

In the lecture we split our training and test data via train_test_split, which splits the data randomly. However, we were warned against doing this in practice, so I'll use a different method to split the data. 

*This actually brings up an important point.* We've augemented our dataset, such that we have symmetry (i.e. (anchor, target) and (target, anchor) are the both in the dataset. For validation purposes, we'd like to avoid training on (A, B) and then validating on (B, A) -- that seems like a great way to artifically boost performance on our validation set while hurting performance on our evaluation set. 

Instead, we should split our data first, and then introduce symmetry into our training set, to ensure that our validation set contains (anchor, target)s not seen in our training set.

In [13]:
from sklearn.model_selection import train_test_split
from datasets import DatasetDict
# first split the data
train_df, test_df = train_test_split(df, test_size=0.25, random_state=98)

# then do symmetry stuff
symmetrical_train_df = train_df.copy(deep=True)
symmetrical_train_df.update(symmetrical_train_df[['anchor', 'target']].rename(columns={'anchor': 'target', 'target': 'anchor'}))
symmetrical_train_df = pd.concat([train_df.copy(deep=True), symmetrical_train_df])
symmetrical_train_df = symmetrical_train_df.drop_duplicates(subset=['anchor', 'target', 'context'])

train_df['input'] = train_df.context + ' ; ' + train_df.anchor + ' ; ' + train_df.target
test_df['input'] = test_df.context + ' ; ' + test_df.anchor + ' ; ' + test_df.target
symmetrical_train_df['input'] = symmetrical_train_df.context + ' ; ' + symmetrical_train_df.anchor + ' ; ' + symmetrical_train_df.target

train_ds = Dataset.from_pandas(train_df)
symmetrical_train_ds = Dataset.from_pandas(symmetrical_train_df)
test_ds = Dataset.from_pandas(test_df)

tokenized_train_ds = train_ds.map(tokenize_input, batched=True)
tokenized_symmetrical_train_ds = symmetrical_train_ds.map(tokenize_input, batched=True)
tokenized_test_ds = test_ds.map(tokenize_input, batched=True)
tokenized_eval_ds = eval_ds.map(tokenize_input, batched=True)

tokenized_train_ds = tokenized_train_ds.rename_column('score', 'label')
tokenized_symmetrical_train_ds = tokenized_symmetrical_train_ds.rename_column('score', 'label')
tokenized_test_ds = tokenized_test_ds.rename_column('score', 'label')




  0%|          | 0/28 [00:00<?, ?ba/s]

  0%|          | 0/55 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [14]:
dds = DatasetDict({'train': tokenized_train_ds, 'test': tokenized_test_ds})
symmetrical_dds = DatasetDict({'train': tokenized_symmetrical_train_ds, 'test': tokenized_test_ds})

To summarize, we have:
- train_ds, our untouched training dataset
- symmetrical_train_ds, our augmented training dataset
- test_ds, our validation dataset
- eval_ds, our true 'test' dataset (huggingface transformers need this to be called 'test', it's confusing)

Now we're finally ready to train the model, for now we'll use the same hyperparameters as in lecutre.

In [15]:
from transformers import TrainingArguments,Trainer
import numpy as np
import torch 

def corr(x,y):
    x = np.squeeze(x)
    print(x,y)
    return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred):
    return {'pearson': corr(*eval_pred)}

bs = 64
epochs = 4
lr = 8e-6

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none', save_strategy='no')

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading:   0%|          | 0.00/1.29G [00:00<?, ?B/s]

Some weights of the model checkpoint at anferico/bert-for-patents were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not 

First we'll try training on the original dataset

In [16]:
trainer = Trainer(model, args, train_dataset=symmetrical_dds['train'], eval_dataset=symmetrical_dds['test'],
                  tokenizer=tokenizer, compute_metrics=corr_d)

Using cuda_amp half precision backend


In [17]:

trainer.train()


The following columns in the training set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: context, anchor, __index_level_0__, id, input, target. If context, anchor, __index_level_0__, id, input, target are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 54494
  Num Epochs = 4
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 3408


Epoch,Training Loss,Validation Loss,Pearson
1,0.0897,0.028949,0.807813
2,0.0233,0.020625,0.832825
3,0.0152,0.01986,0.843177
4,0.013,0.020474,0.84349


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: context, anchor, __index_level_0__, id, input, target. If context, anchor, __index_level_0__, id, input, target are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 128


[0.166   0.02914 0.423   ... 0.1929  0.4478  0.1938 ] [0.25 0.   0.5  ... 0.5  0.25 0.5 ]


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: context, anchor, __index_level_0__, id, input, target. If context, anchor, __index_level_0__, id, input, target are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 128


[0.2585  0.05643 0.5195  ... 0.2266  0.5366  0.2837 ] [0.25 0.   0.5  ... 0.5  0.25 0.5 ]


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: context, anchor, __index_level_0__, id, input, target. If context, anchor, __index_level_0__, id, input, target are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 128


[0.27    0.00531 0.5073  ... 0.1583  0.56    0.2664 ] [0.25 0.   0.5  ... 0.5  0.25 0.5 ]


The following columns in the evaluation set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: context, anchor, __index_level_0__, id, input, target. If context, anchor, __index_level_0__, id, input, target are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Evaluation *****
  Num examples = 9119
  Batch size = 128


[0.298  0.0132 0.5254 ... 0.1649 0.5864 0.2815] [0.25 0.   0.5  ... 0.5  0.25 0.5 ]




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=3408, training_loss=0.02960195325909646, metrics={'train_runtime': 1580.6592, 'train_samples_per_second': 137.902, 'train_steps_per_second': 2.156, 'total_flos': 6704251531523964.0, 'train_loss': 0.02960195325909646, 'epoch': 4.0})

Great! These are actually slightly better Pearsson scores than the model we trained in lecture.

I mentioned I'd try running the same model with the non-augmented dataset as well... unfortunately trying to train a second model in the same notebook has caused a lot of GPU memory headaches for me. I've attempted various ways to free up GPU memory but so far all of them have ended up taking the notebook down with it! :( If I figure out a good way to do this in the future I will come back and update the notebook with those results.

For now, let's generate our submission file.

In [20]:
preds = trainer.predict(tokenized_eval_ds).predictions.astype(float)
preds

The following columns in the test set don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: context, anchor, id, input, target. If context, anchor, id, input, target are not expected by `BertForSequenceClassification.forward`,  you can safely ignore this message.
***** Running Prediction *****
  Num examples = 36
  Batch size = 128


array([[ 6.06933594e-01],
       [ 7.46093750e-01],
       [ 5.01464844e-01],
       [ 2.80029297e-01],
       [-1.13677979e-02],
       [ 4.97558594e-01],
       [ 2.97607422e-01],
       [-1.34124756e-02],
       [ 2.35961914e-01],
       [ 1.14257812e+00],
       [ 1.45385742e-01],
       [ 2.52197266e-01],
       [ 7.63183594e-01],
       [ 9.15039062e-01],
       [ 7.43652344e-01],
       [ 3.45703125e-01],
       [ 3.00537109e-01],
       [ 1.54907227e-01],
       [ 4.73876953e-01],
       [ 2.99072266e-01],
       [ 3.68652344e-01],
       [ 2.49389648e-01],
       [ 3.40820312e-01],
       [ 2.59521484e-01],
       [ 5.23925781e-01],
       [-1.23739243e-04],
       [ 2.77328491e-03],
       [ 4.13818359e-02],
       [-9.62066650e-03],
       [ 6.42578125e-01],
       [ 1.41723633e-01],
       [ 1.78833008e-02],
       [ 6.50390625e-01],
       [ 5.18066406e-01],
       [ 3.15917969e-01],
       [ 3.18359375e-01]])

In [21]:
preds = np.clip(preds, 0, 1)
preds

array([[0.60693359],
       [0.74609375],
       [0.50146484],
       [0.2800293 ],
       [0.        ],
       [0.49755859],
       [0.29760742],
       [0.        ],
       [0.23596191],
       [1.        ],
       [0.14538574],
       [0.25219727],
       [0.76318359],
       [0.91503906],
       [0.74365234],
       [0.34570312],
       [0.30053711],
       [0.15490723],
       [0.47387695],
       [0.29907227],
       [0.36865234],
       [0.24938965],
       [0.34082031],
       [0.25952148],
       [0.52392578],
       [0.        ],
       [0.00277328],
       [0.04138184],
       [0.        ],
       [0.64257812],
       [0.14172363],
       [0.0178833 ],
       [0.65039062],
       [0.51806641],
       [0.31591797],
       [0.31835938]])

In [25]:

submission = Dataset.from_dict({
    'id': tokenized_eval_ds['id'],
    'score': preds
})

submission.to_csv('submission.csv', index=False)

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

1047

I will update this notebook when I have my submission results, I was dealing with some problems around importing the pre-trained BERT model that we used without internet (for an offical submission you must run the notebook without internet). When I figure that out I will update this notebook with the official score.