In [253]:
# Input data files are available in the read-only "../input/" directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/us-patent-phrase-to-phrase-matching/sample_submission.csv
/kaggle/input/us-patent-phrase-to-phrase-matching/train.csv
/kaggle/input/us-patent-phrase-to-phrase-matching/test.csv


# Introduction 

This project is based on the "U.S. Patent Phrase to Phrase Matching" competition that was hosted by Satsyil Corp in 2022. The idea is to compare two phrases, and score them based on whether they are similar or not. Determining the semantic similarity between phrases is crucial during the patent search and examination phase as it is used to detetermine if an invention is already documented. By recognizing equivalents like "television set" and "TV set" or broader matches such as "strong material" and "steel," we can help patent attorneys and examiners in retrieving relevant documents. The goal of this NLP project is to develop a model that matches phrases in order to extract contextual information, to help the patent community connect the dots between millions of patent documents.

# Setup 
We start by creating a boolean that tells us whether or not we are running on Kaggle. The dataset used is only available from Kaggle, so the easiest way to run this notebook is by running it on Kaggle. If you are running it on your own PC or GPU server you will need to download the dataset using the Kaggle API (not done in this notebook).

In [254]:
iskaggle = os.environ.get("KAGGLE_KERNEL_RUN_TYPE", "")

We then create a path object that points to the directory containing our data and download the datasets library developed by Hugging Face, which is used to load and process datasets (we will be using the Transformers library in the Hugging Face ecosystem).

In [255]:
from pathlib import Path

if iskaggle:
    path = Path('../input/us-patent-phrase-to-phrase-matching')
    ! pip install -q datasets

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Let's check that we have our data:

In [256]:
!ls {path}

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


sample_submission.csv  test.csv  train.csv


We will use the Pandas library for working with our csv files:

In [257]:
import pandas as pd

Let's set a path to our training data:

In [258]:
df = pd.read_csv(path/"train.csv")

Let's take a look at our data:

In [259]:
df

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


And let's get some more information about this data to understand it better:

In [260]:
df.describe(include="object")

Unnamed: 0,id,anchor,target,context
count,36473,36473,36473,36473
unique,36473,733,29340,106
top,8d135da0b55b8c88,component composite coating,composition,H01
freq,1,152,24,2186


This tells us that we have 733 unique anchors, almost 30000 unique targets and 106 contexts. We also see the "component composite coating" is a very common anchor as it appears 152. The goal of this project is to rate how similar an anchor and target phrase is in the range 0-1. A very close match receives the score 1.0, a close synonym (e.g. "mobile phone vs. "cellphone") receives the score 0.75, synonyms which don't have the same meaning get the score 0.5, somewhat realted phrases (e.g. two phrases that are in the same high level domain but are not synonyms) get a score of 0.25, and unrelated phrases get a score of 0.0. These scores can be seen in the "score" column in the dataframe above. The similarity between phrases has been scores within a patent's context (specifically its CPC classification), which indicates the subject to which the patent relates. 

We will now create the input to our model. The values will be in the form "TEXT1: context; TEXXT2: target; TEXT3: anchor". 

In [261]:
df["input"] = "TEXT1: " + df.context + "; TEXT2: " + df.target + "; TEXT3: " + df.anchor

Let's look at the first few rows:

In [262]:
df.input.head()

0    TEXT1: A47; TEXT2: abatement of pollution; TEX...
1    TEXT1: A47; TEXT2: act of abating; TEXT3: abat...
2    TEXT1: A47; TEXT2: active catalyst; TEXT3: aba...
3    TEXT1: A47; TEXT2: eliminating process; TEXT3:...
4    TEXT1: A47; TEXT2: forest region; TEXT3: abate...
Name: input, dtype: object

# Tokenization

Tranformers uses a Dataset object to store datasets, so we will store our dataset in this object.

In [263]:
from datasets import Dataset, DatasetDict

ds = Dataset.from_pandas(df)

Let's take a look at it:

In [264]:
ds

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'score', 'input'],
    num_rows: 36473
})

We will now split each text up into tokens and convert each token into a number (as the model expects numbers as inputs). We will use the following model:

In [265]:
model_db = "microsoft/deberta-v3-small"

And we will use AutoTokenizer to create a tokenizer that is appropriate for our model:

In [266]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_db)



Let's see an example of how our tokenizer splits a text into tokens:

In [267]:
tokenizer.tokenize("TEXT1: A47; TEXT2: abatement of pollution; TEXT3: abatement")

['▁TEXT',
 '1',
 ':',
 '▁A',
 '47',
 ';',
 '▁TEXT',
 '2',
 ':',
 '▁abatement',
 '▁of',
 '▁pollution',
 ';',
 '▁TEXT',
 '3',
 ':',
 '▁abatement']

We now create a function that tokenizes our inputs and make sure we do this in parallel on every row in our dataset using map (this is a functional programming approach which means we apply a function to each element in the dataset, and setting batched = True means the map function will process the dataset in batches which is more efficient since our dataset is large):

In [268]:
def tokenize_input(x):
    return tokenizer(x["input"])

tokenized_dataset = ds.map(tokenize_input, batched = True)

  0%|          | 0/37 [00:00<?, ?ba/s]

Let's inspect the first row of our data:

In [269]:
first_row = tokenized_dataset[0]
first_row["input"], first_row["input_ids"]

('TEXT1: A47; TEXT2: abatement of pollution; TEXT3: abatement',
 [1,
  54453,
  435,
  294,
  336,
  5753,
  346,
  54453,
  445,
  294,
  47284,
  265,
  6435,
  346,
  54453,
  508,
  294,
  47284,
  2])

We can see the input and the IDs for the first row of our data. The tokenizer has a list called vocab which contains a unique integer for every possible token string, which gives us the ID for that token string. We can, for instance, look at the token for the string "▁TEXT":

In [270]:
tokenizer.vocab["▁TEXT"]

54453

As expected, the integer 54453 appears in the input IDs above. Lastly, Transformers expects our labels to have the column name "labels" so we have to rename it from "score" to "labels".

In [271]:
tokenized_dataset = tokenized_dataset.rename_columns({"score": "labels"})

# Validation set

We can now create our validation set. We already have our test set:

In [272]:
test_set_df = pd.read_csv(path/"test.csv")
test_set_df.describe()

Unnamed: 0,id,anchor,target,context
count,36,36,36,36
unique,36,34,36,29
top,4112d61851461f60,hybrid bearing,inorganic photoconductor drum,G02
freq,1,2,1,3


We will use DatasetDict to split our tokenized dataset so that we 75% for training and 25% for validation:

In [273]:
dataset_dict = tokenized_dataset.train_test_split(0.25, seed = 42)
dataset_dict

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'labels', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
})

This uses a random split to split the dataset into a training and validation set (NB: the validation set is called test in the DatasetDict, but it is our validation set). We will train our model on our training set and use the validation set to make it more accurate. Once our entire training process is done we will test our model on our test set. (The reason we need a test set is that while training the model we might find things that coincidentally improve our validation set metrics, without the model actually becoming better in practice; in other words we are over-fitting on our validation set.)

To avoid confusion with the test dataset that was created above we will call our actual test set "final_test_dataset":

In [274]:
test_set_df["input"] = "TEXT1: " + test_set_df.context + "; TEXT2: " + test_set_df.target + "; TEXT3: " + test_set_df.anchor
final_test_dataset = Dataset.from_pandas(test_set_df).map(tokenize_input, batched = True)

  0%|          | 0/1 [00:00<?, ?ba/s]

# Training our model

Submissions for this problem are evaluated on the Pearson correlation coefficient between predicted and actual similarity scores. Transformers expects metrics to be returned as a dictionary since that's how the trainer knows what label to use, so we will create functions to do that:

In [280]:
import numpy as np

def corr(x,y): 
    return np.corrcoef(x,y)[0][1]

def corr_d(eval_pred):
    return {"pearson": corr(*eval_pred)}

We will now import what we need to train our model, We also set the batch size to 128, train for 4 epochs and set the learning rate to 8e-5:

In [281]:
from transformers import TrainingArguments, Trainer

batch_size = 128
epochs = 4
lr = 8e-5

We use TrainingArguments to set up the arguments: 

In [282]:
args = TrainingArguments('outputs', learning_rate = lr, warmup_ratio = 0.1, lr_scheduler_type = 'cosine', fp16 = True,
    evaluation_strategy = "epoch", per_device_train_batch_size = batch_size, per_device_eval_batch_size = batch_size * 2,
    num_train_epochs = epochs, weight_decay = 0.01, report_to = 'none')

We can now create our model and trainer:

In [283]:
model = AutoModelForSequenceClassification.from_pretrained(model_db, num_labels = 1)
trainer = Trainer(model, args, train_dataset = dataset_dict['train'], eval_dataset = dataset_dict['test'],
                  tokenizer = tokenizer, compute_metrics = corr_d)

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


(Transformers spits out a lot of warnings but we can ignore them.) Time to train our model:

In [284]:
trainer.train();

Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.027985,0.778681
2,No log,0.02602,0.810518
3,No log,0.025371,0.821559
4,No log,0.024339,0.823245


Recall that the submissions for this problem are evaluated on the Pearson correlation coefficient between predicted and actual similarity scores. As we can see, the Pearson correlation increases in each training cycle and is above 0.8. 

As a final step, we can get some predictions on the test set:

In [285]:
preds = trainer.predict(final_test_dataset).predictions.astype(float)
preds

array([[ 0.5629527 ],
       [ 0.70889294],
       [ 0.55079079],
       [ 0.40113348],
       [-0.00786009],
       [ 0.5348739 ],
       [ 0.49597657],
       [ 0.01237161],
       [ 0.23974185],
       [ 1.09340847],
       [ 0.25360519],
       [ 0.21818624],
       [ 0.74570519],
       [ 0.95155936],
       [ 0.77776659],
       [ 0.37234166],
       [ 0.26105428],
       [-0.03651068],
       [ 0.64209932],
       [ 0.33149865],
       [ 0.35574993],
       [ 0.25540748],
       [ 0.00154419],
       [ 0.23447916],
       [ 0.58790159],
       [-0.02431577],
       [-0.02842573],
       [ 0.00317116],
       [-0.03988128],
       [ 0.62946129],
       [ 0.35683072],
       [ 0.06851827],
       [ 0.69592309],
       [ 0.49425533],
       [ 0.48538256],
       [ 0.19975954]])

Since some of our correlations are below 0 and above 1 we will fix those out-of-bound predictions:

In [286]:
preds = np.clip(preds, 0, 1)
preds

array([[0.5629527 ],
       [0.70889294],
       [0.55079079],
       [0.40113348],
       [0.        ],
       [0.5348739 ],
       [0.49597657],
       [0.01237161],
       [0.23974185],
       [1.        ],
       [0.25360519],
       [0.21818624],
       [0.74570519],
       [0.95155936],
       [0.77776659],
       [0.37234166],
       [0.26105428],
       [0.        ],
       [0.64209932],
       [0.33149865],
       [0.35574993],
       [0.25540748],
       [0.00154419],
       [0.23447916],
       [0.58790159],
       [0.        ],
       [0.        ],
       [0.00317116],
       [0.        ],
       [0.62946129],
       [0.35683072],
       [0.06851827],
       [0.69592309],
       [0.49425533],
       [0.48538256],
       [0.19975954]])

Since the competition is over and there are a lot of steps needed to submit to Kaggle now that it's over I won't submit this solution to Kaggle. But to summarize: we see a Pearson correlation that is above 0.8 on the validation set, and we can also get some predictions on the test set.