<a href="https://colab.research.google.com/github/jamieoliver/us-patents-240918/blob/main/us-patents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# US Patent Phrase to Phrase Matching

Based on https://www.kaggle.com/code/jhoward/getting-started-with-nlp-for-absolute-beginners.

## Plan
- [x] Initial Setup
  - [x] Install relevant libraries
- [x] Prepare Datasets
  - [x] Download
  - [x] Tokenise
  - [x] Split into:
    - [x] Training
    - [x] Validation
    - [x] Test
- [x] Train Model
- [x] Test Model
- [ ] Upload Model

## Initial Setup

In [1]:
!pip install -Uqq huggingface_hub datasets transformers pyarrow==15.0.2

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.3/38.3 MB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m436.4/436.4 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m474.3/474.3 kB[0m [31m30.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.4.1 requires pyarrow<15.0.0a0,>=14.0.1, but you have pyarrow 15.0.2 which is in

## Prepare Datasets

### Download

In [2]:
from huggingface_hub import *

training_path = snapshot_download('jamieoliver/us-patents-240918',
                                  repo_type='dataset',
                                  allow_patterns='*.csv')

test_path = snapshot_download('jamieoliver/us-patents-test-240918',
                              repo_type='dataset',
                              allow_patterns='*.csv')

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

train.csv:   0%|          | 0.00/2.14M [00:00<?, ?B/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

test.csv:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

In [3]:
import pandas as pd
import os.path

training_data = pd.read_csv(os.path.join(training_path, 'train.csv'))
training_data

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


In [4]:
test_data = pd.read_csv(os.path.join(test_path, 'test.csv'))
test_data

Unnamed: 0,id,anchor,target,context
0,4112d61851461f60,opc drum,inorganic photoconductor drum,G02
1,09e418c93a776564,adjust gas flow,altering gas flow,F23
2,36baf228038e314b,lower trunnion,lower locating,B60
3,1f37ead645e7f0c8,cap component,upper portion,D06
4,71a5b6ad068d531f,neural stimulation,artificial neural network,H04
5,474c874d0c07bd21,dry corn,dry corn starch,C12
6,442c114ed5c4e3c9,tunneling capacitor,capacitor housing,G11
7,b8ae62ea5e1d8bdb,angular contact bearing,contact therapy radiation,B23
8,faaddaf8fcba8a3f,produce liquid hydrocarbons,produce a treated stream,C10
9,ae0262c02566d2ce,diesel fuel tank,diesel fuel tanks,F02


### Tokenise

In [5]:
def add_input(data):
  data['input'] = 'TEXT1: ' + data.context + '; TEXT2: ' + data.target + '; ANC1: ' + data.anchor

add_input(training_data)
training_data.input.head()

Unnamed: 0,input
0,TEXT1: A47; TEXT2: abatement of pollution; ANC...
1,TEXT1: A47; TEXT2: act of abating; ANC1: abate...
2,TEXT1: A47; TEXT2: active catalyst; ANC1: abat...
3,TEXT1: A47; TEXT2: eliminating process; ANC1: ...
4,TEXT1: A47; TEXT2: forest region; ANC1: abatement


In [6]:
add_input(test_data)
test_data.input.head()

Unnamed: 0,input
0,TEXT1: G02; TEXT2: inorganic photoconductor dr...
1,TEXT1: F23; TEXT2: altering gas flow; ANC1: ad...
2,TEXT1: B60; TEXT2: lower locating; ANC1: lower...
3,TEXT1: D06; TEXT2: upper portion; ANC1: cap co...
4,TEXT1: H04; TEXT2: artificial neural network; ...


In [7]:
from datasets import *

training_dataset = Dataset.from_pandas(training_data)
training_dataset = training_dataset.rename_column('score', 'label')
training_dataset

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'label', 'input'],
    num_rows: 36473
})

In [8]:
test_dataset = Dataset.from_pandas(test_data)
test_dataset

Dataset({
    features: ['id', 'anchor', 'target', 'context', 'input'],
    num_rows: 36
})

In [9]:
from transformers import *

logging.set_verbosity_warning()

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

GroupViT models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version.Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.
TAPAS models are not usable since `tensorflow_probability` can't be loaded. It seems you have `tensorflow_probability` installed with the wrong tensorflow version. Please try to reinstall it following the instructions here: https://github.com/tensorflow/probability.


In [10]:
model_name = 'microsoft/deberta-v3-small'

In [11]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
list(tokenizer.vocab.items())[:5]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/578 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]



[('▁caring', 6087),
 ('▁acetone', 62189),
 ('▁lex', 70460),
 ('▁Cd', 39058),
 ('zung', 92136)]

In [12]:
def tokenize(x):
  return tokenizer(x['input'], padding='max_length', truncation=True, max_length=64)

training_dataset = training_dataset.map(tokenize, batched=True)
training_dataset[0]['input_ids']

Map:   0%|          | 0/36473 [00:00<?, ? examples/s]

[1,
 54453,
 435,
 294,
 336,
 5753,
 346,
 54453,
 445,
 294,
 47284,
 265,
 6435,
 346,
 23702,
 435,
 294,
 47284,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

In [13]:
test_dataset = test_dataset.map(tokenize, batched=True)
test_dataset[0]['input_ids']

Map:   0%|          | 0/36 [00:00<?, ? examples/s]

[1,
 54453,
 435,
 294,
 1098,
 4159,
 346,
 54453,
 445,
 294,
 31553,
 1456,
 48133,
 8263,
 346,
 23702,
 435,
 294,
 8847,
 1207,
 8263,
 2,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

### Split

In [14]:
training_dataset, validation_dataset = training_dataset.train_test_split(0.25, seed=42).values()
dataset_dict = DatasetDict({'train':training_dataset,
                            'valid':validation_dataset,
                            'test':test_dataset})

dataset_dict

DatasetDict({
    train: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 27354
    })
    valid: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'label', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 9119
    })
    test: Dataset({
        features: ['id', 'anchor', 'target', 'context', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 36
    })
})

## Train Model

In [15]:
import numpy as np

def pearson(eval_preds):
  return {'pearson': np.corrcoef(*eval_preds)[0][1]}

In [16]:
num_epochs = 4
batch_size = 128
learning_rate = 8e-5

training_args = TrainingArguments('outputs',
                                  eval_strategy='epoch',
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size * 2,
                                  learning_rate=learning_rate,
                                  weight_decay=0.1,
                                  num_train_epochs=num_epochs,
                                  lr_scheduler_type='cosine',
                                  warmup_ratio=0.1,
                                  fp16=True,
                                  report_to='none')

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

trainer = Trainer(model,
                  training_args,
                  train_dataset=dataset_dict['train'],
                  eval_dataset=dataset_dict['valid'],
                  tokenizer=tokenizer,
                  compute_metrics=pearson)

pytorch_model.bin:   0%|          | 0.00/286M [00:00<?, ?B/s]

Some weights of DebertaV2ForSequenceClassification were not initialized from the model checkpoint at microsoft/deberta-v3-small and are newly initialized: ['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  self.scaler = torch.cuda.amp.GradScaler(**kwargs)


In [17]:
trainer.train()

Epoch,Training Loss,Validation Loss,Pearson
1,No log,0.031631,0.794567
2,No log,0.023977,0.817735
3,0.033300,0.023089,0.827697
4,0.033300,0.023366,0.829145


TrainOutput(global_step=856, training_loss=0.025266726440358385, metrics={'train_runtime': 367.4566, 'train_samples_per_second': 297.766, 'train_steps_per_second': 2.33, 'total_flos': 1811788837493760.0, 'train_loss': 0.025266726440358385, 'epoch': 4.0})

# Test Model

In [18]:
test_dataset = dataset_dict['test']
prediction_output = trainer.predict(test_dataset)

result_dataset = Dataset.from_dict({
    'id': test_dataset['id'],
    'anchor': test_dataset['anchor'],
    'target': test_dataset['target'],
    'context': test_dataset['context'],
    'score': np.clip(prediction_output.predictions.flatten(), 0, 1)
})

result_dataset.to_pandas()


Unnamed: 0,id,anchor,target,context,score
0,4112d61851461f60,opc drum,inorganic photoconductor drum,G02,0.595703
1,09e418c93a776564,adjust gas flow,altering gas flow,F23,0.699707
2,36baf228038e314b,lower trunnion,lower locating,B60,0.580078
3,1f37ead645e7f0c8,cap component,upper portion,D06,0.336426
4,71a5b6ad068d531f,neural stimulation,artificial neural network,H04,0.0
5,474c874d0c07bd21,dry corn,dry corn starch,C12,0.500488
6,442c114ed5c4e3c9,tunneling capacitor,capacitor housing,G11,0.491943
7,b8ae62ea5e1d8bdb,angular contact bearing,contact therapy radiation,B23,0.0
8,faaddaf8fcba8a3f,produce liquid hydrocarbons,produce a treated stream,C10,0.26001
9,ae0262c02566d2ce,diesel fuel tank,diesel fuel tanks,F02,1.0
