<a href="https://colab.research.google.com/github/matthewleechen/woodcroft_patents/blob/main/industry_class/industry_class_patents.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook is based on Niels Rogge's (extremely helpful!) notebook, "Fine-tuning BERT (and friends) for multi-label text classification", linked [here](https://github.com/matthewleechen/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb).

It is not recommended to run this notebook on the Colab free plan. This notebook's training loop was originally run using Colab Pro on 1 Nvidia Tesla V100 (16GB) GPU. You can also run this locally on a virtual machine or server, but carefully check for dependencies.

This notebook uses [MacBERTh](https://huggingface.co/emanjavacas/MacBERTh), a BERT model pre-trained on historical English (c.1450-1950), to classify inventions into industry categories (original paper linked [here](https://jdmdh.episciences.org/9690)). 

This notebook allows for any model available using [HuggingFace Transformers](https://huggingface.co/docs/transformers/index) to be used. I have experimented with BERT (base and both [cased](https://huggingface.co/bert-base-uncased)/[uncased](https://huggingface.co/bert-base-cased)), RoBERTa ([base](https://huggingface.co/roberta-base) and [distilled](https://huggingface.co/distilroberta-base)), XLNet ([base](https://huggingface.co/xlnet-base-cased)), and [SBERT](https://www.sbert.net) models and find that MacBERTh marginally outperforms across a range of hyperparameters.

**Setup**

In [1]:
%%capture
!pip install transformers==4.29.0 
!pip install datasets
!pip install accelerate

In [2]:
from datasets import load_dataset, Dataset, DatasetDict
from transformers import set_seed, Trainer, AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, EvalPrediction, pipeline
import numpy as np
import pandas as pd
import torch
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from sklearn.model_selection import train_test_split
from tqdm import tqdm

In [3]:
# Set seed
set_seed(42)

**Load data**

In [None]:
# Read the CSV file into a pandas DataFrame
df = pd.read_csv('labelled_data_patents.csv')

# Encode the "Industry" column into separate columns using one-hot encoding
df_encoded = pd.get_dummies(df['Industry'])

# Merge the original DataFrame with the encoded columns
df_final = pd.concat([df, df_encoded], axis=1)

# Iterate over the columns and update the values to "True" or "False" based on the correct class
for industry in df['Industry'].unique():
    df_final[industry] = df_final['Industry'] == industry

# Remove the original "Industry" column
df_final.drop('Industry', axis=1, inplace=True)

df_final.head()

In [None]:
# Load dataset and train-test split
dataset = df_final

# Split the dataset into features and labels
X = dataset[['num', 'text']]
y = dataset.drop(['num', 'text'], axis=1)

# Split the dataset into train, validation, and test sets
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)  # 0.25 * 0.8 = 0.2

train_df = pd.concat([X_train, y_train], axis=1).reset_index(drop=True)
val_df = pd.concat([X_val, y_val], axis=1).reset_index(drop=True)
test_df = pd.concat([X_test, y_test], axis=1).reset_index(drop=True)

# Drop the "__index_level_0__" column if it exists
if '__index_level_0__' in train_df.columns:
    train_df.drop('__index_level_0__', axis=1, inplace=True)

if '__index_level_0__' in val_df.columns:
    val_df.drop('__index_level_0__', axis=1, inplace=True)

if '__index_level_0__' in test_df.columns:
    test_df.drop('__index_level_0__', axis=1, inplace=True)
    
dataset = DatasetDict({
    'train': Dataset.from_pandas(train_df),
    'validation': Dataset.from_pandas(val_df),
    'test': Dataset.from_pandas(test_df)
})

dataset


In [None]:
# Visualize dataset as dictionary with 3 splits
dataset

In [None]:
# check example entry
example = dataset['train'][0]
example

In [None]:
# Create labels
labels = [label for label in dataset['train'].features.keys() if label not in ['num', 'text']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

**Data Pre-processing**

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [None]:
tokenizer = AutoTokenizer.from_pretrained("emanjavacas/MacBERTh") # change tokenizer here

def preprocess_data(examples):
  # take a batch of texts
  text = examples["text"]
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [10]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)

Map:   0%|          | 0/7596 [00:00<?, ? examples/s]

Map:   0%|          | 0/2532 [00:00<?, ? examples/s]

Map:   0%|          | 0/2532 [00:00<?, ? examples/s]

In [None]:
example = encoded_dataset['train'][0]
print(example.keys())

In [None]:
tokenizer.decode(example['input_ids'])

In [None]:
example['labels']

In [None]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

In [15]:
# Set PyTorch tensors
encoded_dataset.set_format("torch")

**Define model**

In [None]:
model = AutoModelForSequenceClassification.from_pretrained("emanjavacas/MacBERTh", # change model here
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

**Training** 

Training uses HuggingFace's [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer): hyperparameters are specified using [TrainingArguments](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments) and the training loop is specified using the [Trainer](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.Trainer) object.

In [17]:
batch_size = 16
metric_name = "f1"

In [18]:
# Specify hyperparameters
args = TrainingArguments(
    f"emanjavacas/MacBERTh", # change model here
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name
)

In [19]:
# define function to compute metrics while training
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

In [None]:
encoded_dataset['train'][0]['labels'].type()

In [None]:
encoded_dataset['train']['input_ids'][0]

In [None]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

In [23]:
# specify training loop
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [None]:
# Run training
trainer.train()

**Evaluation**

In [None]:
# Run evaluation
trainer.evaluate()

In [None]:
# Save model
model_path = "/content/emanjavacas"

model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

**Inference**

This code below runs inference using the HuggingFace Pipelines API - documentation is linked [here](https://huggingface.co/docs/transformers/main_classes/pipelines).

Running this code on a GPU is strongly recommended. A Nvidia Tesla T4 GPU (provided on the Colab free plan) is orders of magnitude faster than using the CPU. On the T4, inference on the full set of patents takes approximately 20-25 minutes, but several hours on the Colab CPU.

In [27]:
# Deploy model and tokenizer for inference
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [None]:
# Deploy Pipeline API: device = 0 for GPU, device = -1 is default (for CPU)
pipe = pipeline(task="text-classification", model=model, device = 0, tokenizer=tokenizer)

In [None]:
# Define input and output csv
input_csv = "/path/to/input/csv" # path to cleaned ner output
output_csv = "/path/to/output/csv" # path to outputted file

In [None]:
# Load in cleaned NER output as a .csv
df = pd.read_csv(input_csv)

In [None]:
# Run inference loop
def classify(phrase):
    result = pipe(phrase)
    return result[0]["label"]

# Apply the function on the misc column and save the output to pred_industry column
df["misc"] = df["misc"].astype(str)
df["pred_industry"] = ""

with tqdm(total=len(df), desc="Classifying") as pbar:
    for index, row in df.iterrows():
        df.loc[index, "pred_industry"] = classify(row["misc"])
        pbar.update(1)

# Save the updated dataframe to a new csv file
df.to_csv("output_csv", index=False)