# BERT Venue Classfier Training

Now that we have a labeled venue dataset, we need a way to create classification scores for each persona for each venue. We can use an 8 head classification model for this task. By taking BERT, and feeding the business name, categories, feature list and one sentence description (or some subset of these elements) into the model as input, we can use the persona labels provided by GPT to fine-tune a BERT checkpoint on providing relationship scores between a venue and our set of personas.

1. **Dataset Creation**

We will want to start by creating a labeled dataset that can be fed directly to BERT for a fine-tuning job. This will allow us to setup a function for transforming a row into an input prompt, which we will be able to modify to test different results of the BERT encoder model.

2. **Fine-Tuning**

We will fine-tune the model using the labelled dataset. This should utilize hugginface and the `AutoModelForSequenceClassfication` class, which will wrap our BERT checkpoint and tokenizer.

3. **Evaluation**

After the evaluation is complete, we will assess the model's perfromance on our test dataset. This will allow us to see how well the model distributes the scores for the 8 labels for each venue. We can modify the input string if necessary based on these results.

In [1]:
import json

import pandas as pd
from dotenv import load_dotenv

load_dotenv()


True

### Dataset Creation

For starters, we will setup a pipeline to take our Yelp location data and create rows that can be used to fine-tune our classifier model.

In [2]:

with open("../data/venues/yelp.json", "r") as f:
    location_data = json.load(f)

personas = ["socialButterfly", "culinaryExplorer", "beautyFashionAficionado", "familyOrientedIndividual", "artCultureEnthusiast", "wellnessSelfCareAdvocate", "adventurerExplorer", "ecoConsciousConsumer"]

dataset = []
for loc in location_data:
    labels = {persona: 1 if persona in loc['personas'] else 0 for persona in personas}
    data = {
        "id": loc["id"],
        "biz_name": loc["name"],
        "categories": ', '.join([cat['title'] for cat in loc["categories"]]),
        "biz_features": loc["biz_features"],
        'summary': loc["business_summary"],
    }
    row = {**data, **labels}
    dataset.append(row)

df = pd.DataFrame(dataset)

In [3]:
def format_input(row):
    return (
        f"Name: {row.biz_name}\n"
        f"Categories: {row.categories}\n"
        f"Biz Features: {row.biz_features}\n"
        f"Summary: {row.summary}\n"
    )

df['input'] = df.apply(format_input, axis=1)

# inputs = tokenizer(df['input'].tolist(), return_tensors='pt', padding=True, truncation=True, max_length=512)
# labels = torch.tensor(df[personas].values, dtype=torch.float)

# dataset = TensorDataset(inputs['input_ids'], inputs['attention_mask'], labels)
# train_size = int(0.9 * len(dataset))
# val_size = len(dataset) - train_size

# train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

# train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# val_dataloader = DataLoader(val_dataset, batch_size=16)


### Fine-Tuning the Model

In [4]:
from transformers import AutoModelForSequenceClassification, AutoTokenizer

bert_chkpt = "distilbert-base-uncased"

model = AutoModelForSequenceClassification.from_pretrained(bert_chkpt, num_labels=len(personas), problem_type="multi_label_classification")
tokenizer = AutoTokenizer.from_pretrained(bert_chkpt)

  from .autonotebook import tqdm as notebook_tqdm
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [26]:
import torch
import numpy as np
from datasets import Dataset, DatasetDict

inputs = tokenizer(df.input.tolist(), return_tensors='pt', padding=True, truncation=True, max_length=512)
labels = torch.tensor(df[personas].values, dtype=torch.float)

indicies = np.arange(len(labels))

np.random.seed(100)
np.random.shuffle(indicies)

train_indicies = indicies[:800]
val_indicies = indicies[800:1000]
test_indicies = indicies[1000:]

train_dataset = Dataset.from_dict({'input_ids': inputs['input_ids'][train_indicies], 'attention_mask': inputs['attention_mask'][train_indicies], 'labels': labels[train_indicies]})
val_dataset = Dataset.from_dict({'input_ids': inputs['input_ids'][val_indicies], 'attention_mask': inputs['attention_mask'][val_indicies], 'labels': labels[val_indicies]})
test_dataset = Dataset.from_dict({'input_ids': inputs['input_ids'][test_indicies], 'attention_mask': inputs['attention_mask'][test_indicies], 'labels': labels[test_indicies]})

dataset_dict = DatasetDict({
    'train': train_dataset,
    'val': val_dataset,
    'test': test_dataset
})


In [27]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)

trainer = Trainer(
    model=model,                            # the instantiated 🤗 Transformers model to be trained
    args=training_args,                     # training arguments, defined above
    train_dataset=dataset_dict['train'],    # training dataset
    eval_dataset=dataset_dict['val']        # evaluation dataset
)

  0%|          | 0/150 [14:38<?, ?it/s]


In [28]:
result = trainer.train()

100%|██████████| 150/150 [01:01<00:00,  2.45it/s]

{'train_runtime': 61.2616, 'train_samples_per_second': 39.176, 'train_steps_per_second': 2.449, 'train_loss': 0.5250987752278646, 'epoch': 3.0}





In [60]:
import torch
import numpy as np

label_map = {i: label for i, label in enumerate(personas)}
predictions = trainer.predict(dataset_dict['test'])
predictions_tensor = torch.tensor(predictions.predictions)

predictions = torch.nn.functional.softmax(predictions_tensor, dim=-1)

100%|██████████| 7/7 [00:02<00:00,  2.86it/s]


In [61]:
labels_df = pd.DataFrame(predictions, columns=personas)
labels_df.describe()

Unnamed: 0,socialButterfly,culinaryExplorer,beautyFashionAficionado,familyOrientedIndividual,artCultureEnthusiast,wellnessSelfCareAdvocate,adventurerExplorer,ecoConsciousConsumer
count,425.0,425.0,425.0,425.0,425.0,425.0,425.0,425.0
mean,0.146046,0.140709,0.038303,0.119202,0.171325,0.054139,0.294034,0.036241
std,0.11023,0.163272,0.009131,0.037351,0.140378,0.011574,0.236626,0.005121
min,0.034029,0.022724,0.02394,0.055401,0.043957,0.033863,0.036282,0.027742
25%,0.054828,0.032655,0.027731,0.101143,0.059879,0.046889,0.088899,0.032732
50%,0.095441,0.053213,0.042869,0.128139,0.097326,0.056167,0.171892,0.036082
75%,0.253147,0.179457,0.04528,0.14798,0.293614,0.063199,0.556664,0.040064
max,0.424459,0.501917,0.059434,0.189813,0.456688,0.078829,0.670703,0.048961


In [57]:
trainer.save_model("../models/bert-yelp")