# LLM Segment Classifier

This notebook reports the methodology followed to fine tune the large language model [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)) to automatically identify sections in job description documents. The model takes a sentence contained in a job description as input and produces as output the section that the sentence belongs to. Inspiration on the applied methodology was taken from [Fine-Tuning BERT for Text Classification](https://towardsdatascience.com/fine-tuning-bert-for-text-classification-54e7df642894).

# Section 0. Preliminaries

## Load libraries

In [1]:
# Update accordingly
run_on_google_colab = True
project_dir = '/content/drive/MyDrive'

if run_on_google_colab:
  from google.colab import drive
  drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [2]:
import sys
sys.path.append(project_dir)

In [3]:
import numpy as np
import os
import pandas as pd
import torch
import warnings

warnings.filterwarnings('ignore')

from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from torch.utils.data import TensorDataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from tqdm import trange

from utils import *

## Initialize constants

In [4]:
OVERWRITE_FINETUNNING = False
random_state = np.random.RandomState(1234)  # for reproducibility

## Load data

In [5]:
data_dir = f'{project_dir}/data'
data_fn = 'jobs_training.csv'
models_dir = f'{project_dir}/models'
output_dir = f'{project_dir}/outputs'

In [6]:
job_training_df = pd.read_csv(os.path.join(data_dir, data_fn))

## Preview data

Visualize the size of the data togethet with a small sample

In [7]:
print(f'dataset size: {job_training_df.shape[0]}x{job_training_df.shape[1]}')
print('-'*10)
job_training_df.head(10)

dataset size: 3885x4
----------


Unnamed: 0,job_id,segment_index,segment,section_label
0,05b865e93e8e46579075562865973d3b,0,Abbott is a global healthcare leader that help...,About Company
1,05b865e93e8e46579075562865973d3b,94,Our portfolio of life-changing technologies sp...,About Company
2,05b865e93e8e46579075562865973d3b,287,"Our 109,000 colleagues\nserve people in more t...",About Company
3,05b865e93e8e46579075562865973d3b,353,"**Tissue Trainer – St. Paul, MN**",Job Title
4,05b865e93e8e46579075562865973d3b,388,Our business purpose is to restore health and ...,About Company
5,05b865e93e8e46579075562865973d3b,573,We aim to lead the markets we serve by requiri...,About Company
6,05b865e93e8e46579075562865973d3b,951,**WHAT YOU’LL DO**,Job Responsibilities/Summary
7,05b865e93e8e46579075562865973d3b,971,We are recruiting for a Tissue Trainer located...,Job Responsibilities/Summary
8,05b865e93e8e46579075562865973d3b,1035,"You are\nresponsible for the coordination, imp...",Job Responsibilities/Summary
9,05b865e93e8e46579075562865973d3b,1238,Coordinates the ongoing and\nrecurring system ...,Job Responsibilities/Summary


Distribution of values by type of sections

In [8]:
job_training_df.section_label.value_counts()

Job Responsibilities/Summary    1453
Job Skills/Requirements         1012
Other                            506
About Company                    425
Benefits                         291
EOE/Diversity                    163
Job Title                         35
Name: section_label, dtype: int64

Distribution of proportions by type of sections

In [9]:
job_training_df.section_label.value_counts()/job_training_df.shape[0]

Job Responsibilities/Summary    0.374003
Job Skills/Requirements         0.260489
Other                           0.130245
About Company                   0.109395
Benefits                        0.074903
EOE/Diversity                   0.041956
Job Title                       0.009009
Name: section_label, dtype: float64

Previous output shows that there are `seven classes` classes, which are unbalaced. Now, check if there are missing values.

In [10]:
# Check for null values
job_training_df.isnull().sum()

job_id           0
segment_index    0
segment          0
section_label    0
dtype: int64

---

# Section 1. Feature engineering

Instantiate BERT tokenizer

In [11]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

## Pre-process text

Using BERT tokenizer, let's preprocess segments

In [12]:
def text_preprocessing(text, tokenizer, max_length=50):
  '''
  Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
  '''
  encoded_text = tokenizer.encode_plus(
    text,
    add_special_tokens = True,
    max_length = max_length,
    pad_to_max_length = True,
    return_attention_mask = True,
    return_tensors = 'pt',
    truncation = True
  )
  return encoded_text

In [13]:
token_ids = []
attention_masks = []
max_length = 200 # taken from 2_lstm_segment_classifier-model_development

for segment in job_training_df.segment:
  encoding_dict = text_preprocessing(segment, tokenizer, max_length)
  token_ids.append(encoding_dict['input_ids'])
  attention_masks.append(encoding_dict['attention_mask'])

token_id = torch.cat(token_ids, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)

### Encode categorical labels

The target variable `section_label` contains categorical data, which need to be converted to numbers before usign them to train machine learning algorithms. Before encoding, labels that correspond to empty segments are removed.

In [14]:
labels = list(job_training_df['section_label'].values)
unique_labels = list(set(labels))
encoded_labels = encode_labels(unique_labels, labels, models_dir)
encoded_labels = torch.tensor(encoded_labels)

# Section 2. Datasets preparation

## Split dataset

Split the dataset into train and test, holding 20% for testing

In [15]:
train_idx, test_idx = train_test_split(
  np.arange(len(encoded_labels)),
  test_size = 0.20,
  shuffle = True,
  stratify = encoded_labels,
  random_state = random_state
)

Split the test set into validation and test

In [16]:
test_encoded_labels =  encoded_labels[test_idx]
valid_idx, test_idx = train_test_split(
    np.arange(len(test_encoded_labels)),
    test_size = 0.50,
    stratify = test_encoded_labels,
    random_state = random_state
)

## Batching datasets

Create tensor datasets and dataloaders to be used in training the neural network.

In [17]:
# Create tensor datasets
train_data = TensorDataset(
    token_id[train_idx],
    attention_masks[train_idx],
    encoded_labels[train_idx]
)
valid_data = TensorDataset(
    token_id[valid_idx],
    attention_masks[valid_idx],
    encoded_labels[valid_idx]
)
test_data = TensorDataset(
    token_id[test_idx],
    attention_masks[test_idx],
    encoded_labels[test_idx]
)

In [18]:
# Define batch size
batch_size = 32 # recommended by https://arxiv.org/pdf/1810.04805.pdf

# Batching datasets
train_loader = DataLoader(
    train_data,
    shuffle=True,
    batch_size=batch_size
)
valid_loader = DataLoader(
    valid_data,
    batch_size=batch_size
)
test_loader = DataLoader(
    test_data,
    batch_size=batch_size
)

# Section 3. Fine-tune BERT

## Instantiate model

In [19]:
# Load the BertForSequenceClassification model
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = len(unique_labels),
    output_attentions = False,
    output_hidden_states = False,
)

# Run on GPU is available
if torch.cuda.is_available():
  print('GPU available and will be used for training.')
  device = torch.device('cuda')
else:
  print('Only CPU available.')
  device = torch.device('cpu')

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


GPU available and will be used for training.


Define optimizer

In [20]:
# Recommended learning rates (Adam): 5e-5, 3e-5, 2e-5.
# https://arxiv.org/pdf/1810.04805.pdf
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr = 5e-5,
    eps = 1e-08
)

## Train function

In [21]:
def train(model, optimizer, train_loader, valid_loader, epochs=2):
  if str(device) == 'cuda':
    model.cuda()
  for _ in trange(epochs, desc = 'Epoch'):
    # Set model to training mode
    model.train()
    # Tracking variables
    tr_loss, tr_examples, tr_steps = 0, 0, 0
    for step, batch in enumerate(train_loader):
      batch = tuple(t.to(device) for t in batch)
      input_ids, input_mask, labels = batch
      optimizer.zero_grad()
      # Forward pass
      train_output = model(
          input_ids,
          token_type_ids = None,
          attention_mask = input_mask,
          labels = labels
      )
      # Backward pass
      train_output.loss.backward()
      optimizer.step()
      # Update tracking variables
      tr_loss += train_output.loss.item()
      tr_examples += input_ids.size(0)
      tr_steps += 1
    # Set model to evaluation mode
    model.eval()
    tr_accu = validate(model, valid_loader)
    print(f'\n\t - Train loss: {tr_loss / tr_steps}')
    print(f'\t - Validation Accuracy: {sum(tr_accu)/len(tr_accu)}')

## Validation function

In [22]:
def validate(model, valid_loader):
  # Tracking accuracy
  accuracies = []
  for batch in valid_loader:
      batch = tuple(t.to(device) for t in batch)
      input_ids, input_mask, labels = batch
      with torch.no_grad():
        # Forward pass
        eval_output = model(
            input_ids,
            token_type_ids = None,
            attention_mask = input_mask
        )
      logits = eval_output.logits.detach().cpu().numpy()
      labels = labels.to('cpu').numpy()
      # Compute accuracy
      total_samples = batch_size
      preds = np.argmax(logits, axis = 1).flatten()
      labels = labels.flatten()
      correct_preds = (preds == labels).sum().item()
      accu = correct_preds / total_samples
      accuracies.append(accu)

  return accuracies

## Finetune BERT model

In [23]:
model_name = f'bert_model.pth'
model_file_path = os.path.join(models_dir, model_name)
if OVERWRITE_FINETUNNING or not os.path.isfile(model_file_path):
  train(model, optimizer, train_loader, valid_loader, epochs=3)
  # save model
  torch.save(model, model_file_path)
else:
  model = torch.load(model_file_path)

## Evaluate fine-tuned BERT model

In [24]:
def evaluate_model(model, test_loader):
  preds = []
  ys = []
  for batch in test_loader:
    batch = tuple(t.to(device) for t in batch)
    input_ids, input_mask, labels = batch
    with torch.no_grad():
      model_output = model(
        input_ids,
        token_type_ids = None,
        attention_mask = input_mask
      )
    logits = model_output.logits.detach().cpu().numpy()
    pred = np.argmax(logits, axis = 1).flatten()
    labels = labels.to('cpu').numpy().flatten()
    preds.extend(pred)
    ys.extend(labels)
  return preds, ys

In [25]:
preds, ys = evaluate_model(model, test_loader)
# show classification report using evaluation results
preds = np.array(preds)
ys = np.array(ys)
enconder_file_path = os.path.join(models_dir, 'encoder_classes.npy')
class_nums = list(range(0,7))
report = classification_report(ys, preds, target_names=decode_labels(class_nums, enconder_file_path))
print(report)

                              precision    recall  f1-score   support

               About Company       0.94      0.94      0.94        33
                    Benefits       0.92      0.97      0.95        36
               EOE/Diversity       0.90      0.90      0.90        21
Job Responsibilities/Summary       0.95      0.95      0.95       154
     Job Skills/Requirements       0.96      0.94      0.95        96
                   Job Title       1.00      0.67      0.80         3
                       Other       0.94      0.96      0.95        46

                    accuracy                           0.95       389
                   macro avg       0.94      0.90      0.92       389
                weighted avg       0.95      0.95      0.95       389



---

# Sectio 4. Use case

### Predict section of segments

Solution is checked by predicting the section of a given sentence taken from the dataset `jobs_test`.

In [26]:
segment = 'The company began more than 100 years ago in Tulsa and has successfully diversified into a variety of industries, businesses and geographies. .'

Process segment

In [27]:
# Apply the tokenizer
encoding = text_preprocessing(segment, tokenizer)

In [28]:
# Extract IDs and Attention Mask
ids = []
attention_mask = []
ids.append(encoding['input_ids'])
attention_mask.append(encoding['attention_mask'])
ids = torch.cat(ids, dim = 0)
attention_mask = torch.cat(attention_mask, dim = 0)

Make prediction

In [29]:
def make_prediction(model, ids, attention_mask):
  with torch.no_grad():
    model_output = model(
        ids.to(device),
        token_type_ids=None,
        attention_mask=attention_mask.to(device)
    )
  pred = np.argmax(model_output.logits.cpu().numpy()).flatten().item()
  return pred

In [30]:
pred = make_prediction(model, ids, attention_mask)

Output prediction result

In [31]:
class_names = decode_labels(list(range(0,7)), os.path.join(models_dir, 'encoder_classes.npy'))
for class_num, class_name in zip(list(range(0,7)), class_names):
  if class_num == pred:
    print('Prediction result')
    print('-'*10)
    print(f'Segment: {segment}')
    print(f'Predicted section: {class_name}')
    print('-'*10)
    break

Prediction result
----------
Segment: The company began more than 100 years ago in Tulsa and has successfully diversified into a variety of industries, businesses and geographies. .
Predicted section: About Company
----------
