# Pipeline for Fine-Tuning Your Own Custom Text Classification Model

This notebook contains the code and instructions to train an LLM-based text classifyer.

To run the pipeline, ensure that there are two sub-folders in the same folder as this notebook file:

* (1) An `src` folder containing the two files, `finetuning.py` and `models.py`. These files contain the main codebase for our text classfication pipeline and are called by the script below during execution. 
* (2) A `data` folder that contains further sub-folder(s) with the name(s) of your dataset(s). Upload your data into a sub-folder in the following format:
    * `all-x-labeled.csv` — the text data for which you have corresponding annotations / class labels (one column, no headers)
    * `all-y-labeled.csv` — the labels that correspond to the previous `all-x-labeled.csv` file (one column, no headers, class labels need to be integer values, and the ordering needs to align with the text data, i.e., the first label belongs to the first row of text data, and so on)  
    * `all-x-unlabeled.csv` — the text data for which you have NO LABELS and that you want to predict/auto-label using the fine-tuned classification model produced in the following (this should be the majority of your corpus).

# A: Install and import necessary packages

In [None]:
# Install required packages (only required if not already installed)
# !pip install sentencepiece
# !pip install pandas
# !pip install numpy
# !pip install wandb
# !pip install scikit-learn
# !pip install torch
# !pip install torchmetrics
# !pip install transformers
# !pip install tqdm

In [None]:
# Configure GPU workspace 
%env CUBLAS_WORKSPACE_CONFIG=:4096:8
%env TOKENIZERS_PARALLELISM=false

# Note: 
# This only works with NVIDIA GPUs. If your computer does not have such a GPU, consider setting this code up on Google Colab. 
# Also see our note on Google Colab in the accompanying `README` file for an option to speed up execution. 

In [None]:
# Import standard Python packages
import pandas as pd
import numpy as np
import pickle
import shutil
import glob
import time
import os
import gc

# Import deep learning packages
import torch
torch.backends.cuda.matmul.allow_tf32 = True
import wandb
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, accuracy_score, recall_score, f1_score

# Import pipeline code
from src.finetuning import train_and_predict_test, init_model, predict_y_from_trained_model, read_x_from_csv, init_misc, compute_and_print_metrics_for_dataset_b, set_seeds

# B: Set Necessary Hyperparameters

Choose a name for your project (`PROJECT_NAME`). Model outputs will be named based on a concatonation of your `PROJECT_NAME`, `DATASET` name, text language (`LANGUAGE_FOR_MODEL`), and the choosen `LANGUAGE_MODEL` as defined in the following cell.   

In [None]:
# Choose a project name:

PROJECT_NAME = "a-name-for-your-project"  

# **************************************************************************************************************************
# Provide the name of the subfolder with your data. This subfolder must be inside 'data' folder. 
# Create separate subfolders for each dataset.
# Prepare your data by labeling the required training and validation data (Figure 3 - Steps 1 and 2).

DATASET = "your-dataset-folder-name"
#DATASET = "01-nyt-sentiment"
#DATASET = "02-twitter-stance"
#DATASET = "03-emotion-angry"
#DATASET = "04-brexit-stance"

# **************************************************************************************************************************
# Define the language of your text data. Your selected language model will then be laoded in the corresponding language. 
# Currently, English and German are pre-implemented (to choose custom models for other languages, use the CUSTOM_MODEL_NAME 
# option below. This allows free choice of any language model that is available via Huggingface).
#
# Supported values: ["en", "de"]

LANGUAGE_FOR_MODEL = "en"

# **************************************************************************************************************************
# Choose a pretrained large language model. RoBERTA tends to show strong results across different tasks and datasets 
# and is a good initial choice (Figure 3 - Step 3).
# 
# Recommended: "ROB-LRG"

# LANGUAGE_MODEL = "ROB-BASE"
# https://huggingface.co/roberta-base

LANGUAGE_MODEL = "ROB-LRG"
# https://huggingface.co/roberta-large

# LANGUAGE_MODEL = "DEB-V3"
# https://huggingface.co/microsoft/deberta-v3-large

# LANGUAGE_MODEL = "ELE-LRG"
# https://huggingface.co/google/electra-large-discriminator

# LANGUAGE_MODEL = "XLNET-LRG"
# https://huggingface.co/xlnet-large-cased

# LANGUAGE_MODEL = "ELE-BS-GER"
# To use the electra base model in german,
# set LANGUAGE_FOR_MODEL="de" above.
# https://huggingface.co/german-nlp-group/electra-base-german-uncased

# For BART and ChatGPT, see separate notebooks

# Instead of selecting a model from the list above,
# it is possible to choose another custom model provided
# by the huggingface library. To use a custom model from Huggingface, 
# set the model ID with the following variable:

CUSTOM_MODEL_NAME = None

# Examples of available models are:
# CUSTOM_MODEL_NAME = "bert-base-cased"
# CUSTOM_MODEL_NAME = "bert-base-german-cased"
# CUSTOM_MODEL_NAME = "xlm-roberta-large-finetuned-conll03-german"
# CUSTOM_MODEL_NAME = "distilbert-base-german-cased"
# CUSTOM_MODEL_NAME = "distilbert-base-cased"
# CUSTOM_MODEL_NAME = "stefan-it/albert-large-german-cased"
# CUSTOM_MODEL_NAME = "albert-large-v2"

# For more model choices see the Huggingface model repository under:
# https://huggingface.co/models?pipeline_tag=text-classification&sort=likes

# **************************************************************************************************************************
# Create a RUN_ID based on the above choices 

if CUSTOM_MODEL_NAME is None: 
    RUN_ID = PROJECT_NAME + "-" + DATASET + "-" + LANGUAGE_MODEL + "-" + LANGUAGE_FOR_MODEL
else:
    RUN_ID = PROJECT_NAME + "-" + DATASET + "-" + CUSTOM_MODEL_NAME + "-" + LANGUAGE_FOR_MODEL

# C: (Optional) Set Advanced Hyperparameters

The following cell provides the option to change our recommended default hyperparameters. This option also allows for systematic hyperparater optimization via grid search should this be wished (in this case, we recommend combining such an approach in combination with the external logging option (Step E) to keep track of the choosen parameters and the respective model results. 

In [None]:
# Starting with default hyperparameters 
# (Figure 3 - Step 3).

# **************************************************************************************************************************
# Choose a random seed. This is an arbitrary number that influences the optimization procedure. Set the seed for 
# reproducibility: The same seed should yield the same results during training.

RAND_SEED = 1234

# **************************************************************************************************************************
# Choose the number of epochs for training. One epoch is one full loop through the training and validation datasets. 
# The more epochs, the longer the training will take and the stronger the model may overfit on the training dataset. 
# A low number of epochs may not lead to the full performance potential of the model but reduces training time.
# 
# Recommended: between 5 and 20 epochs

N_EPOCHS = 10

# **************************************************************************************************************************
# Choose how many samples are together in one optimization step. Higher numbers require more VRAM on the GPU or RAM 
# on the CPU. Using multiples of 4 is not required, but common practice. If you run our of memory, consider reducing to 2.
# 
# Recommended: 4, 8, 16, 32

BATCH_SIZE = 4

# **************************************************************************************************************************
# Choose the number of gradient accumulation steps. This determines after how many steps backpropagation occurs. 
# It can be used as an approximate virtual batch size: Virtual batch size ~ batch size * accumulation steps.
# With a batch size of 4 and an accumulation step size of 8, we get roughly a batch size of 32.                        

GRADIENT_ACC_STEPS = 8

# **************************************************************************************************************************
# Chooese the dropout rate for the classification head (NOT the transformer backbone model). A higher dropout rate may 
# reduce overfitting on a small training set. The dropout rate needs to be in the range [0,1]. 
# Higher value mean more dropout is applied, i.e. more information is lost during a forward pass.
# 
# Recommended values: 0.1-0.4

DROPOUT_RATE = 0.1

# **************************************************************************************************************************
# Choose the learning rate (LR) for the optimizer. A higher LR means bigger steps are taken during training and training 
# completes faster. Setting the LR too high may lead to reduced performance or overfitting. For transfer learning, 
# LRs around 1e-5 usually work best.
#
# Recommended values: [1e-5, 2e-5, 5e-5]

LEARNING_RATE = 1e-5

# **************************************************************************************************************************
# Choose strategy for dealing with class imbalance. Correctly dealing with class imbalance can be key for model performance. 
# For very large datasets, undersampling may work well. 
# For small datasets, choose either upsamling or loss_weight.
#
# Options: ["upsampling", "undersampling", "loss_weight"]
# 
# Recommended: "loss_weight"

IMBALANCE_STRATEGY = 'loss_weight'

# **************************************************************************************************************************
# Enable detailed print statements for the entire pipeline including training.
#
# Options: [ True, False ]

IS_DEBUG_ENABLED = True

# **************************************************************************************************************************
# Ensure that this flag is "True" for the final run of your model to use the all available training data for optimization.
# 
# Set this flag to "False" to avoid overfitting on the training dataset. The validation split is a smaller subsplit from 
# the training dataset. If a model performs well on this 'unseen' data, it will likely also perform well on the unlabeled 
# data.
# 
# (Figure 3 - Step 6 bottom).
# 
# Options: [ True, False ]

DO_VALIDATION_SET = False

# D: Load Dataset

In [None]:
# Run this cell, no choices required.

dataset_sentences = f"./data/{DATASET}/all-x-labeled.csv"
dataset_labels = f"./data/{DATASET}/all-y-labeled.csv"

all_x = np.squeeze(np.array(pd.read_csv(dataset_sentences, header=None, sep='\t\t', engine='python')))
all_y = np.squeeze(np.array(pd.read_csv(dataset_labels, dtype=np.float32, header=None)))

os.makedirs(f'./data/{DATASET}/{RUN_ID}', exist_ok=True)

print(all_x.shape, all_y.shape)

# E: (Optional) Enable/Disable logging via Weights & Biases

Note: you need to sign up with `Weigths & Biases` to use this option (https://wandb.ai/site).

In [1]:
# Option 1: no external logging of the training metrics               
IS_LOGGING_ENABLED = False
wandb_config = None

# Option 2: external logging of the training metrics (for finetuned analysis and optimization of hyperparameters)
# IS_LOGGING_ENABLED = True
# wandb_config = { "project": "ipz-nlp", "entity": "mnbucher" }

# F: Start Fine-tuning

The following cell initiates the model training process. Depending on your computer and GPU availability, this may take a while. In case of excessive run times, consider the Google Colab option mentioned previously (see: Step A).

In [None]:
# Run this cell and inspect the results. Depending on the results, decide how you want to procede (Figure 3 - Steps 4 to 6).

# Set seed
set_seeds(RAND_SEED)

# Randomly shuffle loaded dataset
idxs_shuffle = np.arange(all_x.shape[0])
np.random.shuffle(idxs_shuffle)
all_x = all_x[idxs_shuffle]
all_y = all_y[idxs_shuffle]

# Prepare training
init_misc(RAND_SEED, RUN_ID, IS_DEBUG_ENABLED)

# Train and evaluate model on train/val splits
train_and_predict_test(all_x, all_y, RUN_ID, N_EPOCHS, IMBALANCE_STRATEGY, dataset_B_unlabelled_x=None, learning_rate=LEARNING_RATE, dropout_rate=DROPOUT_RATE, batch_size=BATCH_SIZE, gradient_accumulation_steps=GRADIENT_ACC_STEPS, rand_seed=RAND_SEED, language_model=LANGUAGE_MODEL, language_for_model=LANGUAGE_FOR_MODEL, custom_model_name=CUSTOM_MODEL_NAME, do_validation_set=DO_VALIDATION_SET, log_with_wandb=IS_LOGGING_ENABLED, is_debug=IS_DEBUG_ENABLED, wandb_config=wandb_config)

# G: Make Predictions on Unlabeled Data

Predict labels for hitherto unlabeled data (`all-x-unlabeled.csv`) and save the predicted labels (`predictions-x-unlabeled.csv`) in the `data` folder under the `RUN_ID` name defined above. 

In [None]:
# Once model training is completed to your satisfaction, use the fine-tuned model to auto-label the unlabeled part of your data.
# (Figure 3 - Step 7).

dataset_B_unlabelled_x = read_x_from_csv(f"./data/{DATASET}/all-x-unlabeled.csv")

# **************************************************************************************************************************

n_classes = len(list(np.unique(all_y)))

dvc = init_misc(RAND_SEED, RUN_ID, IS_DEBUG_ENABLED, remove_log_files=False)
model, _, _ = init_model(LANGUAGE_MODEL, LANGUAGE_FOR_MODEL, CUSTOM_MODEL_NAME, LEARNING_RATE, DROPOUT_RATE, n_classes, dvc, N_EPOCHS, GRADIENT_ACC_STEPS, None)
max_seq_length = 512

dataset_B_unlabelled_y_pred = predict_y_from_trained_model(RUN_ID, LANGUAGE_MODEL, LANGUAGE_FOR_MODEL, CUSTOM_MODEL_NAME, dataset_B_unlabelled_x, model, BATCH_SIZE, RAND_SEED, max_seq_length, dvc, IS_DEBUG_ENABLED)

np.savetxt("./output/predictions-x-unlabeled.csv", dataset_B_unlabelled_y_pred, fmt='%f', encoding="utf-8") ### NEW

print("finished!")

# **************************************************************************************************************************

# Clean up
files = [ f for f in glob.glob('./output/*.csv') ] 
files
files_dest = [ f.replace("/output/", f"/data/{DATASET}/{RUN_ID}/") for f in glob.glob('./output/*.csv') ]
files_dest
for f in range(len(files)):
    shutil.move(files[f], files_dest[f])