<a href="https://colab.research.google.com/github/jniimi/tp-berta/blob/main/demo_finetune.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Set up the Environment
- Clone repository from [jniimi/tp-berta](https://github.com/jniimi/tp-berta)
- Install required packages from pip
- It is quite tough for Colab to solve all the dependencies in requirements.txt
- The version of NumPy may need to be less than 1.24

In [None]:
!git clone https://github.com/jniimi/tp-berta
!pip install -q numpy==1.25 pandas scikit_learn torch torchvision torchaudio transformers tqdm wandb gdown category_encoders tomli tomli_w einops

# Download model checkpoints
- Directly download the pre-trained model checkpoints from Google Drive (the files are officially provided by the original authors)
- Extract the compressed files
- Remove compressed files

In [None]:
%cd /content/tp-berta
!mkdir checkpoints
%cd checkpoints
!gdown 13_GAK2VcShxm5TgqSvLk2afBTIYcCbEs
!gdown 1ArjkOAblGPErmxUyVIfpiM0IztnjjYxq
!tar -xzvf tpberta-single.tar.gz
!tar -xzvf tpberta-joint.tar.gz
!rm tpberta-single.tar.gz
!rm tpberta-joint.tar.gz
%cd /content/tp-berta

# Download datasets
- Directly download datasets from Google Drive (the files are officially provided by the original authors)
- Extract the compressed files
- Remove compressed files

In [None]:
%cd /content/tp-berta
!mkdir data
%cd data
!gdown 1Jy45I_vTKn6McMROi5IKjKoSi9QJtx9A
!gdown 1JhOJR1kxjyu4w4ZHi8VcxgMh-iYJRDgG
!tar -xzvf tpberta-finetune-data.tar.gz
!tar -xzvf tpberta-pretrain-data.tar.gz
!rm tpberta-finetune-data.tar.gz
!rm tpberta-pretrain-data.tar.gz
%cd /content/tp-berta

# Prepare Datasets
Different from the preprocession in the pre-training, it is recommended to implement all the datasets in the finetuning preparation.

Still, you can use following arguments:

- **num_datasets (int)**:  
  Randomly sample a smaller number of datasets for quick pre-training attempts.  
  (Useful since the default full dataset is very large and takes time to prepare.)

- **dataset_seed (int)**:  
  Set the random seed value for sampling datasets. Ensures reproducibility.

In [None]:
%cd /content/tp-berta
#!python scripts/clean_feat_names.py --mode finetune --task binclass --num_datasets 3 --dataset_seed 123 --overwrite
!python scripts/clean_feat_names.py --mode finetune --task binclass --overwrite

# Finetune
- **Option 1. Explicitly specify the dataset filename**

  - The dataset file (.csv) needs to be put in `tp-berta/data/finetune-{task}` directory, such as `finetune-bin`.
  - List the datasets (without extention: .csv) in `DATASETS` and automatically finetune the model in the loop.

- **Option 2. Randomly sample the datasets**
  - You can also randomly select the datasets for finetuning.

  

In [None]:
import os, random
task = 'bin'

# Option 1: Specify the dataset
#DATASETS=["Customer_Behaviour", "Bank_Personal_Loan_Modelling"]

# Option 2: Random sampling
num_datasets = 2
dataset_seed = 123
DATASETS = [f.replace('.csv','') for f in os.listdir(f'/content/tp-berta/data/finetune-{task}') if f.endswith('.csv')]
datasets_rng = random.Random(dataset_seed)
DATASETS = datasets_rng.sample(DATASETS, k=num_datasets)

for DATASET in DATASETS:
    !python scripts/finetune/default/run_default_config_tpberta.py --dataset "$DATASET" --task "binclass"

Finally, the finetuned model is saved in / content/tp-berta/finetune_outputs/.