# Seq2Seq AdaLoRa for SQL

This is the main notebook for fine-tuning *codet5p-220m-bimodal seq2seq* using AdaLoRa PEFT method on SQL programming language dataset.

## Clone the repository

In [1]:
!git clone https://github.com/leiluk1/CodeSearcher.git && cd CodeSearcher/ && git checkout dev/embeddings

Cloning into 'CodeSearcher'...
remote: Enumerating objects: 410, done.[K
remote: Counting objects: 100% (50/50), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 410 (delta 5), reused 11 (delta 2), pack-reused 360[K
Receiving objects: 100% (410/410), 39.50 MiB | 29.20 MiB/s, done.
Resolving deltas: 100% (220/220), done.
Branch 'dev/embeddings' set up to track remote branch 'dev/embeddings' from 'origin'.
Switched to a new branch 'dev/embeddings'


## Set up the required dependencies

In [None]:
!pip install dataprep gdown py7zr transformers peft evaluate rouge_score fire loguru --quiet

In [3]:
# import wandb
# wandb.login(key='')

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
!mkdir -p /kaggle/output/CodeSearcher/output
!mkdir -p CodeSearcher/data/raw
!gdown 1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6
!unzip -q -d /kaggle/working/CodeSearcher/data/raw ./XLCoST_data.zip

Downloading...
From (uriginal): https://drive.google.com/uc?id=1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6
From (redirected): https://drive.google.com/uc?id=1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6&confirm=t&uuid=33447518-c0af-4d54-a45f-677ef4f5f45d
To: /kaggle/working/XLCoST_data.zip
100%|█████████████████████████████████████████| 298M/298M [00:01<00:00, 215MB/s]


## Perform fine-tuning 

For more details, please check the code in `src/models/train.py` in our repository.

In [5]:
!export DATASETS_VERBOSITY=error
!cd CodeSearcher/ && export PYTHONPATH=. && python src/models/train.py seq2seq adalora \
    --output_dir="output" \
    --epochs=10 \
    --language="SQL" \
    --train_batch_size=16

Downloading tokenizer_config.json: 100%|███| 1.34k/1.34k [00:00<00:00, 6.63MB/s]
Downloading vocab.json: 100%|████████████████| 511k/511k [00:00<00:00, 10.2MB/s]
Downloading merges.txt: 100%|████████████████| 294k/294k [00:00<00:00, 6.52MB/s]
Downloading tokenizer.json: 100%|██████████| 1.37M/1.37M [00:00<00:00, 10.3MB/s]
Downloading added_tokens.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 368kB/s]
Downloading (…)cial_tokens_map.json: 100%|█| 1.03k/1.03k [00:00<00:00, 6.11MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading config.json: 100%|█████████████| 1.07k/1.07k [00:00<00:00, 5.68MB/s]
Downloading (…)n_codet5p_bimodal.py: 100%|█| 2.81k/2.81k [00:00<00:00, 16.2MB/s]
A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- configuration_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid dow

## Save checkpoint

In [6]:
!zip -r seq2seq_adalora_sql.zip CodeSearcher/output

  adding: CodeSearcher/output/ (stored 0%)
  adding: CodeSearcher/output/best_ckpt/ (stored 0%)
  adding: CodeSearcher/output/best_ckpt/special_tokens_map.json (deflated 82%)
  adding: CodeSearcher/output/best_ckpt/tokenizer_config.json (deflated 95%)
  adding: CodeSearcher/output/best_ckpt/tokenizer.json (deflated 72%)
  adding: CodeSearcher/output/best_ckpt/adapter_model.safetensors (deflated 8%)
  adding: CodeSearcher/output/best_ckpt/merges.txt (deflated 54%)
  adding: CodeSearcher/output/best_ckpt/vocab.json (deflated 59%)
  adding: CodeSearcher/output/best_ckpt/adapter_config.json (deflated 50%)
  adding: CodeSearcher/output/best_ckpt/added_tokens.json (deflated 37%)
  adding: CodeSearcher/output/best_ckpt/training_args.bin (deflated 49%)
  adding: CodeSearcher/output/best_ckpt/README.md (deflated 65%)
  adding: CodeSearcher/output/checkpoint-3500/ (stored 0%)
  adding: CodeSearcher/output/checkpoint-3500/scheduler.pt (deflated 49%)
  adding: CodeSearcher/output/checkpoint-3500/s

In [11]:
from IPython.display import FileLink

FileLink(r'seq2seq_adalora_sql.zip')

## Evaluation

In this step, test the model using *Mean Reciprocal Rank (MRR)* as the evaluation metric. 

For more details, please refer to `src/models/evaluation.py` in the repository.

In [7]:
!cd CodeSearcher/ && export PYTHONPATH=. && python src/models/evaluation.py peft \
    --tuned_ckpt_path="output/best_ckpt" \
    --language="SQL" 

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 472.28it/s]
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 662.19it/s]
[32m2023-11-27 12:15:21.989[0m | [1mINFO    [0m | [36msrc.datasets.StaQC.make_dataset[0m:[36m_setup_dataset[0m:[36m38[0m - [1mLoading complete[0m
[32m2023-11-27 12:15:22.091[0m | [1mINFO    [0m | [36msrc.datasets.StaQC.make_dataset[0m:[36m__init__[0m:[36m68[0m - [1mStaQC SQL train dataset length: 62930[0m
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 621.56it/s]
100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 592.50it/s]
[32m2023-11-27 12:15:25.130[0m | [1mINFO    [0m | [36msrc.datasets.StaQC.make_dataset[0m:[36m_setup_dataset[0m:[36m38[0m - [1mLoading complete[0m
[32m2023-11-27 12:15:25.170[0m | [1mINFO    [0m | [36msrc.datase

# Results:

- Test MRR for SQL: **0.01342745**;
- Epochs: **10**;
- Trainable params: 1,327,968 || all params: 224,410,794 || trainable%: 0.5917576317652528.

You can also check the plots for validation results via link to [wanb report](https://api.wandb.ai/links/ley-khaertdinova/13lvn64p).
