# Seq2Seq LoRa for C++

This is the main notebook for fine-tuning *codet5p-220m-bimodal seq2seq* using LoRa PEFT method on C++ programming language dataset.

## Clone the repository

In [1]:
!git clone https://github.com/leiluk1/CodeSearcher.git && cd CodeSearcher/ && git checkout dev/embeddings

Cloning into 'CodeSearcher'...
remote: Enumerating objects: 251, done.[K
remote: Counting objects: 100% (251/251), done.[K
remote: Compressing objects: 100% (163/163), done.[K
remote: Total 251 (delta 138), reused 188 (delta 83), pack-reused 0[K
Receiving objects: 100% (251/251), 7.17 MiB | 14.18 MiB/s, done.
Resolving deltas: 100% (138/138), done.
Branch 'dev/embeddings' set up to track remote branch 'dev/embeddings' from 'origin'.
Switched to a new branch 'dev/embeddings'


## Set up the required dependencies

In [None]:
!pip install dataprep gdown py7zr transformers peft evaluate rouge_score fire loguru --quiet

In [3]:
# import wandb
# wandb.login(key='')

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
!mkdir -p /kaggle/output/CodeSearcher/output
!mkdir -p CodeSearcher/data/raw
!gdown 1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6
!unzip -q -d /kaggle/working/CodeSearcher/data/raw ./XLCoST_data.zip

Downloading...
From (uriginal): https://drive.google.com/uc?id=1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6
From (redirected): https://drive.google.com/uc?id=1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6&confirm=t&uuid=7f8be221-9093-4566-9c0b-973092da046e
To: /kaggle/working/XLCoST_data.zip
100%|████████████████████████████████████████| 298M/298M [00:06<00:00, 49.1MB/s]


## Perform fine-tuning 

For more details, please check the code in `src/models/train.py` in our repository.

In [5]:
!export DATASETS_VERBOSITY=error
!cd CodeSearcher/ && export PYTHONPATH=. && python src/models/train.py seq2seq lora \
    --output_dir="output" \
    --epochs=10 \
    --language="C++" \
    --device_type="cuda:0" \
    --train_batch_size=32 \
    --eval_batch_size=16 \
    --model_max_src_length=64 \
    --model_max_tgt_length=64

Downloading tokenizer_config.json: 100%|███| 1.34k/1.34k [00:00<00:00, 8.79MB/s]
Downloading vocab.json: 100%|████████████████| 511k/511k [00:00<00:00, 7.27MB/s]
Downloading merges.txt: 100%|█████████████████| 294k/294k [00:00<00:00, 123MB/s]
Downloading tokenizer.json: 100%|██████████| 1.37M/1.37M [00:00<00:00, 24.9MB/s]
Downloading added_tokens.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 446kB/s]
Downloading (…)cial_tokens_map.json: 100%|█| 1.03k/1.03k [00:00<00:00, 8.11MB/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Downloading config.json: 100%|█████████████| 1.07k/1.07k [00:00<00:00, 7.73MB/s]
Downloading (…)n_codet5p_bimodal.py: 100%|█| 2.81k/2.81k [00:00<00:00, 11.7MB/s]
A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- configuration_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid dow

## Save checkpoint

In [6]:
!zip -r seq2seq_lora_c.zip CodeSearcher/output

  adding: CodeSearcher/output/ (stored 0%)
  adding: CodeSearcher/output/runs/ (stored 0%)
  adding: CodeSearcher/output/runs/Nov21_19-20-30_1e319dab7ac8/ (stored 0%)
  adding: CodeSearcher/output/runs/Nov21_19-20-30_1e319dab7ac8/events.out.tfevents.1700594431.1e319dab7ac8.182.0 (deflated 62%)
  adding: CodeSearcher/output/checkpoint-1000/ (stored 0%)
  adding: CodeSearcher/output/checkpoint-1000/adapter_config.json (deflated 45%)
  adding: CodeSearcher/output/checkpoint-1000/trainer_state.json (deflated 72%)
  adding: CodeSearcher/output/checkpoint-1000/scheduler.pt (deflated 49%)
  adding: CodeSearcher/output/checkpoint-1000/added_tokens.json (deflated 37%)
  adding: CodeSearcher/output/checkpoint-1000/training_args.bin (deflated 49%)
  adding: CodeSearcher/output/checkpoint-1000/optimizer.pt (deflated 8%)
  adding: CodeSearcher/output/checkpoint-1000/special_tokens_map.json (deflated 82%)
  adding: CodeSearcher/output/checkpoint-1000/adapter_model.safetensors (deflated 7%)
  adding:

## Evaluation

In this step, test the model using *Mean Reciprocal Rank (MRR)* as the evaluation metric. 

For more details, please refer to `src/models/evaluation.py` in the repository.

In [7]:
!cd CodeSearcher/ && export PYTHONPATH=. && python src/models/evaluation.py \
    --tuned_ckpt_path="output/best_ckpt" \
    --num_virtual_tokens=0 \
    --language="C++" 

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[32m2023-11-21 22:06:06.655[0m | [1mINFO    [0m | [36msrc.datasets.XLCoST.make_dataset[0m:[36m_load_search_dataframe[0m:[36m84[0m - [1mLoading dataframe from data/raw/XLCoST_data/retrieval/nl2code_search/snippet_level/C++/train.jsonl[0m
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dataframe_trunc['code_tokens'] = dataframe_trunc['code_tokens'].apply(_code_tokens_to_str)
[32m2023-11-21 22:06:08.605[0m | [1mINFO    [0m | [36msrc.datasets.XLCoST.make_dataset[0m:[36m__init__[0m:[36m64[0m - [1mXLCoST C++ train generation=False dataset length: 62839[0m
[32m2023-11-21 22:06:08.605[0m | [1mINFO    [0m | [36msrc.datasets.XLCoST.make_d

# Results:

- Test MRR for C++: **0.11642319**;
- Epochs: **10**;
- Trainable params: 884,736 || all params: 223,967,490 || trainable%: 0.39502876064736003.

You can also check the validation plots in [wanb report](https://api.wandb.ai/links/ley-khaertdinova/13lvn64p).