# Seq2Seq Prefix-Tuning for C#

This is the main notebook for prefix-tuning *codet5p-220m-bimodal seq2seq* on C# programming language dataset.

## Clone the repository

In [1]:
!git clone https://github.com/leiluk1/CodeSearcher.git && cd CodeSearcher/ && git checkout dev/embeddings

Cloning into 'CodeSearcher'...
remote: Enumerating objects: 251, done.[K
remote: Counting objects: 100% (251/251), done.[K
remote: Compressing objects: 100% (163/163), done.[K
remote: Total 251 (delta 138), reused 188 (delta 83), pack-reused 0[K
Receiving objects: 100% (251/251), 7.17 MiB | 14.38 MiB/s, done.
Resolving deltas: 100% (138/138), done.
Branch 'dev/embeddings' set up to track remote branch 'dev/embeddings' from 'origin'.
Switched to a new branch 'dev/embeddings'


## Set up the required dependencies

In [None]:
!pip install dataprep gdown py7zr transformers peft evaluate rouge_score fire loguru --quiet

In [3]:
# import wandb
# wandb.login(key='')

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [4]:
!mkdir -p /kaggle/output/CodeSearcher/output
!mkdir -p CodeSearcher/data/raw
!gdown 1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6
!unzip -q -d /kaggle/working/CodeSearcher/data/raw ./XLCoST_data.zip

Downloading...
From (uriginal): https://drive.google.com/uc?id=1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6
From (redirected): https://drive.google.com/uc?id=1tZfsYQgWmc2gG340ru5VbrZ5aLIZ41_6&confirm=t&uuid=43e0cd75-cef7-4754-936a-b646b21fc289
To: /kaggle/working/XLCoST_data.zip
100%|█████████████████████████████████████████| 298M/298M [00:01<00:00, 202MB/s]


## Perform prefix-tuning 

For more details, please check the code in `src/models/train.py` in our repository.

In [5]:
!export DATASETS_VERBOSITY=error
!cd CodeSearcher/ && export PYTHONPATH=. && python src/models/train.py seq2seq prefix \
    --output_dir="output" \
    --epochs=10 \
    --num_virtual_tokens=20 \
    --language="Csharp" \
    --model_max_src_length=256 \
    --model_max_tgt_length=256 \
    --train_batch_size=16 \
    --eval_batch_size=16 \
    --warmup_steps=200 \
    --gradient_accumulation_steps=4 \ 

Downloading tokenizer_config.json: 100%|███| 1.34k/1.34k [00:00<00:00, 6.07MB/s]
Downloading vocab.json: 100%|████████████████| 511k/511k [00:00<00:00, 3.76MB/s]
Downloading merges.txt: 100%|████████████████| 294k/294k [00:00<00:00, 58.6MB/s]
Downloading tokenizer.json: 100%|██████████| 1.37M/1.37M [00:00<00:00, 6.17MB/s]
Downloading added_tokens.json: 100%|██████████| 59.0/59.0 [00:00<00:00, 400kB/s]
Downloading (…)cial_tokens_map.json: 100%|█| 1.03k/1.03k [00:00<00:00, 6.77MB/s]
Downloading config.json: 100%|█████████████| 1.07k/1.07k [00:00<00:00, 7.19MB/s]
Downloading (…)n_codet5p_bimodal.py: 100%|█| 2.81k/2.81k [00:00<00:00, 19.8MB/s]
A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- configuration_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (…)g_codet5p_bimodal.py: 100%|█████| 939/9

## Save checkpoint

In [6]:
!zip -r seq2seq_prefix_csharp.zip CodeSearcher/output

  adding: CodeSearcher/output/ (stored 0%)
  adding: CodeSearcher/output/checkpoint-4500/ (stored 0%)
  adding: CodeSearcher/output/checkpoint-4500/rng_state.pth (deflated 28%)
  adding: CodeSearcher/output/checkpoint-4500/merges.txt (deflated 54%)
  adding: CodeSearcher/output/checkpoint-4500/scheduler.pt (deflated 49%)
  adding: CodeSearcher/output/checkpoint-4500/trainer_state.json (deflated 79%)
  adding: CodeSearcher/output/checkpoint-4500/tokenizer_config.json (deflated 81%)
  adding: CodeSearcher/output/checkpoint-4500/training_args.bin (deflated 48%)
  adding: CodeSearcher/output/checkpoint-4500/special_tokens_map.json (deflated 82%)
  adding: CodeSearcher/output/checkpoint-4500/vocab.json (deflated 59%)
  adding: CodeSearcher/output/checkpoint-4500/tokenizer.json (deflated 72%)
  adding: CodeSearcher/output/checkpoint-4500/README.md (deflated 65%)
  adding: CodeSearcher/output/checkpoint-4500/added_tokens.json (deflated 37%)
  adding: CodeSearcher/output/checkpoint-4500/adapte

## Evaluation

In this step, test the model using *Mean Reciprocal Rank (MRR)* as the evaluation metric. 

For more details, please refer to `src/models/evaluation.py` in the repository.

In [10]:
!cd CodeSearcher/ && export PYTHONPATH=. && python src/models/evaluation.py \
    --tuned_ckpt_path="output/best_ckpt/" \
    --num_virtual_tokens=20 \
    --language="Csharp" \
    --model_max_src_length=128 \
    --model_max_tgt_length=128 \
    --batch_size=32

Downloading config.json: 100%|█████████████| 1.07k/1.07k [00:00<00:00, 5.55MB/s]
Downloading (…)n_codet5p_bimodal.py: 100%|█| 2.81k/2.81k [00:00<00:00, 16.7MB/s]
A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- configuration_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (…)g_codet5p_bimodal.py: 100%|█████| 939/939 [00:00<00:00, 4.95MB/s]
A new version of the following files was downloaded from https://huggingface.co/Salesforce/codet5p-220m-bimodal:
- modeling_codet5p_bimodal.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading pytorch_model.bin: 100%|██████████| 892M/892M [00:03<00:00, 297MB/s]
Downloading tokenizer_config.json: 100%|███| 1.34k/1.34k [00:00<00:00, 6.66MB/s]
Down

# Results:

- Test MRR for C#: **0.12993185**;
- Epochs: **10**;
- Trainable params: 184,320 || all params: 223,267,074 || trainable%: 0.08255583624480159.

You can also check the validation plots in [wanb report](https://api.wandb.ai/links/ley-khaertdinova/13lvn64p).