# Code by TCE team at Qur'an QA 2023 shared task A

# Installation

I use [rclone](https://rclone.org/) to access my drive without asking for permission everytime.
The code accesses a file called colab4 which has my drive access token, you may replicate this on your side or just ignore this altogether and download files manually.  

In [None]:
!lscpu
!nvidia-smi
!free -g

In [None]:
!curl https://rclone.org/install.sh | bash 2> null 1>null

In [None]:
import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["WANDB_DISABLED"] = "true"

In [None]:
!rclone

## Clone repo and prepare the datasets

In [None]:
repo_url = f"https://github.com/mohammed-elkomy/TCE-QQA2023-TASK-A"
!git clone $repo_url
%cd TCE-QQA2023-TASK-A
!pip install -r requirements.txt
!pip install --no-deps python-terrier==0.7.1

### Download and create datasets

In [None]:
!git pull


!python data_scripts/download_datasets.py > null 2> null
!python data_scripts/generate/generate_tydi_qa_pretraining_pairs.py > null 2> null
!python data_scripts/generate/merge_train_dev.py > null 2> null
!python data_scripts/generate/generate_tafseer_data.py > null 2> null

In [None]:
!md5sum data/* | sort -k 2

##### Download pretrained models
Download those files from drive or huggingface

**tydi-pairs ➡ trained only tydi-qa passage-question pairs**
1. araelectra-base-discriminator-tydi-pairs
2. bert-base-arabertv02-tydi-pairs
3. bert-base-arabic-camelbert-ca-tydi-pairs

**tydi-tafseer ➡ trained only tydi-qa passage-question pairs then tafseer pairs**
4. bert-base-arabertv02-tydi-tafseer-pairs
5. bert-base-arabic-camelbert-ca-tydi-tafseer-pairs
6. araelectra-base-discriminator-tydi-tafseer-pairs

**tafseer-pairs ➡ trained only tafseer pairs**
7. bert-base-arabic-camelbert-ca-tafseer-pairs
8. bert-base-arabertv02-tafseer-pairs
9. araelectra-base-discriminator-tafseer-pairs

In [None]:
# make sure to make a copy for the nested DRhard repo
!cp -r "bi-bert-base-arabertv02-tafseer" "biencoder/DRhard/bi-bert-base-arabertv02-tafseer"

Make sure to use colab for this notebook in order to see the interactive form of experiments.

* We have different models to choose from the list below.
* Set the number of models to train, we train 10 models to get average performance.

* choose the experiment mode
    1.  QQA23_TaskA_qrcd_v1.2 | QQA  ➡ normal training with official training data and validation with official validation data.  
    2.  QQA23_TaskA_qrcd_v1.2_merged | QQA-merged ➡ combining training and validation for training and perform inference using hidden split (done for testing phase).
    3. pretraining can be made either by "tafseer" or "TYDI" pairs.

---

**Once the training is made you will find a dump file saved!**

something like: araelectra-base-discriminator-tafseer-pairs-fine-tuned-1e-06-5254-train.zip
This is a araelectra-base-discriminator-tafseer-pairs fine-tuned model with:
1. learning rate of 1e-06.
2. A random starting seed of 5254.
4. train.zip means training data is used

This dump file contains models prediction for the given eval or test data.


## Cross Encoder

A cross encoder is a bert-based model that predicts a relevance score for a pair of sentences (late-interaction)

* We have different models to choose from the list below.
* Set the number of models to train, we train 10 models to get average performance.

* choose the experiment mode
    1.  QQA23_TaskA_qrcd_v1.2  ➡ normal training with official training data and validation with official validation data.  
    2.  QQA23_TaskA_qrcd_v1.2_merged ➡ combining training and validation for training and perform inference using hidden split (done for testing phase).
    3. tafseer  ➡ For tafseer pratraining data pairs
    4. pre-train ➡ For tydi-qa pratraining data pairs

[Check this for more details](https://www.sbert.net/examples/applications/cross-encoder/README.html)

In [None]:
import os
from random import choice
import glob

model_name = "aubmindlab/araelectra-base-discriminator"  # @param ["araelectra-base-discriminator-tydi-tafseer-pairs", "bert-base-arabic-camelbert-ca-tydi-tafseer-pairs", "bert-base-arabertv02-tydi-tafseer-pairs", "====", "araelectra-base-discriminator-tydi-pairs", "bert-base-arabertv02-tydi-pairs", "bert-base-arabic-camelbert-ca-tydi-pairs", "===", "bert-base-arabic-camelbert-ca-tafseer-pairs", "bert-base-arabertv02-tafseer-pairs", "araelectra-base-discriminator-tafseer-pairs", "====", "aubmindlab/bert-base-arabertv02", "CAMeL-Lab/bert-base-arabic-camelbert-ca", "aubmindlab/araelectra-base-discriminator"]

num_models = 1 # @param {type:"integer"}

experiment_mode = "tafseer"  # @param ["QQA23_TaskA_qrcd_v1.2", "QQA23_TaskA_qrcd_v1.2_merged","all_dev","pre-train","tafseer"]

lr = "1e-6"  # @param ["2e-5","1e-5","5e-6","2e-6","1e-6"]


for idx in range(num_models):
    out_file = f"{idx}-out.txt"
    err_file = f"{idx}-err.txt"
    doc_file="data/QQA23_TaskA_QPC_v1.1.tsv"

    if experiment_mode == "QQA23_TaskA_qrcd_v1.2_merged":
        train_qrel_file = "data/QQA23_TaskA_qrels_merged.gold"
        train_query_file = "data/QQA23_TaskA_merged.tsv"
    elif experiment_mode == "QQA23_TaskA_qrcd_v1.2":
        train_qrel_file = "data/QQA23_TaskA_qrels_train.gold"
        train_query_file = "data/QQA23_TaskA_train.tsv"
    elif experiment_mode == "all_dev":
        train_qrel_file = "data/QQA23_TaskA_qrels_dev.gold"
        train_query_file = "data/QQA23_TaskA_dev.tsv"

    validation_qrel_file = "data/QQA23_TaskA_qrels_dev.gold"
    validation_query_file = "data/QQA23_TaskA_dev.tsv"

    test_qrel_file = None
    test_query_file = "data/QQA23_TaskA_test.tsv"
    num_train_epochs = 10
    pre_train = False
    do_predict = True
    do_eval= True
    if experiment_mode == "pre-train":
        doc_file="data/TYDI_QA_DOC.tsv"
        train_qrel_file = "data/TYDI_QA_qrels_train.gold"
        train_query_file = "data/TYDI_QA_train.tsv"
        validation_qrel_file = "data/TYDI_QA_qrels_dev.gold"
        validation_query_file = "data/TYDI_QA_dev.tsv"
        test_qrel_file = None
        test_query_file = None
        pre_train = True
        do_predict = False
        num_train_epochs = 2

    if experiment_mode == "tafseer":
        doc_file="data/tafseer_docs.tsv"
        train_qrel_file = "data/tafseer-qrel.tsv"
        train_query_file = "data/tafseer-query.tsv"
        validation_qrel_file = None
        validation_query_file = None
        test_qrel_file = None
        test_query_file = None
        pre_train = True
        do_eval= False
        do_predict = False
        num_train_epochs = 5


    output_folder = os.path.split(model_name)[-1] + f"-fine-tuned-{float(lr)}"

    batch_size = 8 if "large" in model_name else 16


    !git pull
    !rm -r $output_folder

    !python "cross_encoder/trainer.py" \
            --model_name_or_path  $model_name \
            --do_train True \
            --do_eval $do_eval \
            --do_predict $do_predict \
            --save_last_checkpoint_to_drive $pre_train \
            --train_qrel_file $train_qrel_file \
            --train_query_file  $train_query_file \
            --validation_qrel_file  $validation_qrel_file \
            --validation_query_file $validation_query_file \
            --test_qrel_file $test_qrel_file  \
            --test_query_file  $test_query_file \
            --doc_file $doc_file \
            --learning_rate $lr \
            --num_train_epochs $num_train_epochs \
            --max_seq_length 512 \
            --output_dir $output_folder \
            --per_device_eval_batch_size $batch_size \
            --per_device_train_batch_size $batch_size \
            --save_steps 2 \
            --overwrite_output_dir

# Dual-encoder

A dual-encoder is a bert-based model that predicts a relevance score for a pair of sentences represented individually (representational-based).
The following cells trains ➡ infers ➡ mines hard negatives ➡ trains again.

* We have different models to choose from the list below.
* Set the number of models to train, we train 10 models to get average performance.

* choose the experiment mode
    1. QQA  ➡ normal training with official training data and validation with official validation data.  
    2. QQA-merged ➡ combining training and validation for training and perform inference using hidden split (done for testing phase).
    3. TYDI ➡ For tydi-qa pratraining data pairs

[Check DRhard repo for more details](https://github.com/jingtaozhan/DRhard)

In [None]:
!git pull
import os
from random import choice
import glob

if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

model_name = "bi-bert-base-arabertv02-tafseer"  # @param ["bi-bert-base-arabertv02-tafseer","intfloat/multilingual-e5-base", "aubmindlab/bert-base-arabertv02", "CAMeL-Lab/bert-base-arabic-camelbert-ca", "aubmindlab/araelectra-base-discriminator" ]

num_models = 1 # @param {type:"integer"}

experiment_mode = "QQA"  # @param ["QQA","QQA-merged","TYDI"]

lr = "5e-5"  # @param ["1e-5","5e-5","5e-6","2e-6","1e-6","1e-4"]

!python preprocess.py --data_type $experiment_mode --threads 2 --model_name_or_path $model_name
# max_query_length, max_doc_length are printed in preprocess script
max_doc_length = 335
max_query_length = 64
for idx in range(num_models):
    out_file = f"{idx}-out.txt"
    err_file = f"{idx}-err.txt"
    doc_file="data/QQA23_TaskA_QPC_v1.1.tsv"

    num_train_epochs = 100
    pre_train = False
    do_predict = True
    if experiment_mode == "TYDI":
        pre_train = True
        do_predict = False
        num_train_epochs = 2

    output_folder = os.path.split(model_name)[-1] + f"-fine-tuned-{float(lr)}"

    batch_size = 8 if "large" in model_name else 16

    !python star/train.py --do_train \
        --max_query_length $max_query_length \
        --max_doc_length $max_doc_length \
        --preprocess_dir ./data/QQA/preprocess \
        --init_path  $model_name \
        --output_dir ./data/QQA/star_train/models \
        --logging_dir ./data/QQA/star_train/log \
        --optimizer_str adamw \
        --learning_rate $lr \
        --save_every_epochs 50 \
        --overwrite_output_dir --num_train_epochs $num_train_epochs \
        --per_device_train_batch_size $batch_size

    # !rm -r /content/TCE-QQA2023-TASK-A/biencoder/DRhard/data/QQA/star_train/models


In [None]:
!git pull

# ./data/QQA/star_train/models has the trained model

if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

!mkdir -p './data/QQA/trained_models/star'
!cp -r "data/QQA/star_train/models/checkpoint-1000/."  './data/QQA/trained_models/star'

# './data/QQA/trained_models/star' this is used by inference.py
!rm -r "data/QQA/evaluate/star/"
!python star/inference.py --data_type QQA \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length \
    --mode dev \
    --eval_batch_size 256 \
    --do_full_retrieval \
    --topk 1000 \
    --no_tpu --faiss_gpus 0

!python ./cvt_back.py \
    --input_dir ./data/QQA/evaluate/star/ \
    --preprocess_dir ./data/QQA/preprocess \
    --output_dir ./data/QQA/official_runs/star \
    --mode dev --dataset QQA

%cd "/content/TCE-QQA2023-TASK-A"
!python "metrics/Custom_TaskA_eval.py" \
    --run="/content/TCE-QQA2023-TASK-A/biencoder/DRhard/data/QQA/official_runs/star/dev.rank.tsv" \
    --qrels="data/QQA23_TaskA_qrels_dev.gold"

In [None]:
!git pull
if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

!rm -r "data/QQA/warmup"
!cp -r "data/QQA/trained_models/star" "data/QQA/warmup"

!python star/prepare_hardneg.py \
    --data_type QQA \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length  \
    --mode train \
    --topk 200 \
    --eval_batch_size 64 \
    --max_positives 200 \
    --output_inference_dir "star-hard-prepare"

In [None]:
if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

# warmup model is the trained model from random negatives (copied at the last cell)
!python ./star/train.py --do_train \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length \
    --preprocess_dir ./data/QQA/preprocess \
    --hardneg_path  ./data/QQA/star-hard-prepare/hard.json \
    --init_path ./data/QQA/warmup \
    --output_dir ./data/QQA/star_train_hard/models \
    --logging_dir ./data/QQA/star_train_hard/log \
    --optimizer_str adamw \
    --learning_rate $lr \
    --save_every_epochs 25 \
    --per_device_train_batch_size $batch_size \
    --overwrite_output_dir --num_train_epochs 100

In [None]:
# ./data/QQA/star_train_hard/models has the trained model
!git pull

if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

!mkdir -p './data/QQA/trained_models/star'
!cp -r "data/QQA/star_train_hard/models/checkpoint-1000/."  './data/QQA/trained_models/star'

!rm -r "data/QQA/evaluate/star/"
!python star/inference.py --data_type QQA \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length \
    --mode dev \
    --eval_batch_size 256 \
    --do_full_retrieval \
    --topk 1000 \
    --no_tpu --faiss_gpus 0

!python ./cvt_back.py \
    --input_dir ./data/QQA/evaluate/star/ \
    --preprocess_dir ./data/QQA/preprocess \
    --output_dir ./data/QQA/official_runs/star-hard \
    --mode dev --dataset QQA

%cd "/content/TCE-QQA2023-TASK-A"
!python "metrics/Custom_TaskA_eval.py" \
    --run="/content/TCE-QQA2023-TASK-A/biencoder/DRhard/data/QQA/official_runs/star-hard/dev.rank.tsv" \
    --qrels="data/QQA23_TaskA_qrels_dev.gold"

In [None]:
# create a new warm up model from the hard negatives trained model

if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

!rm -r "data/QQA/warmup"
!cp -r "data/QQA/trained_models/star" "data/QQA/warmup"

!python star/prepare_hardneg.py \
    --data_type QQA \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length  \
    --mode train \
    --topk 200 \
    --eval_batch_size 64 \
    --max_positives 200 \
    --output_inference_dir "star-hard-prepare2"

In [None]:
if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root


!python ./star/train.py --do_train \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length \
    --preprocess_dir ./data/QQA/preprocess \
    --hardneg_path  ./data/QQA/star-hard-prepare2/hard.json \
    --init_path ./data/QQA/warmup \
    --output_dir ./data/QQA/star_train_hard2/models \
    --logging_dir ./data/QQA/star_train_hard2/log \
    --optimizer_str adamw \
    --learning_rate $lr \
    --save_every_epochs 25 \
    --per_device_train_batch_size $batch_size \
    --overwrite_output_dir --num_train_epochs 100

In [None]:
# ./data/QQA/star_train_hard/models has the trained model
!git pull

if "biencoder" not in os.getcwd():
    repo_root = os.path.join(os.getcwd(),"biencoder","DRhard",)
    %cd $repo_root

!mkdir -p './data/QQA/trained_models/star'
!cp -r "data/QQA/star_train_hard2/models/checkpoint-1000/."  './data/QQA/trained_models/star'

!rm -r "data/QQA/evaluate/star/"
!python star/inference.py --data_type QQA \
    --max_query_length $max_query_length \
    --max_doc_length $max_doc_length \
    --mode dev \
    --eval_batch_size 256 \
    --do_full_retrieval \
    --topk 1000 \
    --no_tpu --faiss_gpus 0

!python ./cvt_back.py \
    --input_dir ./data/QQA/evaluate/star/ \
    --preprocess_dir ./data/QQA/preprocess \
    --output_dir ./data/QQA/official_runs/star-hard2 \
    --mode dev --dataset QQA

%cd "/content/TCE-QQA2023-TASK-A"
!python "metrics/Custom_TaskA_eval.py" \
    --run="/content/TCE-QQA2023-TASK-A/biencoder/DRhard/data/QQA/official_runs/star-hard2/dev.rank.tsv" \
    --qrels="data/QQA23_TaskA_qrels_dev.gold"

# sbert

A biencoder is a bert-based model that predicts a relevance score for a pair of sentences represented individually (representational-based).
The following cell only trains using random negatives.

* We have different models to choose from the list below.
* Set the number of models to train, we train 10 models to get average performance.

* choose the experiment mode
    1.  QQA23_TaskA_qrcd_v1.2  ➡ normal training with official training data and validation with official validation data.  
    2.  QQA23_TaskA_qrcd_v1.2_merged ➡ combining training and validation for training and perform inference using hidden split (done for testing phase).
    3. tafseer  ➡ For tafseer pratraining data pairs
    4. pre-train ➡ For tydi-qa pratraining data pairs

[Check this for more details](https://www.sbert.net/examples/applications/cross-encoder/README.html)

In [None]:
import os
from random import choice
import glob

model_name = "aubmindlab/bert-base-arabertv02"  # @param ["bi-bert-base-arabertv02-tafseer","intfloat/multilingual-e5-base", "aubmindlab/bert-base-arabertv02", "CAMeL-Lab/bert-base-arabic-camelbert-ca", "aubmindlab/araelectra-base-discriminator" ]

num_models = 1 # @param {type:"integer"}

experiment_mode = "tafseer"  # @param ["QQA23_TaskA_qrcd_v1.2", "QQA23_TaskA_qrcd_v1.2_merged","all_dev","pre-train","tafseer"]

lr = "1e-6"  # @param ["2e-5","1e-5","5e-6","2e-6","1e-6"]


for idx in range(num_models):
    out_file = f"{idx}-out.txt"
    err_file = f"{idx}-err.txt"
    doc_file="data/QQA23_TaskA_QPC_v1.1.tsv"

    if experiment_mode == "QQA23_TaskA_qrcd_v1.2_merged":
        train_qrel_file = "data/QQA23_TaskA_qrels_merged.gold"
        train_query_file = "data/QQA23_TaskA_merged.tsv"
    elif experiment_mode == "QQA23_TaskA_qrcd_v1.2":
        train_qrel_file = "data/QQA23_TaskA_qrels_train.gold"
        train_query_file = "data/QQA23_TaskA_train.tsv"
    elif experiment_mode == "all_dev":
        train_qrel_file = "data/QQA23_TaskA_qrels_dev.gold"
        train_query_file = "data/QQA23_TaskA_dev.tsv"

    validation_qrel_file = "data/QQA23_TaskA_qrels_dev.gold"
    validation_query_file = "data/QQA23_TaskA_dev.tsv"

    test_qrel_file = "data/QQA23_TaskA_qrels_dev.gold"
    test_query_file = "data/QQA23_TaskA_dev.tsv"
    num_train_epochs = 10
    pre_train = False
    do_predict = True
    do_eval= True
    if experiment_mode == "pre-train":
        doc_file="data/TYDI_QA_DOC.tsv"
        train_qrel_file = "data/TYDI_QA_qrels_train.gold"
        train_query_file = "data/TYDI_QA_train.tsv"
        validation_qrel_file = "data/TYDI_QA_qrels_dev.gold"
        validation_query_file = "data/TYDI_QA_dev.tsv"
        test_qrel_file = None
        test_query_file = None
        pre_train = True
        do_predict = False
        num_train_epochs = 2

    if experiment_mode == "tafseer":
        doc_file="data/tafseer_docs.tsv"
        train_qrel_file = "data/tafseer-qrel.tsv"
        train_query_file = "data/tafseer-query.tsv"
        validation_qrel_file = None
        validation_query_file = None
        test_qrel_file = None
        test_query_file = None
        pre_train = True
        do_eval= False
        do_predict = False
        num_train_epochs = 5


    output_folder = os.path.split(model_name)[-1] + f"-fine-tuned-{float(lr)}"

    batch_size = 8 if "large" in model_name else 16


    !git pull
    !rm -r $output_folder

    !python "sbert/sbert_trainer.py" \
            --model_name_or_path  $model_name \
            --do_train True \
            --do_eval $do_eval \
            --do_predict $do_predict \
            --save_last_checkpoint_to_drive $pre_train \
            --train_qrel_file $train_qrel_file \
            --train_query_file  $train_query_file \
            --validation_qrel_file  $validation_qrel_file \
            --validation_query_file $validation_query_file \
            --test_qrel_file $test_qrel_file  \
            --test_query_file  $test_query_file \
            --doc_file $doc_file \
            --learning_rate $lr \
            --num_train_epochs $num_train_epochs \
            --max_seq_length 512 \
            --output_dir $output_folder \
            --per_device_eval_batch_size $batch_size \
            --per_device_train_batch_size $batch_size \
            --save_steps 2 \
            --overwrite_output_dir

# Analysis and ensemble

**Once the training is made you will find a dump file saved!**

**Once the training is made you will find a dump file saved!**

something like: araelectra-base-discriminator-tafseer-pairs-fine-tuned-1e-06-5254-train.zip
This is an araelectra-base-discriminator-tafseer-pairs fine-tuned model with:
1. learning rate of 1e-06.
2. A random starting seed of 5254.
4. train.zip means training data is used

This dump file contains models prediction for the given eval or test data.

You can look at the **analysis** directory of the repo for more details.
You can group dump files into folders:
1. run **performance_analysis.py** script to process and get results for single models and ensemble models
   - **retrieval_ensemble.py** is consumed by **performance_analysis.py** to implement the ensemble logic


In [None]:
!python analysis/performance_analysis.py