# Quickstart for text retrieval part

This is a notebook of quick start tutorials for text retrieval part.

## inference

We offer 2 different methods for inference. The first is use `TextEmbedder` and `TextReranker`, which only support pytorch model inference. The second is `BaseEmbedderInferenceEngine` and `BaseRerankerInferenceEngine`, which support `normal`, `onnx`, `tensorrt` inference mode.


### Usage of TextEmbedder

1. import TextEmbedder.
2. load model from TextEmbedder. We support any type of dense embedding models that can be loaded by hf transformers `AutoModel.from_pretrained()`
3. feed sentences into model and get embeddings.

In [None]:
from InfoNexus import TextEmbedder

sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "The Eiffel Tower is located in Paris, France.",
    "Python is a popular programming language.",
    "The Great Wall of China is one of the Seven Wonders of the World.",
    "Space exploration has led to many scientific discoveries.",
    "Climate change is a pressing global issue.",
    "The Mona Lisa is a famous painting by Leonardo da Vinci.",
    "Electric cars are becoming more common.",
    "The human brain is an incredibly complex organ."
]


model = TextEmbedder(model_name_or_path='/data2/OpenLLMs/bge-base-zh-v1.5', use_fp16=True, devices=['cuda:1','cuda:0']) # Setting use_fp16 to True speeds up computation with a slight performance degradation

embedding= model.encode(sentences, batch_size = 5)

print(embedding.shape)
print(embedding[0]@ embedding[1].T)

### Usage of BaseEmbedderInferenceEngine

1. Import AbsInferenceArguments and BaseEmbedderInferenceEngine.
2. Get onnx model or tensorrt model. You can convert pytorch model to onnx model using class method `convert_to_onnx` or convert onnx model to tensorrt model using `convert_to_tensorrt`. If you already got onnx or tensorrt model, skip this step.

In [None]:
from InfoNexus import AbsInferenceArguments, BaseEmbedderInferenceEngine

model_path='/data2/OpenLLMs/bge-base-zh-v1.5'
onnx_model_path='/data2/OpenLLMs/bge-base-zh-v1.5/onnx/model.onnx'

# convert to onnx
BaseEmbedderInferenceEngine.convert_to_onnx(model_name_or_path=model_path, onnx_model_path=onnx_model_path)

3. Create args, where you can specify the  param `infer_mode`. We support `normal`, `onnx`, `tensorrt` for inference.

In [None]:
args=AbsInferenceArguments(
    model_name_or_path=model_path,
    onnx_model_path=onnx_model_path,
    trt_model_path=None,
    infer_mode='onnx',
    infer_device=0,
    infer_batch_size=16
)

4. Inference and get embeddings.

In [None]:
# 2. Inference with onnx session
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "The Eiffel Tower is located in Paris, France.",
    "Python is a popular programming language.",
    "The Great Wall of China is one of the Seven Wonders of the World.",
    "Space exploration has led to many scientific discoveries.",
    "Climate change is a pressing global issue.",
    "The Mona Lisa is a famous painting by Leonardo da Vinci.",
    "Electric cars are becoming more common.",
    "The human brain is an incredibly complex organ."
]


inference_engine_onnx = BaseEmbedderInferenceEngine(args)
emb_onnx = inference_engine_onnx.inference(sentences, normalize=True, batch_size=5)
print(emb_onnx.shape)
print(emb_onnx[0]@ emb_onnx[1].T)

## Training

We support multi-node, multi-gpu distributed training using accelerate.

### Single node

Below is a script to launch single node multi gpus training, which is included in `examples/text_retrieval/training`

Here are details about this script:

1. args for path:

    1. `base_dir` : base dir of this repo

    2. `train_data` : path to training data, can be files or folders. Data format should be as below:

    ```json
    {
        "query": "query",
        "pos": ["pos 1", "pos 2"],
        "neg": ["neg 1", "neg 2"],
    }
    ```

    3. `model_name_or_path` : model name or path of base model 

    4. `ckpt_save_dir` : checkpoints save path

    5. `deepspeed` : deepspeed config file path

2. args for training:

    1. `num_train_epochs` : training epochs.

    2. `per_device_train_batch_size` batch size per device

    3. `num_gpus` use num of gpus

In [None]:
export WANDB_MODE=disabled

base_dir='/data2/home/angqing/code/InfoNexus'
train_data='/data2/home/angqing/code/InfoNexus/examples/text_retrieval/example_data/fiqa.jsonl'
model_name_or_path='/data2/OpenLLMs/bge-base-zh-v1.5'
ckpt_save_dir='/data2/home/angqing/code/InfoNexus/checkpoints/test_embedder'

deepspeed='/data2/home/angqing/code/InfoNexus/examples/text_retrieval/training/ds_stage0.json'
# set large epochs and small batch size for testing
num_train_epochs=2
per_device_train_batch_size=30

# set num_gpus to 2 for testing
num_gpus=2

if [ -z "$HF_HUB_CACHE" ]; then
    export HF_HUB_CACHE="$HOME/.cache/huggingface/hub"
fi

model_args="\
    --model_name_or_path $model_name_or_path \
    --cache_dir $HF_HUB_CACHE \
"

data_args="\
    --train_data $train_data \
    --cache_path ~/.cache \
    --train_group_size 8 \
    --query_max_len 512 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
    --query_instruction_format '{}{}' \
    --knowledge_distillation False \
"

training_args="\
    --output_dir $ckpt_save_dir \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs $num_train_epochs \
    --per_device_train_batch_size $per_device_train_batch_size \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --deepspeed $deepspeed \
    --logging_steps 1 \
    --save_steps 100 \
    --negatives_cross_device \
    --temperature 0.02 \
    --sentence_pooling_method cls \
    --normalize_embeddings True \
    --kd_loss_type kl_div \
"


cd $base_dir

cmd="torchrun --nproc_per_node $num_gpus \
    -m InfoNexus.training.embedder.text_retrieval \
    $model_args \
    $data_args \
    $training_args \
"

echo $cmd
eval $cmd

### Multi node

We use `accelerate` to launch multi-node training. Scripts are in `examples/text_retrieval/training/*_accelerate.sh`

1. First, generate accelerate config file.

In [None]:
accelerate config --config_file accelerate_config.json

2. Then, launch accelerate training scripts in each node from rank 0 to rank n.

    Args are similar to single node script, except `ACCELERATE_CONFIG`, which is path to the accelerate config file generated from step 1.

In [None]:
export WANDB_MODE=disabled


BASE_DIR='/data1/home/recstudio/angqing/InfoNexus'

MODEL_NAME_OR_PATH='/data1/home/recstudio/angqing/models/bge-base-zh-v1.5'
TRAIN_DATA="/data1/home/recstudio/angqing/InfoNexus/eval_scripts/training/text_retrieval/example_data/fiqa.jsonl"
CKPT_SAVE_DIR='/data1/home/recstudio/angqing/InfoNexus/checkpoints'
DEEPSPEED_DIR='/data1/home/recstudio/angqing/InfoNexus/eval_scripts/training/ds_stage0.json'
ACCELERATE_CONFIG='/data1/home/recstudio/angqing/InfoNexus/eval_scripts/training/text_retrieval/accelerate_config_multi.json'
# set large epochs and small batch size for testing
num_train_epochs=2
per_device_train_batch_size=16
# set num_gpus to 2 for testing
num_gpus=2

if [ -z "$HF_HUB_CACHE" ]; then
    export HF_HUB_CACHE="$HOME/.cache/huggingface/hub"
fi

model_args="\
    --model_name_or_path $MODEL_NAME_OR_PATH \
    --cache_dir $HF_HUB_CACHE \
"

data_args="\
    --train_data $TRAIN_DATA \
    --cache_path ~/.cache \
    --train_group_size 8 \
    --query_max_len 512 \
    --passage_max_len 512 \
    --pad_to_multiple_of 8 \
    --query_instruction_for_retrieval 'Represent this sentence for searching relevant passages: ' \
    --query_instruction_format '{}{}' \
    --knowledge_distillation False \
"

training_args="\
    --output_dir $CKPT_SAVE_DIR \
    --overwrite_output_dir \
    --learning_rate 1e-5 \
    --fp16 \
    --num_train_epochs $num_train_epochs \
    --per_device_train_batch_size $per_device_train_batch_size \
    --dataloader_drop_last True \
    --warmup_ratio 0.1 \
    --gradient_checkpointing \
    --deepspeed $DEEPSPEED_DIR \
    --logging_steps 10 \
    --save_steps 500 \
    --negatives_cross_device \
    --temperature 0.02 \
    --sentence_pooling_method cls \
    --normalize_embeddings True \
    --kd_loss_type kl_div \
"

cd $BASE_DIR

cmd="accelerate launch --config_file $ACCELERATE_CONFIG \
    InfoNexus/training/embedder/text_retrieval/__main__.py \
    $model_args \
    $data_args \
    $training_args \
"

echo $cmd
eval $cmd

## Evaluation

Our evaluation step support `normal`, `onnx`, `tensorrt` inference mode for embedder and reranker.

1. Air-benchmark

    We support a easy script for evaluate air-bencmark in `examples/text_retrieval/evaluation/run.sh`

In [None]:
BASE_DIR=/data1/home/recstudio/angqing/InfoNexus
EMBEDDER=/data1/home/recstudio/angqing/models/bge-base-zh-v1.5
RERANKER=/data1/home/recstudio/angqing/models/bge-reranker-base
embedder_infer_mode=onnx
reranker_infer_mode=onnx
embedder_onnx_path=$EMBEDDER/onnx/model.onnx
reranker_onnx_path=$RERANKER/onnx/model.onnx
embedder_trt_path=$EMBEDDER/trt/model.trt
reranker_trt_path=$RERANKER/trt/model.trt


cd $BASE_DIR


python -m InfoNexus.evaluation.text_retrieval.airbench \
    --benchmark_version AIR-Bench_24.05 \
    --task_types qa \
    --domains arxiv \
    --languages en \
    --splits dev test \
    --output_dir $BASE_DIR/examples/text_retrieval/evaluation/air_bench/search_results \
    --search_top_k 1000 \
    --rerank_top_k 20 \
    --cache_dir $BASE_DIR/examples/text_retrieval/evaluation/cache/data \
    --overwrite False \
    --embedder_name_or_path $EMBEDDER \
    --reranker_name_or_path $RERANKER \
    --devices cuda:0 \
    --model_cache_dir $BASE_DIR/examples/text_retrieval/evaluation/cache/model \
    --reranker_query_max_length 128 \
    --reranker_max_length 512 \
    --embedder_infer_mode $embedder_infer_mode \
    --reranker_infer_mode $reranker_infer_mode \
    --embedder_onnx_model_path $embedder_onnx_path \
    --reranker_onnx_model_path $reranker_onnx_path \
    --embedder_trt_model_path $embedder_trt_path \
    --reranker_trt_model_path $reranker_trt_path

2. Custom

    Below are some important arguments:

    1. `eval_name`: your eval dataset name
    2. `dataset_dir`: your data path
    3. `splits`: default is `test`
    4. `corpus_embd_save_dir`: your corpus embedding index save path
    5. `output_dir`: eval result output dir
    6. `search_top_k`: search top k
    7. `rerank_top_k`: rerank top k
    8. `embedder_name_or_path`: embedder model path
    9. `reranker_name_or_path`: reranker model path
    10. `embedder_infer_mode`:
        embedder infer mode, default is `None`, which means using `TextEmbedder`. Choices are `normal`, `onnx`, `tensorrt`

    11. `reranker_infer_mode`: same to embedder
    12. `embedder_onnx_model_path`: onnx model path, needed if you setting `embedder_infer_mode` to `onnx`
    13. `reranker_onnx_model_path`: same to embedder
    14. `embedder_trt_model_path`: tensorrt model path, needed if you setting `embedder_infer_mode` to `tensorrt`
    15. `reranker_trt_model_path`: save to embedder

In [None]:
python -m InfoNexus.evaluation.text_retrieval \
    --eval_name your_data_name \
    --dataset_dir ./your_data_path \
    --splits test \
    --corpus_embd_save_dir ./your_data_name/corpus_embd \
    --output_dir ./your_data_name/search_results \
    --search_top_k 1000 \
    --rerank_top_k 100 \
    --cache_path ./cache/data \
    --overwrite False \
    --k_values 10 100 \
    --eval_output_method markdown \
    --eval_output_path ./your_data_name/eval_results.md \
    --eval_metrics ndcg_at_10 recall_at_100 \
    --embedder_name_or_path BAAI/bge-m3 \
    --reranker_name_or_path BAAI/bge-reranker-v2-m3 \
    --devices cuda:0 cuda:1 \
    --cache_dir ./cache/model \
    --reranker_query_max_length 512 \
    --reranker_max_length 1024 \
    --embedder_infer_mode $embedder_infer_mode \
    --reranker_infer_mode $reranker_infer_mode \
    --embedder_onnx_model_path $embedder_onnx_path \
    --reranker_onnx_model_path $reranker_onnx_path \
    --embedder_trt_model_path $embedder_trt_path \
    --reranker_trt_model_path $reranker_trt_path