# Quickstart for text retrieval part

This is a notebook of quick start tutorials for text retrieval part.

## inference

We offer 2 different methods for inference. The first is use `TextEmbedder` and `TextReranker`, which only support pytorch model inference. The second is `BaseEmbedderInferenceEngine` and `BaseRerankerInferenceEngine`, which support `normal`, `onnx`, `tensorrt` inference mode.


### Usage of TextEmbedder

1. import TextEmbedder.
2. load model from TextEmbedder. We support any type of dense embedding models that can be loaded by hf transformers `AutoModel.from_pretrained()`
3. feed sentences into model and get embeddings.

In [None]:
from Nexus import TextEmbedder

sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "The Eiffel Tower is located in Paris, France.",
    "Python is a popular programming language.",
    "The Great Wall of China is one of the Seven Wonders of the World.",
    "Space exploration has led to many scientific discoveries.",
    "Climate change is a pressing global issue.",
    "The Mona Lisa is a famous painting by Leonardo da Vinci.",
    "Electric cars are becoming more common.",
    "The human brain is an incredibly complex organ."
]


model = TextEmbedder(model_name_or_path='/data2/OpenLLMs/bge-base-zh-v1.5', use_fp16=True, devices=['cuda:1','cuda:0']) # Setting use_fp16 to True speeds up computation with a slight performance degradation

embedding= model.encode(sentences, batch_size = 5)

print(embedding.shape)
print(embedding[0]@ embedding[1].T)

### Usage of BaseEmbedderInferenceEngine

1. Import AbsInferenceArguments and BaseEmbedderInferenceEngine.
2. Get onnx model or tensorrt model. You can convert pytorch model to onnx model using class method `convert_to_onnx` or convert onnx model to tensorrt model using `convert_to_tensorrt`. If you already got onnx or tensorrt model, skip this step.

In [None]:
from Nexus import AbsInferenceArguments, BaseEmbedderInferenceEngine

model_path='/data2/OpenLLMs/bge-base-zh-v1.5'
onnx_model_path='/data2/OpenLLMs/bge-base-zh-v1.5/onnx/model.onnx'

# convert to onnx
BaseEmbedderInferenceEngine.convert_to_onnx(model_name_or_path=model_path, onnx_model_path=onnx_model_path)

3. Create args, where you can specify the  param `infer_mode`. We support `normal`, `onnx`, `tensorrt` for inference.

In [None]:
args=AbsInferenceArguments(
    model_name_or_path=model_path,
    onnx_model_path=onnx_model_path,
    trt_model_path=None,
    infer_mode='onnx',
    infer_device=0,
    infer_batch_size=16
)

4. Inference and get embeddings.

In [None]:
# 2. Inference with onnx session
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "Artificial intelligence is transforming the world.",
    "The Eiffel Tower is located in Paris, France.",
    "Python is a popular programming language.",
    "The Great Wall of China is one of the Seven Wonders of the World.",
    "Space exploration has led to many scientific discoveries.",
    "Climate change is a pressing global issue.",
    "The Mona Lisa is a famous painting by Leonardo da Vinci.",
    "Electric cars are becoming more common.",
    "The human brain is an incredibly complex organ."
]


inference_engine_onnx = BaseEmbedderInferenceEngine(args)
emb_onnx = inference_engine_onnx.inference(sentences, normalize=True, batch_size=5)
print(emb_onnx.shape)
print(emb_onnx[0]@ emb_onnx[1].T)

## Training

We support multi-node, multi-gpu distributed training using accelerate.
You can provide data, training, model config files for training.

Important Args:

1. args for path:

    1. `base_dir` : base dir of this repo

    2. `train_data` : path to training data, can be files or folders. Data format should be as below:

    ```json
    {
        "query": "query",
        "pos": ["pos 1", "pos 2"],
        "neg": ["neg 1", "neg 2"],
    }
    ```

    3. `model_name_or_path` : model name or path of base model 

    4. `ckpt_save_dir` : checkpoints save path

    5. `deepspeed` : deepspeed config file path

2. args for training:

    1. `num_train_epochs` : training epochs.

    2. `per_device_train_batch_size` batch size per device

    3. `num_gpus` use num of gpus

We use `accelerate` to launch multi-node training.

1. First, generate accelerate config file.

In [None]:
accelerate config --config_file accelerate_config.json

2. Then, launch accelerate training scripts. In multi node training, you should launch accelerate in each node in order.

In [None]:
accelerate launch --config_file /your/accelerate/config/file /your/python/file 

### Single node Single Device

In single device mode, you should set 'negatives_cross_device' to 'false', and 'deepspeed' should be None in case it wouldn't start distributed training.

In [None]:
import json
from transformers import HfArgumentParser

from Nexus.training.embedder.text_retrieval import *
import time
def main():
    data_config_path='/root/Nexus/examples/text_retrieval/training/embedder/data_config.json'
    train_config_path='/root/Nexus/examples/text_retrieval/training/embedder/training_config_single_device.json'
    model_config_path='/root/Nexus/examples/text_retrieval/training/embedder/model_config.json'
    
    model_args = TextEmbedderModelArguments.from_json(model_config_path)
    data_args = TextEmbedderDataArguments.from_json(data_config_path)
    training_args = TextEmbedderTrainingArguments.from_json(train_config_path)
    runner = TextEmbedderRunner(
        model_args=model_args,
        data_args=data_args,
        training_args=training_args
    )
    
    start = time.time()
    runner.run()
    end = time.time()
    elapsed_time = end-start
    print(f"程序运行耗时: {elapsed_time:.4f} 秒")


if __name__ == "__main__":
    main()

### Multi Node

Below is python scripts for single/multi node multi device training.


In [None]:
import json
from transformers import HfArgumentParser

from Nexus.training.embedder.text_retrieval import *
import time
def main():
    data_config_path='/root/Nexus/examples/text_retrieval/training/embedder/data_config.json'
    train_config_path='/root/Nexus/examples/text_retrieval/training/embedder/training_config.json'
    model_config_path='/root/Nexus/examples/text_retrieval/training/embedder/model_config.json'
    
    model_args = TextEmbedderModelArguments.from_json(model_config_path)
    data_args = TextEmbedderDataArguments.from_json(data_config_path)
    training_args = TextEmbedderTrainingArguments.from_json(train_config_path)
    runner = TextEmbedderRunner(
        model_args=model_args,
        data_args=data_args,
        training_args=training_args
    )
    start = time.time()
    runner.run()
    end = time.time()
    elapsed_time = end-start
    print(f"程序运行耗时: {elapsed_time:.4f} 秒")

if __name__ == "__main__":
    main()


## Evaluation

Our evaluation step support `normal`, `onnx`, `tensorrt` inference mode for embedder and reranker.

Below are some important arguments:

1. `eval_name`: your eval dataset name
2. `dataset_dir`: your data path
3. `splits`: default is `test`
4. `corpus_embd_save_dir`: your corpus embedding index save path
5. `output_dir`: eval result output dir
6. `search_top_k`: search top k
7. `rerank_top_k`: rerank top k
8. `embedder_name_or_path`: embedder model path
9. `reranker_name_or_path`: reranker model path
10. `embedder_infer_mode`:
    embedder infer mode, default is `None`, which means using `TextEmbedder`. Choices are `normal`, `onnx`, `tensorrt`

11. `reranker_infer_mode`: same to embedder
12. `embedder_onnx_model_path`: onnx model path, needed if you setting `embedder_infer_mode` to `onnx`
13. `reranker_onnx_model_path`: same to embedder
14. `embedder_trt_model_path`: tensorrt model path, needed if you setting `embedder_infer_mode` to `tensorrt`
15. `reranker_trt_model_path`: save to embedder

In [None]:
from transformers import HfArgumentParser

from Nexus.evaluation.text_retrieval.airbench import (
    AIRBenchEvalArgs, AIRBenchEvalModelArgs,
    AIRBenchEvalRunner
)

from Nexus.evaluation.text_retrieval.arguments import load_config

def main():

    eval_config_path='/root/Nexus/examples/text_retrieval/evaluation/config/eval_config.json'
    model_config_path='/root/Nexus/examples/text_retrieval/evaluation/config/model_config.json'
    
    eval_args = load_config(eval_config_path, AIRBenchEvalArgs)
    model_args = load_config(model_config_path, AIRBenchEvalModelArgs)
    
    eval_args: AIRBenchEvalArgs
    model_args: AIRBenchEvalModelArgs

    runner = AIRBenchEvalRunner(
        eval_args=eval_args,
        model_args=model_args
    )

    runner.run()


if __name__ == "__main__":
    main()
    print("==============================================")
    print("Search results have been generated.")
