# Demo Notebook for Sentence Transformer Model Training, Saving and Uploading to OpenSearch
This notebook provides a walkthrough guidance for users use their synthetic queries to fine tune and train a sentence transformer model. In this notebook, you use opensearch_py_ml to accomplish the following:

Step 0: Import packages and set up client

Step 1: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model

Step 2: (Optional) Save model

Step 3: Upload the model to OpenSearch cluster

## Step 0: Import packages and set up client
Install opensearchpy and opensearch_py_ml through pypi

In [1]:
import opensearch_py_ml as oml
from opensearchpy import OpenSearch
from opensearch_py_ml.sentence_transformer_model import SentenceTransformerModel
import warnings
warnings.filterwarnings('ignore')

  OS_VERSION = os_version(OPENSEARCH_TEST_CLIENT)


In [2]:
# import mlcommon to later upload the model to OpenSearch Cluster
from opensearch_py_ml.ml_commons_integration import MLCommonClient

In [3]:
CLUSTER_URL = 'https://localhost:9200'

In [4]:
def get_os_client(cluster_url = CLUSTER_URL,
                  username='admin',
                  password='admin'):
    '''
    Get OpenSearch client
    :param cluster_url: cluster URL like https://ml-te-netwo-1s12ba42br23v-ff1736fa7db98ff2.elb.us-west-2.amazonaws.com:443
    :return: OpenSearch client
    '''
    client = OpenSearch(
        hosts=[cluster_url],
        http_auth=(username, password),
        verify_certs=False
    )
    return client 

In [5]:
client = get_os_client()

## Step 1: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model
With a synthetic queries zip file, users can fine tune a sentence transformer model. The `train` function will import synthestic queries, load sentence transformer example and train the model using a hugging face sentence transformer model.

    """
    Description:
    read the synthetic queries and use it to fine tune/train a sentence transformer model to save a zip file

    Parameters:
    read_path: str
        required, path to read the generated queries zip file, if None, default as 'synthetic_query' folder 
        in current directory
    model_id: str = None
        optional, the url to download sentence transformer model, if None, default as 
        'sentence-transformers/msmarco-distilbert-base-tas-b
    output_model_path: str=None
        optional, the path to store trained custom model. If None, default as current folder path
    output_model_name: str=None
        optional, the name of the trained custom model. If None, default as 'trained_model.pt'
    zip_file_name: str =None
        optional, file name for zip file. if None, default as custom_tasb_model.zip
    use_accelerate: bool = False,
        Optional, use accelerate to fine tune model. Default as false to not use accelerator to fine tune model.
        If there are multiple gpus available in the machine, it's recommended to use accelerate with
        num_processor>1 to speeed up the training progress. If use accelerator to train model, run auto setup 
        accelerate confi and launch train_model function with the number of processors provided by users 
        if NOT use accelerator,trigger train_model function with default setting
    compute_environment: str
        optional, compute environment type to run model, if None, default using 'LOCAL_MACHINE'
    num_machines: int
        optional, number of machine to run model , if None, default using 1
    num_processes: int
        optional, number of processors to run model , if None, default using 1
    learning_rate: float
        optional, learning rate to train model, default is 2e-5
    num_epochs: int
        optional, number of epochs to train model, default is 20
    verbose: bool
        optional, use plotting to plot the training progress. Default as false.
    Return:
        None
    """


In [6]:
# clean up cache before training to free up spaces
import gc, torch

gc.collect()

torch.cuda.empty_cache()

In [15]:
model = SentenceTransformerModel()
training = model.train(read_path = '/home/ec2-user/SageMaker/opensearch-py-ml-mingshl/synthetic_queries.zip',
                        output_model_name = 'test_model.pt',
                        zip_file_name= 'test_model.zip',
                        overwrite = True,
                        use_accelerate  = True,  
                        num_machines = 1,
                        num_processes = 2,)

reading synthetic query file: /home/ec2-user/SageMaker/opensearch-py-ml-mingshl/synthetic_queries/output.p
Loading training examples... 


100%|██████████| 773/773 [00:00<00:00, 781628.98it/s]

generated config file: at/home/ec2-user/.cache/huggingface/accelerate/default_config.yaml
[{'compute_environment': 'LOCAL_MACHINE', 'deepspeed_config': {'gradient_accumulation_steps': 1, 'offload_optimizer_device': 'none', 'offload_param_device': 'none', 'zero3_init_flag': False, 'zero_stage': 2}, 'distributed_type': 'DEEPSPEED', 'downcast_bf16': 'no', 'fsdp_config': {}, 'machine_rank': 0, 'main_process_ip': None, 'main_process_port': None, 'main_training_function': 'main', 'mixed_precision': 'no', 'num_machines': 1, 'num_processes': 2, 'use_cpu': False}]
Launching training on 2 GPUs.





Start training with accelerator...Start training with accelerator...

The number of training epoch are 20The number of training epoch are 20

The total number of steps training epoch are 25The total number of steps training epoch are 25

Training epoch 0...Training epoch 0...



100%|██████████| 13/13 [00:01<00:00, 11.84it/s]


Training epoch 1...


100%|██████████| 13/13 [00:01<00:00, 11.67it/s]


Training epoch 1...


100%|██████████| 13/13 [00:01<00:00, 12.38it/s]


Training epoch 2...


100%|██████████| 13/13 [00:01<00:00, 12.26it/s]


Training epoch 2...


100%|██████████| 13/13 [00:01<00:00, 12.07it/s]


Training epoch 3...


100%|██████████| 13/13 [00:01<00:00, 12.19it/s]


Training epoch 3...


100%|██████████| 13/13 [00:01<00:00, 11.84it/s]


Training epoch 4...


100%|██████████| 13/13 [00:01<00:00, 11.96it/s]


Training epoch 4...


100%|██████████| 13/13 [00:01<00:00, 12.38it/s]


Training epoch 5...


100%|██████████| 13/13 [00:01<00:00, 12.28it/s]


Training epoch 5...


100%|██████████| 13/13 [00:01<00:00, 12.22it/s]


Training epoch 6...


100%|██████████| 13/13 [00:01<00:00, 12.32it/s]


Training epoch 6...


100%|██████████| 13/13 [00:01<00:00, 12.74it/s]


Training epoch 7...


100%|██████████| 13/13 [00:01<00:00, 12.49it/s]


Training epoch 7...


100%|██████████| 13/13 [00:01<00:00, 12.68it/s]


Training epoch 8...


100%|██████████| 13/13 [00:01<00:00, 12.09it/s]


Training epoch 8...


100%|██████████| 13/13 [00:01<00:00, 12.66it/s]


Training epoch 9...


100%|██████████| 13/13 [00:01<00:00, 12.55it/s]


Training epoch 9...


100%|██████████| 13/13 [00:01<00:00, 12.89it/s]


Training epoch 10...


100%|██████████| 13/13 [00:01<00:00, 12.66it/s]


Training epoch 10...


100%|██████████| 13/13 [00:01<00:00, 12.77it/s]


Training epoch 11...


100%|██████████| 13/13 [00:01<00:00, 12.86it/s]


Training epoch 11...


100%|██████████| 13/13 [00:01<00:00, 12.64it/s]


Training epoch 12...


100%|██████████| 13/13 [00:01<00:00, 12.87it/s]


Training epoch 12...


100%|██████████| 13/13 [00:01<00:00, 12.96it/s]


Training epoch 13...


100%|██████████| 13/13 [00:01<00:00, 12.84it/s]


Training epoch 13...


100%|██████████| 13/13 [00:01<00:00, 12.68it/s]


Training epoch 14...


100%|██████████| 13/13 [00:00<00:00, 13.05it/s]


Training epoch 14...


100%|██████████| 13/13 [00:01<00:00, 12.32it/s]


Training epoch 15...


100%|██████████| 13/13 [00:01<00:00, 12.05it/s]


Training epoch 15...


100%|██████████| 13/13 [00:01<00:00, 12.70it/s]


Training epoch 16...


100%|██████████| 13/13 [00:01<00:00, 12.40it/s]


Training epoch 16...


100%|██████████| 13/13 [00:01<00:00, 12.64it/s]


Training epoch 17...


100%|██████████| 13/13 [00:01<00:00, 12.57it/s]


Training epoch 17...


100%|██████████| 13/13 [00:01<00:00, 12.96it/s]


Training epoch 18...


100%|██████████| 13/13 [00:01<00:00, 13.00it/s]


Training epoch 18...


100%|██████████| 13/13 [00:01<00:00, 12.85it/s]


Training epoch 19...


100%|██████████| 13/13 [00:01<00:00, 12.77it/s]


Training epoch 19...


100%|██████████| 13/13 [00:01<00:00, 12.55it/s]
100%|██████████| 13/13 [00:01<00:00, 12.20it/s]


Total training time: 22.39643096923828
Total training time: 22.410019636154175
Preparing model to save
Preparing model to save
Model saved to path: /home/ec2-user/SageMaker/opensearch-py-ml-mingshl/test_model.pt
Model saved to path: /home/ec2-user/SageMaker/opensearch-py-ml-mingshl/test_model.pt
zip file is saved to /home/ec2-user/SageMaker/opensearch-py-ml-mingshl/test_model.zip


## Step 2: (Optional) Save model
If following step 1, the model zip will be auto generated, and the print message will indicate the zip file path as shown above. 

But if using other pretrained sentence transformer model from Hugging face, users can use `save_as_pt` function to save a pre-trained sentence transformer model for inferencing or benchmark with other models. 

The ` save_as_pt`  function will prepare the model in proper format(Torch Script) along with tokenizers configuration file to upload to OpenSearch. The `save_as_pt` function takes the following arguments. 

        """
        Description:
        download sentence transformer model directly from huggingface, convert model to torch script format,
        zip the model file and its tokenizer.json file to prepare to upload to the Open Search cluster

        Parameters:
        sentences:[str]
            Required, for example  sentences = ['today is sunny']
        model: str
            Optional, if provide model in parameters, will convert model to torch script format,
            else, not provided model then it will download sentence transformer model from huggingface. 
            If None, default takes model_id = "sentence-transformers/msmarco-distilbert-base-tas-b". 
        model_name: str
            Optional, model name to name the model file, e.g, "sample_model.pt". If None, default takes the 
            model_id and add the extension with ".pt".
        zip_file_name: str =None
            Optional, file name for zip file. e.g, "sample_model.zip". If None, default takes the model_id 
            and add the extension with ".zip".
            None

        """



In [23]:
# default to download model id, "sentence-transformers/msmarco-distilbert-base-tas-b" from hugging face 
# and output a model in a zip file containing model.pt file and tokenizers.json file. 
model = SentenceTransformerModel()
model.save_as_pt(sentences = ['today is sunny'])

model file is saved to/home/ec2-user/SageMaker/opensearch-py-ml-mingshl/msmarco-distilbert-base-tas-b.pt
zip file is saved to /home/ec2-user/SageMaker/opensearch-py-ml-mingshl/msmarco-distilbert-base-tas-b.zip


## Step 3: Upload the model to OpenSearch cluster

In general, the ml common client supports uploading sentence transformer models. With a zip file contains model in  Torch Script format, and a configuration file for tokenizers in json format, the `upload_model` function connects to opensearch through ml client and upload the model. The `upload_model` function takes three agruments:

    """
    Description:
    upload a zip file and model_config file to OpenSearch cluster. 
    Parameters:
    model_path: str
        file path of the model file (zip file expected). The zip file should contain two files. The first file 
        is a model file in Torch Script format, e.g, "model.pt". The second file is a configuration file for
        tokenizers in json format, e.g, "tokenizers.json".
    model_config_file: str
        file path of the model config file (json file expected), which includes necessary config info for the 
        model, including model name, version number and etc. The details for model configuration are here: 
        https://opensearch.org/docs/latest/ml-commons-plugin/model-serving-framework/. An example file will 
        show in the below cells. 
    Return: 
        None
    """


In [None]:
#connect to ml_common client with OpenSearch client
ml_client = MLCommonClient(client)

In [None]:
#user will need to prepare a model_config.json file to config the model, including model name ..
#this is a sample of model_config.json file

{

   "name":"all-MiniLM-L6-v2",

   "version":1,

   "model_format":"TORCH_SCRIPT",

   "model_task_type":"TEXT_EMBEDDING",

   "model_config":{

      "model_type":"bert",

      "embedding_dimension":384,

      "framework_type":"sentence_transformers",

      "all_config":"{\"_name_or_path\":\"nreimers/MiniLM-L6-H384-uncased\",\"architectures\":[\"BertModel\"],\"attention_probs_dropout_prob\":0.1,\"gradient_checkpointing\":false,\"hidden_act\":\"gelu\",\"hidden_dropout_prob\":0.1,\"hidden_size\":384,\"initializer_range\":0.02,\"intermediate_size\":1536,\"layer_norm_eps\":1e-12,\"max_position_embeddings\":512,\"model_type\":\"bert\",\"num_attention_heads\":12,\"num_hidden_layers\":6,\"pad_token_id\":0,\"position_embedding_type\":\"absolute\",\"transformers_version\":\"4.8.2\",\"type_vocab_size\":2,\"use_cache\":true,\"vocab_size\":30522}"

   }

}

In [8]:
# upload model to OpenSearch cluster, using model zip file path

model_path = '/home/ec2-user/SageMaker/opensearch-py-ml-mingshl/test_model.zip'
ml_client.upload_model( model_path, '/home/ec2-user/SageMaker/opensearch-py-ml-mingshl/save_pre_trained_model_json/model_config.json', isVerbose=True)

Total number of chunks 27
Sha1 value of the model file:  d1fc88bc317ed3dc52c4c7dc4d51122c6989876e62167566289a8067ab5d51e7
Model meta data was created successfully. Model Id:  oglVm4QBDmk7AZE7yW-d
uploading chunk 1 of 27
{'status': 'Uploaded'}
uploading chunk 2 of 27
{'status': 'Uploaded'}
uploading chunk 3 of 27
{'status': 'Uploaded'}
uploading chunk 4 of 27
{'status': 'Uploaded'}
uploading chunk 5 of 27
{'status': 'Uploaded'}
uploading chunk 6 of 27
{'status': 'Uploaded'}
uploading chunk 7 of 27
{'status': 'Uploaded'}
uploading chunk 8 of 27
{'status': 'Uploaded'}
uploading chunk 9 of 27
{'status': 'Uploaded'}
uploading chunk 10 of 27
{'status': 'Uploaded'}
uploading chunk 11 of 27
{'status': 'Uploaded'}
uploading chunk 12 of 27
{'status': 'Uploaded'}
uploading chunk 13 of 27
{'status': 'Uploaded'}
uploading chunk 14 of 27
{'status': 'Uploaded'}
uploading chunk 15 of 27
{'status': 'Uploaded'}
uploading chunk 16 of 27
{'status': 'Uploaded'}
uploading chunk 17 of 27
{'status': 'Uploaded