# Demo Notebook for Sentence Transformer Model Training, Saving and Uploading to OpenSearch


## Introduction

Deep learning models are very powerful and have been shown to improve state-of-the-art for several tasks. However, they need a lot of labelled training data. Such data is often hard to obtain. In this notebook, we show how pre-trained large language models can be used to circumvent this issue. We introduce the technique of synthetic data generation and use it obtain a transformer model that is custom built for a given set of documents for the task of search. 


### Passage retrieval

We focus on the task of passage retrieval i.e the corpus consists of passages, and it is searched at run-time given a user query. A passage could be any piece of unstructured text such as sentences, documents or webpages. 

Deep neural networks such as transformers have been shown to give state-of-the-art results for the task of passage/document retireval given large enough labelled dataset. For passage retrieval a labelled dataset would consist of (query, relevant passage) pairs. 

Labelled datasets such as MS Marco and Natural questions consist of more than 500K (query, passage) pairs and can be used to train transfomers. However transformers trained on these datasets have limited performance on out-of-domain datasets https://arxiv.org/abs/2104.08663. This is a well know fact -- medium sized transformers have toruble generalizing on out-of-distribution data. Thus to use transformers for search we need domain specific labelled data. Unfortunately such data is not generally avaialable and is hard to acquire. 

### Synthetic query generation

In the absence of such labelled data we provide a synthetic query generator (SQG) model that can be used to create synthetic queries given a passage. The SQG model is a large transformer model that has been trained to generate human like queries given a passage. It can be used to create a labelled dataset of (synthetic queries, passage). A transformer model can be trained on this synthetic data and used for semantic search. In fact, we have shown that such synthetically trained models beat the current state-of-the-art.

### Train BERT Model with synthetic query data

After generating synthetic query we can train Sentence Transformer model to get more precise embedding. 



This notebook provides a walkthrough guidance for users use their synthetic queries to fine tune and train a sentence transformer model. In this notebook, you use opensearch_py_ml to accomplish the following:

Step 1: Import packages and set up client

Step 2: Import the data/passages for synthetic query generation

Step 3: Generate Synthetic Queries

Step 4: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model

Step 5: (Optional) Save model

Step 6: Upload the model to OpenSearch cluster

## Step 1: Import packages, set up client and define helper functions

Install required packages for opensearch_py_ml.sentence_transformer_model
Install `opensearchpy` and `opensearch-py-ml` through pypi

generate.py script is released with the Synthetic Query Generation model.

Please refer https://pytorch.org/ to proper install torch based on your environment setting.  

In [1]:
# pip install pandas matplotlib numpy torch accelerate sentence_transformers tqdm transformers opensearch-py opensearch-py-ml detoxify datasets 

In [2]:
import warnings
warnings.filterwarnings('ignore')
import opensearch_py_ml as oml
from opensearchpy import OpenSearch
import generate # generate.py script is release with the 
from generate import Synthetic_Query_Generation
from opensearch_py_ml.ml_models import SentenceTransformerModel
import boto3, json
import pandas as pd, numpy as np
from datasets import load_dataset
import gc, torch
gc.collect()
torch.cuda.empty_cache()

In [3]:
# import mlcommon to later upload the model to OpenSearch Cluster
from opensearch_py_ml.ml_commons import MLCommonClient

In [4]:
CLUSTER_URL = 'https://localhost:9200'

In [5]:
def get_os_client(cluster_url = CLUSTER_URL,
                  username='admin',
                  password='admin'):
    '''
    Get OpenSearch client
    :param cluster_url: cluster URL like https://ml-te-netwo-1s12ba42br23v-ff1736fa7db98ff2.elb.us-west-2.amazonaws.com:443
    :return: OpenSearch client
    '''
    client = OpenSearch(
        hosts=[cluster_url],
        http_auth=(username, password),
        verify_certs=False
    )
    return client 

In [6]:
client = get_os_client()

In [7]:
def myselect(x):    
    if max(x["passages"]["is_selected"]) == 1:
        return x["passages"]["passage_text"][np.argmax(x["passages"]["is_selected"])]
    return "-1"

## Step 2: Import the data/passages for synthetic query generation

There are three supported options to read datasets :
* Option 1: read from a local data folder in jsonl file 
* Option 2: read from a list of passages
* Option 3: read from OpenSearch client by index_name

For the purpose of this notebook we will demonstrate option 2: read from a list of passages. 

We take the MS Marco dataset of passages as our example dataset. 

### 2.1) Load the data and convert into a pandas dataframe

In [12]:
dataset = load_dataset("ms_marco","v1.1")
df = pd.DataFrame.from_dict(dataset["validation"])

Reusing dataset ms_marco (/Users/dhrubo/.cache/huggingface/datasets/ms_marco/v1.1/1.1.0/b6a62715fa5219aea5275dd3556601004cd63945cb63e36e022f77bb3cbbca84)


  0%|          | 0/3 [00:00<?, ?it/s]

In [13]:
df["passage"] = df.apply(lambda x: myselect(x), axis = 1)
df = df[["query","passage"]][df.passage != "-1"]

The above cells create a dataframe that consists of queries and passages

In [14]:
# Setting print options to display full columns

pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', None)
pd.set_option('max_colwidth', None)

The dataset looks like,

In [15]:
df[0:10]

Unnamed: 0,query,passage
0,walgreens store sales average,"The average Walgreens salary ranges from approximately $15,000 per year for Customer Service Associate / Cashier to $179,900 per year for District Manager. Average Walgreens hourly pay ranges from approximately $7.35 per hour for Laboratory Technician to $68.90 per hour for Pharmacy Manager. Salary information comes from 7,810 data points collected directly from employees, users, and jobs on Indeed."
1,how much do bartenders make,"According to the Bureau of Labor Statistics, the average hourly wage for a bartender is $10.36, and the average yearly take-home is $21,550. Bartending can be a lot of things. For some it is exciting, for others exhausting. At times there is a lot of fun to be had, at others it is rather dull. But for the most part, bartending is almost always rewarding in the financial sense, as long as you stick with it."
2,what is a furuncle boil,"A boil, also called a furuncle, is a deep folliculitis, infection of the hair follicle. It is most commonly caused by infection by the bacterium Staphylococcus aureus, resulting in a painful swollen area on the skin caused by an accumulation of pus and dead tissue. Signs and symptoms [edit]. Boils are bumpy, red, pus-filled lumps around a hair follicle that are tender, warm, and very painful. They range from pea-sized to golf ball-sized. A yellow or white point at the center of the lump can be seen when the boil is ready to drain or discharge pus."
3,what can urinalysis detect,"Urinalysis is a test that evaluates a sample of your urine. Urinalysis is used to detect and assess a wide range of disorders, such as urinary tract infection, kidney disease and diabetes. Urinalysis involves examining the appearance, concentration and content of urine. Abnormal urinalysis results may point to a disease or illness. For example, a urinary tract infection can make urine look cloudy instead of clear. Increased levels of protein in urine can be a sign of kidney disease."
4,what is vitamin a used for,"Vitamin A is also used for shigellosis, diseases of the nervous system, nose infections, loss of sense of smell, asthma, persistent headaches, kidney stones, overactive thyroid, iron-poor blood (anemia), deafness, ringing in the ears, and precancerous mouth sores (leukoplakia). It can also be made in a laboratory. Vitamin A is used for treating vitamin A deficiency. It is also used to reduce complications of diseases such as malaria, HIV, measles, and diarrhea in children with"
5,what causes genetic alterations in normal cells,"The initiation of cell transformation is generally associated with genetic alterations in normal cells that lead to the loss of intercellular-and/or extracellular-matrix- (ECM-) mediated cell adhesion. Cancer afflicts an organ or a tissue by inducing abnormal and uncontrolled division of cells that either constitute it or migrate to it. At the cellular level, this is caused by genetic alterations in networks that regulate cell division and cell death."
6,cost to frame basement,"Our free calculator uses recent, trusted data to estimate costs for your Basement Wall Framing project. For a basic 125 square feet project in zip code 47474, the benchmark cost to Frame Basement Walls ranges between $2.51 - $3.17 per square foot* . To estimate costs for your project:"
7,erudite divergent definition,"The smart ones, the ones value knowledge and logic are Erudite. They know everything. . Erudite is one of the five factions in the world of Divergent, the one and only faction dedicated to knowledge, intelligence, curiosity, and astuteness. It was formed by those who blamed ignorance for the war that had occurred in the past, causing them to split into factions in the first place. They also use Dauntless as their soldiers near the end of Divergent. They have a close relationship with Amity, but Amity are not involved in the war because they are the peace faction. No relationship is stated between Erudite and Candor."
8,why is albumin normally absent in urine,"Share. Albumin is a protein present in the blood. Proteins are normally absent in urine because the kidney cells generally prevent large molecules including proteins, from being excreted. Some proteins may appear in the urine in normal individuals also if blood levels are very high. In kidney diseases, albumin will appear in the urine even with normal blood levels."
9,where was movie the birds filmed,"The Birds filming location: the Schoolhouse: Bodega Lane, Bodega, Northern California. Alfred Hitchcock ’s film of the Daphne du Maurier short story (originally set in Cornwall, England) uses lots of process (special effects) shots but, as usual, the director plays fair with the geography. The Birds filming location: the restaurant: Tides Wharf and Restaurant, Bodega Bay, Northern California. The Tides Wharf & Restaurant, in which the assorted locals shelter from the bird attacks, has expanded into an unrecognisable hotel complex since the film was made. The big surprise is that there is no town."


The MS Marco dataset has real queries for passages but we will pretend that it does not and generate synthetic queries for each passage

### 2.2) Convert the data into a list of strings and instantiate an object of the class Synthetic_Query_Generation

In [18]:
sample_passages = list(df.passage.values)

In [19]:
ss = Synthetic_Query_Generation(sentences = sample_passages[:8]) 

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Step 3: Generate synthetic queries

In [20]:
three_step_query = ss.generate_synthetic_queries(num_machines = 1,
                                                 overwrite = True,
                                                 total_queries = 10,                                            
                                                 numseq = 5,
                                                 num_gpu = 0,
                                                 toxic_cutoff = 0.01)

Tokenizing corpus...
Preparing input_ids and attention_mask... 


100%|██████████| 8/8 [00:00<00:00, 424.24it/s]

0 number of documents out of 8 are longer than 512 tokens and were discarded





The number of steps for creating queries:  8
Running on CPU...


100%|██████████| 8/8 [03:25<00:00, 25.68s/it]


The total number of synthetic queries before detoxify is 80
68good queries are kept after detoxify.
File is saved to /Volumes/workplace/opensearch-py-ml/src/opensearch-py-ml/queries_after_detoxify/synthetic_queries_batch_.p file.
Zip file is saved to/Volumes/workplace/opensearch-py-ml/src/opensearch-py-ml/clean_synthetic_queries.zip


A lot of actions are being executed in the above cell. We elaborate them step by step, 

    1) Convert the data into a form that can be consumed by the Synthetic query generator (SQG) model. This amounts to tokenizing the data using a tokenizer. The SQG model is a fine-tuned version of the GPT-XL model https://huggingface.co/gpt2-xl and the tokenizer is the GPT tokenizer. 
    
    2) The tokenizer has a max input length of 512 tokens. Every passage is tokenized with the special tokens <|startoftext|> and QRY: appended to the beginning and the end of every passage respectively.
    
    3) Load the SQG model i.e. 1.5B parameter GPT2-XL model that has been trained to ask questions given passages. This model has been made publicly available and can be found here https://ci.opensearch.org/ci/dbc/models/ml-models/amazon/gpt/GPT2_xl_sqg/1.0.0/GPT2_xl_sqg.zip. 
    
    4) Once the model has been loaded and the data has been tokenized, the model starts the process of query generation. "total_queries" is number of synthetic queries generated for every passage and "numseq" is the number of queries that are generated by a model at a given time. Ideally total_queries = numseq, but this can lead to out of memory issues. So set numseq to an integer that is around 10 or less, and is a divisor of total_queries. 
    
    It also needs the number of GPUs and the number of machines/nodes that it can use. Since we are using a single node instance with no GPUs we pass 0 and 1 to the function.   
    
    5) The function now begins to generate queries and displays a progress bar. We create total_queries per passage. Empirically we find that generating more queries leads to better peformance but there are diminishing returns since the total inference time increases with total_queries.
    
    6) After generating the queries, the function uses a publicly available package called Detoxify to remove innappropriate queries from the dataset. "toxic_cutoff" is a float. The script rejects all queries that have a toicity score greater than toxic_cutoff
    
    7) Finally, the synthetic queries along with their corresponding passages are saved in a zipped file in the current working directory.

### This is how the sample queries look like, 

In [24]:
# initiate SentenceTransformerModel object

custom_model = SentenceTransformerModel(folder_path="/Volumes/workplace/upload_content/model_files/", overwrite = True)



df = custom_model.read_queries(read_path = '/Volumes/workplace/upload_content/clean_synthetic_queries.zip', overwrite = True)

df[::10]

Reading synthetic query file: /Volumes/workplace/upload_content/model_files/synthetic_queries/synthetic_queries_batch.p



Unnamed: 0,prob,query,passages
11,5.8e-05,what is the salary range for a bb bar,"According to the Bureau of Labor Statistics, the average hourly wage for a bartender is $10.36, and the average yearly take-home is $21,550. Bartending can be a lot of things. For some it is exciting, for others exhausting. At times there is a lot of fun to be had, at others it is rather dull. But for the most part, bartending is almost always rewarding in the financial sense, as long as you stick with it."
67,2.7e-05,are divergent known for their science,"The smart ones, the ones value knowledge and logic are Erudite. They know everything.. Erudite is one of the five factions in the world of Divergent, the one and only faction dedicated to knowledge, intelligence, curiosity, and astuteness. It was formed by those who blamed ignorance for the war that had occurred in the past, causing them to split into factions in the first place. They also use Dauntless as their soldiers near the end of Divergent. They have a close relationship with Amity, but Amity are not involved in the war because they are the peace faction. No relationship is stated between Erudite and Candor."
33,8.8e-05,what diseases do urinalis tests show,"Urinalysis is a test that evaluates a sample of your urine. Urinalysis is used to detect and assess a wide range of disorders, such as urinary tract infection, kidney disease and diabetes. Urinalysis involves examining the appearance, concentration and content of urine. Abnormal urinalysis results may point to a disease or illness. For example, a urinary tract infection can make urine look cloudy instead of clear. Increased levels of protein in urine can be a sign of kidney disease."
46,1e-06,what is a primary role for growth factors?,"The initiation of cell transformation is generally associated with genetic alterations in normal cells that lead to the loss of intercellular-and/or extracellular-matrix- (ECM-) mediated cell adhesion. Cancer afflicts an organ or a tissue by inducing abnormal and uncontrolled division of cells that either constitute it or migrate to it. At the cellular level, this is caused by genetic alterations in networks that regulate cell division and cell death."
49,0.00061,definition of initiation of gene expression,"The initiation of cell transformation is generally associated with genetic alterations in normal cells that lead to the loss of intercellular-and/or extracellular-matrix- (ECM-) mediated cell adhesion. Cancer afflicts an organ or a tissue by inducing abnormal and uncontrolled division of cells that either constitute it or migrate to it. At the cellular level, this is caused by genetic alterations in networks that regulate cell division and cell death."
0,5.286169,how much does walgreen assistant manager make,"The average Walgreens salary ranges from approximately $15,000 per year for Customer Service Associate / Cashier to $179,900 per year for District Manager. Average Walgreens hourly pay ranges from approximately $7.35 per hour for Laboratory Technician to $68.90 per hour for Pharmacy Manager. Salary information comes from 7,810 data points collected directly from employees, users, and jobs on Indeed."
65,0.100414,what kind of people make up divergent,"The smart ones, the ones value knowledge and logic are Erudite. They know everything.. Erudite is one of the five factions in the world of Divergent, the one and only faction dedicated to knowledge, intelligence, curiosity, and astuteness. It was formed by those who blamed ignorance for the war that had occurred in the past, causing them to split into factions in the first place. They also use Dauntless as their soldiers near the end of Divergent. They have a close relationship with Amity, but Amity are not involved in the war because they are the peace faction. No relationship is stated between Erudite and Candor."


## Step 4: Read synthetic queries and train/fine-tune a hugging face sentence transformer model on synthetic data

With a synthetic queries zip file, users can fine tune a sentence transformer model. 

The `SentenceTransformerModel` class will inititate an object for training, exporting and configuring the model. Plese visit the [SentenceTransformerModel](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel) for API Reference . 

The `train` function will import synthestic queries, load sentence transformer example and train the model using a hugging face sentence transformer model. Plese visit the [SentenceTransformerModel.train](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.train) for API Reference . 

In [7]:
# clean up cache before training to free up spaces
import gc, torch

gc.collect()

torch.cuda.empty_cache()

In [25]:

training = custom_model.train(read_path = '/Volumes/workplace/upload_content/clean_synthetic_queries.zip',
                        output_model_name = 'test2_model.pt',
                        zip_file_name= 'test2_model.zip',
                        overwrite = True,
                        num_epochs = 1,
                        verbose = False)

Reading synthetic query file: /Volumes/workplace/upload_content/model_files/synthetic_queries/synthetic_queries_batch.p

Loading training examples... 



100%|██████████| 66/66 [00:00<00:00, 244544.23it/s]


Start training without accelerator...

The number of training epoch are 1

The total number of steps per training epoch are 2

Training epoch 0...



100%|██████████| 2/2 [00:10<00:00,  5.44s/it]


Total training time: 11.379661083221436

Model saved to path: /Volumes/workplace/upload_content/model_files/

tokenizer_json_path:  /Volumes/workplace/upload_content/model_files/tokenizer.json
zip file is saved to /Volumes/workplace/upload_content/model_files/test2_model.zip



## Step 5: (Optional) Save model
If following step 1, the model zip will be auto generated, and the print message will indicate the zip file path as shown above. 

But if using other pretrained sentence transformer model from Hugging face, users can use `save_as_pt` function to save a pre-trained sentence transformer model for inferencing or benchmark with other models. 

The `save_as_pt`  function will prepare the model in proper format(Torch Script) along with tokenizers configuration file to upload to OpenSearch. Plese visit the [SentenceTransformerModel.save_as_pt](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.save_as_pt) for API Reference . 

In [9]:
# default to download model id, "sentence-transformers/msmarco-distilbert-base-tas-b" from hugging face 
# and output a model in a zip file containing model.pt file and tokenizers.json file. 
pre_trained_model = SentenceTransformerModel(folder_path = '/Volumes/workplace/upload_content/export_huggingface/', overwrite = True)
pre_trained_model.save_as_pt(sentences = ['today is sunny'])

model file is saved to  /Volumes/workplace/upload_content/export_huggingface/msmarco-distilbert-base-tas-b.pt
zip file is saved to  /Volumes/workplace/upload_content/export_huggingface/msmarco-distilbert-base-tas-b.zip 



SentenceTransformer(
  original_name=SentenceTransformer
  (0): Transformer(
    original_name=Transformer
    (auto_model): DistilBertModel(
      original_name=DistilBertModel
      (embeddings): Embeddings(
        original_name=Embeddings
        (word_embeddings): Embedding(original_name=Embedding)
        (position_embeddings): Embedding(original_name=Embedding)
        (LayerNorm): LayerNorm(original_name=LayerNorm)
        (dropout): Dropout(original_name=Dropout)
      )
      (transformer): Transformer(
        original_name=Transformer
        (layer): ModuleList(
          original_name=ModuleList
          (0): TransformerBlock(
            original_name=TransformerBlock
            (attention): MultiHeadSelfAttention(
              original_name=MultiHeadSelfAttention
              (dropout): Dropout(original_name=Dropout)
              (q_lin): Linear(original_name=Linear)
              (k_lin): Linear(original_name=Linear)
              (v_lin): Linear(original_name=Lin

## Step 6: Upload the model to OpenSearch cluster
After generated a model zip file, the users will need to describe model configuration in a ml-commons_model_config.json file. The `make_model_config_json` function in sentencetransformermodel class will parse the config file from hugging-face config.son file. If users would like to use a different config than the pre-trained sentence transformer, `make_model_config_json` function provide arguuments to change the configuration content and generated a ml-commons_model_config.json file. Plese visit the [SentenceTransformerModel.make_model_config_json](https://opensearch-project.github.io/opensearch-py-ml/reference/api/sentence_transformer.html#opensearch_py_ml.sentence_transformer_model.SentenceTransformerModel.make_model_config_json) for API Reference . 

In general, the ml common client supports uploading sentence transformer models. With a zip file contains model in  Torch Script format, and a configuration file for tokenizers in json format, the `upload_model` function connects to opensearch through ml client and upload the model. Plese visit the [MLCommonClient.upload_model](https://opensearch-project.github.io/opensearch-py-ml/reference/api/ml_commons_upload_api.html#opensearch_py_ml.ml_commons_integration.MLCommonClient.upload_model) for API Reference. 

In [10]:
#users will need to prepare a ml-commons_model_config.json file to config the model, including model name ..
#this is a helpful function in py-ml.sentence_transformer_model to generate ml-commons_model_config.json file
custom_model.make_model_config_json()

ml-commons_model_config.json file is saved at :  /Volumes/workplace/upload_content/model_files/ml-commons_model_config.json


In [11]:
#connect to ml_common client with OpenSearch client
import opensearch_py_ml as oml
from opensearch_py_ml.ml_commons import MLCommonClient
ml_client = MLCommonClient(client)

In [12]:
# upload model to OpenSearch cluster, using model zip file path and ml-commons_model_config.json file generated above

model_path = '/Volumes/workplace/upload_content/model_files/test2_model.zip'
model_config_path = '/Volumes/workplace/upload_content/model_files/ml-commons_model_config.json'
ml_client.upload_model( model_path, model_config_path, isVerbose=True)

Total number of chunks 27
Sha1 value of the model file:  1a198957ec8a759e83f1e862ad46bb120c6c1b5a031e75c415c1a893c87a3da7
Model meta data was created successfully. Model Id:  cz2RloUB6UQeRtfO8Jph
uploading chunk 1 of 27
{'status': 'Uploaded'}
uploading chunk 2 of 27
{'status': 'Uploaded'}
uploading chunk 3 of 27
{'status': 'Uploaded'}
uploading chunk 4 of 27
{'status': 'Uploaded'}
uploading chunk 5 of 27
{'status': 'Uploaded'}
uploading chunk 6 of 27
{'status': 'Uploaded'}
uploading chunk 7 of 27
{'status': 'Uploaded'}
uploading chunk 8 of 27
{'status': 'Uploaded'}
uploading chunk 9 of 27
{'status': 'Uploaded'}
uploading chunk 10 of 27
{'status': 'Uploaded'}
uploading chunk 11 of 27
{'status': 'Uploaded'}
uploading chunk 12 of 27
{'status': 'Uploaded'}
uploading chunk 13 of 27
{'status': 'Uploaded'}
uploading chunk 14 of 27
{'status': 'Uploaded'}
uploading chunk 15 of 27
{'status': 'Uploaded'}
uploading chunk 16 of 27
{'status': 'Uploaded'}
uploading chunk 17 of 27
{'status': 'Uploaded

'cz2RloUB6UQeRtfO8Jph'