notebook_documentation #66

dhrubo-os · 2023-01-12T06:24:52Z

Signed-off-by: Dhrubo Saha dhrubo@amazon.com

Description

Created a new notebook for ml_commons integration with better explanation.

Issues Resolved

[List any issues this PR will resolve]

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

mingshl · 2023-01-12T17:37:36Z

docs/requirements-docs.txt

@@ -3,6 +3,7 @@ pandas>=1.5,<2
 matplotlib>=3.6.0,<4
 nbval
 sphinx
+sphinx-rtd-theme


are the two packages with similar names both requires?
sphinx-rtd-theme
sphinx_rtd_theme

This part I'm not exactly sure. You can see two different installation instruction:

https://sphinx-rtd-theme.readthedocs.io/en/stable/installing.html

https://pypi.org/project/sphinx-rtd-theme/

And I'm seeing bit different output than my end. So for which I just wanted to give a try with installing both.

ylwu-amzn · 2023-01-17T23:45:50Z

docs/source/examples/demo_ml_commons_integration.ipynb

+   "source": [
+    "\n",
+    "model_path = '/Volumes/workplace/upload_content/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip'\n",
+    "model_config_path = '/Volumes/workplace/upload_content/all-MiniLM-L6-v2_torchscript.json'\n",


Can we paste this json file content here for easy reference?

Yeah sure!!

ylwu-amzn · 2023-01-17T23:51:21Z

docs/source/examples/demo_transformer_model_train_save_upload_to_openSearch.ipynb

+    "\n",
+    "Step 0: Import packages and set up client\n",
+    "\n",
+    "Step 1: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model\n",


Will we create a new demo notebook to show how to generate synthetic queries?

How about we also create a new demo notebook about how to trace huggingface model to torchscript and upload to ml-commons?

Will we create a new demo notebook to show how to generate synthetic queries?

--> Yes we will have another demo notebook to show how to generate synthetic queries.

How about we also create a new demo notebook about how to trace huggingface model to torchscript and upload to ml-commons?

--> That's a good idea. We can add that too.

I will add these two notebooks in a separate PR.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

MilindShyani

Edits part 1

MilindShyani · 2023-01-21T21:41:37Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

@@ -62,14 +62,16 @@ def __init__(
        :return: no return value expected


Please change it to "Description: Initiate a sentence transformer model object. The model id will be used to download pre-trained model from Huggingface and serve as the default name for model files, and the folder_path will be the default location to store files generated by the following functions"

--> "overwrite: Optional, choose to overwrite the folder at folder path. Default as false. When training
different sentence transformer models, it's recommended to give designated folder path every time.
Users can choose to overwrite = True to overwrite previous runs"

MilindShyani · 2023-01-21T21:43:50Z

opensearch_py_ml/ml_models/sentencetransformermodel.py


        if folder_path is None:
            self.folder_path = default_folder_path
        else:
            self.folder_path = folder_path

-        # check folder exist in self.folder_path
+        # Check if self.folder_path exists
        if os.path.exists(self.folder_path) and not overwrite:
            print(
                "To prevent overwritten, please enter a different folder path or delete the folder or enable "


--> "To prevent overwriting"

MilindShyani · 2023-01-21T21:47:26Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

        compute_environment: str = None,
        num_machines: int = 1,
-        num_processes: int = None,
+        num_gpu: int = 0,
        learning_rate: float = 2e-5,
        num_epochs: int = 20,


Please change the default num_epochs to "10"

MilindShyani · 2023-01-21T21:52:32Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

@@ -146,6 +137,11 @@ def train(
        :param verbose:
            optional, use plotting to plot the training progress. Default as false
        :type verbose: bool
+        :param percentile:


--> "To save memory while training we truncate all passages beyond a certain max_length. Most middle sized transformers have a max length limit of 512 tokens. However, certain corpora can have shorter documents. We find the word length of all documents, sort them in increasing order and take the max length of {percentile}% of the documents. Default is 95%."

MilindShyani · 2023-01-21T21:52:47Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

@@ -146,6 +137,11 @@ def train(
        :param verbose:


"default is 10"

MilindShyani · 2023-01-21T21:54:30Z

opensearch_py_ml/ml_models/sentencetransformermodel.py


            self.set_up_accelerate_config(
                compute_environment=compute_environment,
                num_machines=num_machines,
-                num_processes=num_processes,
+                num_processes=num_gpu,
                verbose=verbose,
            )



The logic for self.__is_notebook() does not depend on num_gpu. Perhaps move it outside of the if-else clause?

MilindShyani · 2023-01-21T22:04:52Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

                    query.append(dict_str["query"])
+                    passages.append(dict_str["passage"])
+                    if "probability" in dict_str.keys():
+                        prob.append(dict_str["probability"])  # language modeling score

        df = pd.DataFrame(


This will lead to wrong answers whenever len(prob) is different from len(query). At best zip will throw an error and at worst it will create the wrong dataset i.e queries and passages will get mismatched with prob. One solution could be to append "-1" whenever probability is not present in dict_str.keys(). Here "-1" will serve as a label saying that the probability does that exist.

MilindShyani · 2023-01-21T22:06:50Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

-    def load_sentence_transformer_example(
-        self, query_df, use_accelerate: bool = False
-    ) -> List[str]:
+    def load_sentence_transformer_example(self, query_df) -> List[List[str]]:


I always felt this is a misleading name. Can we change it to something like "load_training_data" ?

MilindShyani

Edits part 1

MilindShyani

Edits part 2

MilindShyani · 2023-01-21T22:09:43Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

-    def load_sentence_transformer_example(
-        self, query_df, use_accelerate: bool = False
-    ) -> List[str]:
+    def load_sentence_transformer_example(self, query_df) -> List[List[str]]:
        """
        Create input data for training the model

        :param query_df:
            required for loading sentence transformer examples


--> "required for loading training data"

MilindShyani · 2023-01-21T22:10:42Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

+        queries = list(query_df["query"])
+        passages = list(query_df["passages"])
+        for i in tqdm(range(len(query_df)), total=len(query_df)):
+            train_examples.append([queries[i], passages[i]])
        return train_examples

    def train_model(
        self,
        train_examples: List[str],


Isn't train_exmaples List[List[str]] ?

MilindShyani · 2023-01-21T22:13:07Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

        output_model_name: str = None,
-        use_accelerate: bool = False,
        learning_rate: float = 2e-5,
        num_epochs: int = 20,


Does not make sense to set num_epochs to a default value here. The function that calls train_model always passes a default value for num_epochs.

Similarly for other params such as learning rate etc

MilindShyani · 2023-01-21T22:14:36Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

        if model_id is None:
            model_id = "sentence-transformers/msmarco-distilbert-base-tas-b"
        if output_model_name is None:
            output_model_name = str(self.model_id.split("/")[-1] + ".pt")

-        # prepare an output model path for later saving the pt model.
-        output_model_path = os.path.join(output_path, output_model_name)
-
        # declare variables before assignment for training
        batch_size = 32


Let's make this an argument that will be passed by the user with default = 32

MilindShyani · 2023-01-21T22:15:32Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

+        for i in range(len(train_examples)):
+            corp_len.append(len(train_examples[i][1].split(" ")))
+
+        # In the following, we find the max length of 95% of the documents. Since this length is measured in terms of


--> we find the max length of 95% of the documents (when sorted by increasing word length)

MilindShyani · 2023-01-21T22:16:03Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

        # roughly translates to 1.3 to 1.5 tokens. For instance the word butterfly will be split by most tokeniers into
+
        # butter and fly, but the word sun will be kept as it is.


--> but the word sun will be probably kept as it is.

MilindShyani · 2023-01-21T23:21:03Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

+                print(
+                    f"The total number of steps per training epoch are {len(train_dataloader)}\n"
+                )
+


Add comment about training loop here

"The following training loop trains the sentence transformer model using the standard contrastive loss with in-batch negatives. The particular contrastive loss that we use is the Multi-class N-pair Loss (eq. 6, Sohn, NeurIPS 2016). In addition, we symmetrize the loss with respect to queries and passages (as also used in OpenAI's CLIP model). The performance improves with the number of in-batch negatives but larger batch sizes can lead to out of memory issues, so please use the batch-size judiciously. "

MilindShyani · 2023-01-21T23:30:39Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

+            print(f"The number of training epoch are {num_epochs}\n")
+            print(
+                f"The total number of steps per training epoch are {len(train_examples) // batch_size}\n"
+            )



Add comment about training loop here

"The following training loop trains the sentence transformer model using the standard contrastive loss with in-batch negatives. The particular contrastive loss that we use is the Multi-class N-pair Loss (eq. 6, Sohn, NeurIPS 2016). In addition, we symmetrize the loss with respect to queries and passages (as also used in OpenAI's CLIP model). The performance improves with the number of in-batch negatives but larger batch sizes can lead to out of memory issues, so please use the batch-size judiciously. "

MilindShyani · 2023-01-21T23:33:39Z

opensearch_py_ml/ml_models/sentencetransformermodel.py

@@ -731,7 +710,7 @@ def save_as_pt(
        :type model_name: string
        :param save_json_folder_path:
             Optional, path to save model json file, e.g, "home/save_pre_trained_model_json/"). If None, default as
-             default_folder_path from the constructor.
+             default_folder_path from the constructor
        :type save_json_folder_path: string
        :param zip_file_name:
            Optional, file name for zip file. e.g, "sample_model.zip". If None, default takes the model_id


What's torch.tensor(1) doing near line 744 below?

* notebook_documentation Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * updated sentencetransformermodel.py Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

notebook_documentation

3fefff9

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

dhrubo-os requested a review from mingshl as a code owner January 12, 2023 06:24

mingshl reviewed Jan 12, 2023

View reviewed changes

ylwu-amzn reviewed Jan 17, 2023

View reviewed changes

updated sentencetransformermodel.py

1bbbeb1

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

mingshl approved these changes Jan 20, 2023

View reviewed changes

ylwu-amzn approved these changes Jan 20, 2023

View reviewed changes

dhrubo-os merged commit 6ead7bd into opensearch-project:main Jan 20, 2023

MilindShyani reviewed Jan 21, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebook_documentation #66

notebook_documentation #66

dhrubo-os commented Jan 12, 2023

mingshl Jan 12, 2023

dhrubo-os Jan 17, 2023

ylwu-amzn Jan 17, 2023

dhrubo-os Jan 19, 2023

ylwu-amzn Jan 17, 2023

ylwu-amzn Jan 17, 2023

dhrubo-os Jan 19, 2023

dhrubo-os Jan 20, 2023

MilindShyani left a comment

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023 •

edited

MilindShyani left a comment

MilindShyani left a comment

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

MilindShyani Jan 21, 2023

		@@ -62,14 +62,16 @@ def __init__(
		:return: no return value expected

		# roughly translates to 1.3 to 1.5 tokens. For instance the word butterfly will be split by most tokeniers into

		# butter and fly, but the word sun will be kept as it is.

notebook_documentation #66

notebook_documentation #66

Conversation

dhrubo-os commented Jan 12, 2023

Description

Issues Resolved

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MilindShyani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MilindShyani Jan 21, 2023 • edited

Choose a reason for hiding this comment

MilindShyani left a comment

Choose a reason for hiding this comment

MilindShyani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MilindShyani Jan 21, 2023 •

edited