Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

notebook_documentation #66

Merged
merged 2 commits into from
Jan 20, 2023

Conversation

dhrubo-os
Copy link
Collaborator

Signed-off-by: Dhrubo Saha dhrubo@amazon.com

Description

Created a new notebook for ml_commons integration with better explanation.

Issues Resolved

[List any issues this PR will resolve]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
@@ -3,6 +3,7 @@ pandas>=1.5,<2
matplotlib>=3.6.0,<4
nbval
sphinx
sphinx-rtd-theme
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the two packages with similar names both requires?
sphinx-rtd-theme
sphinx_rtd_theme

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part I'm not exactly sure. You can see two different installation instruction:

  1. https://sphinx-rtd-theme.readthedocs.io/en/stable/installing.html
  2. https://pypi.org/project/sphinx-rtd-theme/

And I'm seeing bit different output than my end. So for which I just wanted to give a try with installing both.

"source": [
"\n",
"model_path = '/Volumes/workplace/upload_content/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip'\n",
"model_config_path = '/Volumes/workplace/upload_content/all-MiniLM-L6-v2_torchscript.json'\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we paste this json file content here for easy reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sure!!

"\n",
"Step 0: Import packages and set up client\n",
"\n",
"Step 1: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we create a new demo notebook to show how to generate synthetic queries?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also create a new demo notebook about how to trace huggingface model to torchscript and upload to ml-commons?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we create a new demo notebook to show how to generate synthetic queries?

--> Yes we will have another demo notebook to show how to generate synthetic queries.

How about we also create a new demo notebook about how to trace huggingface model to torchscript and upload to ml-commons?

--> That's a good idea. We can add that too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add these two notebooks in a separate PR.

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
@dhrubo-os dhrubo-os merged commit 6ead7bd into opensearch-project:main Jan 20, 2023
Copy link

@MilindShyani MilindShyani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits part 1

@@ -62,14 +62,16 @@ def __init__(
:return: no return value expected

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change it to "Description: Initiate a sentence transformer model object. The model id will be used to download pre-trained model from Huggingface and serve as the default name for model files, and the folder_path will be the default location to store files generated by the following functions"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> "overwrite: Optional, choose to overwrite the folder at folder path. Default as false. When training
different sentence transformer models, it's recommended to give designated folder path every time.
Users can choose to overwrite = True to overwrite previous runs"


if folder_path is None:
self.folder_path = default_folder_path
else:
self.folder_path = folder_path

# check folder exist in self.folder_path
# Check if self.folder_path exists
if os.path.exists(self.folder_path) and not overwrite:
print(
"To prevent overwritten, please enter a different folder path or delete the folder or enable "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> "To prevent overwriting"

compute_environment: str = None,
num_machines: int = 1,
num_processes: int = None,
num_gpu: int = 0,
learning_rate: float = 2e-5,
num_epochs: int = 20,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the default num_epochs to "10"

@@ -146,6 +137,11 @@ def train(
:param verbose:
optional, use plotting to plot the training progress. Default as false
:type verbose: bool
:param percentile:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> "To save memory while training we truncate all passages beyond a certain max_length. Most middle sized transformers have a max length limit of 512 tokens. However, certain corpora can have shorter documents. We find the word length of all documents, sort them in increasing order and take the max length of {percentile}% of the documents. Default is 95%."

@@ -146,6 +137,11 @@ def train(
:param verbose:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"default is 10"


self.set_up_accelerate_config(
compute_environment=compute_environment,
num_machines=num_machines,
num_processes=num_processes,
num_processes=num_gpu,
verbose=verbose,
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for self.__is_notebook() does not depend on num_gpu. Perhaps move it outside of the if-else clause?

query.append(dict_str["query"])
passages.append(dict_str["passage"])
if "probability" in dict_str.keys():
prob.append(dict_str["probability"]) # language modeling score

df = pd.DataFrame(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will lead to wrong answers whenever len(prob) is different from len(query). At best zip will throw an error and at worst it will create the wrong dataset i.e queries and passages will get mismatched with prob. One solution could be to append "-1" whenever probability is not present in dict_str.keys(). Here "-1" will serve as a label saying that the probability does that exist.

def load_sentence_transformer_example(
self, query_df, use_accelerate: bool = False
) -> List[str]:
def load_sentence_transformer_example(self, query_df) -> List[List[str]]:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always felt this is a misleading name. Can we change it to something like "load_training_data" ?

Copy link

@MilindShyani MilindShyani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits part 1

Copy link

@MilindShyani MilindShyani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edits part 2

def load_sentence_transformer_example(
self, query_df, use_accelerate: bool = False
) -> List[str]:
def load_sentence_transformer_example(self, query_df) -> List[List[str]]:
"""
Create input data for training the model

:param query_df:
required for loading sentence transformer examples

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> "required for loading training data"

queries = list(query_df["query"])
passages = list(query_df["passages"])
for i in tqdm(range(len(query_df)), total=len(query_df)):
train_examples.append([queries[i], passages[i]])
return train_examples

def train_model(
self,
train_examples: List[str],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't train_exmaples List[List[str]] ?

output_model_name: str = None,
use_accelerate: bool = False,
learning_rate: float = 2e-5,
num_epochs: int = 20,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not make sense to set num_epochs to a default value here. The function that calls train_model always passes a default value for num_epochs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly for other params such as learning rate etc

if model_id is None:
model_id = "sentence-transformers/msmarco-distilbert-base-tas-b"
if output_model_name is None:
output_model_name = str(self.model_id.split("/")[-1] + ".pt")

# prepare an output model path for later saving the pt model.
output_model_path = os.path.join(output_path, output_model_name)

# declare variables before assignment for training
batch_size = 32

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this an argument that will be passed by the user with default = 32

for i in range(len(train_examples)):
corp_len.append(len(train_examples[i][1].split(" ")))

# In the following, we find the max length of 95% of the documents. Since this length is measured in terms of

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> we find the max length of 95% of the documents (when sorted by increasing word length)

# roughly translates to 1.3 to 1.5 tokens. For instance the word butterfly will be split by most tokeniers into

# butter and fly, but the word sun will be kept as it is.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--> but the word sun will be probably kept as it is.

print(
f"The total number of steps per training epoch are {len(train_dataloader)}\n"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment about training loop here

"The following training loop trains the sentence transformer model using the standard contrastive loss with in-batch negatives. The particular contrastive loss that we use is the Multi-class N-pair Loss (eq. 6, Sohn, NeurIPS 2016). In addition, we symmetrize the loss with respect to queries and passages (as also used in OpenAI's CLIP model). The performance improves with the number of in-batch negatives but larger batch sizes can lead to out of memory issues, so please use the batch-size judiciously. "

print(f"The number of training epoch are {num_epochs}\n")
print(
f"The total number of steps per training epoch are {len(train_examples) // batch_size}\n"
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment about training loop here

"The following training loop trains the sentence transformer model using the standard contrastive loss with in-batch negatives. The particular contrastive loss that we use is the Multi-class N-pair Loss (eq. 6, Sohn, NeurIPS 2016). In addition, we symmetrize the loss with respect to queries and passages (as also used in OpenAI's CLIP model). The performance improves with the number of in-batch negatives but larger batch sizes can lead to out of memory issues, so please use the batch-size judiciously. "

@@ -731,7 +710,7 @@ def save_as_pt(
:type model_name: string
:param save_json_folder_path:
Optional, path to save model json file, e.g, "home/save_pre_trained_model_json/"). If None, default as
default_folder_path from the constructor.
default_folder_path from the constructor
:type save_json_folder_path: string
:param zip_file_name:
Optional, file name for zip file. e.g, "sample_model.zip". If None, default takes the model_id

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's torch.tensor(1) doing near line 744 below?

gaiksaya pushed a commit to gaiksaya/opensearch-py-ml that referenced this pull request Feb 10, 2023
* notebook_documentation

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

* updated sentencetransformermodel.py

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>

Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants