-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
notebook_documentation #66
notebook_documentation #66
Conversation
Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
@@ -3,6 +3,7 @@ pandas>=1.5,<2 | |||
matplotlib>=3.6.0,<4 | |||
nbval | |||
sphinx | |||
sphinx-rtd-theme |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the two packages with similar names both requires?
sphinx-rtd-theme
sphinx_rtd_theme
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part I'm not exactly sure. You can see two different installation instruction:
- https://sphinx-rtd-theme.readthedocs.io/en/stable/installing.html
- https://pypi.org/project/sphinx-rtd-theme/
And I'm seeing bit different output than my end. So for which I just wanted to give a try with installing both.
"source": [ | ||
"\n", | ||
"model_path = '/Volumes/workplace/upload_content/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip'\n", | ||
"model_config_path = '/Volumes/workplace/upload_content/all-MiniLM-L6-v2_torchscript.json'\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we paste this json file content here for easy reference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure!!
"\n", | ||
"Step 0: Import packages and set up client\n", | ||
"\n", | ||
"Step 1: Read synthetic queries and train/fine-tune model using a hugging face sentence transformer model\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we create a new demo notebook to show how to generate synthetic queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we also create a new demo notebook about how to trace huggingface model to torchscript and upload to ml-commons?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we create a new demo notebook to show how to generate synthetic queries?
--> Yes we will have another demo notebook to show how to generate synthetic queries.
How about we also create a new demo notebook about how to trace huggingface model to torchscript and upload to ml-commons?
--> That's a good idea. We can add that too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add these two notebooks in a separate PR.
Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edits part 1
@@ -62,14 +62,16 @@ def __init__( | |||
:return: no return value expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change it to "Description: Initiate a sentence transformer model object. The model id will be used to download pre-trained model from Huggingface and serve as the default name for model files, and the folder_path will be the default location to store files generated by the following functions"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> "overwrite: Optional, choose to overwrite the folder at folder path. Default as false. When training
different sentence transformer models, it's recommended to give designated folder path every time.
Users can choose to overwrite = True to overwrite previous runs"
|
||
if folder_path is None: | ||
self.folder_path = default_folder_path | ||
else: | ||
self.folder_path = folder_path | ||
|
||
# check folder exist in self.folder_path | ||
# Check if self.folder_path exists | ||
if os.path.exists(self.folder_path) and not overwrite: | ||
print( | ||
"To prevent overwritten, please enter a different folder path or delete the folder or enable " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> "To prevent overwriting"
compute_environment: str = None, | ||
num_machines: int = 1, | ||
num_processes: int = None, | ||
num_gpu: int = 0, | ||
learning_rate: float = 2e-5, | ||
num_epochs: int = 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the default num_epochs to "10"
@@ -146,6 +137,11 @@ def train( | |||
:param verbose: | |||
optional, use plotting to plot the training progress. Default as false | |||
:type verbose: bool | |||
:param percentile: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> "To save memory while training we truncate all passages beyond a certain max_length. Most middle sized transformers have a max length limit of 512 tokens. However, certain corpora can have shorter documents. We find the word length of all documents, sort them in increasing order and take the max length of {percentile}% of the documents. Default is 95%."
@@ -146,6 +137,11 @@ def train( | |||
:param verbose: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"default is 10"
|
||
self.set_up_accelerate_config( | ||
compute_environment=compute_environment, | ||
num_machines=num_machines, | ||
num_processes=num_processes, | ||
num_processes=num_gpu, | ||
verbose=verbose, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for self.__is_notebook() does not depend on num_gpu. Perhaps move it outside of the if-else clause?
query.append(dict_str["query"]) | ||
passages.append(dict_str["passage"]) | ||
if "probability" in dict_str.keys(): | ||
prob.append(dict_str["probability"]) # language modeling score | ||
|
||
df = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will lead to wrong answers whenever len(prob) is different from len(query). At best zip will throw an error and at worst it will create the wrong dataset i.e queries and passages will get mismatched with prob. One solution could be to append "-1" whenever probability is not present in dict_str.keys(). Here "-1" will serve as a label saying that the probability does that exist.
def load_sentence_transformer_example( | ||
self, query_df, use_accelerate: bool = False | ||
) -> List[str]: | ||
def load_sentence_transformer_example(self, query_df) -> List[List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always felt this is a misleading name. Can we change it to something like "load_training_data" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edits part 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edits part 2
def load_sentence_transformer_example( | ||
self, query_df, use_accelerate: bool = False | ||
) -> List[str]: | ||
def load_sentence_transformer_example(self, query_df) -> List[List[str]]: | ||
""" | ||
Create input data for training the model | ||
|
||
:param query_df: | ||
required for loading sentence transformer examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> "required for loading training data"
queries = list(query_df["query"]) | ||
passages = list(query_df["passages"]) | ||
for i in tqdm(range(len(query_df)), total=len(query_df)): | ||
train_examples.append([queries[i], passages[i]]) | ||
return train_examples | ||
|
||
def train_model( | ||
self, | ||
train_examples: List[str], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't train_exmaples List[List[str]] ?
output_model_name: str = None, | ||
use_accelerate: bool = False, | ||
learning_rate: float = 2e-5, | ||
num_epochs: int = 20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not make sense to set num_epochs to a default value here. The function that calls train_model always passes a default value for num_epochs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly for other params such as learning rate etc
if model_id is None: | ||
model_id = "sentence-transformers/msmarco-distilbert-base-tas-b" | ||
if output_model_name is None: | ||
output_model_name = str(self.model_id.split("/")[-1] + ".pt") | ||
|
||
# prepare an output model path for later saving the pt model. | ||
output_model_path = os.path.join(output_path, output_model_name) | ||
|
||
# declare variables before assignment for training | ||
batch_size = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this an argument that will be passed by the user with default = 32
for i in range(len(train_examples)): | ||
corp_len.append(len(train_examples[i][1].split(" "))) | ||
|
||
# In the following, we find the max length of 95% of the documents. Since this length is measured in terms of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> we find the max length of 95% of the documents (when sorted by increasing word length)
# roughly translates to 1.3 to 1.5 tokens. For instance the word butterfly will be split by most tokeniers into | ||
|
||
# butter and fly, but the word sun will be kept as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--> but the word sun will be probably kept as it is.
print( | ||
f"The total number of steps per training epoch are {len(train_dataloader)}\n" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment about training loop here
"The following training loop trains the sentence transformer model using the standard contrastive loss with in-batch negatives. The particular contrastive loss that we use is the Multi-class N-pair Loss (eq. 6, Sohn, NeurIPS 2016). In addition, we symmetrize the loss with respect to queries and passages (as also used in OpenAI's CLIP model). The performance improves with the number of in-batch negatives but larger batch sizes can lead to out of memory issues, so please use the batch-size judiciously. "
print(f"The number of training epoch are {num_epochs}\n") | ||
print( | ||
f"The total number of steps per training epoch are {len(train_examples) // batch_size}\n" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment about training loop here
"The following training loop trains the sentence transformer model using the standard contrastive loss with in-batch negatives. The particular contrastive loss that we use is the Multi-class N-pair Loss (eq. 6, Sohn, NeurIPS 2016). In addition, we symmetrize the loss with respect to queries and passages (as also used in OpenAI's CLIP model). The performance improves with the number of in-batch negatives but larger batch sizes can lead to out of memory issues, so please use the batch-size judiciously. "
@@ -731,7 +710,7 @@ def save_as_pt( | |||
:type model_name: string | |||
:param save_json_folder_path: | |||
Optional, path to save model json file, e.g, "home/save_pre_trained_model_json/"). If None, default as | |||
default_folder_path from the constructor. | |||
default_folder_path from the constructor | |||
:type save_json_folder_path: string | |||
:param zip_file_name: | |||
Optional, file name for zip file. e.g, "sample_model.zip". If None, default takes the model_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's torch.tensor(1) doing near line 744 below?
* notebook_documentation Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> * updated sentencetransformermodel.py Signed-off-by: Dhrubo Saha <dhrubo@amazon.com> Signed-off-by: Dhrubo Saha <dhrubo@amazon.com>
Signed-off-by: Dhrubo Saha dhrubo@amazon.com
Description
Created a new notebook for ml_commons integration with better explanation.
Issues Resolved
[List any issues this PR will resolve]
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.