New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nlp example #467
Add nlp example #467
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for your contribution to zenml Ankur, we appreciate it greatly! An NLP pipeline was sorely missing from our examples.
I have made a few Change Requests that mostly relate to formatting and documentation. Let us know if you need assistance on anything.
Co-authored-by: Alexej Penner <alexej@zenml.io>
@Ankur3107 Apart from @AlexejPenner requests I would ask you to please use the P.S. I also changed the base branch to develop in accordance with the contributing guidelines |
@AlexejPenner Thanks for your feedback. Actually, I was doing coding in colab and just copy-paste, So mixed basic stuff. I have updated accordingly, Please check. |
Hey Ankur, thanks for your incorporated changes. In regards to your question regarding the sub directory we talked a bit internally and this is the idea we came up with Basically we treat our examples as showcases of integrations, in that spirit it would make sense to treat your work as the huggingface integration. In order to get there you would need to do the following: You could move your Huggingface Materializers into a huggingface integration within our main code:
Feel free to look at our integration for Then in order to deal with the subdirectory issue we could proceed as follows:
Does this make sense to you? Feel free to reach out to further discuss these ideas :) |
This makes sense to me, I was also thinking the same to put materializer into integration. I will request you for review when it is ready. |
Looking forward to it :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic job @Ankur3107 ! I think this is a great addition to ZenML. I am jumping in with my own review, if you dont mind. Its mostly refactorings and small suggestions that hopefully are quick and easy.
However, I do have one bigger request: Currently, it seems like the following functions wont work on the cloud with a GCP, AWS, or Azure bucket:
Dataset.from_parquet
/Dataset.to_parquet
DatasetDict.load_from_disk
/DatasetDict.save_to_disk
model.save_pretrained
/model.from_pretrained
For the first two, I think we can leverage the built-in integrations of HuggingFace that they speak about here: https://huggingface.co/docs/datasets/v1.4.0/filesystems.html . As they are using the same libraries as us for blob storage, I think we can do osmething clever their in the future. But for now, maybe we can add some simple logic to detect whether self.artifact.uri
is s3://
, gs://
or az://
?
For the third (i.e. the model) I would simple use zenml.io.fileio
to copy from a local temp path to self.artifact.uri
. This would be simple enough for the model files.
Does that make sense?
ds.save_to_disk(filepath) | ||
|
||
|
||
DEFAULT_TF_MODEL_DIR = "hf_tf_model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would potentially push this the top the of the file
class HFTFModelMaterializer(BaseMaterializer): | ||
"""Materializer to read tensorflow model to and from huggingface pretrained model.""" | ||
|
||
from transformers import TFPreTrainedModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put this at the top of the file too, rather than nesting it inside?
) | ||
|
||
|
||
DEFAULT_PT_MODEL_DIR = "hf_pt_model" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, might make sense to put it at the top of the file
) | ||
|
||
|
||
DEFAULT_TOKENIZER_DIR = "hf_tokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, might make sense to put at the top of the file
|
||
# Convert tokenized datasets into tf dataset | ||
train_set = tokenized_datasets["train"].to_tf_dataset( | ||
columns=["attention_mask", "input_ids"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be part of the config
shuffle=True, | ||
batch_size=config.batch_size, | ||
collate_fn=DataCollatorWithPadding(tokenizer, return_tensors="tf"), | ||
label_cols="label", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be part of the config
# Convert into tf dataset format | ||
validation_set = tokenized_datasets["test"].to_tf_dataset( | ||
columns=["attention_mask", "input_ids"], | ||
shuffle=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be part of the config
shuffle=False, | ||
batch_size=config.batch_size, | ||
collate_fn=DataCollatorWithPadding(tokenizer, return_tensors="tf"), | ||
label_cols="label", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be part of the config
|
||
# Convert into tf dataset format | ||
validation_set = tokenized_datasets["test"].to_tf_dataset( | ||
columns=["attention_mask", "input_ids"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be part of the config
@htahir1 I would like to take some time because I am traveling for my office work, I have very limited free time for few days, But I will add whenever I get time. Please give me some time. |
@Ankur3107 Please take your time we can make it part of the release to happen in two weeks :-) |
@Ankur3107 How are we looking? We are making the release next week so thought id ask you if we would make progress on this PR by then? Thanks! |
@htahir1 Yes, I have started working on this. can you help me, how can I get this
ds.save_to_disk(self.artifact.uri,fs=fs)
|
Hey Ankur, sorry for the delayed answer. We recently had some changes within our code that affected the s3_plugin so we wanted to make sure this was stable before coming back to you. Additionally we had some internal discussions regarding Artifact stores and how their FS could be directly accessed from within a Materializer. As of right now we don't have the perfect solution for this but we're working on it. In the meantime you could try a workaround. You could create a temporary path locally that is written to and then copy that file into the artifact store fs. This should work as a temporary solution until we have implemented the more longterm solution. Be aware that this could fal if the amount of data becomes huge. For your materializer the pseudocode would look a bit like this: def handle_return(self, ds: Type[Any]) -> None:
"""Writes a Dataset to the specified filename.
Args:
Dataset: The Dataset to write.
"""
super().handle_return(ds)
...
ds.save_to_disk(tmp_path)
fileio.copy(tmp_path, self.artifact.uri) |
@Ankur3107 Would appreciate an update on this. Anyway we could help out? Would love to see this as part of the next release |
@htahir1 I am good for review. Please review and let me know your feedback. |
@Ankur3107 There are some merge conflicts, could you kindly fix them and we run the tests again? Then ill do a review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately our last release changed the location of the copy_dir() method. Once you have accounted for that, I think this would be ready to merge.
Somehow our linter is also failing, I am not entirely sure what the issue is. I will try investigating it on the side as well.
src/zenml/integrations/huggingface/materializers/huggingface_datasets_materializer.py
Show resolved
Hide resolved
src/zenml/integrations/huggingface/materializers/huggingface_pt_model_materializer.py
Show resolved
Hide resolved
src/zenml/integrations/huggingface/materializers/huggingface_tf_model_materializer.py
Show resolved
Hide resolved
src/zenml/integrations/huggingface/materializers/huggingface_tokenizer_materializer.py
Show resolved
Hide resolved
@AlexejPenner updated as per your feedback. |
Thank you so much for your patience and diligence in this contribution. We greatly appreciate your work on this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's merge this :)
@AlexejPenner @htahir1 Thank you for the guidance and patience. Will love to contribute in future. I am loving zenml framework. |
Thats amazing to hear. We'd love to have your future contributions. Feel free to drop by on our slack to connect or chat https://zenml.io/slack-invite/ |
@AlexejPenner Thanks, I have already connected on slack. |
Describe changes
I implemented/fixed _ to achieve _.
Pre-requisites
Please ensure you have done the following:
Types of changes