Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add nlp example #467

Merged
merged 35 commits into from Apr 14, 2022
Merged

Add nlp example #467

merged 35 commits into from Apr 14, 2022

Conversation

Ankur3107
Copy link
Contributor

Describe changes

I implemented/fixed _ to achieve _.

Pre-requisites

Please ensure you have done the following:

  • I have read the CONTRIBUTING.md document.
  • If my change requires a change to docs, I have updated the documentation accordingly.
  • If I have added an integration, I have updated the integrations table.
  • I have added tests to cover my changes.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Other (add details above)

@CLAassistant
Copy link

CLAassistant commented Mar 13, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@AlexejPenner AlexejPenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for your contribution to zenml Ankur, we appreciate it greatly! An NLP pipeline was sorely missing from our examples.

I have made a few Change Requests that mostly relate to formatting and documentation. Let us know if you need assistance on anything.

examples/nlp/token-classification/pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/run_pipeline.py Outdated Show resolved Hide resolved
examples/nlp/token-classification/run_pipeline.py Outdated Show resolved Hide resolved
examples/nlp/README.md Outdated Show resolved Hide resolved
Co-authored-by: Alexej Penner <alexej@zenml.io>
@htahir1 htahir1 changed the base branch from main to develop March 15, 2022 12:59
@htahir1
Copy link
Contributor

htahir1 commented Mar 15, 2022

@Ankur3107 Apart from @AlexejPenner requests I would ask you to please use the format.sh and lint.sh scripts in the scripts folder as explained in the CONTRIBUTING.md

P.S. I also changed the base branch to develop in accordance with the contributing guidelines

@Ankur3107
Copy link
Contributor Author

@AlexejPenner Thanks for your feedback. Actually, I was doing coding in colab and just copy-paste, So mixed basic stuff.

I have updated accordingly, Please check.

@AlexejPenner
Copy link
Contributor

@AlexejPenner Thanks for your feedback. Actually, I was doing coding in colab and just copy-paste, So mixed basic stuff.

I have updated accordingly, Please check.

Hey Ankur, thanks for your incorporated changes. In regards to your question regarding the sub directory we talked a bit internally and this is the idea we came up with

Basically we treat our examples as showcases of integrations, in that spirit it would make sense to treat your work as the huggingface integration. In order to get there you would need to do the following:

You could move your Huggingface Materializers into a huggingface integration within our main code:

  • src
    • zenml
      • ...
      • integrations
        • ...
        • huggingface
          • materializers
            • huggingface_materializers.py <- this is where you could move the materializer

Feel free to look at our integration for pytorch_lightning or sklearn to inspire yourself on how this looks like. Doing this would make your materializers accessible to anyone using zenml to create huggingface-based pipelines.

Then in order to deal with the subdirectory issue we could proceed as follows:

  • examples
    • huggingface <- rename nlp to huggingface
      • materializer_utils.py <- containing your non-huggingface-specific materializers
      • token_classification.py <- move your pipeline code here (as you had it inside pipeline.py originally before the refactor)
      • text_classification.py <- your future work
      • question_answering.py <- your future work
      • run_pipeline.py <- main example file that runs the chosen pipeline based on argparse inpput

Does this make sense to you? Feel free to reach out to further discuss these ideas :)

@Ankur3107
Copy link
Contributor Author

This makes sense to me, I was also thinking the same to put materializer into integration. I will request you for review when it is ready.

@AlexejPenner
Copy link
Contributor

This makes sense to me, I was also thinking the same to put materializer into integration. I will request you for review when it is ready.

Looking forward to it :)

Copy link
Contributor

@htahir1 htahir1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic job @Ankur3107 ! I think this is a great addition to ZenML. I am jumping in with my own review, if you dont mind. Its mostly refactorings and small suggestions that hopefully are quick and easy.

However, I do have one bigger request: Currently, it seems like the following functions wont work on the cloud with a GCP, AWS, or Azure bucket:

  • Dataset.from_parquet / Dataset.to_parquet
  • DatasetDict.load_from_disk / DatasetDict.save_to_disk
  • model.save_pretrained / model.from_pretrained

For the first two, I think we can leverage the built-in integrations of HuggingFace that they speak about here: https://huggingface.co/docs/datasets/v1.4.0/filesystems.html . As they are using the same libraries as us for blob storage, I think we can do osmething clever their in the future. But for now, maybe we can add some simple logic to detect whether self.artifact.uri is s3://, gs:// or az:// ?

For the third (i.e. the model) I would simple use zenml.io.fileio to copy from a local temp path to self.artifact.uri. This would be simple enough for the model files.

Does that make sense?

ds.save_to_disk(filepath)


DEFAULT_TF_MODEL_DIR = "hf_tf_model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would potentially push this the top the of the file

class HFTFModelMaterializer(BaseMaterializer):
"""Materializer to read tensorflow model to and from huggingface pretrained model."""

from transformers import TFPreTrainedModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this at the top of the file too, rather than nesting it inside?

)


DEFAULT_PT_MODEL_DIR = "hf_pt_model"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, might make sense to put it at the top of the file

)


DEFAULT_TOKENIZER_DIR = "hf_tokenizer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, might make sense to put at the top of the file


# Convert tokenized datasets into tf dataset
train_set = tokenized_datasets["train"].to_tf_dataset(
columns=["attention_mask", "input_ids"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be part of the config

shuffle=True,
batch_size=config.batch_size,
collate_fn=DataCollatorWithPadding(tokenizer, return_tensors="tf"),
label_cols="label",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be part of the config

# Convert into tf dataset format
validation_set = tokenized_datasets["test"].to_tf_dataset(
columns=["attention_mask", "input_ids"],
shuffle=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be part of the config

shuffle=False,
batch_size=config.batch_size,
collate_fn=DataCollatorWithPadding(tokenizer, return_tensors="tf"),
label_cols="label",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be part of the config


# Convert into tf dataset format
validation_set = tokenized_datasets["test"].to_tf_dataset(
columns=["attention_mask", "input_ids"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be part of the config

@Ankur3107
Copy link
Contributor Author

@htahir1
I like the idea and it is making sense as well. I will work on this.

I would like to take some time because I am traveling for my office work, I have very limited free time for few days, But I will add whenever I get time. Please give me some time.

@htahir1
Copy link
Contributor

htahir1 commented Mar 24, 2022

@Ankur3107 Please take your time we can make it part of the release to happen in two weeks :-)

@Ankur3107 Ankur3107 marked this pull request as draft March 29, 2022 13:49
@htahir1
Copy link
Contributor

htahir1 commented Apr 2, 2022

@Ankur3107 How are we looking? We are making the release next week so thought id ask you if we would make progress on this PR by then? Thanks!

@Ankur3107
Copy link
Contributor Author

@htahir1 Yes, I have started working on this.

can you help me, how can I get this fs

fs: s3fs.S3FileSystem = None
object because I want to use here
ds.save_to_disk(self.artifact.uri,fs=fs)

@AlexejPenner
Copy link
Contributor

@htahir1 Yes, I have started working on this.

can you help me, how can I get this fs

fs: s3fs.S3FileSystem = None

object because I want to use here
ds.save_to_disk(self.artifact.uri,fs=fs)

Hey Ankur, sorry for the delayed answer. We recently had some changes within our code that affected the s3_plugin so we wanted to make sure this was stable before coming back to you. Additionally we had some internal discussions regarding Artifact stores and how their FS could be directly accessed from within a Materializer.

As of right now we don't have the perfect solution for this but we're working on it.

In the meantime you could try a workaround. You could create a temporary path locally that is written to and then copy that file into the artifact store fs. This should work as a temporary solution until we have implemented the more longterm solution. Be aware that this could fal if the amount of data becomes huge.

For your materializer the pseudocode would look a bit like this:

def handle_return(self, ds: Type[Any]) -> None:
    """Writes a Dataset to the specified filename.
    Args:
        Dataset: The Dataset to write.
    """
    super().handle_return(ds)
    ...
    ds.save_to_disk(tmp_path)
    fileio.copy(tmp_path, self.artifact.uri)

@htahir1
Copy link
Contributor

htahir1 commented Apr 11, 2022

@Ankur3107 Would appreciate an update on this. Anyway we could help out? Would love to see this as part of the next release

@Ankur3107
Copy link
Contributor Author

@htahir1 I am good for review. Please review and let me know your feedback.

@Ankur3107 Ankur3107 requested a review from htahir1 April 12, 2022 05:09
@htahir1
Copy link
Contributor

htahir1 commented Apr 12, 2022

@Ankur3107 There are some merge conflicts, could you kindly fix them and we run the tests again? Then ill do a review

@Ankur3107 Ankur3107 marked this pull request as ready for review April 12, 2022 09:18
Copy link
Contributor

@AlexejPenner AlexejPenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately our last release changed the location of the copy_dir() method. Once you have accounted for that, I think this would be ready to merge.

Somehow our linter is also failing, I am not entirely sure what the issue is. I will try investigating it on the side as well.

@Ankur3107
Copy link
Contributor Author

@AlexejPenner updated as per your feedback.

@AlexejPenner
Copy link
Contributor

@AlexejPenner updated as per your feedback.

Thank you so much for your patience and diligence in this contribution. We greatly appreciate your work on this.

Copy link
Contributor

@AlexejPenner AlexejPenner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge this :)

@AlexejPenner AlexejPenner merged commit b1466df into zenml-io:develop Apr 14, 2022
@Ankur3107
Copy link
Contributor Author

@AlexejPenner @htahir1 Thank you for the guidance and patience. Will love to contribute in future. I am loving zenml framework.

@AlexejPenner
Copy link
Contributor

Thats amazing to hear. We'd love to have your future contributions. Feel free to drop by on our slack to connect or chat https://zenml.io/slack-invite/

@Ankur3107
Copy link
Contributor Author

@AlexejPenner Thanks, I have already connected on slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants