Add nlp example #467

Ankur3107 · 2022-03-13T16:51:25Z

Describe changes

I implemented/fixed _ to achieve _.

Pre-requisites

Please ensure you have done the following:

I have read the CONTRIBUTING.md document.
If my change requires a change to docs, I have updated the documentation accordingly.
If I have added an integration, I have updated the integrations table.
I have added tests to cover my changes.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Other (add details above)

(cherry picked from commit 7d9d189)

CLAassistant · 2022-03-13T16:51:30Z

All committers have signed the CLA.

AlexejPenner

Thanks so much for your contribution to zenml Ankur, we appreciate it greatly! An NLP pipeline was sorely missing from our examples.

I have made a few Change Requests that mostly relate to formatting and documentation. Let us know if you need assistance on anything.

examples/nlp/token-classification/pipeline.py

examples/nlp/token-classification/run_pipeline.py

examples/nlp/README.md

Co-authored-by: Alexej Penner <alexej@zenml.io>

htahir1 · 2022-03-15T13:00:37Z

@Ankur3107 Apart from @AlexejPenner requests I would ask you to please use the format.sh and lint.sh scripts in the scripts folder as explained in the CONTRIBUTING.md

P.S. I also changed the base branch to develop in accordance with the contributing guidelines

Ankur3107 · 2022-03-15T16:35:44Z

@AlexejPenner Thanks for your feedback. Actually, I was doing coding in colab and just copy-paste, So mixed basic stuff.

I have updated accordingly, Please check.

AlexejPenner · 2022-03-17T16:29:01Z

@AlexejPenner Thanks for your feedback. Actually, I was doing coding in colab and just copy-paste, So mixed basic stuff.

I have updated accordingly, Please check.

Hey Ankur, thanks for your incorporated changes. In regards to your question regarding the sub directory we talked a bit internally and this is the idea we came up with

Basically we treat our examples as showcases of integrations, in that spirit it would make sense to treat your work as the huggingface integration. In order to get there you would need to do the following:

You could move your Huggingface Materializers into a huggingface integration within our main code:

src
- zenml
  - ...
  - integrations
    - ...
    - huggingface
      - materializers
        
        huggingface_materializers.py <- this is where you could move the materializer

Feel free to look at our integration for pytorch_lightning or sklearn to inspire yourself on how this looks like. Doing this would make your materializers accessible to anyone using zenml to create huggingface-based pipelines.

Then in order to deal with the subdirectory issue we could proceed as follows:

examples
- huggingface <- rename nlp to huggingface
  - materializer_utils.py <- containing your non-huggingface-specific materializers
  - token_classification.py <- move your pipeline code here (as you had it inside pipeline.py originally before the refactor)
  - text_classification.py <- your future work
  - question_answering.py <- your future work
  - run_pipeline.py <- main example file that runs the chosen pipeline based on argparse inpput

Does this make sense to you? Feel free to reach out to further discuss these ideas :)

Ankur3107 · 2022-03-18T09:57:20Z

This makes sense to me, I was also thinking the same to put materializer into integration. I will request you for review when it is ready.

AlexejPenner · 2022-03-18T13:33:16Z

This makes sense to me, I was also thinking the same to put materializer into integration. I will request you for review when it is ready.

Looking forward to it :)

htahir1

Fantastic job @Ankur3107 ! I think this is a great addition to ZenML. I am jumping in with my own review, if you dont mind. Its mostly refactorings and small suggestions that hopefully are quick and easy.

However, I do have one bigger request: Currently, it seems like the following functions wont work on the cloud with a GCP, AWS, or Azure bucket:

Dataset.from_parquet / Dataset.to_parquet
DatasetDict.load_from_disk / DatasetDict.save_to_disk
model.save_pretrained / model.from_pretrained

For the first two, I think we can leverage the built-in integrations of HuggingFace that they speak about here: https://huggingface.co/docs/datasets/v1.4.0/filesystems.html . As they are using the same libraries as us for blob storage, I think we can do osmething clever their in the future. But for now, maybe we can add some simple logic to detect whether self.artifact.uri is s3://, gs:// or az:// ?

For the third (i.e. the model) I would simple use zenml.io.fileio to copy from a local temp path to self.artifact.uri. This would be simple enough for the model files.

Does that make sense?

htahir1 · 2022-03-22T06:54:43Z

src/zenml/integrations/huggingface/materializers/huggingface_materializers.py

+            ds.save_to_disk(filepath)
+
+
+DEFAULT_TF_MODEL_DIR = "hf_tf_model"


I would potentially push this the top the of the file

htahir1 · 2022-03-22T06:55:10Z

src/zenml/integrations/huggingface/materializers/huggingface_materializers.py

+class HFTFModelMaterializer(BaseMaterializer):
+    """Materializer to read tensorflow model to and from huggingface pretrained model."""
+
+    from transformers import TFPreTrainedModel


Can we put this at the top of the file too, rather than nesting it inside?

htahir1 · 2022-03-22T06:55:26Z

src/zenml/integrations/huggingface/materializers/huggingface_materializers.py

+        )
+
+
+DEFAULT_PT_MODEL_DIR = "hf_pt_model"


Same as above, might make sense to put it at the top of the file

htahir1 · 2022-03-22T06:55:37Z

src/zenml/integrations/huggingface/materializers/huggingface_materializers.py

+        )
+
+
+DEFAULT_TOKENIZER_DIR = "hf_tokenizer"


Same as above, might make sense to put at the top of the file

src/zenml/integrations/huggingface/materializers/__init__.py

htahir1 · 2022-03-22T07:00:48Z