# "MLOps project - part 2b: Machine Learning Workflow Orchestration using ZenML"
> "Machine learning workflow orchestration using ZenML."

- toc: True
- branch: master
- badges: true
- comments: true
- categories: [mlops]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true

In the [previous blog post](https://kargarisaac.github.io/blog/mlops/2022/08/09/machine-learning-workflow-orchestration-prefect.html), we discussed how to use Prefect as a workflow orchestration in our MLOps project. In this blog post, we will see how ZenML can help us to do the same job and maybe other stuff too. Let's get started.

# ZenML

If you are a ML engineer or data scientist shipping models to production and jumbling a plethora of tools. It helps to do data, code, and models versioning. It also helps replicating production pipelines and monitoring models in production.

![](images/workflow-orchestration/zenml.gif)
*[source](https://docs.zenml.io/getting-started/introduction)*

> ZenML is an extensible, open-source MLOps framework for creating portable, production-ready MLOps pipelines. It's built for data scientists, ML Engineers, and MLOps Developers to collaborate as they develop to production. ZenML has simple, flexible syntax, is cloud- and tool-agnostic, and has interfaces/abstractions that are catered towards ML workflows. ZenML brings together all your favorite tools in one place so you can tailor your workflow to cater your needs.

Let's first get familiar with some of the ZenML's concepts briefly:

**Pipelines and Steps**: ZenML follows a pipeline-based workflow. A pipeline consist of a series of steps, organized in any order that makes sense for your use case. You can have multiple pipelines for different purposes. For example, a training pipeline to train and evaluate models and an inference pipeline to serve the model. We can use decorators such as `@step` and `@pipeline` to define steps and pipelines in ZenML. 

**Stacks, Components, and Stores**: A Stack is the configuration of the underlying infrastructure and choices around how your pipeline will be run. For example, you can choose to run your pipeline locally or on the cloud by changing the stack you use.

In any Stack, there must be at least three basic Stack Components -
- *Orchestrator*: An Orchestrator is the workhorse that coordinates all the steps to run in a pipeline.  
- *Artifact Store*: An Artifact Store is a component that houses all data that pass through the pipeline. Data in the artifact store are called artifacts.
- *Metadata Store*: A Metadata Store keeps track of all the bits of extraneous data from a pipeline run. It allows you to fetch specific steps from your pipeline run and their output artifacts in a post-execution workflow.

ZenML has also some other stack components which can help us to scale up our stack to run elsewhere, for example on a cloud with powerful GPUs for training or CPU's for deployment. You can check [here](https://docs.zenml.io/mlops-stacks/categories) for the full list of these stack components.

with ZenML, we can easily switch our stack from running on a local machine to running on the cloud with a single CLI command.
Our code (steps and pipelines) stay the same. The only change is in the stack and its components. This is amazing!


Let's get back to our example and apply ZenML! In the previous blog post, we saw our simple pipeline:

![](images/workflow-orchestration/5.png)

Your ML workflows may actually be a lot more complex. The performance of various models will need to be compared, they will need to be deployed in a production environment, and there may be extensive preprocessing that you do not want to repeat every time you train a model. In this situation, ML pipelines are useful since they let us describe our workflows as a series of interchangeable modules.

[ZenML's caching function](https://docs.zenml.io/developer-guide/steps-and-pipelines/caching), which is turned on by default, is another strong feature. As long as the inputs, outputs, and parameters of steps and pipelines are tracked and versioned automatically by ZenML, the pipeline's steps won't be repeated when the pipeline is performed again. This drastically shortens the development period. You can enable/disable it for the whole pipeline or individual steps.

In our ML process, all of the artifacts are automatically stored in an Artifact Store. By default, this is just a location in your local file system, but we can enable ZenML to store this information in a cloud bucket instead, such as an Amazon S3 or Google Cloud storage bucket.

ZenML automatically stores Metadata, such as the location of the item in a Metadata Store, together with the artifact itself. By default, this is a SQLite database on your local computer, but we could just as easily replace it with a cloud service.

Any MLOps stack's foundation is made up of artifact stores, metadata stores, and orchestrators since they allow us to save, distribute, and reproduce our work. Without them, it's simple to lose sight of the specific steps taken to develop our present ML pipelines.

When invoking `pipeline.run()`, the Orchestrator specifies how and where each pipeline step will be carried out.


## ZenML on Local Development Environment

Let's now define each of the steps as a ZenML **[Pipeline Step](https://docs.zenml.io/developer-guide/steps-and-pipelines#step)** simply by having several functions with ZenML's `@step` [Python decorator](https://realpython.com/primer-on-python-decorators/) decorator.
We can also use ZenML's `@pipeline` decorator to connect all of the steps into an ML pipeline.




One thing that I spent some time was the input and output type definition for steps. Especially custom types like the tokenizer. We need to tell ZenML somehow about this custom type. There is a solution of rthat called [`Materializer`](https://docs.zenml.io/developer-guide/advanced-usage/materializer#using-a-custom-materializer) in ZenML. You can see the defined custom materializer for our TensorFlow Tokenizer and how to use it in the following code. 



```python
class TokenizerMaterializer(BaseMaterializer):
    ASSOCIATED_TYPES = (Tokenizer,)
    ASSOCIATED_ARTIFACT_TYPES = (DataArtifact,)

    def handle_input(self, data_type: Type[Tokenizer]) -> Tokenizer:
        """Read from artifact store"""
        super().handle_input(data_type)
        with fileio.open(os.path.join(self.artifact.uri, 'tokenizer.pickle'), 'rb') as f:
            tokenizer = pickle.load(f)
        return tokenizer

    def handle_return(self, tokenizer: Tokenizer) -> None:
        """Write to artifact store"""
        super().handle_return(tokenizer)
        with fileio.open(os.path.join(self.artifact.uri, 'tokenizer.pickle'), 'wb') as f:
            pickle.dump(tokenizer, f, protocol=pickle.HIGHEST_PROTOCOL)
```

You then need to pass it to the step via `.with_return_materializers({"tokenizer": TokenizerMaterializer})` in the pipeline definition. You will see this in a few moment.

In addition, we need to integrate MLflow in ZenML. It is a bit different from how we did it with Prefect. First you need to install the integration:

```bash
zenml integration install mlflow
```

MLflow can handle various ML lifecycle steps, such as experiment tracking, code packaging, model deployment, and more. We will focus on the [MLflow Tracking](https://mlflow.org/docs/latest/tracking.html) component in this lesson, but we will learn about other MLflow features in later posts. 

First, install MLflow ZenML integration:

```bash
zenml integration install mlflow -y
```

Then, by adding an `@enable_mlflow` decorator on top of the function, ZenML then automatically initializes MLflow and we can log what we want.

to run our MLflow pipelines with ZenML, we first need to add MLflow into our ZenML MLOps stack. We first register a new experiment tracker with ZenML and then add it to our current stack. To set the `tracking_uri` for MLflow in ZenML, you need to do it as follows, which is a bit different from what we did before and how you do it with pure MLflow. Also, it seems setting the experiment name in ZenML 0.13 is not possible and it uses a default name which is the name of the function you use for training. We will see it in a few moments. In the documents, it says you can pass it to the decorator, but in the new version I get an error. I hope they add it in the future versions.


```bash
# Register the MLflow experiment tracker
zenml experiment-tracker register mlflow_tracker --flavor=mlflow --tracking_uri="sqlite:///mlflow.db"

# Add the MLflow experiment tracker into our default stack
zenml stack update default -e mlflow_tracker
```

and run mlflow UI using the following command:

```bash
mlflow ui --backend-store-uri sqlite:///mlflow.db
```

Note that when you use `@enable_mlflow` decorator for a step, it will be one single run and you cannot have multiple runs in that step. If you want to test different set of hyperparameters and do hyperparameter search, you need to call the step multiple times. The other problem you may face is that you cannot pass parameters to a step if it is not the result of a previous step. You can do the same trick as [this example](https://github.com/zenml-io/zenbytes/blob/main/2-1_Experiment_Tracking.ipynb).

```python
import numpy as np
import pandas as pd
import os
import nltk
import re
if os.path.exists('./corpora'):
    os.environ["NLTK_DATA"] = "./corpora"
else:
    nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import mlflow
import pickle
from zenml.steps import step, Output
from zenml.pipelines import pipeline
from zenml.artifacts import DataArtifact
from zenml.io import fileio
from zenml.materializers.base_materializer import BaseMaterializer
from zenml.integrations.mlflow.mlflow_step_decorator import enable_mlflow
from typing import Type


DATA_PATH = "data/Womens Clothing E-Commerce Reviews.csv"


class TokenizerMaterializer(BaseMaterializer):
    ASSOCIATED_TYPES = (Tokenizer,)
    ASSOCIATED_ARTIFACT_TYPES = (DataArtifact,)

    def handle_input(self, data_type: Type[Tokenizer]) -> Tokenizer:
        """Read from artifact store"""
        super().handle_input(data_type)
        with fileio.open(os.path.join(self.artifact.uri, 'tokenizer.pickle'), 'rb') as f:
            tokenizer = pickle.load(f)
        return tokenizer

    def handle_return(self, tokenizer: Tokenizer) -> None:
        """Write to artifact store"""
        super().handle_return(tokenizer)
        with fileio.open(os.path.join(self.artifact.uri, 'tokenizer.pickle'), 'wb') as f:
            pickle.dump(tokenizer, f, protocol=pickle.HIGHEST_PROTOCOL)


## data loading
@step
def read_data() -> pd.DataFrame:
    data = pd.read_csv(DATA_PATH, index_col =[0])
    print("Data loaded.\n\n")
    return data

## preprocess text
@step
def preprocess_data(
    data: pd.DataFrame,
    ) -> Output(corpus=list, y=np.ndarray):
    #check if data/corpus is created before or not
    if not os.path.exists('data/corpus_y.pickle'):
        print("Preprocessed data not found. Creating new data. \n\n")
        data = data[~data['Review Text'].isnull()]  #Dropping columns which don't have any review
        X = data[['Review Text']]
        X.index = np.arange(len(X))

        y = data['Recommended IND'].values

        corpus =[]
        for i in range(len(X)):
            review = re.sub('[^a-zA-z]',' ',X['Review Text'][i])
            review = review.lower()
            review = review.split()
            ps = PorterStemmer()
            review =[ps.stem(i) for i in review if not i in set(stopwords.words('english'))]
            review =' '.join(review)
            corpus.append(review)

        with open('data/corpus_y.pickle', 'wb') as handle:
            pickle.dump((corpus, y), handle)
    else:
        print("Preprocessed data found. Loading data. \n\n")
        with open('data/corpus_y.pickle', 'rb') as handle:
            corpus, y = pickle.load(handle)

    print("Data preprocessed.\n\n")

    return corpus, y

## tokenization and dataset creation
@step
def create_dataset(
    corpus: list, 
    y: np.ndarray
    ) -> Output(X_train=np.ndarray, X_test=np.ndarray, y_train=np.ndarray, y_test=np.ndarray, tokenizer=Tokenizer):

    tokenizer = Tokenizer(num_words = 3000)
    tokenizer.fit_on_texts(corpus)

    sequences = tokenizer.texts_to_sequences(corpus)
    padded = pad_sequences(sequences, padding='post')

    X_train, X_test, y_train, y_test = train_test_split(padded, y, test_size = 0.2, random_state = 42)

    print("Dataset created.\n\n")
    return X_train, X_test, y_train, y_test, tokenizer


def build_dl_pipeline(hyperparams):
    @enable_mlflow(nested=True)
    @step(enable_cache=False)
    def train_model(
        X_train: np.ndarray, 
        y_train: np.ndarray, 
        X_test: np.ndarray, 
        y_test: np.ndarray, 
        tokenizer: Tokenizer
        ) -> None:
        
        mlflow.tensorflow.autolog()
        
        embedding_dim = hyperparams['embedding_dim']
        batch_size = hyperparams['batch_size']

        # model definition
        model = tf.keras.Sequential([
            tf.keras.layers.Embedding(3000, embedding_dim),
            tf.keras.layers.GlobalAveragePooling1D(),
            tf.keras.layers.Dense(6, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
        ])

        ## training
        num_epochs = 50
        callback = tf.keras.callbacks.EarlyStopping(
            monitor="val_loss",
            min_delta=0,
            patience=2,
            verbose=0,
            mode="auto",
            baseline=None,
            restore_best_weights=False,
        )

        model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

        mlflow.set_tag("developer", "Isaac")
        mlflow.set_tag("algorithm", "Deep Learning")
        mlflow.log_param("train-data", "Womens Clothing E-Commerce Reviews")
        mlflow.log_param("embedding-dim", embedding_dim)

        print("Fit model on training data")
        model.fit(
            X_train,
            y_train,
            batch_size=batch_size,
            epochs=num_epochs,
            callbacks=callback,
            # We pass some validation for
            # monitoring validation loss and metrics
            # at the end of each epoch
            validation_data=(X_test, y_test),
        )

        ## save model and tokenizer
        mlflow.keras.log_model(model, 'models/model_dl')

        with open('tokenizer_pickle/tf_tokenizer.pickle', 'wb') as handle:
            pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
        mlflow.log_artifact(local_path="tokenizer_pickle/tf_tokenizer.pickle", artifact_path="tokenizer_pickle")

        # Evaluate the model on the test data using `evaluate`
        print("Evaluate on test data")
        results = model.evaluate(X_test, y_test, batch_size=128)
        print("test loss, test acc:", results)
        mlflow.log_metric("loss", results[0])
        mlflow.log_metric("accuracy", results[1])

        print("Model training completed.\n\n")

    return training_pipeline(
        reading_data=read_data(),
        preprocessing_data=preprocess_data(),
        creating_dataset=create_dataset().with_return_materializers({"tokenizer": TokenizerMaterializer}),
        training_model=train_model(),
    )


@pipeline
def training_pipeline(
    reading_data,
    preprocessing_data,
    creating_dataset,
    training_model,
):
    data = reading_data()
    corpus, y = preprocessing_data(data)
    X_train, X_test, y_train, y_test, tokenizer = creating_dataset(corpus, y)
    training_model(X_train, y_train, X_test, y_test, tokenizer)


if __name__ == '__main__':

    for embedding_dim, batch_size in zip([32, 64, 128], [32, 64, 128]):
        hyperparams = {
            'embedding_dim': embedding_dim,
            'batch_size': batch_size
        }
        print(hyperparams)
        build_dl_pipeline(hyperparams=hyperparams).run()    
```

The above will run the training for different set of hyperparameters and do the experiment tracking using MLflow locally. 


## ZenML on GCP

Now let's see how to deploy this pipeline on Cloud. I will do it on GCP here, but other clouds are almost the same.

You can have a simple MLOps stack ready for running your machine learning workloads. To do this, you can follow the steps [here](https://github.com/zenml-io/mlops-stacks/tree/main/vertex-ai). It sets up the following resources:

- A Vertex AI enabled workspace as an [orchestrator](https://docs.zenml.io/mlops-stacks/orchestrators) that you can submit your pipelines to.
- A service account with all the necessary permissions needed to execute your pipelines.
- A GCS bucket as an [artifact store](https://docs.zenml.io/mlops-stacks/artifact-stores), which can be used to store all your ML artifacts like the model, checkpoints, etc. 
- A CloudSQL instance as a [metadata store](https://docs.zenml.io/mlops-stacks/metadata-stores) that is essential to track all your metadata and its location in your artifact store.  
- A Container Registry repository as [container registry](https://docs.zenml.io/mlops-stacks/container-registries) for hosting your docker images.
- A [secrets manager](https://docs.zenml.io/mlops-stacks/secrets-managers) enabled for storing your secrets. 
- An optional MLflow Tracking server deployed on a GKE cluster as an [experiment tracker](https://docs.zenml.io/mlops-stacks/experiment-trackers). 


Note that the terraform will create the required service account and enable required APIs on GCP.

- create new project on GCP

- set project in gcloud cli 

```bash
gcloud config set project <new project id>
```

- pull zenml vertex-ai recipe  

```bash
zenml stack recipe pull vertex-ai
```

- Customize your deployment by editing the default values in the `locals.tf` file like `project_id`, `region`, `gcs` name, `prefix`?.

- Add your secret information like keys and passwords into the `values.tfvars.json` file which is not committed and only exists locally.

- deploy using zenml 

```bash
zenml stack recipe deploy vertex-ai
```

- create the ZenML stack. You may need to install gcp integration using `zenml integration install gcp` too.

```bash
zenml stack import vertex-ai zenml_stack_recipes/vertex-ai/vertex_stack_2022-08-26T07_02.yml
```

- And to destroy the deployed infrastructure:


```bash
zenml stack recipe destroy vertex-ai
zenml stack recipe clean
```

Then use `zenml stack list` to see the list of stacks. If the new created stack (`vertex-ai`) is not active, use `zenml stack set vertex-ai` to activate it.

![](images/workflow-orchestration/6.png)

You then should be able to run your pipeline. Note that you may need to do `python <path-to-python-env>\Scripts\pywin32_postinstall.py -install` if you see any error related to that. Make sure you have docker installed and running. 

The other point that you need to take care of is the requirements for your code on the cloud. One way is to create a `requirements.txt` file in your repo and set the path to it in your code. Check [here](https://docs.zenml.io/developer-guide/advanced-usage/docker#how-to-install-additional-pip-dependencies) to learn more.

```python
.
.
.
docker_config = DockerConfiguration(requirements="./requirements.txt")
.
.
.
@pipeline(docker_configuration=docker_config)
.
.
.
```

Here is the content of the `requirements.txt` file:

```
nltk==3.7
tensorflow==2.9.1
mlflow==1.25.1
zenml==0.13.0
scikit-learn
pickle
```

The whole code would be like this:

```python
# train_dl_zenml.py file

import numpy as np
import pandas as pd
import os
import nltk
import re
if os.path.exists('./corpora'):
    os.environ["NLTK_DATA"] = "./corpora"
else:
    nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import mlflow
import pickle
from zenml.steps import step, Output
from zenml.pipelines import pipeline
from zenml.artifacts import DataArtifact
from zenml.io import fileio
from zenml.materializers.base_materializer import BaseMaterializer
from zenml.integrations.mlflow.mlflow_step_decorator import enable_mlflow
from zenml.config.docker_configuration import DockerConfiguration
from typing import Type


DATA_PATH = "data/Womens Clothing E-Commerce Reviews.csv"
docker_config = DockerConfiguration(requirements="./requirements.txt")


class TokenizerMaterializer(BaseMaterializer):
    ASSOCIATED_TYPES = (Tokenizer,)
    ASSOCIATED_ARTIFACT_TYPES = (DataArtifact,)

    def handle_input(self, data_type: Type[Tokenizer]) -> Tokenizer:
        """Read from artifact store"""
        super().handle_input(data_type)
        with fileio.open(os.path.join(self.artifact.uri, 'tokenizer.pickle'), 'rb') as f:
            tokenizer = pickle.load(f)
        return tokenizer

    def handle_return(self, tokenizer: Tokenizer) -> None:
        """Write to artifact store"""
        super().handle_return(tokenizer)
        with fileio.open(os.path.join(self.artifact.uri, 'tokenizer.pickle'), 'wb') as f:
            pickle.dump(tokenizer, f, protocol=pickle.HIGHEST_PROTOCOL)


## data loading
@step
def read_data() -> pd.DataFrame:
    data = pd.read_csv(DATA_PATH, index_col =[0])
    print("Data loaded.\n\n")
    return data

## preprocess text
@step
def preprocess_data(
    data: pd.DataFrame,
    ) -> Output(corpus=list, y=np.ndarray):
    #check if data/corpus is created before or not
    if not os.path.exists('data/corpus_y.pickle'):
        print("Preprocessed data not found. Creating new data. \n\n")
        data = data[~data['Review Text'].isnull()]  #Dropping columns which don't have any review
        X = data[['Review Text']]
        X.index = np.arange(len(X))

        y = data['Recommended IND'].values

        corpus =[]
        for i in range(len(X)):
            review = re.sub('[^a-zA-z]',' ',X['Review Text'][i])
            review = review.lower()
            review = review.split()
            ps = PorterStemmer()
            review =[ps.stem(i) for i in review if not i in set(stopwords.words('english'))]
            review =' '.join(review)
            corpus.append(review)

        with open('data/corpus_y.pickle', 'wb') as handle:
            pickle.dump((corpus, y), handle)
    else:
        print("Preprocessed data found. Loading data. \n\n")
        with open('data/corpus_y.pickle', 'rb') as handle:
            corpus, y = pickle.load(handle)

    print("Data preprocessed.\n\n")

    return corpus, y

## tokenization and dataset creation
@step
def create_dataset(
    corpus: list, 
    y: np.ndarray
    ) -> Output(X_train=np.ndarray, X_test=np.ndarray, y_train=np.ndarray, y_test=np.ndarray, tokenizer=Tokenizer):

    tokenizer = Tokenizer(num_words = 3000)
    tokenizer.fit_on_texts(corpus)

    sequences = tokenizer.texts_to_sequences(corpus)
    padded = pad_sequences(sequences, padding='post')

    X_train, X_test, y_train, y_test = train_test_split(padded, y, test_size = 0.2, random_state = 42)

    print("Dataset created.\n\n")
    # return tokenizer
    return X_train, X_test, y_train, y_test, tokenizer


def build_dl_pipeline(hyperparams):
    @enable_mlflow(
        nested=True,
        # experiment_name="customer-sentiment-analysis"
        )
    @step(enable_cache=False)
    def train_model(
        X_train: np.ndarray, 
        y_train: np.ndarray, 
        X_test: np.ndarray, 
        y_test: np.ndarray, 
        tokenizer: Tokenizer
        ) -> None:
        
        mlflow.tensorflow.autolog()
        
        embedding_dim = hyperparams['embedding_dim']
        batch_size = hyperparams['batch_size']

        # model definition
        model = tf.keras.Sequential([
            tf.keras.layers.Embedding(3000, embedding_dim),
            tf.keras.layers.GlobalAveragePooling1D(),
            tf.keras.layers.Dense(6, activation='relu'),
            tf.keras.layers.Dense(1, activation='sigmoid')
        ])

        ## training
        num_epochs = 50
        callback = tf.keras.callbacks.EarlyStopping(
            monitor="val_loss",
            min_delta=0,
            patience=2,
            verbose=0,
            mode="auto",
            baseline=None,
            restore_best_weights=False,
        )

        model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])

        mlflow.set_tag("developer", "Isaac")
        mlflow.set_tag("algorithm", "Deep Learning")
        mlflow.log_param("train-data", "Womens Clothing E-Commerce Reviews")
        mlflow.log_param("embedding-dim", embedding_dim)

        print("Fit model on training data")
        model.fit(
            X_train,
            y_train,
            batch_size=batch_size,
            epochs=num_epochs,
            callbacks=callback,
            # We pass some validation for
            # monitoring validation loss and metrics
            # at the end of each epoch
            validation_data=(X_test, y_test),
        )

        ## save model and tokenizer
        mlflow.keras.log_model(model, 'models/model_dl')

        with open('tokenizer_pickle/tf_tokenizer.pickle', 'wb') as handle:
            pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
        
        mlflow.log_artifact(local_path="tokenizer_pickle/tf_tokenizer.pickle", artifact_path="tokenizer_pickle")

        # Evaluate the model on the test data using `evaluate`
        print("Evaluate on test data")
        results = model.evaluate(X_test, y_test, batch_size=128)
        print("test loss, test acc:", results)
        mlflow.log_metric("loss", results[0])
        mlflow.log_metric("accuracy", results[1])

        print("Model training completed.\n\n")

    return training_pipeline(
        reading_data=read_data(),
        preprocessing_data=preprocess_data(),
        creating_dataset=create_dataset().with_return_materializers({"tokenizer": TokenizerMaterializer}),
        training_model=train_model(),
    )


@pipeline(docker_configuration=docker_config)
def training_pipeline(
    reading_data,
    preprocessing_data,
    creating_dataset,
    training_model,
):
    data = reading_data()
    corpus, y = preprocessing_data(data)
    X_train, X_test, y_train, y_test, tokenizer = creating_dataset(corpus, y)
    training_model(X_train, y_train, X_test, y_test, tokenizer)


if __name__ == '__main__':

    for embedding_dim, batch_size in zip([32, 64, 128], [32, 64, 128]):
        hyperparams = {
            'embedding_dim': embedding_dim,
            'batch_size': batch_size
        }
        build_dl_pipeline(hyperparams=hyperparams).run()    
```

Then you can run your pipeline:


```bash
python train_dl_zenml.py
```

If you see an error and problem, here are some tips:

- there would be a folder in `C:\Users\<user>\AppData\Roaming\zenml` if you are on windows. You can use `zenml clean` yo remove it. If it needs permission, you can remove it manualy.


To see the MLflow UI, you can ...

Here are some resources I used:

https://github.com/zenml-io/mlops-stacks/tree/main/vertex-ai

https://github.com/zenml-io/zenml/blob/main/examples/vertex_ai_orchestration/run.py
    
> youtube: https://www.youtube.com/watch?v=qgvmvexGv_c

As the final thought, I must admit that I loved ZenML. It's much more than defining a machine learning pipeline and helps you integrate different MLOps tools in your stack. It makes all the steps and MLOps tools easy to use. I decided to continue using it in this project and also in SUPPLYZ.eu stack.