# TFX Guided Project on Vertex

**Learning Objectives:**

* Learn how to generate a standard TFX template pipeline using `tfx template`
* Learn how to modify and run a templated TFX pipeline on Vertex

In [1]:
import os

from google.cloud import aiplatform

## Step 1. Environment setup

### Environment variable setup

Let's set some environment variables to use Vertex Pipelines.

Change your region if needed.

In [2]:
shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
GOOGLE_CLOUD_PROJECT = shell_output[0]
REGION = "us-central1"

%env GOOGLE_CLOUD_PROJECT={GOOGLE_CLOUD_PROJECT}
%env REGION={REGION}

env: GOOGLE_CLOUD_PROJECT=qwiklabs-asl-01-579c20dd4e24
env: REGION=us-central1


## Step 2. Copy the predefined template to your project directory.

In this step, we will create a working pipeline project directory and 
files by copying additional files from a predefined template.

You may give your pipeline a different name by changing the PIPELINE_NAME below. 

In [3]:
PIPELINE_NAME = "tfx-guided-project-on-vertex"

This will also become the name of the project directory where your files will be put:

In [4]:
PROJECT_DIR = os.path.join(os.path.expanduser("."), PIPELINE_NAME)
PROJECT_DIR

'./tfx-guided-project-on-vertex'

TFX includes the `taxi` template with the TFX python package. 

If you are planning to solve a point-wise prediction problem,
including classification and regresssion, this template could be used as a starting point.

The `tfx template copy` CLI command copies predefined template files into your project directory.

In [5]:
!tfx template copy \
  --pipeline-name={PIPELINE_NAME} \
  --destination-path={PROJECT_DIR} \
  --model=taxi

CLI
Copying taxi pipeline template
configs.py -> ./tfx-guided-project-on-vertex/pipeline/configs.py
__init__.py -> ./tfx-guided-project-on-vertex/pipeline/__init__.py
pipeline.py -> ./tfx-guided-project-on-vertex/pipeline/pipeline.py
features.py -> ./tfx-guided-project-on-vertex/models/features.py
features_test.py -> ./tfx-guided-project-on-vertex/models/features_test.py
preprocessing_test.py -> ./tfx-guided-project-on-vertex/models/preprocessing_test.py
model_test.py -> ./tfx-guided-project-on-vertex/models/estimator_model/model_test.py
constants.py -> ./tfx-guided-project-on-vertex/models/estimator_model/constants.py
__init__.py -> ./tfx-guided-project-on-vertex/models/estimator_model/__init__.py
model.py -> ./tfx-guided-project-on-vertex/models/estimator_model/model.py
__init__.py -> ./tfx-guided-project-on-vertex/models/__init__.py
preprocessing.py -> ./tfx-guided-project-on-vertex/models/preprocessing.py
model_test.py -> ./tfx-guided-project-on-vertex/models/keras_model/model_test

Next we will need to build the Docker container that will run the TFX components on Vertex and push it to the Google Cloud Registry associated with the project:

In [6]:
# Docker image name for the pipeline image.
CUSTOM_TFX_IMAGE = f"gcr.io/{GOOGLE_CLOUD_PROJECT}/{PIPELINE_NAME}"
CUSTOM_TFX_IMAGE

'gcr.io/qwiklabs-asl-01-579c20dd4e24/tfx-guided-project-on-vertex'

Let's move into the TFX project scaffold generated by `tfx template` and create a `Dockerfile` in there:

In [7]:
%cd {PROJECT_DIR}

/home/jupyter/asl-ml-immersion/notebooks/tfx_pipelines/guided_projects/tfx-guided-project-on-vertex


In [8]:
%%writefile Dockerfile
FROM gcr.io/tfx-oss-public/tfx:1.4.0

RUN pip install -U pip
RUN pip install google-cloud-aiplatform==1.7.1 kfp==1.8.1

WORKDIR /pipeline
COPY . ./
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"

Writing Dockerfile


We can now build and push the container: 

In [9]:
!gcloud builds submit --timeout 15m --tag $CUSTOM_TFX_IMAGE .

Creating temporary tarball archive of 24 file(s) totalling 1.9 MiB before compression.
Some files were not included in the source upload.

Check the gcloud log [/home/jupyter/.config/gcloud/logs/2022.11.18/09.44.14.173230.log] to see which files and the contents of the
default gcloudignore file used (see `$ gcloud topic gcloudignore` to learn
more).

Uploading tarball of [.] to [gs://qwiklabs-asl-01-579c20dd4e24_cloudbuild/source/1668764654.268241-df2fc4a287bc47cfa095a7bb68431d61.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/qwiklabs-asl-01-579c20dd4e24/locations/global/builds/41d9c60f-43d9-4325-a4d8-19e1744a6de8].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds/41d9c60f-43d9-4325-a4d8-19e1744a6de8?project=296524281444 ].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "41d9c60f-43d9-4325-a4d8-19e1744a6de8"

FETCHSOURCE
Fetching storage object: gs://qwiklabs-asl-01-579c20dd4e24_cloudbuild/source/1

### Step 3. Browse your copied source files

The TFX template provides basic scaffold files to build a pipeline, including Python source code,
sample data, and Jupyter Notebooks to analyse the output of the pipeline. 

The `taxi` template uses the Chicago Taxi dataset.

Here is brief introduction to each of the Python files:

`pipeline` - This directory contains the definition of the pipeline
* `configs.py` — defines common constants for pipeline runners
* `pipeline.py` — defines TFX components and a pipeline

`models` - This directory contains ML model definitions.
* `features.py`, `features_test.py` — defines features for the model
* `preprocessing.py`, `preprocessing_test.py` — defines preprocessing jobs using tf::Transform

`models/estimator` - This directory contains an Estimator based model.
* `constants.py` — defines constants of the model
* `model.py`, `model_test.py` — defines DNN model using TF estimator

`models/keras` - This directory contains a Keras based model.
* `constants.py` — defines constants of the model
* `model.py`, `model_test.py` — defines DNN model using Keras

`local_runner.py`, `kubeflow_runner.py`, `kubeflow_v2_runner.py` — define runners for each orchestration engine


**Running the tests:**
You might notice that there are some files with `_test.py` in their name. 
These are unit tests of the pipeline and it is recommended to add more unit 
tests as you implement your own pipelines. 
You can run unit tests using the python `-m` and supplying the path to the test module. 
You can usually get a module name by deleting `.py` extension and replacing `/` with `..`

For example:

In [11]:
!python -m models.features_test
!python -m models.keras_model.model_test

2022-11-18 10:03:42.341977: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64:/usr/local/nccl2/lib:/usr/local/cuda/extras/CUPTI/lib64
2022-11-18 10:03:42.342029: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-11-18 10:03:42.342051: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (mlops): /proc/driver/nvidia/version does not exist
Running tests under Python 3.7.12: /opt/conda/bin/python
[ RUN      ] FeaturesTest.testNumberOfBucketFeatureBucketCount
INFO:tensorflow:time(__main__.FeaturesTest.testNumberOfBucketFeatureBucketCount): 0.0s
I1118 10:03:42.345040 140549603784512 test_util.py:2189] time(__main__.FeaturesTest.testNumberOfBucketFeatureBucketCount): 0.0s
[       OK ] FeaturesTest.te

Let's quickly go over the structure of a test file to test Tensorflow code:

In [12]:
!tail -26 models/features_test.py

# See the License for the specific language governing permissions and
# limitations under the License.


import tensorflow as tf

from models import features


class FeaturesTest(tf.test.TestCase):

  def testNumberOfBucketFeatureBucketCount(self):
    self.assertEqual(
        len(features.BUCKET_FEATURE_KEYS),
        len(features.BUCKET_FEATURE_BUCKET_COUNT))
    self.assertEqual(
        len(features.CATEGORICAL_FEATURE_KEYS),
        len(features.CATEGORICAL_FEATURE_MAX_VALUES))

  def testTransformedNames(self):
    names = ["f1", "cf"]
    self.assertEqual(["f1_xf", "cf_xf"], features.transformed_names(names))


if __name__ == "__main__":
  tf.test.main()


First of all, notice that you start by importing the code you want to test by importing the corresponding module. Here we want to test the code in `features.py` so we import the module `features`:
```python
from models import features
```
To implement test cases start by defining your own test class inheriting from `tf.test.TestCase`:
```python
class FeaturesTest(tf.test.TestCase):
```
Wen you execute the test file with
```bash
python -m models.features_test
```
the main method
```python
 tf.test.main()
```
will parse your test class (here: `FeaturesTest`) and execute every method whose name starts by `test`. Here we have two such methods for instance:
```python
def testNumberOfBucketFeatureBucketCount(self):
def testTransformedNames(self):
```
So when you want to add a test case, just add a method to that test class whose name starts by `test`. Now inside the body of these test methods is where the actual testing takes place. In this case for instance, `testTransformedNames` test the function `features.transformed_name` and makes sure it outputs what is expected.
Since your test class inherits from `tf.test.TestCase` it has a number of helper methods you can use to help you create tests, as for instance
```python
self.assertEqual(expected_outputs, obtained_outputs)
```
that will fail the test case if `obtained_outputs` do the match the `expected_outputs`. 


Typical examples of test case you may want to implement for machine learning code would comprise test insurring that your model builds correctly, your preprocessing function preprocesses raw data as expected, or that your model can train successfully on a few mock examples. When writing tests make sure that their execution is fast (we just want to check that the code works not actually train a performant model when testing). For that you may have to create synthetic data in your test files. For more information, read the [tf.test.TestCase documentation](https://www.tensorflow.org/api_docs/python/tf/test/TestCase) and the [Tensorflow testing best practices](https://www.tensorflow.org/community/contribute/tests).


## Step 4. Run your first TFX pipeline

Components in the TFX pipeline will generate outputs for each run as
[ML Metadata Artifacts](https://www.tensorflow.org/tfx/guide/mlmd), and they need to be stored in a GCS bucket accessible from Vertex.


Let us create this bucket. Its name will be `<YOUR_PROJECT>-kubeflowpipelines-default`.

**Note:** The name of this bucket can be changed, but then it will also need to be changed in the generated `./pipeline/configs.py` file, which also defines a corresponding `GCS_BUCKET_NAME` variable.

In [13]:
GCS_BUCKET_NAME = GOOGLE_CLOUD_PROJECT + "-kubeflowpipelines-default"
GCS_BUCKET_NAME

'qwiklabs-asl-01-579c20dd4e24-kubeflowpipelines-default'

We now create this bucket in case it does not exist:

In [14]:
!gsutil ls | grep ^gs://{GCS_BUCKET_NAME}/$ || gsutil mb -l {REGION} gs://{GCS_BUCKET_NAME}

Creating gs://qwiklabs-asl-01-579c20dd4e24-kubeflowpipelines-default/...


Let's upload our sample data to GCS bucket so that we can use it in our pipeline later.

In [15]:
!gsutil cp data/data.csv gs://{GCS_BUCKET_NAME}/tfx-template/data/taxi/data.csv

Copying file://data/data.csv [Content-Type=text/csv]...
/ [1 files][  1.8 MiB/  1.8 MiB]                                                
Operation completed over 1 objects/1.8 MiB.                                      


The pipeline is now ready to be compiled and executed. You will produce `pipeline.json` artifact describing the template TFX pipeline and that can be executed on Vertex as a Vertex pipeline using the following compilation command:

In [16]:
!tfx pipeline compile --engine vertex --pipeline_path kubeflow_v2_runner.py

CLI
Compiling pipeline
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:absl:tensorflow_ranking is not available: No module named 'tensorflow_ranking'
INFO:absl:tensorflow_text is not available: No module named 'tensorflow_text'
INFO:absl:tensorflow_decision_forests is not available: No module named 'tensorflow_decision_forests'
INFO:absl:struct2tensor is not available: No module named 'struct2tensor'
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.ty

You should now see a `pipeline.json` file in `PROJECT_DIR` (which should be the current working directory, since we `cd` into it earlier):

In [17]:
ls pipeline.json

pipeline.json


To launch the execution of this pipeline on Vertex, we will use the `aiplatform` sdk:

In [18]:
aiplatform.init(project=GOOGLE_CLOUD_PROJECT, location=REGION)

pipeline = aiplatform.PipelineJob(
    display_name=PIPELINE_NAME,
    template_path="pipeline.json",
    enable_caching=True,
)

pipeline.run()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118100515
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118100515')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/tfx-guided-project-on-vertex-20221118100515?project=296524281444
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118100515 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/2

This pipeline is minimal and only comprises the `CSVExampleGen` component. 

In the next sections, we will add more and more components to this pipeline by uncommenting and modifying the TFX scaffold generated by `tfx template`.

You'll be able to see the pipeline runing at: https://console.cloud.google.com/vertex-ai/pipelines

## Step 5. Add components for data validation.

In this step, you will add components for data validation including `StatisticsGen`, `SchemaGen`, and `ExampleValidator`.
If you are interested in data validation, please see 
[Get started with Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started).

**Double-click to change directory to pipeline and double-click again to open** `pipeline.py`. 
Find and uncomment the 3 lines which add `StatisticsGen`, `SchemaGen`, and `ExampleValidator` to the pipeline.
(Tip: search for comments containing TODO(step 5):). Make sure to save `pipeline.py` after you edit it.

You now need to update the existing pipeline with modified pipeline definition and trigger another run on Vertex (the cell above that runs the pipeline may need to be interrupted to allow for the execution of the two next cells) :

In [21]:
!tfx pipeline compile --engine vertex --pipeline_path kubeflow_v2_runner.py

CLI
Compiling pipeline
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:absl:tensorflow_ranking is not available: No module named 'tensorflow_ranking'
INFO:absl:tensorflow_text is not available: No module named 'tensorflow_text'
INFO:absl:tensorflow_decision_forests is not available: No module named 'tensorflow_decision_forests'
INFO:absl:struct2tensor is not available: No module named 'struct2tensor'
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.ty

In [22]:
aiplatform.init(project=GOOGLE_CLOUD_PROJECT, location=REGION)

pipeline = aiplatform.PipelineJob(
    display_name=PIPELINE_NAME,
    template_path="pipeline.json",
    enable_caching=True,
)

pipeline.run()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118102431
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118102431')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/tfx-guided-project-on-vertex-20221118102431?project=296524281444
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118102431 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/2

### Check pipeline outputs

You'll be able to see the pipeline runing at: https://console.cloud.google.com/vertex-ai/pipelines

## Step 6. Add components for training

In this step, you will add components for training and model validation including `Transform`, `Trainer`, `Resolver`, `Evaluator`, and `Pusher`.

**Double-click to open** `pipeline.py`. Find and uncomment the 5 lines which add `Transform`, `Trainer`, `ResolverNode`, `Evaluator` and `Pusher` to the pipeline. (Tip: search for TODO(step 6):)

You now need to update the existing pipeline with modified pipeline definition and trigger another run on Vertex:

In [23]:
!tfx pipeline compile --engine vertex --pipeline_path kubeflow_v2_runner.py

CLI
Compiling pipeline
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:absl:tensorflow_ranking is not available: No module named 'tensorflow_ranking'
INFO:absl:tensorflow_text is not available: No module named 'tensorflow_text'
INFO:absl:tensorflow_decision_forests is not available: No module named 'tensorflow_decision_forests'
INFO:absl:struct2tensor is not available: No module named 'struct2tensor'
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.ty

In [24]:
aiplatform.init(project=GOOGLE_CLOUD_PROJECT, location=REGION)

pipeline = aiplatform.PipelineJob(
    display_name=PIPELINE_NAME,
    template_path="pipeline.json",
    enable_caching=True,
)

pipeline.run()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118102734
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118102734')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/tfx-guided-project-on-vertex-20221118102734?project=296524281444
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118102734 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/2

You'll be able to see the pipeline runing at: https://console.cloud.google.com/vertex-ai/pipelines

## Step 7. Try BigQueryExampleGen

[BigQuery](https://cloud.google.com/bigquery) is a serverless, highly scalable, and cost-effective cloud data warehouse.
`BigQuery` can be used as a source for training examples in TFX. In this step, we will add `BigQueryExampleGen` to the pipeline.

**Double-click to open** `pipeline.py`. Comment out `CsvExampleGen` and uncomment the line which creates an instance of `BigQueryExampleGen`. You also need to uncomment the query argument of the `create_pipeline` function.

We need to specify which GCP project to use for `BigQuery`, and this is done by setting `--project` in `beam_pipeline_args` when creating a pipeline.

**Double-click to open** `configs.py`. Uncomment the definition of `GOOGLE_CLOUD_REGION`, `BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS` and `BIG_QUERY_QUERY`. You should replace the region value in this file with the correct values for your GCP project.

**Note:** You MUST set your GCP region in the `configs.py` file before proceeding

**Change directory one level up.** Click the name of the directory above the file list. The name of the directory is the name of the pipeline which is `tfx-guided-project-on-vertex` if you didn't change.

**Double-click to open** `kubeflow_v2_runner.py`. Uncomment two arguments, `query` and `beam_pipeline_args`, for the `create_pipeline` function.

Now the pipeline is ready to use `BigQuery` as an example source. Update the pipeline as before and create a new execution run as we did in step 5 and 6.

In [25]:
!tfx pipeline compile --engine vertex --pipeline_path kubeflow_v2_runner.py

CLI
Compiling pipeline
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:absl:tensorflow_ranking is not available: No module named 'tensorflow_ranking'
INFO:absl:tensorflow_text is not available: No module named 'tensorflow_text'
INFO:absl:tensorflow_decision_forests is not available: No module named 'tensorflow_decision_forests'
INFO:absl:struct2tensor is not available: No module named 'struct2tensor'
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.typehints.native_type_compatibility:Using Any for unsupported type: typing.MutableMapping[str, typing.Any]
INFO:apache_beam.ty

In [26]:
aiplatform.init(project=GOOGLE_CLOUD_PROJECT, location=REGION)

pipeline = aiplatform.PipelineJob(
    display_name=PIPELINE_NAME,
    template_path="pipeline.json",
    enable_caching=True,
)

pipeline.run()

INFO:google.cloud.aiplatform.pipeline_jobs:Creating PipelineJob
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob created. Resource name: projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118104601
INFO:google.cloud.aiplatform.pipeline_jobs:To use this PipelineJob in another session:
INFO:google.cloud.aiplatform.pipeline_jobs:pipeline_job = aiplatform.PipelineJob.get('projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118104601')
INFO:google.cloud.aiplatform.pipeline_jobs:View Pipeline Job:
https://console.cloud.google.com/vertex-ai/locations/us-central1/pipelines/runs/tfx-guided-project-on-vertex-20221118104601?project=296524281444
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/296524281444/locations/us-central1/pipelineJobs/tfx-guided-project-on-vertex-20221118104601 current state:
PipelineState.PIPELINE_STATE_RUNNING
INFO:google.cloud.aiplatform.pipeline_jobs:PipelineJob projects/2

You'll be able to see the pipeline runing at: https://console.cloud.google.com/vertex-ai/pipelines

# (Optional) Customize the pipeline to your data 

We made a TFX pipeline for a model using the Chicago Taxi dataset. Now it's time to put your data into the pipeline.



Your data can be stored anywhere your pipeline can access, including GCS, or BigQuery. You will need to modify the pipeline definition to access your data.

Review the steps above to remember what needs to be customized in full details. You'll find below a short summary of these steps:

1. If your data is stored in files, modify the `DATA_PATH` in `kubeflow_v2_runner.py` and set it to the location of your files. If your data is stored in BigQuery, modify `BIG_QUERY_QUERY` in `pipeline/configs.py` to correctly query for your data.
1. Add features in `models/features.py`
1. Modify `models/preprocessing.py` to [transform input data for training](https://www.tensorflow.org/tfx/guide/transform).
1. Modify `models/keras/model.py` and `models/keras/constants.py` to [describe your ML model](https://www.tensorflow.org/tfx/guide/trainer).

We suggest that you take a small sample of the data, and select columns that are easy to preprocess for the sake of time. Here are a few pointers to get inspiration:

* [A small Slice](https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?date_received_max=2020-11-26&date_received_min=2020-08-26&field=all&format=csv&no_aggs=true&size=119459) of the [Consumer Complaint Database](https://www.consumerfinance.gov/data-research/consumer-complaints/). (You'll still probably need to take only a subset of the rows and columns for the sake of fast model developpement.)
* The [Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) that has a number of very interesting datasets.

The easiest way to create a small CVS file containing your dataset in the Jupyterlab, and then upload it to in a Cloud Storage bucket. This way you'll simply use `CsvExampleGen` to connect to your dataset.

## License

Copyright 2022 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.