# Structured data prediction using Vertex AI Platform


## Learning Objectives

1. Create a BigQuery Dataset and Google Cloud Storage Bucket 
2. Export from BigQuery to CSVs in GCS
3. Training on Cloud AI Platform
4. Deploy trained model

## Introduction

In this notebook, you train, evaluate, and deploy a machine learning model to predict a baby's weight.



In [1]:
!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst

In [2]:
!pip install --user google-cloud-bigquery>=2.26.0


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!pip install -U google-cloud-aiplatform "shapely>=2"

Collecting google-cloud-aiplatform
  Downloading google_cloud_aiplatform-1.124.0-py2.py3-none-any.whl.metadata (45 kB)
Downloading google_cloud_aiplatform-1.124.0-py2.py3-none-any.whl (8.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m69.4 MB/s[0m  [33m0:00:00[0m
[?25hInstalling collected packages: google-cloud-aiplatform
  Attempting uninstall: google-cloud-aiplatform
    Found existing installation: google-cloud-aiplatform 1.120.0
    Uninstalling google-cloud-aiplatform-1.120.0:
      Successfully uninstalled google-cloud-aiplatform-1.120.0
Successfully installed google-cloud-aiplatform-1.124.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


**Note**: Restart your kernel to use updated packages.

Kindly ignore the deprecation warnings and incompatibility errors related to google-cloud-storage.

## Set up environment variables and load necessary libraries

Set environment variables so that we can use them throughout the entire notebook. We will be using our project name for our bucket, so you only need to change your project and region.

In [4]:
# change these to try this notebook out
BUCKET = 'testbucket1_unique' # Replace with the your bucket name
PROJECT = 'qwiklabs-gcp-03-652967eb1d9b' # Replace with your project-id
REGION = 'us-central1'

In [5]:
import os
from google.cloud import bigquery

In [6]:
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION
os.environ["TFVERSION"] = "2.11"
os.environ["PYTHONVERSION"] = "3.7"

In [7]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "$PROJECT

Your current GCP Project Name is: qwiklabs-gcp-03-652967eb1d9b


## The source dataset

Our dataset is hosted in [BigQuery](https://cloud.google.com/bigquery/). The CDC's Natality data has details on US births from 1969 to 2008 and is a publically available dataset, meaning anyone with a GCP account has access. Click [here](https://console.cloud.google.com/bigquery?project=bigquery-public-data&p=publicdata&d=samples&t=natality&page=table) to access the dataset.

The natality dataset is relatively large at almost 138 million rows and 31 columns, but simple to understand. `weight_pounds` is the target, the continuous value we’ll train a model to predict.

## Create a BigQuery Dataset and Google Cloud Storage Bucket 

A BigQuery dataset is a container for tables, views, and models built with BigQuery ML. Let's create one called __babyweight__. We'll do the same for a GCS bucket for our project too.

In [8]:
%%bash

# Create a BigQuery dataset for babyweight if it doesn't exist
datasetexists=$(bq ls -d | grep -w babyweight)

if [ -n "$datasetexists" ]; then
    echo -e "BigQuery dataset already exists, let's not recreate it."

else
    echo "Creating BigQuery dataset titled: babyweight"
    
    bq --location=US mk --dataset \
        --description "Babyweight" \
        $PROJECT:babyweight
    echo "Here are your current datasets:"
    bq ls
fi
    
## Create GCS bucket if it doesn't exist already...
exists=$(gsutil ls -d | grep -w gs://${BUCKET}/)

if [ -n "$exists" ]; then
    echo -e "Bucket exists, let's not recreate it."
    
else
    echo "Creating a new GCS bucket."
    gsutil mb -l ${REGION} gs://${BUCKET}
    echo "Here are your current buckets:"
    gsutil ls
fi

Creating BigQuery dataset titled: babyweight
Dataset 'qwiklabs-gcp-03-652967eb1d9b:babyweight' successfully created.
Here are your current datasets:
  datasetId   
 ------------ 
  babyweight  
Bucket exists, let's not recreate it.


## Create the training and evaluation data tables

Since there is already a publicly available dataset, we can simply create the training and evaluation data tables using this raw input data. First we are going to create a subset of the data limiting our columns to `weight_pounds`, `is_male`, `mother_age`, `plurality`, and `gestation_weeks` as well as some simple filtering and a column to hash on for repeatable splitting.

* Note:  The dataset in the create table code below is the one created previously, e.g. "babyweight".

### Preprocess and filter dataset

We have some preprocessing and filtering we would like to do to get our data in the right format for training.

Preprocessing:
* Cast `is_male` from `BOOL` to `STRING`
* Cast `plurality` from `INTEGER` to `STRING` where `[1, 2, 3, 4, 5]` becomes `["Single(1)", "Twins(2)", "Triplets(3)", "Quadruplets(4)", "Quintuplets(5)"]`
* Add `hashcolumn` hashing on `year` and `month`

Filtering:
* Only want data for years later than `2000`
* Only want baby weights greater than `0`
* Only want mothers whose age is greater than `0`
* Only want plurality to be greater than `0`
* Only want the number of weeks of gestation to be greater than `0`

In [9]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_data AS
SELECT
    weight_pounds,
    CAST(is_male AS STRING) AS is_male,
    mother_age,
    CASE
        WHEN plurality = 1 THEN "Single(1)"
        WHEN plurality = 2 THEN "Twins(2)"
        WHEN plurality = 3 THEN "Triplets(3)"
        WHEN plurality = 4 THEN "Quadruplets(4)"
        WHEN plurality = 5 THEN "Quintuplets(5)"
    END AS plurality,
    gestation_weeks,
    FARM_FINGERPRINT(
        CONCAT(
            CAST(year AS STRING),
            CAST(month AS STRING)
        )
    ) AS hashmonth
FROM
    publicdata.samples.natality
WHERE
    year > 2000
    AND weight_pounds > 0
    AND mother_age > 0
    AND plurality > 0
    AND gestation_weeks > 0

Query is running:   0%|          |

### Augment dataset to simulate missing data

Now we want to augment our dataset with our simulated babyweight data by setting all gender information to `Unknown` and setting plurality of all non-single births to `Multiple(2+)`.

In [10]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_augmented_data AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks,
    hashmonth
FROM
    babyweight.babyweight_data
UNION ALL
SELECT
    weight_pounds,
    "Unknown" AS is_male,
    mother_age,
    CASE
        WHEN plurality = "Single(1)" THEN plurality
        ELSE "Multiple(2+)"
    END AS plurality,
    gestation_weeks,
    hashmonth
FROM
    babyweight.babyweight_data

Query is running:   0%|          |

### Split augmented dataset into train and eval sets

Using `hashmonth`, apply a module to get approximately a 75/25 train-eval split.

#### Split augmented dataset into train dataset

In [11]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_data_train AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks
FROM
    babyweight.babyweight_augmented_data
WHERE
    ABS(MOD(hashmonth, 4)) < 3

Query is running:   0%|          |

#### Split augmented dataset into eval dataset

In [12]:
%%bigquery
CREATE OR REPLACE TABLE
    babyweight.babyweight_data_eval AS
SELECT
    weight_pounds,
    is_male,
    mother_age,
    plurality,
    gestation_weeks
FROM
    babyweight.babyweight_augmented_data
WHERE
    ABS(MOD(hashmonth, 4)) = 3

Query is running:   0%|          |

## Verify table creation

Verify that you created the dataset and training data table.

In [13]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_train
LIMIT 0

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks


In [14]:
%%bigquery
-- LIMIT 0 is a free query; this allows us to check that the table exists.
SELECT * FROM babyweight.babyweight_data_eval
LIMIT 0

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,weight_pounds,is_male,mother_age,plurality,gestation_weeks


## Export from BigQuery to CSVs in GCS

Use BigQuery Python API to export our train and eval tables to Google Cloud Storage in the CSV format to be used later for TensorFlow/Keras training. We'll want to use the dataset we've been using above as well as repeat the process for both training and evaluation data.

In [15]:
# Construct a BigQuery client object.
client = bigquery.Client()

dataset_name = "babyweight"

# Create dataset reference object
dataset_ref = client.dataset(
    dataset_id=dataset_name, project=client.project)

# Export both train and eval tables
for step in ["train", "eval"]:
    destination_uri = os.path.join(
        "gs://", BUCKET, dataset_name, "data", "{}*.csv".format(step))
    table_name = "babyweight_data_{}".format(step)
    table_ref = dataset_ref.table(table_name)
    extract_job = client.extract_table(
        table_ref,
        destination_uri,
        # Location must match that of the source table.
        location="US",
    )  # API request
    extract_job.result()  # Waits for job to complete.

    print("Exported {}:{}.{} to {}".format(
        client.project, dataset_name, table_name, destination_uri))

Exported qwiklabs-gcp-03-652967eb1d9b:babyweight.babyweight_data_train to gs://testbucket1_unique/babyweight/data/train*.csv
Exported qwiklabs-gcp-03-652967eb1d9b:babyweight.babyweight_data_eval to gs://testbucket1_unique/babyweight/data/eval*.csv


## Verify CSV creation

Verify that we correctly created the CSV files in our bucket.

In [16]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*.csv

gs://testbucket1_unique/babyweight/data/eval000000000000.csv
gs://testbucket1_unique/babyweight/data/eval000000000001.csv
gs://testbucket1_unique/babyweight/data/eval000000000002.csv
gs://testbucket1_unique/babyweight/data/eval000000000003.csv
gs://testbucket1_unique/babyweight/data/eval000000000004.csv
gs://testbucket1_unique/babyweight/data/eval000000000005.csv
gs://testbucket1_unique/babyweight/data/eval000000000006.csv
gs://testbucket1_unique/babyweight/data/eval000000000007.csv
gs://testbucket1_unique/babyweight/data/eval000000000008.csv
gs://testbucket1_unique/babyweight/data/eval000000000009.csv
gs://testbucket1_unique/babyweight/data/eval000000000010.csv
gs://testbucket1_unique/babyweight/data/eval000000000011.csv
gs://testbucket1_unique/babyweight/data/eval000000000012.csv
gs://testbucket1_unique/babyweight/data/eval000000000013.csv
gs://testbucket1_unique/babyweight/data/eval000000000014.csv
gs://testbucket1_unique/babyweight/data/eval000000000015.csv
gs://testbucket1_unique/

## Check data exists

Verify that you previously created CSV files we'll be using for training and evaluation.

In [17]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv

gs://testbucket1_unique/babyweight/data/eval000000000000.csv
gs://testbucket1_unique/babyweight/data/train000000000000.csv


In [19]:
%%bash

OUTDIR=gs://${BUCKET}/babyweight/trained_model
JOBNAME=babyweight_$(date -u +%y%m%d_%H%M%S)

gcloud ai custom-jobs create \
  --region=${REGION} \
  --display-name=${JOBNAME} \
  --worker-pool-spec=machine-type=n1-standard-8,executor-image-uri=us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-11:latest,local-package-path=babyweight/,python-module=trainer.task \
  --args="--train_data_path=gs://${BUCKET}/babyweight/data/train*.csv","--eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv","--output_dir=${OUTDIR}","--num_epochs=10","--train_examples=10000","--eval_steps=100","--batch_size=32","--nembeds=8"

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
  self.stdin = io.open(p2cwrite, 'wb', bufsize)
  self.stdout = io.open(c2pread, 'rb', bufsize)


#0 building with "default" instance using docker driver

#1 [internal] load build definition from Dockerfile
#1 transferring dockerfile: 664B done
#1 DONE 0.0s

#2 [internal] load metadata for us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-11:latest
#2 DONE 0.3s

#3 [internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s

#4 [internal] load build context
#4 transferring context: 14.50kB done
#4 DONE 0.0s

#5 [1/7] FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-11:latest@sha256:07e7ea696bf78f9e51c481f74c4cee100c00870cad74b04de849aea7e66f8c51
#5 resolve us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-11:latest@sha256:07e7ea696bf78f9e51c481f74c4cee100c00870cad74b04de849aea7e66f8c51 0.0s done
#5 sha256:2c270b121e8185d98ca3a832f61981f1a8529ef7d14e105550e06bb11c649f44 12.93kB / 12.93kB done
#5 sha256:57c139bbda7eb92a286d974aa8fef81acf1a8cbc742242619252c13b196ab499 1.05MB / 29.55MB 0.1s
#5 sha256:6289371d2bb385802456c41320c8f9d1f2d2c92345a822581eb3709c500a7ade 0B / 327B 0


A custom container image is built locally.

  self.stdin = io.open(p2cwrite, 'wb', bufsize)
  self.stdout = io.open(c2pread, 'rb', bufsize)


The push refers to repository [gcr.io/qwiklabs-gcp-03-652967eb1d9b/cloudai-autogenerated/babyweight_251101_174255]
917c3db455c6: Preparing
88c433a8c765: Preparing
c8a8f7203097: Preparing
d5168c8d6db6: Preparing
5f70bf18a086: Preparing
b09e19954e86: Preparing
e42695c7b436: Preparing
e42695c7b436: Preparing
7e34967c8575: Preparing
03aa2a4bdb68: Preparing
69ff3552dab2: Preparing
bde9e2053036: Preparing
bde9e2053036: Preparing
b253aec57174: Preparing
e9a5c35692b6: Preparing
5ca5a09f80b2: Preparing
f27306b95858: Preparing
e96984247094: Preparing
bf89224ff876: Preparing
ca7739d6661c: Preparing
ca7739d6661c: Preparing
6afff9338181: Preparing
5f70bf18a086: Preparing
380cd88b9fb2: Preparing
25c9ddea4aaa: Preparing
eec152ec24b8: Preparing
dd7d6ac03700: Preparing
be9dc4e2456b: Preparing
ceab7f116eb5: Preparing
bd5ff18df433: Preparing
a27f4aa3db94: Preparing
1a102d1cac2b: Preparing
b09e19954e86: Waiting
e42695c7b436: Waiting
7e34967c8575: Waiting
03aa2a4bdb68: Waiting
69ff3552dab2: Waiting
bde9e20


Custom container image [gcr.io/qwiklabs-gcp-03-652967eb1d9b/cloudai-autogenerated/babyweight_251101_174255:20251101.17.42.56.905319] is created for your custom job.

CustomJob [projects/1085470363887/locations/us-central1/customJobs/3753837099292295168] is submitted successfully.

Your job is still active. You may view the status of your job with the command

  $ gcloud ai custom-jobs describe projects/1085470363887/locations/us-central1/customJobs/3753837099292295168

or continue streaming the logs with the command

  $ gcloud ai custom-jobs stream-logs projects/1085470363887/locations/us-central1/customJobs/3753837099292295168


In [21]:
%%bash
gcloud ai custom-jobs stream-logs projects/1085470363887/locations/us-central1/customJobs/3753837099292295168

Using endpoint [https://us-central1-aiplatform.googleapis.com/]


INFO	2025-11-01 17:50:59 +0000	service	Waiting for job to be provisioned.
INFO	2025-11-01 17:50:59 +0000	service	Vertex AI is provisioning job running framework. First time usage might take couple of minutes, and subsequent runs can be much faster.
INFO	2025-11-01 17:51:04 +0000	service	Vertex AI is provisioning job running framework. First time usage might take couple of minutes, and subsequent runs can be much faster.
INFO	2025-11-01 17:51:09 +0000	service	Vertex AI is provisioning job running framework. First time usage might take couple of minutes, and subsequent runs can be much faster.
INFO	2025-11-01 17:51:14 +0000	service	Vertex AI is provisioning job running framework. First time usage might take couple of minutes, and subsequent runs can be much faster.
INFO	2025-11-01 17:51:19 +0000	service	Vertex AI is provisioning job running framework. First time usage might take couple of minutes, and subsequent runs can be much faster.
INFO	2025-11-01 17:51:24 +0000	service	Vertex AI is



Command killed by keyboard interrupt



Process was interrupted.


CalledProcessError: Command 'b'gcloud ai custom-jobs stream-logs projects/1085470363887/locations/us-central1/customJobs/3753837099292295168\n'' died with <Signals.SIGINT: 2>.

The training job should complete within 15 to 20 minutes. You do not need to wait for this training job to finish before moving forward in the notebook, but will need a trained model.

In [22]:
%%bash
gcloud ai custom-jobs describe projects/1085470363887/locations/us-central1/customJobs/3753837099292295168

Using endpoint [https://us-central1-aiplatform.googleapis.com/]


createTime: '2025-11-01T17:48:22.686381Z'
displayName: babyweight_251101_174255
endTime: '2025-11-01T18:13:04Z'
jobSpec:
  workerPoolSpecs:
  - containerSpec:
      args:
      - --train_data_path=gs://testbucket1_unique/babyweight/data/train*.csv
      - --eval_data_path=gs://testbucket1_unique/babyweight/data/eval*.csv
      - --output_dir=gs://testbucket1_unique/babyweight/trained_model
      - --num_epochs=10
      - --train_examples=10000
      - --eval_steps=100
      - --batch_size=32
      - --nembeds=8
      imageUri: gcr.io/qwiklabs-gcp-03-652967eb1d9b/cloudai-autogenerated/babyweight_251101_174255:20251101.17.42.56.905319
    diskSpec:
      bootDiskSizeGb: 100
      bootDiskType: pd-ssd
    machineSpec:
      machineType: n1-standard-8
    replicaCount: '1'
name: projects/1085470363887/locations/us-central1/customJobs/3753837099292295168
startTime: '2025-11-01T18:00:29Z'
state: JOB_STATE_SUCCEEDED
updateTime: '2025-11-01T18:13:28.450541Z'


## Check our trained model files

Let's check the directory structure of our outputs of our trained model in folder we exported. We'll want to deploy the saved_model.pb within the timestamped directory as well as the variable values in the variables folder. Therefore, we need the path of the timestamped directory so that everything within it can be found by Cloud AI Platform's model deployment service.

In [23]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/trained_model/

gs://testbucket1_unique/babyweight/trained_model/
gs://testbucket1_unique/babyweight/trained_model/20251101181247/
gs://testbucket1_unique/babyweight/trained_model/checkpoints/


In [24]:
%%bash
MODEL_LOCATION=$(gsutil ls -ld -- gs://${BUCKET}/babyweight/trained_model/2* \
                 | tail -1)
gsutil ls ${MODEL_LOCATION}

gs://testbucket1_unique/babyweight/trained_model/20251101181247/
gs://testbucket1_unique/babyweight/trained_model/20251101181247/fingerprint.pb
gs://testbucket1_unique/babyweight/trained_model/20251101181247/saved_model.pb
gs://testbucket1_unique/babyweight/trained_model/20251101181247/assets/
gs://testbucket1_unique/babyweight/trained_model/20251101181247/variables/


## Deploy trained model

Deploying the trained model to act as a REST web service is a simple gcloud call.

In [26]:
%%bash
MODEL_NAME="babyweight"
MODEL_VERSION="v1"
MODEL_DIR=$(gsutil ls -d gs://${BUCKET}/babyweight/trained_model/2* | tail -1)

# Create a model resource in Vertex AI
gcloud ai models upload \
  --region=${REGION} \
  --display-name=${MODEL_NAME} \
  --artifact-uri=${MODEL_DIR} \
  --container-image-uri=us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-11:latest

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [4224736763869921280]...
.....................................done.


In [None]:
%%bash

ENDPOINT_NAME="babyweight-endpoint"

# Create endpoint
gcloud ai endpoints create --region=${REGION} --display-name=${ENDPOINT_NAME}

# Deploy model
gcloud ai endpoints deploy-model ENDPOINT_ID \
  --region=${REGION} \
  --model=MODEL_ID \
  --machine-type=n1-standard-2

Using endpoint [https://us-central1-aiplatform.googleapis.com/]
Waiting for operation [8194659835397013504]...


Copyright 2021 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.