## Purpose
This project demonstrates how to use BigQuery object tables in a Vertex AI custom training job.

### Benefits
- Improved data efficiency: Object tables can store large amounts of data in a compressed format, reducing storage costs and improving query performance.
- Simplified data management: Object tables can be created and managed directly in BigQuery, eliminating the need for complex data pipelines.
- Enhanced model performance: By using object tables, you can train models on larger datasets, which can lead to improved model performance.

### Prerequisites
- A Google Cloud project
- A BigQuery dataset
- A Vertex AI custom training job
- .env file in the root directory. It contains 
  - BUCKET_NAME - sample image uploaded target bucket, 
  - DATASET_ID
  - TABLE_ID
  - PROJECT_ID
  - REGION
  - CONNECTION_ID

### Setup
Create a BigQuery object table.
Configure your Vertex AI custom training job to use the object table.
Train your model.
Expected Outcomes
After completing this project, you will be able to:
- Create and use BigQuery object tables.
- Configure Vertex AI custom training jobs to use object tables.
- Train models on large datasets using object tables.

In [57]:
# before to make a training job, prepare sample images and create a object table.

# !pip install tensorflow
# !pip install tensorflow-datasets
# !pip install google-cloud-storage google-cloud-bigquery google-cloud-aiplatform
# !pip install python-dotenv
# !pip install fiftyone -- it's optional
# !pip install requests
# ! pip install db-dtypes

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting db-dtypes
  Downloading db_dtypes-1.2.0-py2.py3-none-any.whl (14 kB)
Collecting pyarrow>=3.0.0
  Downloading pyarrow-15.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.5 MB)
[K     |████████████████████████████████| 38.5 MB 11.9 MB/s eta 0:00:01
Installing collected packages: pyarrow, db-dtypes
Successfully installed db-dtypes-1.2.0 pyarrow-15.0.2


In [1]:
import tensorflow as tf
print(tf.__version__)

2024-03-27 08:42:32.105190: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-27 08:42:32.209392: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-27 08:42:32.210647: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


2.13.1


In [19]:
# ! git clone https://github.com/EliSchwartz/imagenet-sample-images.git

Cloning into 'imagenet-sample-images'...
remote: Enumerating objects: 1012, done.[K
remote: Counting objects: 100% (10/10), done.[K
remote: Compressing objects: 100% (8/8), done.[K
remote: Total 1012 (delta 3), reused 5 (delta 2), pack-reused 1002[K
Receiving objects: 100% (1012/1012), 103.84 MiB | 5.67 MiB/s, done.
Resolving deltas: 100% (3/3), done.


In [20]:
import tensorflow_datasets as tfds
from google.cloud import storage
import os
from dotenv import load_dotenv

load_dotenv()
# Configure Google Cloud Storage
client = storage.Client()
bucket_name = os.environ.get("BUCKET_NAME")
bucket = client.get_bucket(bucket_name)

# Download imagenet2012 dataset with TensorFlow Datasets
image_dir = "imagenet-sample-images/"
for filename in os.listdir(image_dir):
  if not filename.endswith(".JPEG"):
    continue
  blob = bucket.blob(f"{filename}")
  blob.upload_from_filename(f"{image_dir}/{filename}")



## Metadata in Machine Learning

Metadata is data that describes other data. In machine learning, metadata can be used to describe the features of a dataset, the labels of the data points, and the performance of a model.

Metadata can be provided in a variety of formats, including:

* Separate text files
* JSON files
* XML files
* Databases

In the example, the metadata is in the form of image descriptions that are included in the file names. This metadata can be easily extracted by splitting the file names on the delimiter character.

However, for BigQuery object tables, it is not possible to directly import the metadata as a dataset. Instead, the metadata must be converted into a format that is compatible with BigQuery. One way to do this is to create a new column in the object table to store the metadata.

Once the metadata has been converted into a compatible format, it can be used to train a machine learning model. By using the metadata, the model can learn to identify the different objects in the images.



In [4]:
# Create data connection for object table
import os
from dotenv import load_dotenv


load_dotenv()

dataset_id = os.environ.get("DATASET_ID")
table_id = os.environ.get("TABLE_ID")
project_id = os.environ.get("PROJECT_ID")
location = os.environ.get("REGION")
connection_id = os.environ.get("CONNECTION_ID")

print(location)

! echo "$location"

# !echo "${location}" -- Sometimes, it's not working. the region value - asia-northeast3 would be converted into '-northeast3' in magic cell.


# !bq mk --connection --location="$location" --project_id="$project_id" --connection_type=CLOUD_RESOURCE "$connection_id"

asia-northeast3
asia-northeast3


In [5]:
# ! bq mk --table --external_table_definition="gs://$bucket_name/*"@"$location"."$connection_id" --object_metadata=SIMPLE --metadata_cache_mode=MANUAL "$project_id":"$dataset_id"."$table_id"

### IAM Permission settings. 

After connection creation, you should grant the right permissions for the service account or user account.

In [6]:
import google.cloud.bigquery as bigquery

bqclient = bigquery.Client()

df_object_tb = bqclient.query(f"SELECT * FROM `{project_id}.{dataset_id}.{table_id}`").to_dataframe()

In [7]:
df_object_tb

Unnamed: 0,uri,generation,content_type,size,md5_hash,updated,metadata
0,gs://bigquery-object-table-images/n01440764_te...,1711487536256726,image/jpeg,100582,7dd8d26a6be277d117beda8d955202a8,2024-03-26 21:12:16.259000+00:00,[]
1,gs://bigquery-object-table-images/n01443537_go...,1711487543662786,image/jpeg,36973,b4370d0e25ed47fb17ea086063dad15b,2024-03-26 21:12:23.664000+00:00,[]
2,gs://bigquery-object-table-images/n01484850_gr...,1711487567884600,image/jpeg,99943,e29c6cb820e8fdf003fad8e6bb22d283,2024-03-26 21:12:47.885000+00:00,[]
3,gs://bigquery-object-table-images/n01491361_ti...,1711487543426381,image/jpeg,50366,3b2b52eedcd8be5b5e16c9589c21a00a,2024-03-26 21:12:23.427000+00:00,[]
4,gs://bigquery-object-table-images/n01494475_ha...,1711487557572174,image/jpeg,142314,912d2fa8080c54d113f4456512841ad0,2024-03-26 21:12:37.573000+00:00,[]
...,...,...,...,...,...,...,...
995,gs://bigquery-object-table-images/n13044778_ea...,1711487576255631,image/jpeg,81699,7a7cce50ed4bc4f7b7e7444990d9cb08,2024-03-26 21:12:56.257000+00:00,[]
996,gs://bigquery-object-table-images/n13052670_he...,1711487539310035,image/jpeg,107703,5211079f487af321d08de48837b1320e,2024-03-26 21:12:19.311000+00:00,[]
997,gs://bigquery-object-table-images/n13054560_bo...,1711487563113768,image/jpeg,209627,0ece8318be3e30c3344d10658a20702e,2024-03-26 21:12:43.115000+00:00,[]
998,gs://bigquery-object-table-images/n13133613_ea...,1711487544982576,image/jpeg,137821,a10d6894566fda919770e1fa911d184f,2024-03-26 21:12:24.985000+00:00,[]


### Training with object tables

In [None]:
# Import necessary libraries
from google.cloud import bigquery
from google.cloud import aiplatform
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten

# Set your project ID and BigQuery dataset/table names
PROJECT_ID = "your-project-id"
BQ_DATASET = "your-dataset-name"
BQ_TABLE = "your-table-name"

# Initialize BigQuery client
bq_client = bigquery.Client(project=PROJECT_ID)

In [None]:
def load_data_from_bq(dataset_name, table_name):
  # Construct BigQuery query
  query = f"""
  SELECT image_uri, label
  FROM `{PROJECT_ID}.{dataset_name}.{table_name}`
  """

  # Run the query and convert results to a Pandas DataFrame
  df = bq_client.query(query).to_dataframe()

  # Extract image URIs and labels
  image_uris = df["image_uri"].tolist()
  labels = df["label"].tolist()

  return image_uris, labels

In [None]:
def train_resnet_model(image_uris, labels):
  # Load pre-trained ResNet50 without top layers
  base_model = ResNet50(weights="imagenet", include_top=False, input_shape=(224, 224, 3))

  # Freeze base model layers
  for layer in base_model.layers:
    layer.trainable = False

  # Add custom top layers
  x = base_model.output
  x = Flatten()(x)
  predictions = Dense(len(labels[0]), activation="softmax")(x)
  model = Model(inputs=base_model.input, outputs=predictions)

  # Compile and train the model
  model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
  model.fit(image_uris, labels, epochs=10)  # Adjust epochs as needed

  return model

In [None]:
# Define Vertex AI custom training job parameters
job_spec = {
  "worker_pool_specs": [
    {
      "machine_spec": "n1-standard-4",
      "replica_count": 1,
      "container_spec": {
        "image_uri": "gcr.io/your-project-id/your-training-image",
        "python_package_uris": ["gs://your-bucket/path/to/training_code.tar.gz"],
      },
    }
  ],
}

# Create and run the custom training job
job = aiplatform.CustomTrainingJob(
  display_name="resnet-training-job",
  script_path="path/to/training_script.py",
  container_uri="gcr.io/your-project-id/your-training-image",
  requirements=["tensorflow", "google-cloud-bigquery"],
)

job.run(job_spec)