# Using Serverless Spark with Google Cloud Vertex AI
This notebook provides example steps for using Serverless Spark with Google Cloud Vertex AI.

## Prerequisites
**Note:** This notebook and repository are supporting artifacts for the "Google Machine Learning and Generative AI for Solutions Architects" book. The book describes the concepts associated with this notebook, and for some of the activities, the book contains instructions that should be performed before running the steps in the notebooks. Each top-level folder in this repo is associated with a chapter in the book. Please ensure that you have read the relevant chapter sections before performing the activities in this notebook.

**There are also important generic prerequisite steps outlined [here](https://github.com/PacktPublishing/Google-Machine-Learning-for-Solutions-Architects/blob/main/Prerequisite-steps/Prerequisites.ipynb).**


## Introduction and setup

In this notebook, we will build a pipeline that will perform the following steps:
1. Custom data processing such as feature scaling, one-hot encoding, and feature engineering, in a Google Cloud Serverless Spark environment.
2. Implement a custom training job in Vertex AI to train a custom model. In this case, our model uses the [Titanic dataset from Kaggle](https://www.kaggle.com/competitions/titanic/data) to predict the likelihood of survival of each passenger based on their associated features in the dataset.
3. Upload the trained model to Vertex AI Model Registry.
4. Deploy the trained model to a Vertex AI endpoint for online inference.

In this initial section, we set up all of the baseline requirements to run our pipeline.

**Attention:** The code in this notebook creates Google Cloud resources that can incur costs.

Refer to the Google Cloud pricing documentation for details.

For example:

* [Vertex AI Pricing](https://cloud.google.com/vertex-ai/pricing)
* [Google Cloud Storage Pricing](https://cloud.google.com/storage/pricing)
* [Dataproc pricing](https://cloud.google.com/dataproc/pricing)


## Install required packages

We will use the following libraries in this notebook:

* [The Vertex AI Python SDK](https://cloud.google.com/python/docs/reference/aiplatform/latest)
* [Kubeflow Pipelines (KFP)](https://www.kubeflow.org/docs/components/pipelines/v1/sdk/sdk-overview/)
* [Google Cloud Pipeline Components (GCPC)](https://cloud.google.com/vertex-ai/docs/pipelines/components-introduction)

In [None]:
! python -m pip install --upgrade pip

In [None]:
! pip3 install --quiet --user --upgrade google-cloud-aiplatform kfp google-cloud-pipeline-components

*The pip installation commands sometimes report various errors. Those errors usually do not affect the activities in this notebook, and you can ignore them.*


## Restart the kernel

The code in the next cell will retart the kernel, which is sometimes required after installing/upgrading packages.

**When prompted, click OK to restart the kernel.**

The sleep command simply prevents further cells from executing before the kernel restarts.

In [None]:
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)


In [None]:
import time
time.sleep(10)

# (Wait for kernel to restart before proceeding...)

## Import required libraries

In [None]:
# General
from google.cloud import aiplatform

# Kubeflow Pipelines (KFP)
import kfp
from kfp import compiler, dsl
from kfp.dsl import component, Input, Output, Artifact

# Google Cloud Pipeline Components (GCPC)
from google_cloud_pipeline_components.v1.dataproc import DataprocPySparkBatchOp
from google_cloud_pipeline_components.v1 import dataset, custom_job
from google_cloud_pipeline_components.v1.model import ModelUploadOp
from google_cloud_pipeline_components.types import artifact_types
from google_cloud_pipeline_components.v1.endpoint import EndpointCreateOp, ModelDeployOp

## Set Google Cloud resource variables

The following code will set variables specific to your Google Cloud resources that will be used in this notebook, such as the Project ID, Region, and GCS Bucket.

**Note: This notebook is intended to execute in a Vertex AI Workbench Notebook, in which case the API calls issued in this notebook are authenticated according to the permissions (e.g., service account) assigned to the Vertex AI Workbench Notebook.**

We will use the `gcloud` command to get the Project ID details from the local Google Cloud project, and assign the results to the PROJECT_ID variable. If, for any reason, PROJECT_ID is not set, you can set it manually or change it, if preferred.

We also use a default bucket name for most of the examples and activities in this book, which has the format: `{PROJECT_ID}-aiml-sa-bucket`. You can change the bucket name if preferred.

Also, we're defaulting to the **us-central1** region, but you can optionally replace this with your [preferred region](https://cloud.google.com/about/locations).

In [None]:
PROJECT_ID_DETAILS = !gcloud config get-value project
PROJECT_ID = PROJECT_ID_DETAILS[0]  # The project ID is item 0 in the list returned by the gcloud command
BUCKET=f"{PROJECT_ID}-aiml-sa-bucket" # Optional: replace with your preferred bucket name, which must be a unique name.
REGION="us-central1" # Optional: replace with your preferred region (See: https://cloud.google.com/about/locations) 
print(f"Project ID: {PROJECT_ID}")
print(f"Bucket Name: {BUCKET}")

## Create bucket

The following code will create the bucket if it doesn't already exist.

If you get an error saying that it already exists, that's fine, you can ignore it and continue with the rest of the steps, unless you want to use a different bucket.

In [None]:
!gsutil mb -l us-central1 gs://{BUCKET}

## Begin implementation

Now that we have performed the prerequisite steps for this activity, it's time to implement the activity.

## Define constants
In this section, we define all of the constants that will be referenced throughout the rest of the notebook.

**REPLACE THE PROJECT_ID, REGION, AND BUCKET DETAILS WITH YOUR DETAILS.**

In [None]:
# Core constants
BUCKET_URI = f"gs://{BUCKET}"
TRAINER_DIR = "pyspark-titanic-training" # Local parent directory for our pipeline resources
PROCESSING_DIR = "pyspark-titanic-preprocessing" # Local directory for PySpark data processing resources
DATAPROC_RUNTIME_VERSION = "2.1" # (See https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions)

# Pipeline constants
PIPELINE_NAME = "pyspark-titanic-pipeline" # Name of our pipeline
PIPELINE_ROOT = f"{BUCKET_URI}/pipelines" # (See: https://www.kubeflow.org/docs/components/pipelines/v1/overview/pipeline-root/)
SUBNETWORK = "default" # Our VPC subnet name
SUBNETWORK_URI = f"projects/{PROJECT_ID}/regions/{REGION}/subnetworks/{SUBNETWORK}" # Our VPC subnet resource identifier
MODEL_NAME = "pyspark-titanic-model" # Name of our model
EXPERIMENT_NAME = "aiml-sa-pyspark-experiment" # Vertex AI "Experiment" name for metadata tracking

# Preprocessing constants
SOURCE_DATASET = f"{BUCKET_URI}/data/unprocessed/titanic/train.csv" # Our raw source dataset
PREPROCESSING_PYTHON_FILE_URI = f"{BUCKET_URI}/code/mlops/preprocessing.py" # GCS location of our PySpark script
PROCESSED_DATA_URI =f"{BUCKET_URI}/data/processed/mlops-titanic" # Location to store the output of our data preprocessing step
# Arguments to pass to our preprocessing script:
PREPROCESSING_ARGS = [
    "--source_dataset",
    SOURCE_DATASET,
    "--processed_data_path",
    PROCESSED_DATA_URI,
]

# Training constants
TRAINING_PYTHON_FILE_URI = f"{BUCKET_URI}/code/additional-use-cases-chapter/pyspark-ml.py" # GCS location of our PySpark script
MODEL_URI = f"{BUCKET_URI}/models/additional-use-cases-chapter/mlops" # Where to store our trained model

# Arguments to pass to our training job
TRAINING_ARGS=[
        "--processed_data_path",
        PROCESSED_DATA_URI,
        "--model_path",
        MODEL_URI,
    ]

### Create local directories
We will use the following local directories during the activities in this notebook.

In [None]:
# make a source directory to save the code
!mkdir -p $TRAINER_DIR
!mkdir -p $PROCESSING_DIR

### Upload source dataset 
Upload our source dataset to GCS. Our data preprocessing step in our pipeline will ingest this data from GCS.

In [None]:
! gsutil cp ./data/train.csv $SOURCE_DATASET

### Set  project ID for  gcloud
The following command sets our project ID for using gcloud commands in this notebook.

In [None]:
! gcloud config set project $PROJECT_ID --quiet

### Initialize the Vertex AI SDK client

In [None]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

### Configure Private Google Access for Dataproc 
Our Serverless Spark data preprocessing job in our pipeline will run in Dataproc, which is (as Google defines) Google's "fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks."
We're going to configure something called "Private Google Access", which allows us to interact with Google services without sending requests over the public Internet.

You can learn more about Dataproc [here](https://cloud.google.com/dataproc?hl=en), and learn more about Private Google Access [here](https://cloud.google.com/vpc/docs/private-google-access).

In [None]:
!gcloud compute networks subnets update $SUBNETWORK --region=$REGION --enable-private-ip-google-access
!gcloud compute networks subnets describe $SUBNETWORK --region=$REGION --format="get(privateIpGoogleAccess)"

# Create custom PySpark job

The following code will create a file that contains the code for our custom PySpark data preprocessing job. 

The code initiates a Spark session, loads our raw source dataset, and then performs the following processing steps (we performed many of these steps using pandas in our feature engineering chapter earlier in this book, but in this case we will implement the steps using PySpark in Google Cloud Serverless Spark):

1. Removes rows from the dataset where the target variable ("Survived") is missing values.
2. Drops columns that are unlikely to affect the likelihod of surviving, such as 'PassengerId', 'Name', 'Ticket', and 'Cabin'.
3. Fills in missing values in input features.
4. Performs some feature engineering by creating new features such as 'FamilySize' and 'IsAlone' from combinations of existing features.
5. Ensures that all numeric features are on a consistent scale with each other.
6. One-hot encodes all categorical features.
7. Converts the resulting sparse vector to a dense vector. This mainly makes it easier for us to feed the data into our Keras model in our training script later, with minimal processing needed in the training script.
8. Writes the resulting processed data to a parquet file in GCS.

**Note:** we could create a custom container in which to run our PySpark code on Dataproc Serverless if we had very specific dependencies that needed to be installed. However, Dataproc Serverless also provides [default runtimes](https://cloud.google.com/dataproc-serverless/docs/concepts/versions/spark-runtime-versions) that we can use, and these are fine for our needs in this activity, so all we need to do is define our code and put it into GCS so that it can be referenced by the `DataprocPySparkBatchOp` component in our pipeline.

In [None]:
%%writefile $PROCESSING_DIR/preprocessing.py

import argparse
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import OneHotEncoder, StringIndexer, StandardScaler, VectorAssembler
from pyspark.sql.functions import udf, col, when
from pyspark.sql.types import StringType, ArrayType, FloatType

# Setting up the argument parser
parser = argparse.ArgumentParser(description='Data Preprocessing Script')
parser.add_argument('--source_dataset', type=str, help='Path to the source dataset')
parser.add_argument('--processed_data_path', type=str, help='Path to save the output data')

# Parsing the arguments
args = parser.parse_args()
source_dataset = args.source_dataset
processed_data_path = args.processed_data_path

# Initialize a SparkSession
spark = SparkSession.builder \
    .appName("Titanic Data Processing") \
    .getOrCreate()

# Load the data
titanic = spark.read.csv(args.source_dataset, header=True, inferSchema=True)

# Remove rows where 'Survived' is missing
titanic = titanic.filter(titanic.Survived.isNotNull())

# Drop irrelevant columns
titanic = titanic.drop('PassengerId', 'Name', 'Ticket', 'Cabin')

# Fill missing values
def calculate_median(column_name):
    return titanic.filter(col(column_name).isNotNull()).approxQuantile(column_name, [0.5], 0)[0]

median_age = calculate_median('Age')  # Median age
median_fare = calculate_median('Fare')  # Median fare

titanic = titanic.fillna({
    'Pclass': -1,
    'Sex': 'Unknown',
    'Age': median_age,
    'SibSp': -1,
    'Parch': -1,
    'Fare': median_fare,
    'Embarked': 'Unknown'
})

# Feature Engineering
titanic = titanic.withColumn('FamilySize', col('SibSp') + col('Parch') + 1)
titanic = titanic.withColumn('IsAlone', when(col('FamilySize') == 1, 1).otherwise(0))

# Define categorical features 
categorical_features = ['Pclass', 'Sex', 'Embarked', 'IsAlone']

# Define numerical features 
numerical_features = ['Age', 'SibSp', 'Parch', 'Fare', 'FamilySize']

# One-hot encoding for categorical features
stages = []
for col_name in categorical_features:
    string_indexer = StringIndexer(inputCol=col_name, outputCol=f"{col_name}_Index")
    encoder = OneHotEncoder(inputCols=[f"{col_name}_Index"], outputCols=[f"{col_name}_Vec"])
    stages += [string_indexer, encoder]
    
# Scaling numerical features 
for col_name in numerical_features:
    assembler = VectorAssembler(inputCols=[col_name], outputCol=f"vec_{col_name}")
    scaler = StandardScaler(inputCol=f"vec_{col_name}", outputCol=f"scaled_{col_name}", withStd=True, withMean=False)
    stages += [assembler, scaler]

# Create a pipeline and transform the data
pipeline = Pipeline(stages=stages)
pipeline_model = pipeline.fit(titanic)
titanic = pipeline_model.transform(titanic)

# Drop intermediate columns created during scaling and one-hot encoding
titanic = titanic.drop('vec_Age', 'vec_Fare', 'vec_FamilySize', 'vec_SibSp', 'vec_Parch', 'Pclass_Index', 'Sex_Index', 'Embarked_Index', 'IsAlone_Index')

# Drop original categorical columns (no longer needed after one-hot encoding)
titanic = titanic.drop(*categorical_features)

# Drop original numeric columns (no longer needed after scaling)
titanic = titanic.drop(*numerical_features)

vector_columns = ["Pclass_Vec", "Sex_Vec", "Embarked_Vec", "IsAlone_Vec", "scaled_Age", "scaled_Fare", "scaled_FamilySize", "scaled_SibSp", "scaled_Parch"]

def to_dense(vector):
    return vector.toArray().tolist()

to_dense_udf = udf(to_dense, ArrayType(FloatType()))

for vector_col in vector_columns:
    titanic = titanic.withColumn(vector_col, to_dense_udf(col(vector_col)))

for vector_col in vector_columns:
    num_features = len(titanic.select(vector_col).first()[0])  # Getting the size of the vector
    
    for i in range(num_features):
        titanic = titanic.withColumn(f"{vector_col}_{i}", col(vector_col).getItem(i))
    
    titanic = titanic.drop(vector_col)

# Save the processed data to GCS
titanic.write.parquet(args.processed_data_path, mode="overwrite")

# Stop the SparkSession
spark.stop()

# Create custom training job
In this section, we will create our custom training job. It will consist of the following steps:
1. Create a Google Artifact Registry repository to host our custom container image.
2. Create our custom training script.
3. Create a Dockerfile that will specify how to build our custom container image. 
4. Build our custom container image.
5. Push our custom container image to Google Artifact Registry so that we can use it in subsequent steps in our pipeline.

## Define the code for our training job

The following code will create a file that contains the code for our custom training job. 

The code performs the following processing steps:

1. Imports required libraries and sets initial variable values based on arguments passed to the script (the arguments are described below).
2. Reads in the processed dataset that was created by the data preprocessing step in our pipeline.
3. Fills in missing values in input features.
4. Performs some feature engineering by creating new features such as 'FamilySize' and 'IsAlone' from combinations of existing features.
5. Ensures that all numeric features are on a consistent scale with each other.
6. One-hot encodes all categorical features.
7. Converts the resulting sparse vector to a dense vector. This mainly makes it easier for us to feed the data into our Keras model in our training script later, with minimal processing needed in the training script.
8. Writes the resulting processed data to a parquet file in GCS.

In [None]:
%%writefile {TRAINER_DIR}/pyspark-model.py

import argparse
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

def train_model(args):
    # Input arguments
    processed_data_path = args.processed_data_path
    model_path = args.model_path
    
    # Initialize Spark Session
    spark = SparkSession.builder.appName("TitanicSurvivalPrediction").getOrCreate()

    ### DATA PREPARATION SECTION ###
    
    # Read Parquet files into a DataFrame
    data = spark.read.parquet(processed_data_path)

    print(f"Data loaded successfully from {processed_data_path}")

    # Separate the target and input features in the dataset
    data = data.withColumnRenamed('Survived', 'label')

    # Split the data into training and testing sets
    train_data, test_data = data.randomSplit([0.8, 0.2], seed=42)

    ### MODEL TRAINING AND EVALUATION SECTION ###

    # Define the pipeline stages
    assembler = VectorAssembler(inputCols=[col for col in data.columns if col != 'label'], outputCol="features")
    scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures")
    lr = LogisticRegression(featuresCol="scaledFeatures", labelCol="label")

    # Construct the pipeline
    pipeline = Pipeline(stages=[assembler, scaler, lr])

    # Fit the pipeline to training data
    model = pipeline.fit(train_data)

    # Make predictions on test data
    predictions = model.transform(test_data)

    # Evaluate the model
    evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
    auc = evaluator.evaluate(predictions, {evaluator.metricName: "areaUnderROC"})
    print(f"AUC: {auc}")

    # Save the model locally
    model.write().overwrite().save(model_path)

    # Return the trained model
    return model

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Train a logistic regression model for Titanic survival prediction')

    parser.add_argument('--processed_data_path', type=str, help='Path to the directory containing the preprocessed data')
    parser.add_argument('--model_path', type=str, help='Path to save the trained model')

    args = parser.parse_args()

    train_model(args)

### Upload source code for PySpark

We need to upload our PySpark code to our GCS bucket to be referenced by the `DataprocPySparkBatchOp` components in our pipeline.

In [None]:
! gsutil cp $PROCESSING_DIR/preprocessing.py $PREPROCESSING_PYTHON_FILE_URI
! gsutil cp $TRAINER_DIR/pyspark-model.py $TRAINING_PYTHON_FILE_URI

# Define our Vertex AI Pipeline

Now that we have defined our custom data preprocessing and model training components, it's time to define our MLOps pipeline.

In this section, we will use the Kubeflow Pipelines SDK and Google Cloud Pipeline Components to define our MLOps pipeline.

We begin by specifying all of the required variables in our pipeline, and populating their values from the constants we defined earlier in our notebook. We then specify the following components in our pipeline:

1. [DataprocPySparkBatchOp](https://cloud.google.com/vertex-ai/docs/pipelines/dataproc-component) to perform our data preprocessing step.
2. [CustomTrainingJobOp](https://cloud.google.com/vertex-ai/docs/pipelines/customjob-component#customjobop) to perform our custom model training step.
3. [importer](https://www.kubeflow.org/docs/components/pipelines/v2/components/importer-component/) to import our [UnmanagedContainerModel](https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1.types.UnmanagedContainerModel) object.
4. [ModelUploadOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/model.html#v1.model.ModelUploadOp) to upload our Model artifact into Vertex AI Model Registry.
5. [EndpointCreateOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/endpoint.html#v1.endpoint.EndpointCreateOp) to create a Vertex AI [Endpoint](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints).
6. [ModelDeployOp](https://google-cloud-pipeline-components.readthedocs.io/en/google-cloud-pipeline-components-2.0.0/api/v1/endpoint.html#v1.endpoint.ModelDeployOp) to deploy our Google Cloud Vertex AI Model to an Endpoint, creating a [DeployedModel](https://cloud.google.com/vertex-ai/docs/reference/rest/v1/projects.locations.endpoints#deployedmodel) object within it.

In [None]:
@dsl.pipeline(name=PIPELINE_NAME, description="MLOps pipeline for custom data preprocessing, model training, and deployment.")
def pipeline(
    bucket_name: str = BUCKET_URI,
    display_name: str = PIPELINE_NAME,
    preprocessing_main_python_file_uri: str = PREPROCESSING_PYTHON_FILE_URI,
    training_main_python_file_uri: str = TRAINING_PYTHON_FILE_URI,
    preprocessing_args: list = PREPROCESSING_ARGS,
    training_args: list = TRAINING_ARGS,
    project_id: str = PROJECT_ID,
    location: str = REGION,
    subnetwork_uri: str = SUBNETWORK_URI,
    dataproc_runtime_version: str = DATAPROC_RUNTIME_VERSION,
    base_output_directory: str = PIPELINE_ROOT,
):
    
    # Preprocess data
    preprocessing_op = DataprocPySparkBatchOp(
        project=project_id,
        location=location,
        main_python_file_uri=preprocessing_main_python_file_uri,
        args=preprocessing_args,
        subnetwork_uri=subnetwork_uri,
        runtime_config_version=dataproc_runtime_version,
    )

    # Train model
    training_op = DataprocPySparkBatchOp(
        project=project_id,
        location=location,
        main_python_file_uri=training_main_python_file_uri,
        args=training_args,
        subnetwork_uri=subnetwork_uri,
        runtime_config_version=dataproc_runtime_version,
    ).after(preprocessing_op)

### Compile our pipeline into a YAML file

Now that we have defined out pipeline structure, we need to compile it into YAML format in order to run it in Vertex AI Pipelines.

In [None]:
compiler.Compiler().compile(pipeline, 'mlops-pipeline.yaml')

## Submit and run our pipeline in Vertex AI Pipelines

Now we're ready to use the Vertex AI Python SDK to submit and run our pipeline in Vertex AI Pipelines.

The parameters, artifacts, and metrics produced from the pipeline run are automatically captured into Vertex AI Experiments as an experiment run. We will discuss the concept of Vertex AI Experiments in more detail in laer chapters in the book. The output of the following cell will provide a link at which you can watch your pipeline as it progresses through each of the steps.

In [None]:
pipeline = aiplatform.PipelineJob(display_name=PIPELINE_NAME, template_path='mlops-pipeline.yaml', enable_caching=False)

pipeline.submit(experiment=EXPERIMENT_NAME)

### Wait for the pipeline to complete
The following function will periodically print the status of our pipeline execution. If all goes to plan, you will eventually see a message saying "PipelineJob run completed".

In [None]:
pipeline.wait()

## Great job!! You have officially created and implemented an MLOps pipeline on Vertex AI!!

I also highly recommend going through [this tutorial](https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml) to learn more about using Dataproc, BigQuery, and Apache Spark ML for Machine Learning! 

# Cleaning up

When you no longer need the resources created by this notebook. You can delete them as follows.

**Note: if you do not delete the resources, you will continue to pay for them**

In [None]:
clean_up = False

if clean_up == True:
    # Delete pipeline
    pipeline.delete()
else:
    print("clean_up parameter is set to False")