# MLOps Data Scientist

| | |
|-|-|
| Author(s) | [Keeyana Jones](https://github.com/keeyanajones/) |

## **A Notebook for MLOps Data Scientist**

A Data Scientist pre-configured Vertex AI Workbench Jupyter Notebook Offers a powerful, streamlined environment designed to showcase typical data science tasks within Vertex AI Workbench.  It provides a structured template that a GCP developer can customize for specific projects and data. 

### Tasks Data Scientists Perform in a Vertex Workbench Notebook

Data scientists typically engage in a cycle of data understanding, model development, evaluation, and preparation for deployment. Here's how that translates to actions in a Vertex Workbench notebook:

#### 1. Environment Setup & Initialization

This sets up the development environment, authenticates to Google Cloud, and defines essential project variables for subsequent machine learning tasks on Vertex AI.

**Goals of this section:**
1.  Verify the Python environment and installed libraries.
2.  Understand how authentication to Google Cloud services works in Vertex Workbench.
3.  Define key project-specific variables like Project ID, Region, and Google Cloud Storage bucket URIs.

**Task:** Ensure the notebook environment is correctly set up, authenticated, and has access to necessary libraries and GCP services.

Notebook Configuration:
- **Kernel Selection:** Default to a TensorFlow or PyTorch kernel with GPU support if relevant.
- **Pre-installed Libraries:** Ensure google-cloud-aiplatform, google-cloud-bigquery, google-cloud-storage, pandas, numpy, scikit-learn, matplotlib, seaborn, tensorflow, pytorch, torchvision, tqdm, etc., are pre-installed via your custom container image.
- **Authentication:** Explain that the notebook's service account handles most authentication automatically. Provide boilerplate code for explicit authentication if needed for specific scenarios (gcloud auth application-default login or programmatic authentication for specific APIs).
- **Project Variables:** Include initial cells to define PROJECT_ID, REGION, BUCKET_URI, BIGQUERY_DATASET, etc. This makes it easy for data scientists to configure their project.

## Kernel and Pre-installed Libraries

Vertex AI Workbench instances come pre-configured with popular machine learning frameworks (TensorFlow, PyTorch) and common data science libraries. This notebook is designed to run with a **TensorFlow (GPU)** or **PyTorch (GPU)** kernel, leveraging the power of GPUs for accelerated computation.

**Expected pre-installed libraries:**
* `google-cloud-aiplatform` (Vertex AI SDK)
* `google-cloud-bigquery`
* `google-cloud-storage`
* `pandas`
* `numpy`
* `scikit-learn`
* `matplotlib`
* `seaborn`
* `tensorflow` (if using TensorFlow kernel)
* `torch`, `torchvision` (if using PyTorch kernel)
* `tqdm`
* And many others commonly used in data science.

Let's verify some key installations and their versions.

In [None]:
# Import common libraries to verify installation
import sys
import pandas as pd
import numpy as np
import sklearn
import matplotlib
import seaborn
import tqdm
import logging

print(f"Python Version: {sys.version}")
print(f"Pandas Version: {pd.__version__}")
print(f"NumPy Version: {np.__version__}")
print(f"Scikit-learn Version: {sklearn.__version__}")
print(f"Matplotlib Version: {matplotlib.__version__}")
print(f"Seaborn Version: {seaborn.__version__}")
print(f"tqdm Version: {tqdm.__version__}")

# Optional: Check for TensorFlow or PyTorch if relevant to your default kernel
try:
    import tensorflow as tf
    print(f"TensorFlow Version: {tf.__version__}")
    if tf.test.is_built_with_cuda():
        print("TensorFlow is built with CUDA (GPU support).")
    else:
        print("TensorFlow is NOT built with CUDA (GPU support).")
    if len(tf.config.list_physical_devices('GPU')) > 0:
        print(f"Found {len(tf.config.list_physical_devices('GPU'))} GPU(s) available for TensorFlow.")
    else:
        print("No GPU found for TensorFlow.")
except ImportError:
    print("TensorFlow not installed or accessible.")

try:
    import torch
    print(f"PyTorch Version: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"Found {torch.cuda.device_count()} GPU(s) available for PyTorch.")
        print(f"Current PyTorch GPU: {torch.cuda.get_device_name(0)}")
    else:
        print("No GPU found for PyTorch.")
except ImportError:
    print("PyTorch not installed or accessible.")

# Set up basic logging for notebooks
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logging.info("Environment setup verification complete.")

In [None]:
! pip install --upgrade pip \
google-cloud-aiplatform \
google-cloud-storage \
google-generativeai \
google-cloud-bigquery \
google-cloud-logging \
google-cloud-monitoring 

In [None]:
! pip install --upgrade pip \ 
# tensorflow or python \

# Data Science Core 
pandas numpy scipy \
# ML Frameworks, TensorFlow from base, pytorch from specific wheels
torch torchvision torchaudio \
# Tranditional ML
scikit-learn statsmodels xgboost lightgbm catboost \
# Visualization 
matplotlib seaborn plotly bokeh dash nltk  spacy  textblob gensim transformers \
# Computer Vision 
Pillow opencv-python \
# MLOPs & Experiment Tracking
mlflow  Kubeflow-pipelines \
# Model Explaination
shap Lime \
# Data Ingestion & Connectivity 
openpyxl xlrd SQLAlchemy psycopg2-binary pymysql db-dtypes fsspec gcsfs \
# Other utilities & jupyter Extensions 
tqdm jupyterLab jupyterLab-git ipywidgets jupyter-resource-usage jupyter-widgets jupyterlab-manager nbconvert \
# Exporting Models 
onnx onnxruntime 

## Google Cloud Authentication

Vertex AI Workbench instances are typically associated with a **Service Account**. This service account is automatically used by Google Cloud client libraries (like `google-cloud-aiplatform`, `google-cloud-storage`, `google-cloud-bigquery`) for authentication to most Google Cloud services.

This means you usually **do not need to explicitly log in** or provide credentials within the notebook.

However, for some specific scenarios (e.g., interacting with `gcloud CLI` commands directly for certain administrative tasks, or if your service account lacks necessary permissions and you need to impersonate another), you might use the following:

* **`gcloud auth application-default login`**: Authenticates the `gcloud CLI` using your user account's credentials. Useful for `gcloud` commands that rely on user authentication.
* **Programmatic Authentication**: For highly specific cases, you might manually load credentials, though this is rare in Vertex Workbench for typical ML workflows.

**For the vast majority of ML development in Vertex Workbench, no action is required in this section.** The default service account authentication is sufficient and recommended.

In [None]:
# Uncomment the following lines ONLY if you need explicit user authentication
# (e.g., for specific gcloud CLI commands or when the service account is insufficient).
# For most ML workflows, the default service account authentication is enough.

# from google.colab import auth # If using Colab Enterprise
# auth.authenticate_user()     # If using Colab Enterprise

# If you were outside a managed environment and needed to authenticate with a service account key file:
# from google.oauth2 import service_account
# key_path = 'path/to/your/service-account-key.json'
# credentials = service_account.Credentials.from_service_account_file(key_path)
# client = bigquery.Client(credentials=credentials, project=PROJECT_ID)

print("Authentication check: Typically, Vertex AI Workbench handles authentication automatically via the instance's service account.")
print("No explicit authentication step is usually required for client library operations.")

## Define Project Variables

To ensure your notebook interacts with the correct Google Cloud resources, define the following essential project-specific variables.

**Important:**
* Replace the placeholder values with your actual GCP Project ID, desired Region, and the name of your primary Google Cloud Storage bucket.
* Ensure the service account associated with this notebook has the necessary permissions to access these resources (e.g., `Vertex AI User`, `Storage Object Admin`, `BigQuery Data Editor`).

In [None]:
# --- IMPORTANT: CONFIGURE THESE VARIABLES ---
# Replace with your Google Cloud Project ID
PROJECT_ID = "your-gcp-project-id" # e.g., "my-ml-project-12345"

# Replace with your desired Google Cloud Region (e.g., "us-central1", "europe-west4")
# Choose a region close to your data and users for lower latency.
REGION = "us-central1"

# Replace with the name of your primary Google Cloud Storage bucket for ML artifacts.
# This bucket will be used to store datasets, model checkpoints, outputs, etc.
GCS_BUCKET_NAME = "your-ml-artifacts-bucket" # e.g., "gs://my-ml-artifacts-bucket-123"

# Optional: Replace with the name of your BigQuery Dataset, if applicable
BIGQUERY_DATASET = "your_bigquery_dataset" # e.g., "my_ml_data"

# --- DO NOT MODIFY BELOW THIS LINE (unless you know what you're doing) ---
# Construct full GCS bucket URI
BUCKET_URI = f"gs://{GCS_BUCKET_NAME}"

# Initialize Vertex AI SDK
import google.cloud.aiplatform as aiplatform

try:
    aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)
    logging.info(f"Vertex AI SDK initialized successfully for project: {PROJECT_ID}, region: {REGION}, staging bucket: {BUCKET_URI}")
except Exception as e:
    logging.error(f"Failed to initialize Vertex AI SDK: {e}")
    logging.error("Please ensure your PROJECT_ID, REGION, and GCS_BUCKET_NAME are correct and accessible.")


print(f"PROJECT_ID: {PROJECT_ID}")
print(f"REGION: {REGION}")
print(f"GCS_BUCKET_NAME: {GCS_BUCKET_NAME}")
print(f"BUCKET_URI: {BUCKET_URI}")
print(f"BIGQUERY_DATASET: {BIGQUERY_DATASET}")

logging.info("All project variables defined and Vertex AI SDK initialized.")

## Next Steps

You've successfully set up your environment! You are now ready to proceed with data exploration, model development, and other machine learning tasks within this project context.

* Proceed to `Data Exploration and Preparation` to start working with your data.

----

## Data Exploration & Preparation (EDA & Feature Engineering)

**Task:** Understand the data, clean it, transform it, and create new features.

Notebook Content/Capabilities:
- Data Loading:
   - **BigQuery:** Example code to query data from BigQuery tables using google-cloud-bigquery or pandas-gbq.
   - **Cloud Storage:** Example code to load files (CSV, JSON, Parquet, images) from GCS using google-cloud-storage or pandas.read_csv().
   - **Vertex AI Feature Store:** Boilerplate to connect to and retrieve features from the Feature Store.
   - **Data Cleaning & Transformation:** Demonstrate common pandas operations for handling missing values (df.fillna(), df.dropna()), outliers, data type conversions.
   - **Feature Engineering:** Examples of creating new columns, one-hot encoding, scaling, or using more advanced techniques.
   - **Visualization:** Example plots using matplotlib, seaborn, or plotly for distribution, correlation, and relationships.
   - **Saving Prepared Data:** Code to save processed data back to GCS or BigQuery for subsequent training stages.

----

#### 3. Model Development (Training & Experimentation)

**Task:** Select, train, and fine-tune machine learning models.

- Notebook Content/Capabilities:
    - **Model Definition:** Boilerplate for common model architectures (e.g., a simple scikit-learn model, a basic TensorFlow Keras model, or a PyTorch model).
    - **Training Loop:** A clear training loop structure, especially for deep learning.
    
    - Hyperparameter Tuning:
        - Guidance on manual tuning.
        - Example code to kick off a Vertex AI Vizier hyperparameter tuning job (often externalized to a Python script called from the notebook).

    - Experiment Tracking with Vertex AI Experiments:
        - **Crucial:** Include code to initialize an experiment run (aiplatform.init(), aiplatform.start_run()).
        - Log metrics (run.log_metric()), parameters (run.log_params()), and artifacts (run.log_artifact()) like plots or model checkpoints. This is vital for comparing different model iterations.
    
    - **Model Checkpointing:** Code to save model weights or the entire model periodically.

----

#### 4. Model Evaluation

**Task:** Assess model performance, understand its strengths and weaknesses, and identify biases.
- Notebook Content/Capabilities:
   - **Prediction:** Code to make predictions on a held-out test set.
   - **Metric Calculation:** Examples of calculating standard metrics (e.g., accuracy_score, precision_score, recall_score, f1_score for classification; mean_squared_error, r2_score for regression).
   - **Visualization of Results:** Confusion matrices, ROC curves, precision-recall curves, residual plots, etc.
   - **Vertex Explainable AI:** Boilerplate to generate feature importances or saliency maps for predictions (e.g., integrated gradients, XAI API calls). This helps data scientists understand "why" a model made a certain prediction.

----

#### 5. Model Deployment Preparation & Local Testing

**Task:** Prepare the trained model for deployment, often by saving it in a deployable format and testing predictions locally.
- Notebook Content/Capabilities:
    - **Model Serialization:** Code to save the trained model in a format suitable for Vertex AI (e.g., TensorFlow SavedModel, PyTorch state_dict, scikit-learn joblib).
    - **Local Prediction Test:** A small function or script to load the saved model and make local predictions to ensure it's working as expected before full deployment.
    - **Vertex AI Model Registry Integration:** Boilerplate to upload the trained model to the Model Registry, including model versions and metadata.

----

#### 6. MLOps Integration & Orchestration (Initiated from Notebook)

**Task:** Transition from local notebook experimentation to automated, scalable ML workflows.
- Notebook Content/Capabilities:
    - **Custom Training Job Submission:** Example code to take the training script developed in the notebook and submit it as a custom training job on Vertex AI (e.g., specifying machine type, accelerators, custom container). This allows training on larger datasets and more powerful infrastructure.
    - **Vertex AI Pipeline Triggering:** While pipeline definitions are usually in separate Python files, the notebook can include cells to:
       - Compile a pre-defined pipeline.
       - Submit/run a Vertex AI Pipeline, passing parameters like model version or data path. This enables automation of the full ML workflow (data prep -> training -> evaluation -> deployment).
   - Model Deployment to Endpoint:
       - Example code to take a model from the Model Registry and deploy it to a Vertex AI Endpoint for online predictions, including specifying machine types and autoscaling settings.
       - Initial calls to test the deployed endpoint.
   - **Model Monitoring Configuration:** Code snippets to configure Vertex AI Model Monitoring for the deployed endpoint, setting up alerts for data drift or prediction drift.

----

#### 7. Collaboration & Version Control

**Task:** Work collaboratively and manage different versions of code and notebooks.
- Notebook Configuration/Guidance:
   - **Git Integration:** Pre-configure the notebook with Git extensions or provide instructions on how to clone repositories, commit changes, and push to remote (GitHub, GitLab, Cloud Source Repositories).
   - **Shared Drives:** If using shared GCS buckets for data or models, ensure appropriate permissions and clear paths.
   - **Markdown for Documentation:** Encourage data scientists to use Markdown cells extensively for documenting their thought process, findings, and code explanations.

----