# Cloud Run LLM Serving with Cloud Storage FUSE

This notebook provides an end-to-end solution for deploying large language models on [Google Cloud Run](https://cloud.google.com/run/docs/), leveraging [Cloud Storage](https://cloud.google.com/storage/docs/) and [Cloud Storage FUSE](https://cloud.google.com/storage/docs/cloud-storage-fuse/overview) for efficient model management.

By decoupling model weights from the container image, you can:
*   **Deploy models of any size** without being constrained by container image size limits.
*   **Rapidly iterate on your application code** without re-building and uploading large model files.
*   **Share a single model artifact** across multiple services or applications.

This guide walks you through two main stages:
1.  A **Cloud Run Job** to download a model from the Hugging Face Hub and store it in a GCS bucket.
2.  A **Cloud Run Service** that mounts the GCS bucket using GCS FUSE and serves the model with [Ollama](https://ollama.com/).

This approach provides a scalable, cost-effective, and flexible way to serve large models on Google Cloud.

### About the Example Model
This notebook uses [`unsloth/gemma-3n-E4B-it-GGUF`](https://huggingface.co/unsloth/gemma-3n-E4B-it-GGUF) as the example. This is a lightweight, 4-billion parameter multimodal model from Google's Gemma family, optimized by Unsloth for efficient performance. While this guide is ideal for very large models, the principles apply to models of any size.

## Section 1: Download Model to GCS with a Cloud Run Job

In [None]:
# Install Python packages required for this notebook
%pip install --upgrade huggingface_hub hf_transfer --quiet

# Specific imports for functionality used directly in sections
from google.colab import auth
import re

print("Python environment setup complete. All necessary packages installed and libraries imported.")

### 1.1. Configure Google Cloud and Notebook Settings
This section defines all the necessary parameters for the deployment, including your Google Cloud project, region, and the specifics of the Hugging Face model to be deployed. Resource names for the Cloud Run Job and GCS bucket will now be dynamically generated based on the Hugging Face repository ID for greater flexibility.

In [None]:
# @markdown #### **1. Google Cloud Project and Region Configuration**
# @markdown Enter your Google Cloud Project ID and desired deployment region.
PROJECT_ID = "your-gcp-project-id" # @param {type:"string"}
REGION = "us-central1" # @param {type:"string"}

# @markdown #### **2. Google Cloud Storage Bucket for Model Storage**
# @markdown Enter the name of an existing GCS bucket where the model files will be stored.
# @markdown The model files will be placed in a subfolder within this bucket.
GCS_BUCKET_NAME = "your-bucket-name" # @param {type:"string"}

# @markdown #### **3. Artifact Registry Repository for Docker Images**
# @markdown Enter the name of the Artifact Registry repository for your Docker images.
ARTIFACT_REGISTRY_REPO = "docker" # @param {type:"string"}

# --- IMPORTANT VALIDATION ---
if PROJECT_ID == "your-gcp-project-id" or not PROJECT_ID:
    print("ERROR: Please update 'PROJECT_ID' in this cell with your actual Google Cloud Project ID.")
    print("Execution aborted. Please fix the PROJECT_ID and re-run this cell.")
    raise SystemExit("PROJECT_ID not set.")

if GCS_BUCKET_NAME == "your-bucket-name" or not GCS_BUCKET_NAME:
    print("ERROR: Please update 'GCS_BUCKET_NAME' in this cell with the name of your GCS bucket.")
    print("Execution aborted. Please fix the GCS_BUCKET_NAME and re-run this cell.")
    raise SystemExit("GCS_BUCKET_NAME not set.")
# --- END IMPORTANT VALIDATION ---

# @markdown #### **2. Cloud Run Job Settings (for Model Transfer)**
# @markdown This is the name for the job that copies model files from Hugging Face to GCS.
JOB_NAME_SUFFIX = "-job"

# @markdown #### **3. Hugging Face Model Details**
# @markdown Specify the Hugging Face repository and an optional file pattern for the model files.
HF_REPO_ID = "unsloth/gemma-3n-E4B-it-GGUF" # @param {type:"string"}
# Cleaned version of HF_REPO_ID for use in resource names
HF_REPO_NAME_CLEAN = re.sub(r'[^a-zA-Z0-9]+', '-', HF_REPO_ID).lower()

JOB_NAME = f"{HF_REPO_NAME_CLEAN[:23]}{JOB_NAME_SUFFIX}"

# @markdown CPU and Memory for the transfer job. Increased memory to safely handle larger *metadata* / small files from HF.
# @markdown Note: This may require your project to use the Gen2 execution environment in the specified region.
JOB_CPU = 2 # @param {type:"integer"}
JOB_MEMORY_GI = 4 # @param {type:"integer"}

# @markdown Optional: Specify a pattern to filter files from the Hugging Face repository (e.g., `*.gguf`, `model.safetensors`).
# @markdown Leave blank to download all files.
HF_MODEL_FILE_PATTERN = "*Q4_K_XL*" # @param {type:"string"}

# @markdown ---
# @markdown **Important: Set your Hugging Face API Token in Colab Secrets!**
# @markdown To ensure reliable and faster downloads from Hugging Face Hub, it is highly recommended to provide an API token.
# @markdown 1. Go to [Hugging Face Settings -> Access Tokens](https://huggingface.co/settings/tokens).
# @markdown 2. Create a new token with "read" role.
# @markdown 3. In Colab, click on the "🔑" (Secrets) icon on the left sidebar.
# @markdown 4. Add a new secret named `HF_TOKEN` and paste your Hugging Face API token as the value.
# @markdown 5. Make sure to enable "Notebook access" for this secret. The next section will use it to create a Google Cloud Secret.

# @markdown #### **5. Cleanup Confirmation**
# @markdown Set to `True` to confirm resource deletion in the final cleanup section.
CONFIRM_DELETE = False # @param {type:"boolean"}

# Derived Docker Image Names (using Artifact Registry regional hostname)
# The format is REGION-docker.pkg.dev/PROJECT_ID/REPOSITORY_NAME/IMAGE_NAME:TAG
JOB_IMAGE_NAME = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{ARTIFACT_REGISTRY_REPO}/{JOB_NAME}:latest"

# Derived Cloud Run Service Accounts
JOB_SA_NAME = f"{JOB_NAME}-sa"
JOB_SA_EMAIL = f"{JOB_SA_NAME}@{PROJECT_ID}.iam.gserviceaccount.com"

# Define the path within the GCS bucket for the model files (constant)
GCS_MODEL_PATH_PREFIX = f"{HF_REPO_NAME_CLEAN}-model/" # This remains fixed based on current Kimi K2 structure

# Configure `gcloud` CLI with your project and region
!gcloud config set project {PROJECT_ID}
!gcloud config set run/region {REGION}

# Authenticate with Google Cloud
print("Authenticating with Google Cloud. Please follow the instructions in the new browser tab.")
auth.authenticate_user()
print("Authentication complete.")

# Enable necessary Google Cloud APIs
print("Enabling required Google Cloud APIs. This may take a moment...")
# secretmanager.googleapis.com is needed for the Cloud Run Job to access HF_TOKEN
!gcloud services enable run.googleapis.com artifactregistry.googleapis.com cloudbuild.googleapis.com secretmanager.googleapis.com --project={PROJECT_ID}
print("Required APIs are enabled.")

print("\nAll configuration and initial cloud setup steps are complete.")

### 1.2. Securely Store Hugging Face Token in Google Cloud
To ensure your Hugging Face token is handled securely, this step creates a secret in Google Cloud Secret Manager. The Cloud Run job will later access this secret to authenticate with Hugging Face.

In [None]:
# Load the Hugging Face token from Colab's user secrets
try:
    from google.colab import userdata
    HF_TOKEN_VALUE = userdata.get('HF_TOKEN')
    if not HF_TOKEN_VALUE:
        raise ValueError("HF_TOKEN not found in Colab secrets. Please add it via the '🔑' icon on the left.")
    print("Successfully loaded HF_TOKEN from Colab secrets.")
except Exception as e:
    print(f"ERROR: Could not load HF_TOKEN from Colab secrets. Please add it via the '🔑' icon on the left.")
    raise SystemExit(e)

# Create the secret in Google Cloud Secret Manager if it doesn't exist
# If it exists, this command will fail gracefully.
print("Attempting to create the 'HF_TOKEN' secret in Google Cloud Secret Manager...")
!gcloud secrets create HF_TOKEN --replication-policy="automatic" --project={PROJECT_ID} --quiet || echo "Secret 'HF_TOKEN' likely already exists. Continuing."

# Add the token value as the latest version of the secret. This is idempotent and safe to run multiple times.
print("Adding the token value as the latest version of the 'HF_TOKEN' secret...")
# The `tr -d '\n'` removes any trailing newline characters that might interfere with the token.
!echo -n "{HF_TOKEN_VALUE}" | tr -d '\n' | gcloud secrets versions add HF_TOKEN --data-file=- --project={PROJECT_ID}

print("\nGoogle Cloud Secret 'HF_TOKEN' is now configured.")

### 1.3. Create Artifact Registry Repository
This step ensures that a Docker repository exists in Google Cloud's Artifact Registry. This repository will store the container image for our Cloud Run job.

In [None]:
print(f"Ensuring Artifact Registry repository '{ARTIFACT_REGISTRY_REPO}' exists in region '{REGION}'...")
# The `--async` flag prevents the notebook from blocking while the repository is created if it doesn't exist.
# It's idempotent, so safe to run even if it exists.
!gcloud artifacts repositories create {ARTIFACT_REGISTRY_REPO} \
    --repository-format=docker \
    --location={REGION} \
    --description="Docker repository for Hugging Face model images" \
    --project={PROJECT_ID} \
    --async

print(f"Artifact Registry repository creation command sent. It should be ready shortly.")

### 1.4. Define the Model Transfer Cloud Run Job
This section creates the necessary files for our Cloud Run job, which is responsible for transferring the model from Hugging Face to GCS. This includes the Python script, the Dockerfile for containerization, and the build configuration.

In [None]:
%%writefile copy_model_job.py
import os
import logging
import asyncio
from urllib.parse import urlparse
from huggingface_hub import HfApi, hf_hub_url
import obstore as obs
from obstore.store import GCSStore, HTTPStore
import fnmatch
from typing import Dict, Any, Tuple, List

# --- Configuration Constants (with Environment Variable Overrides) ---
# Maximum number of concurrent file streams.
MAX_CONCURRENT_FILES = int(os.getenv("MAX_CONCURRENT_FILES", "12"))
# Timeout for each individual file download in seconds.
INDIVIDUAL_FILE_TIMEOUT_SECONDS = int(os.getenv("INDIVIDUAL_FILE_TIMEOUT_SECONDS", 4 * 60 * 60)) # Default: 4 hours
# Log progress every N megabytes.
PROGRESS_LOG_INTERVAL_MB = int(os.getenv("PROGRESS_LOG_INTERVAL_MB", "100"))

# --- Logging Configuration ---
logging.basicConfig(
    level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)


async def progress_stream_wrapper(streaming_response, log_prefix, file_name):
    """An async generator that wraps a download stream to report progress."""
    try:
        total_size = streaming_response.meta.get("size")
        total_size_mb = total_size / (1024 * 1024) if total_size else None
    except Exception:
        total_size = None
        total_size_mb = None

    bytes_processed = 0
    last_logged_mb = 0
    log_interval_bytes = PROGRESS_LOG_INTERVAL_MB * 1024 * 1024

    async for chunk in streaming_response:
        yield chunk
        bytes_processed += len(chunk)

        # Log progress periodically
        processed_mb = bytes_processed / (1024 * 1024)
        if (processed_mb - last_logged_mb) >= PROGRESS_LOG_INTERVAL_MB:
            last_logged_mb = processed_mb
            if total_size_mb:
                percent = (bytes_processed / total_size) * 100
                logger.info(
                    f"{log_prefix} -> Progress for '{file_name}': "
                    f"{processed_mb:.2f} MB / {total_size_mb:.2f} MB ({percent:.1f}%)"
                )
            else:
                logger.info(
                    f"{log_prefix} -> Progress for '{file_name}': {processed_mb:.2f} MB"
                )


async def stream_file_to_gcs(
    file_name: str,
    config: Dict[str, Any],
    stores: Tuple[GCSStore, HTTPStore],
    semaphore: asyncio.Semaphore,
    file_index: int,
    total_files: int,
):
    """Streams a file from a Hugging Face repo directly to GCS using obstore."""
    gcs_store, http_store = stores
    gcs_destination_path = os.path.join(config["gcs_path_prefix"], file_name)
    log_prefix = f"[{file_index + 1}/{total_files}]"

    async with semaphore:
        logger.info(
            f"{log_prefix} Starting stream of '{file_name}' to "
            f"GCS: gs://{config['gcs_bucket_name']}/{gcs_destination_path}"
        )
        try:
            full_download_url = hf_hub_url(
                repo_id=config["hf_repo_id"], filename=file_name, revision="main"
            )
            download_path = urlparse(full_download_url).path

            streaming_response = await obs.get_async(http_store, download_path)

            progress_stream = progress_stream_wrapper(
                streaming_response, log_prefix, file_name
            )

            await obs.put_async(gcs_store, gcs_destination_path, progress_stream)

            logger.info(
                f"{log_prefix} Successfully streamed '{file_name}' to GCS."
            )

        except Exception as e:
            logger.error(
                f"{log_prefix} FATAL: Error processing '{file_name}': {e}",
                exc_info=True,
            )
            # Re-raising the exception will cause asyncio.gather to fail.
            raise


def _get_job_config() -> Dict[str, Any]:
    """Reads and validates all configuration from environment variables."""
    logger.info("Verifying environment variables...")
    config = {
        "hf_repo_id": os.getenv("HF_REPO_ID"),
        "hf_model_file_pattern": os.getenv("HF_MODEL_FILE_PATTERN"), # Optional
        "gcs_bucket_name": os.getenv("GCS_BUCKET_NAME"),
        "gcs_path_prefix": os.getenv("GCS_MODEL_PATH_PREFIX"),
        "hf_token": None
    }

    hf_token_file_path = "/etc/secrets/hf-token/HF_TOKEN"
    if os.path.exists(hf_token_file_path):
        try:
            with open(hf_token_file_path, "r") as f:
                config["hf_token"] = f.read().strip()
                logger.info("Hugging Face token loaded successfully from mounted secret.")
        except Exception as e:
            raise IOError(f"Error reading HF_TOKEN from mounted file: {e}") from e
    else:
        logger.warning(
            f"HF_TOKEN file not found at {hf_token_file_path}. "
            f"Downloads will be unauthenticated and might be slower or fail."
        )

    required_vars = ["hf_repo_id", "gcs_bucket_name", "gcs_path_prefix"]
    if not all(config[key] for key in required_vars):
        raise ValueError("Missing one or more required environment variables: "
                         "HF_REPO_ID, GCS_BUCKET_NAME, GCS_MODEL_PATH_PREFIX.")

    logger.info("All required environment variables are set.")
    return config


def _initialize_clients(config: Dict[str, Any]) -> Tuple[GCSStore, HTTPStore]:
    """Initializes and returns HTTP and GCS stores."""
    logger.info("Initializing obstore HTTP and GCS stores...")
    try:
        # Explicitly build the client_options dictionary ensuring all values
        # are in a format the Rust layer can handle directly (strings).
        client_options: Dict[str, Any] = {
            # Convert the integer timeout to a "human-readable duration string"
            "timeout": f"{INDIVIDUAL_FILE_TIMEOUT_SECONDS}s",
        }
        if config["hf_token"]:
            # The correct key for headers is 'default_headers'.
            client_options["default_headers"] = {
                "Authorization": f"Bearer {config['hf_token']}"
            }

        # Pass the fully constructed dictionary to the 'client_options' argument.
        http_store = HTTPStore.from_url(
            "https://huggingface.co", client_options=client_options
        )

        gcs_store = GCSStore(bucket=config["gcs_bucket_name"])
        logger.info(
            "Successfully initialized stores for Hugging Face and GCS bucket: "
            f"{config['gcs_bucket_name']}"
        )
        return gcs_store, http_store
    except Exception as e:
        raise ConnectionError(f"Failed to initialize obstore clients: {e}") from e


def _get_target_files(api: HfApi, config: Dict[str, Any]) -> List[str]:
    """Lists files from the repo and filters them based on the pattern."""
    repo_id = config["hf_repo_id"]
    pattern = config["hf_model_file_pattern"]
    logger.info(f"Listing files in Hugging Face repo '{repo_id}'...")
    try:
        files_info = api.list_repo_files(repo_id=repo_id, token=config["hf_token"])
        logger.info(f"Successfully listed {len(files_info)} files from repo '{repo_id}'.")
    except Exception as e:
        raise ConnectionError(f"Failed to list files from repo '{repo_id}': {e}") from e

    if pattern:
        return [f for f in files_info if fnmatch.fnmatch(f, pattern)]
    return files_info


async def main():
    """Main execution function for the Cloud Run Job."""
    logger.info("Cloud Run Job started.")
    try:
        # 1. Setup and Configuration
        config = _get_job_config()
        stores = _initialize_clients(config)
        api = HfApi()

        # 2. Get the list of files to transfer
        target_files = _get_target_files(api, config)

        if not target_files:
            pattern = config["hf_model_file_pattern"]
            logger.warning(
                f"No files found matching pattern '{pattern}' in repo "
                f"'{config['hf_repo_id']}'. Job will complete successfully."
            )
            return

        pattern_info = (f"matching pattern '{config['hf_model_file_pattern']}'"
                        if config['hf_model_file_pattern'] else "in repo")
        logger.info(
            f"Found {len(target_files)} files {pattern_info}. "
            f"Proceeding to download and upload concurrently."
        )

        # 3. Create and run concurrent download/upload tasks
        semaphore = asyncio.Semaphore(MAX_CONCURRENT_FILES)
        logger.info(f"Limiting concurrent files to {MAX_CONCURRENT_FILES} at a time.")

        tasks = [
            stream_file_to_gcs(
                file_name=file_name,
                config=config,
                stores=stores,
                semaphore=semaphore,
                file_index=i,
                total_files=len(target_files),
            )
            for i, file_name in enumerate(target_files)
        ]

        await asyncio.gather(*tasks, return_exceptions=False)

        logger.info(
            "All model files have been processed and are now in GCS. "
            "Job completed successfully."
        )

    except Exception as e:
        logger.error(f"FATAL: Job failed during execution: {e}", exc_info=True)
        exit(1)


if __name__ == "__main__":
    asyncio.run(main())

In [None]:
%%writefile Dockerfile
FROM python:3.13-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY copy_model_job.py .

CMD ["python", "copy_model_job.py"]

In [None]:
%%writefile requirements.txt
google-cloud-storage
huggingface-hub
obstore

In [None]:
%%writefile .gcloudignore
*

!copy_model_job.py
!Dockerfile
!requirements.txt

In [None]:
CLOUDBUILD_JOB_YAML_CONTENT = f"""
steps:
- name: 'gcr.io/cloud-builders/docker'
  args: ['build', '-t', '{JOB_IMAGE_NAME}', '-f', 'Dockerfile', '.']
- name: 'gcr.io/cloud-builders/docker'
  args: ['push', '{JOB_IMAGE_NAME}']
images:
- {JOB_IMAGE_NAME}
"""

with open("cloudbuild.job.yaml", "w") as f:
    f.write(CLOUDBUILD_JOB_YAML_CONTENT)

print("cloudbuild.job.yaml created successfully.")

### 1.5. Build and Execute the Cloud Run Job
This section handles the final steps of the deployment process. It creates a dedicated service account for the job, assigns the necessary permissions, builds the Docker image using Cloud Build, and then deploys and runs the job on Cloud Run.

In [None]:
# Create a service account for the Cloud Run Job
print(f"Creating service account for Cloud Run Job: {JOB_SA_EMAIL}...")
!gcloud iam service-accounts create {JOB_SA_NAME} \
    --display-name="{HF_REPO_ID} Model Transfer Job SA" \
    --project={PROJECT_ID} --quiet

print("Granting necessary permissions to the Cloud Run Job service account...")
# Grant Storage Object Admin role, which includes create, get, list, and delete permissions for objects.
!gcloud projects add-iam-policy-binding {PROJECT_ID} \
    --member="serviceAccount:{JOB_SA_EMAIL}" \
    --role="roles/storage.objectAdmin" \
    --condition=None --quiet > /dev/null

# Grant Secret Manager Secret Accessor role for HF_TOKEN
print("Granting Secret Manager Secret Accessor role to the job service account...")
!gcloud projects add-iam-policy-binding {PROJECT_ID} \
    --member="serviceAccount:{JOB_SA_EMAIL}" \
    --role="roles/secretmanager.secretAccessor" \
    --condition=None --quiet > /dev/null

In [None]:
# Build the Docker image for the Cloud Run Job using Cloud Build
print(f"Submitting build job for Cloud Run Job image: {JOB_IMAGE_NAME}. This may take a few minutes.")
# The `.` at the end will now respect the .gcloudignore file
!gcloud builds submit --config cloudbuild.job.yaml --project={PROJECT_ID} .

In [None]:
!gcloud beta run jobs deploy {JOB_NAME} \
    --image {JOB_IMAGE_NAME} \
    --region {REGION} \
    --cpu {JOB_CPU} \
    --memory {JOB_MEMORY_GI}Gi \
    --service-account {JOB_SA_EMAIL} \
    --labels dev-tutorial=notebook-gcsfuse \
    --set-env-vars HF_REPO_ID={HF_REPO_ID},HF_MODEL_FILE_PATTERN={HF_MODEL_FILE_PATTERN},GCS_BUCKET_NAME={GCS_BUCKET_NAME},GCS_MODEL_PATH_PREFIX={GCS_MODEL_PATH_PREFIX} \
    --set-secrets=/etc/secrets/hf-token/HF_TOKEN=HF_TOKEN:latest \
    --project={PROJECT_ID} \
    --task-timeout 24h \
    --execute-now

## Section 2: Host the Model with a Cloud Run Service

### 2.1. Create the Service Configuration Files

This next section creates the configuration files for the Cloud Run service. These files tell the service how to load and serve the model.

*   **`start.sh`**: This is the entrypoint script for the container. It performs the following steps:
    1.  Calculates the SHA256 hash of the model file.
    2.  Creates a symbolic link from the model file to the Ollama blobs directory. This is a crucial step that allows Ollama to discover and use the model from the GCS FUSE mount.
    3.  Creates the model manifest using the `ollama create` command.
    4.  Starts the Ollama server.
*   **`Modelfile`**: This file defines the model that Ollama will serve. It simply points to the model file that is mounted from GCS.

In [None]:
import os
import sys

# Construct the full GCS path including the model subfolder and file pattern
gcs_path_pattern = f"gs://{GCS_BUCKET_NAME}/{GCS_MODEL_PATH_PREFIX}{HF_MODEL_FILE_PATTERN}"

print(f"Searching for model files in GCS at: {gcs_path_pattern}")

# Execute the gsutil ls command and capture the output
# The '2>/dev/null' part suppresses potential error messages if no files are found
gsutil_output = !gsutil ls {gcs_path_pattern} 2>/dev/null

# --- Validation ---
if not gsutil_output:
    print(f"ERROR: No files found in GCS matching the pattern.")
    print("Please ensure the model transfer job has completed successfully and the GCS_BUCKET_NAME, GCS_MODEL_PATH_PREFIX, and HF_MODEL_FILE_PATTERN variables are correct.")
    sys.exit("Aborting due to missing model file in GCS.")

# Take the first file from the list (gsutil ls sorts alphabetically)
first_file_full_path = gsutil_output[0]

# Extract just the filename from the full gs:// path
MODEL_FILENAME_FROM_GCS = os.path.basename(first_file_full_path)

print(f"Successfully identified model file in GCS: {MODEL_FILENAME_FROM_GCS}")

In [None]:
!mkdir ollama_service

In [None]:
import os
import sys

gcs_path_pattern = f"gs://{GCS_BUCKET_NAME}/{GCS_MODEL_PATH_PREFIX}{HF_MODEL_FILE_PATTERN}"
print(f"Searching for model file in GCS: {gcs_path_pattern}")

gsutil_output = !gsutil ls {gcs_path_pattern} 2>/dev/null
if not gsutil_output:
    print("ERROR: No files found in GCS matching the pattern. Ensure the transfer job is complete.")
    sys.exit("Aborting.")

MODEL_GCS_PATH = gsutil_output[0]
MODEL_FILENAME = os.path.basename(MODEL_GCS_PATH)
print(f"Successfully identified model file in GCS: {MODEL_GCS_PATH}")

In [None]:
# This is the full, absolute path to the model file *inside the container*.
full_model_path_in_container = f"/models/{GCS_MODEL_PATH_PREFIX}{MODEL_FILENAME}"

# --- Create start.sh Script Content ---
# The f-string now starts IMMEDIATELY with #!/bin/sh to avoid any
# leading newlines. This is the critical fix.
START_SH_CONTENT = f"""#!/bin/sh

# This path was dynamically inserted by the deployment notebook.
MODEL_FILE_PATH="{full_model_path_in_container}"
MODEL_NAME="gemma-3n-custom"
BLOBS_DIR="/var/lib/ollama/blobs"

echo "Starting Ollama server in the background..."
ollama serve &
sleep 3

echo "Calculating SHA256 for $MODEL_FILE_PATH..."
MODEL_SHA256=$(sha256sum "$MODEL_FILE_PATH" | awk '{{print $1}}')

if [ -z "$MODEL_SHA256" ]; then
    echo "ERROR: Failed to calculate SHA256 hash for the model file."
    exit 1
fi
echo "SHA256 calculated: $MODEL_SHA256"

BLOB_PATH="$BLOBS_DIR/sha256-$MODEL_SHA256"

echo "Checking if blob already exists or needs a symlink..."
if [ ! -e "$BLOB_PATH" ]; then
    echo "Creating symlink from $MODEL_FILE_PATH to $BLOB_PATH"
    mkdir -p "$BLOBS_DIR"
    ln -s "$MODEL_FILE_PATH" "$BLOB_PATH"
else
    echo "Blob path already exists. No symlink needed."
fi

echo "Running 'ollama create' to generate the manifest..."
ollama create "$MODEL_NAME" -f /workspace/Modelfile

echo "Model created. Bringing Ollama to the foreground."
wait $!
"""

# Write the dynamic start.sh file
with open("ollama_service/start.sh", "w") as f:
    f.write(START_SH_CONTENT)
print("Successfully created dynamic ollama_service/start.sh with correct formatting.")


# --- Create Modelfile Content (This part was correct) ---
MODELFILE_CONTENT = f"""
# This Modelfile was dynamically generated to point to the correct model file.
FROM {full_model_path_in_container}
"""

# Write the dynamic Modelfile
with open("ollama_service/Modelfile", "w") as f:
    f.write(MODELFILE_CONTENT)
print("Successfully created dynamic ollama_service/Modelfile")

In [None]:
%%writefile ollama_service/Dockerfile

FROM ollama/ollama:latest

# Set environment variables
ENV OLLAMA_HOST=0.0.0.0:8080
ENV OLLAMA_MODELS=/var/lib/ollama
ENV OLLAMA_DEBUG=false
ENV OLLAMA_KEEP_ALIVE=-1

# Copy the Modelfile and our startup script into the container's workspace
COPY Modelfile start.sh /workspace/

# Make our startup script executable
RUN chmod +x /workspace/start.sh

# Set the entrypoint to our script. This will run when the container starts.
ENTRYPOINT [ "/workspace/start.sh" ]

### 2.2. Deploy the Service to Cloud Run

This final step deploys the service to Cloud Run. The following flags are particularly important:

*   `--add-volume=name=ollama-gcs-models,type=cloud-storage,bucket={GCS_BUCKET_NAME},readonly=true`: This flag mounts the GCS bucket containing the model into the container as a read-only volume.
*   `--add-volume-mount=volume=ollama-gcs-models,mount-path=/models`: This flag mounts the GCS volume at the `/models` path in the container.
*   `--add-volume=name=ollama-writable-state,type=in-memory,size-limit=1Gi`: This flag creates a writable in-memory volume for Ollama's state. This is necessary because the GCS volume is read-only, but Ollama needs a writable directory to store its state.

In [None]:
!gcloud run deploy {HF_REPO_NAME_CLEAN[:63]} \
    --source ./ollama_service \
    --region {REGION} \
    --labels dev-tutorial=notebook-gcsfuse \
    --set-env-vars OLLAMA_NUM_PARALLEL=4 \
    --project {PROJECT_ID} \
    --gpu 1 \
    --gpu-type nvidia-l4 \
    --max-instances 1 \
    --memory 32Gi \
    --cpu 8 \
    --concurrency 4 \
    --timeout=600 \
    --no-cpu-throttling \
    --allow-unauthenticated \
    --execution-environment=gen2 \
    --no-gpu-zonal-redundancy \
    --add-volume=name=ollama-gcs-models,type=cloud-storage,bucket={GCS_BUCKET_NAME},readonly=true \
    --add-volume-mount=volume=ollama-gcs-models,mount-path=/models \
    --add-volume=name=ollama-writable-state,type=in-memory,size-limit=1Gi \
    --add-volume-mount=volume=ollama-writable-state,mount-path=/var/lib/ollama

### 2.3. Run Inference on the Deployed Model

Now that the service is deployed, this step sends a test prompt to the model's API endpoint. It constructs a `curl` command to make a POST request with a JSON payload containing the prompt. The response from the model is then parsed and displayed.

In [None]:
# Get the URL of our newly deployed Cloud Run service
SERVICE_URL = !gcloud run services describe {HF_REPO_NAME_CLEAN[:63]} --platform managed --region {REGION} --format 'value(status.url)'
SERVICE_URL = SERVICE_URL[0]

# The /api/generate endpoint is the standard path for Ollama prompts
OLLAMA_ENDPOINT_URL = f"{SERVICE_URL}/api/generate"

print(f"Service is running. Endpoint URL for prompts is:\n{OLLAMA_ENDPOINT_URL}")

In [None]:
!curl -s -X POST -H "Content-Type: application/json" -d {json_payload} {OLLAMA_ENDPOINT_URL}


In [None]:
import json

# Define the service endpoint URL (assuming SERVICE_URL is already defined)
OLLAMA_ENDPOINT_URL = f"{SERVICE_URL}/api/generate"

# Define the data payload for the prompt as a Python dictionary
prompt_data = {
  "model": "gemma-3n-custom:latest",
  "prompt": "What are the top 3 benefits of accessing models from a GCS bucket instead of the container image?",
  "stream": False
}

# Convert the Python dictionary to a JSON string for the curl command
# The single quotes around the f-string are important for the shell command
json_payload = f"'{json.dumps(prompt_data)}'"

# 1. Execute the curl command and capture its output into the 'raw_output' list
raw_output = !curl -s -X POST -H "Content-Type: application/json" -d {json_payload} {OLLAMA_ENDPOINT_URL}

# 2. The output is a list of lines; join them into a single string
response_string = "".join(raw_output)

# 3. Parse the JSON string into a Python dictionary
response_json = json.loads(response_string)

# 4. Extract and print just the 'response' key
model_answer = response_json.get("response", "Error: 'response' key not found in the JSON output.")
print(model_answer.strip())

## Optional: Clean Up Resources

This final section provides commands to delete all the Google Cloud resources created during this guide. It's crucial to run this step when you are finished to avoid incurring unnecessary costs, especially for GPU resources and large storage buckets.

**WARNING**: Running this section will delete your Cloud Run service, the Cloud Run Job, Docker image repositories in Artifact Registry, and your Cloud Storage bucket containing the model. This action is irreversible. Ensure `CONFIRM_DELETE` is set to `True` in the **Google Cloud Configuration** section to enable this cleanup.

In [None]:
if CONFIRM_DELETE:
    print("Initiating cleanup of Google Cloud resources...")

    # Delete the Cloud Run Job
    print(f"Deleting Cloud Run Job '{JOB_NAME}'...")
    !gcloud run jobs delete {JOB_NAME} --region={REGION} --quiet --project={PROJECT_ID}

    # Delete the Docker image repositories from Artifact Registry
    # This deletes the entire 'docker' repository in your chosen REGION for the project.
    print(f"Deleting Docker image repository 'docker' in region '{REGION}'...")
    !gcloud artifacts repositories delete docker --location={REGION} --project={PROJECT_ID} --quiet || true

    # Delete the Cloud Storage bucket (and all its contents)
    print(f"Deleting Cloud Storage bucket 'gs://{GCS_BUCKET_NAME}'...")
    !gsutil -r rm -r gs://{GCS_BUCKET_NAME}

    # Delete the Secret Manager secret for HF_TOKEN
    print(f"Deleting Secret Manager secret 'HF_TOKEN'...")
    !gcloud secrets delete HF_TOKEN --project={PROJECT_ID} --quiet || true

    # Delete the custom service accounts
    print(f"Deleting job service account '{JOB_SA_EMAIL}'...")
    !gcloud iam service-accounts delete {JOB_SA_EMAIL} --project={PROJECT_ID} --quiet --display-name="{HF_REPO_ID} Model Transfer Job SA" || true

    print("\nCleanup commands executed. Please verify in the Google Cloud Console that all resources have been removed to prevent further charges.")
else:
    print("Resource deletion not confirmed. To delete resources, set `CONFIRM_DELETE = True` in the Google Cloud Configuration section and re-run this section.")