<a href="https://colab.research.google.com/github/hardrave/GCP_Guild_AI_in_GCP/blob/main/cloud_run_with_gemma2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#GCP AOssociation Guild - Gemma2 Model Deployment Guide

This collaborative notebook was created for the GCP Guild Association to demonstrate the deployment of the Gemma2 2B parameter model to Google Cloud Run. The notebook provides a step-by-step guide covering model preparation, containerization, and deployment to GCP,

# Authenticate with Google Cloud

This cell authenticates your Google Cloud account using the `gcloud` command-line tool. It updates the application default credentials (ADC) and runs in quiet mode.

In [None]:
!gcloud auth login --update-adc --quiet

# Project Setup

This section sets up the project ID and location for your Google Cloud resources.

*   **PROJECT_ID:**  This variable stores your Google Cloud Project ID. You can provide a value if you want to use a specific project, otherwise it will use the environment variable `GOOGLE_CLOUD_REGION`.
*   **LOCATION:** This variable stores the location for your Google Cloud resources. It defaults to "us-central1" if not specified.

In [2]:
# Use the environment variable if the user doesn't provide Project ID.
import os

PROJECT_ID = ""  # @param {type:"string", isTemplate: true}

LOCATION = os.environ.get("GOOGLE_CLOUD_REGION", "us-central1")

# Artifact Registry Repository Name

This section defines the name of the Artifact Registry repository that will be used to store the Docker image for the Ollama model.

*   **AR_REPOSITORY_NAME:** This variable stores the name of the Artifact Registry repository. It is set to "michal" in this case.

In [3]:
AR_REPOSITORY_NAME = "michal"

# Create Artifact Registry Repository

This cell creates an Artifact Registry repository to store the Docker image for the Ollama model.

It uses the `gcloud` command-line tool to create a new repository with the following parameters:

*   **repository-format:** Specifies the repository format, which is set to `docker` for storing Docker images.
*   **location:** Specifies the location of the repository, using the `LOCATION` variable defined earlier.
*   **project:** Specifies the Google Cloud project ID, using the `PROJECT_ID` variable defined earlier.

In [None]:
!gcloud artifacts repositories create $AR_REPOSITORY_NAME \
      --repository-format=docker \
      --location=$LOCATION \
      --project=$PROJECT_ID

# Model Name

This section defines the name of the Ollama model that will be deployed.

*   **MODEL_NAME:** This variable stores the name of the Ollama model, which is set to "gemma2:2b" in this case. This likely refers to the "Gemma 2 2b" model from Ollama.

In [5]:
MODEL_NAME = "gemma2:2b"

# Build Dockerfile

This section defines and creates the Dockerfile used to build the image for the Ollama model deployment.

**Dockerfile Content:**

The `dockerfile_content` variable stores the content of the Dockerfile. Here's a breakdown of the key instructions:

*   **FROM ollama/ollama:** This line specifies the base image for the Dockerfile, which is the official Ollama image.
*   **ENV OLLAMA_HOST ...:** These lines set environment variables for the Ollama server, including the host, port, model directory, debug mode, and model keep-alive settings.
*   **ENV MODEL {MODEL_NAME}:** This line sets the `MODEL` environment variable to the `MODEL_NAME` defined earlier, specifying the model to load.
*   **RUN ollama serve ...:** This line starts the Ollama server, waits for it to start, and then pulls the specified model weights.
*   **ENTRYPOINT ["/bin/sh"] & CMD [...]:** These lines define the entry point and command to run when the container starts, ensuring the Ollama server starts and loads the model.

**Writing the Dockerfile:**

The code then writes the `dockerfile_content` to a file named "Dockerfile" in the current directory. This Dockerfile will be used in the next step to build the Docker image.

In [6]:
dockerfile_content = f"""
FROM ollama/ollama

# Set the host and port to listen on
ENV OLLAMA_HOST 0.0.0.0:8080

# Set the directory to store model weight files
ENV OLLAMA_MODELS /models

# Reduce the verbosity of the logs
ENV OLLAMA_DEBUG false

# Do not unload model weights from the GPU
ENV OLLAMA_KEEP_ALIVE -1

# Choose the model to load. Ollama defaults to 4-bit quantized weights
ENV MODEL {MODEL_NAME}

# Start the ollama server and download the model weights
RUN ollama serve & sleep 5 && ollama pull $MODEL

# At startup time we start the server and run a dummy request
# to request the model to be loaded in the GPU memory
ENTRYPOINT ["/bin/sh"]
CMD ["-c", "ollama serve  & (ollama run $MODEL 'Say one word' &) && wait"]
"""

# Write the Dockerfile
with open("Dockerfile", "w") as f:
    f.write(dockerfile_content)

# Container URI

This section defines the URI for the Docker image that will be built and stored in Artifact Registry.

*   **CONTAINER_URI:** This variable stores the full URI of the Docker image in Artifact Registry. It's constructed using the `LOCATION`, `PROJECT_ID`, `AR_REPOSITORY_NAME`, and the image name ("ollama-gemma-2"). This URI will be used to push and pull the image.

In [7]:
CONTAINER_URI = (
    f"{LOCATION}-docker.pkg.dev/{PROJECT_ID}/{AR_REPOSITORY_NAME}/ollama-gemma-2"
)

# Build and Push Docker Image

This cell builds the Docker image using Cloud Build and pushes it to Artifact Registry.

It uses the `gcloud` command-line tool to submit a build request with the following parameters:

*   **tag:** Specifies the tag for the Docker image, which is set to the `CONTAINER_URI` defined earlier.
*   **project:** Specifies the Google Cloud project ID, using the `PROJECT_ID` variable defined earlier.
*   **machine-type:** Specifies the machine type to use for the build, which is set to `e2-highcpu-32` for better performance. This machine type is suitable for CPU-intensive build processes.

In [None]:
!gcloud builds submit --tag $CONTAINER_URI --project $PROJECT_ID --machine-type e2-highcpu-32

# Cloud Run Service Name

This section defines the name for the Cloud Run service that will host the Ollama model.

*   **SERVICE_NAME:** This variable stores the name of the Cloud Run service. It is set to "ollama-gemma-2" in this case. This name will be used to identify and manage the deployed service.

In [9]:
SERVICE_NAME = "ollama-gemma-2"  # @param {type:"string"}

# Deploy to Cloud Run

This cell deploys the Ollama model to Cloud Run as a service.

It uses the `gcloud` command-line tool with the `beta run deploy` command to create and deploy a new Cloud Run service with the following parameters:

*   **SERVICE_NAME:** The name of the Cloud Run service, defined earlier.
*   **project:** The Google Cloud project ID.
*   **region:** The region where the service will be deployed.
*   **image:** The URI of the Docker image in Artifact Registry.
*   **concurrency:** The maximum number of requests that can be processed concurrently by a single instance of the service (set to 4).
*   **cpu:** The number of CPU cores allocated to each instance (set to 8).
*   **max-instances:** The maximum number of instances that can be running for the service (set to 1).
*   **memory:** The amount of memory allocated to each instance (set to 16Gi).
*   **no-allow-unauthenticated:** Disables unauthenticated access to the service, requiring authentication.
*   **no-cpu-throttling:** Disables CPU throttling, ensuring consistent performance.
*   **timeout:** The request timeout for the service, in seconds (set to 600).

In [None]:
!gcloud beta run deploy $SERVICE_NAME \
    --project $PROJECT_ID \
    --region $LOCATION \
    --image $CONTAINER_URI \
    --concurrency 4 \
    --cpu 8 \
    --max-instances 1 \
    --memory 16Gi \
    --no-allow-unauthenticated \
    --no-cpu-throttling \
    --timeout=600

#Accessing the Deployed Model

In order to connect to the deployed Gemma2 model on Google Cloud Run, follow these steps:

Start the Cloud Run Proxy:
Open a cloud shell terminal and run the following command to start the proxy. When prompted to install the cloud-run-proxy component, choose Y to proceed.
```
gcloud run services proxy ollama-gemma --port=9090
```
This will expose the service on localhost:9090.

Send a Request to the Model:
In a separate terminal tab, while keeping the proxy running, execute the following curl command to send a test request to the model:



```
# curl http://localhost:9090/api/generate -d '{
  "model": "gemma2:2b",
  "prompt": "Why is the sky blue?"
}'
```


The response will contain the model-generated output based on the provided prompt.

