# Deploy Falcon 7B on Vertex AI 

<table align="left">
  <td>
    <a href="https://github.com/huggingface/Google-Cloud-Containers/blob/main/examples/vertex-ai/notebooks/deploy-falcon-on-vertex-ai.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>
<br/><br/><br/>

[Falcon-7b-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) is a 7B parameters causal decoder-only model built by TII based on [Falcon-7B](https://huggingface.co/tiiuae/falcon-7b) and finetuned on a mixture of chat/instruct datasets. It is made available under the Apache 2.0 license. You can find more information about other Falcon model in the blog post [The Falcon has landed in the Hugging Face ecosystem](https://huggingface.co/blog/falcon).

In this tutorial you will learn how to deploy [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) on Vertex AI Endpoints. We are going to use the Hugging Face [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) container which is a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). You can now find all Hugging Face containers on [Google Cloud](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face).

What you'll learn in this blog:

1. [Setup development environment](#1-setup-development-environment)
2. [Configure gcloud CLI](#2-configure-gcloud-cli)
3. [Initialize Vertex AI SDK](#3-initialize-vertex-ai-sdk)
4. [Deploy Falcon-7B on Vertex AI](#4-deploy-falcon-7b-on-vertex-ai)
5. [Run Inference with deployed Model](#5-run-inference-with-deployed-model)
6. [Cleaning Up Resources](#6-cleaning-up-resources)

## 1. Setup development environment

We are going to use the `Vertex AI` python SDK to deploy [tiiuae/falcon-7b-instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) on Vertex AI. You need to have a GCP project and an account with the necessary permissions to create resources in the project. 

Before we can install the packages we need to install the `gcloud CLI`. Instructions can be found here: https://cloud.google.com/sdk/docs/install

In [None]:
# Install the required packages for the notebook
! pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage "google-auth>=2.23.3"

## 2. Configure gcloud CLI

We need to authenticate with Google Cloud SDK to use the Vertex AI services. Run the following command to authenticate with Google Cloud SDK.

```bash
gcloud auth login 
gcloud auth application-default login
```

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "gcp-project-id"  # @param {type:"string"}
REGION = "us-central1"  # @param {type: "string"}
BUCKET_URI = f"gs://vertexai-{PROJECT_ID}-tgi"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID} --quiet
# Set the region
! gcloud config set ai/region {REGION} --quiet
# create the bucket if it doesn't exist
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

## 3. Initialize Vertex AI SDK 

The following set of constants will be used to create names and display names of Vertex AI Prediction resources like models, endpoints, and model deployments.

In [None]:
# set model names and version
MODEL_NAME = "falcon-7b-hf" # @param {type:"string"}
MODEL_VERSION = "v01" # @param {type: "string"}
MODEL_DISPLAY_NAME = f"TGI-{MODEL_NAME}-{MODEL_VERSION}" # @param {type:"string"}
ENDPOINT_DISPLAY_NAME = f"endpoint-{MODEL_NAME}-{MODEL_VERSION}" # @param {type:"string"}

# Set the TGI serving container image uri, selected from https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face
SERVING_CONTAINER_IMAGE_URI = "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310"

In [None]:
from google.cloud import aiplatform

# Initialize the Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

create new model

In [None]:
model = aiplatform.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
    serving_container_environment_variables={
        "MODEL_ID": "tiiuae/falcon-7b-instruct", # Hugging Face model ID
        "NUM_SHARD": "1",
        "MAX_INPUT_LENGTH": "1512",
        "MAX_TOTAL_TOKENS": "4096",
        },
    
    serving_container_ports=[80],
)


model.wait()

print(model.display_name)
print(model.resource_name)

Once the model is uploaded, now we can deploy it on Vertex AI Endpoints. 
First, we need to create an endpoint and then deploy the model to the endpoint. Here you also need to choose the hardware configuration for the deployment. 

The deployment will take ~20-25 minutes. You can check the status of the deployment in the Google cloud console.

In [None]:
machine_type = 'g2-standard-8' # L4 GPUs
endpoint = aiplatform.Endpoint.create(display_name=ENDPOINT_DISPLAY_NAME)

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_NAME, # The display name of the deployed model
    machine_type=machine_type,              # type of machine, read more here: https://cloud.google.com/vertex-ai/docs/predictions/configure-compute
    accelerator_type="NVIDIA_L4",           # Hardware accelerator 
    accelerator_count=1,                    # Number of accelerators to attach to a worker replica.
    traffic_percentage=100,                 # Percentage of traffic to send to this model
    min_replica_count=1,                    # The minimum number of machine replicas this deployed model will be always deployed on.
    sync=True,                              # Whether to execute this method synchronously.
)

## 5. Run Inference with deployed Model
Awesome! We have successfully deployed the Falcon-7B model on Vertex AI. Now let's run inference on our endpoint. We will use the `predict` method of the `Endpoint` class to run inference on the deployed model. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. You can find supported parameters in the [TGI documentation](https://huggingface.github.io/text-generation-inference/) defined under the `GenerateParameters` section.


In [None]:
# Prompt for generation
# define payload
prompt = """You are an helpful Assistant, called Falcon who knows everything about Google Cloud.

User: Can you tell me about Google Cloud Vertex AI?
Falcon:"""

res = deployed_model.predict(instances=[
  {"inputs": prompt, 
   "parameters": {"max_new_tokens": 128, 
                  "do_sample": True, 
                  "top_p": 0.9, 
                  "temparature": 1.0, 
                  }} # Generation arguments
  ]
)
print(res.predictions[0])


## 6. Cleaning Up Resources

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial, you can delete the resources you created in this tutorial.

In [None]:
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()