diff --git a/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb b/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb new file mode 100644 index 000000000..46c495528 --- /dev/null +++ b/notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb @@ -0,0 +1,690 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Copyright 2023 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Use BigQuery DataFrames to cluster and characterize complaints\n", + "\n", + "\n", + "\n", + " \n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + " \n", + " \n", + " \"Vertex\n", + " Open in Vertex AI Workbench\n", + " \n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Overview\n", + "\n", + "The goal of this notebook is to demonstrate a comment characterization algorithm for an online business. We will accomplish this using [Google's PaLM 2](https://ai.google/discover/palm2/) and [KMeans clustering](https://en.wikipedia.org/wiki/K-means_clustering) in three steps:\n", + "\n", + "1. Use PaLM2TextEmbeddingGenerator to [generate text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) for each of 10000 complaints sent to an online bank. If you're not familiar with what a text embedding is, it's a list of numbers that are like coordinates in an imaginary \"meaning space\" for sentences. (It's like [word embeddings](https://en.wikipedia.org/wiki/Word_embedding), but for more general text.) The important point for our purposes is that similar sentences are close to each other in this imaginary space.\n", + "2. Use KMeans clustering to group together complaints whose text embeddings are near to eachother. This will give us sets of similar complaints, but we don't yet know _why_ these complaints are similar.\n", + "3. Prompt PaLM2TextGenerator in English asking what the difference is between the groups of complaints that we got. Thanks to the power of modern LLMs, the response might give us a very good idea of what these complaints are all about, but remember to [\"understand the limits of your dataset and model.\"](https://ai.google/responsibility/responsible-ai-practices/#:~:text=Understand%20the%20limitations%20of%20your%20dataset%20and%20model)\n", + "\n", + "We will tie these pieces together in Python using BigQuery DataFrames. [Click here](https://cloud.google.com/bigquery/docs/dataframes-quickstart) to learn more about BigQuery DataFrames!" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Dataset\n", + "\n", + "This notebook uses the [CFPB Consumer Complaint Database](https://console.cloud.google.com/marketplace/product/cfpb/complaint-database)." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* BigQuery (compute)\n", + "* BigQuery ML\n", + "* Generative AI support on Vertex AI\n", + "\n", + "Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models), [Generative AI support on Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing#generative_ai_models),\n", + "and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),\n", + "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", + "to generate a cost estimate based on your projected usage." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Before you begin\n", + "\n", + "Complete the tasks in this section to set up your environment." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Set up your Google Cloud project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", + "\n", + "3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,run.googleapis.com,artifactregistry.googleapis.com,cloudbuild.googleapis.com,cloudresourcemanager.googleapis.com) to enable the following APIs:\n", + "\n", + " * BigQuery API\n", + " * BigQuery Connection API\n", + " * Cloud Run API\n", + " * Artifact Registry API\n", + " * Cloud Build API\n", + " * Cloud Resource Manager API\n", + " * Vertex AI API\n", + "\n", + "4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk)." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Set your project ID\n", + "\n", + "**If you don't know your project ID**, see the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# set your project ID below\n", + "PROJECT_ID = \"\" # @param {type:\"string\"}\n", + "\n", + "# Set the project id in gcloud\n", + "! gcloud config set project {PROJECT_ID}" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Set the region\n", + "\n", + "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "REGION = \"US\" # @param {type: \"string\"}" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Authenticate your Google Cloud account\n", + "\n", + "Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Vertex AI Workbench**\n", + "\n", + "Do nothing, you are already authenticated." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Local JupyterLab instance**\n", + "\n", + "Uncomment and run the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# ! gcloud auth login" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Colab**\n", + "\n", + "Uncomment and run the following cell:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# from google.colab import auth\n", + "# auth.authenticate_user()" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bf.close_session()`. After that, you can reuse `bf.options.bigquery.location` to specify another location." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Connect to Vertex AI\n", + "\n", + "In order to use PaLM2TextGenerator, we will need to set up a [cloud resource connection](https://cloud.google.com/bigquery/docs/create-cloud-resource-connection)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from google.cloud import bigquery_connection_v1 as bq_connection\n", + "\n", + "CONN_NAME = \"bqdf-llm\"\n", + "\n", + "client = bq_connection.ConnectionServiceClient()\n", + "new_conn_parent = f\"projects/{PROJECT_ID}/locations/{REGION}\"\n", + "exists_conn_parent = f\"projects/{PROJECT_ID}/locations/{REGION}/connections/{CONN_NAME}\"\n", + "cloud_resource_properties = bq_connection.CloudResourceProperties({})\n", + "\n", + "try:\n", + " request = client.get_connection(\n", + " request=bq_connection.GetConnectionRequest(name=exists_conn_parent)\n", + " )\n", + " CONN_SERVICE_ACCOUNT = f\"serviceAccount:{request.cloud_resource.service_account_id}\"\n", + "except Exception:\n", + " connection = bq_connection.types.Connection(\n", + " {\"friendly_name\": CONN_NAME, \"cloud_resource\": cloud_resource_properties}\n", + " )\n", + " request = bq_connection.CreateConnectionRequest(\n", + " {\n", + " \"parent\": new_conn_parent,\n", + " \"connection_id\": CONN_NAME,\n", + " \"connection\": connection,\n", + " }\n", + " )\n", + " response = client.create_connection(request)\n", + " CONN_SERVICE_ACCOUNT = (\n", + " f\"serviceAccount:{response.cloud_resource.service_account_id}\"\n", + " )\n", + "print(CONN_SERVICE_ACCOUNT)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Set permissions for the service account\n", + "\n", + "The resource connection service account requires certain project-level permissions:\n", + " - `roles/aiplatform.user` and `roles/bigquery.connectionUser`: These roles are required for the connection to create a model definition using the LLM model in Vertex AI ([documentation](https://cloud.google.com/bigquery/docs/generate-text#give_the_service_account_access)).\n", + " - `roles/run.invoker`: This role is required for the connection to have read-only access to Cloud Run services that back custom/remote functions ([documentation](https://cloud.google.com/bigquery/docs/remote-functions#grant_permission_on_function)).\n", + "\n", + "Set these permissions by running the following `gcloud` commands:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/bigquery.connectionUser'\n", + "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/aiplatform.user'\n", + "!gcloud projects add-iam-policy-binding {PROJECT_ID} --condition=None --no-user-output-enabled --member={CONN_SERVICE_ACCOUNT} --role='roles/run.invoker'" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we are ready to use BigQuery DataFrames!" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "xckgWno6ouHY" + }, + "source": [ + "## Step 1: Text embedding " + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Project Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R7STCS8xB5d2" + }, + "outputs": [], + "source": [ + "import bigframes.pandas as bf\n", + "\n", + "bf.options.bigquery.project = PROJECT_ID\n", + "bf.options.bigquery.location = REGION" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "v6FGschEowht" + }, + "source": [ + "Data Input - read the data from a publicly available BigQuery dataset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zDSwoBo1CU3G" + }, + "outputs": [], + "source": [ + "input_df = bf.read_gbq(\"bigquery-public-data.cfpb_complaints.complaint_database\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tYDoaKgJChiq" + }, + "outputs": [], + "source": [ + "issues_df = input_df[[\"consumer_complaint_narrative\"]].dropna()\n", + "issues_df.head(n=5) # View the first five complaints" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Download 10000 complaints to use with PaLM2TextEmbeddingGenerator" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OltYSUEcsSOW" + }, + "outputs": [], + "source": [ + "# Choose 10,000 complaints randomly and store them in a column in a DataFrame\n", + "downsampled_issues_df = issues_df.sample(n=10000)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "Wl2o-NYMoygb" + }, + "source": [ + "Generate the text embeddings" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "li38q8FzDDMu" + }, + "outputs": [], + "source": [ + "from bigframes.ml.llm import PaLM2TextEmbeddingGenerator\n", + "\n", + "model = PaLM2TextEmbeddingGenerator() # No connection id needed" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cOuSOQ5FDewD" + }, + "outputs": [], + "source": [ + "# Will take ~3 minutes to compute the embeddings\n", + "predicted_embeddings = model.predict(downsampled_issues_df)\n", + "# Notice the lists of numbers that are our text embeddings for each complaint\n", + "predicted_embeddings.head() " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4H_etYfsEOFP" + }, + "outputs": [], + "source": [ + "# Join the complaints with their embeddings in the same DataFrame\n", + "combined_df = downsampled_issues_df.join(predicted_embeddings)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now have the complaints and their text embeddings as two columns in our combined_df. Recall that complaints with numerically similar text embeddings should have similar meanings semantically. We will now group similar complaints together." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "OUZ3NNbzo1Tb" + }, + "source": [ + "## Step 2: KMeans clustering" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AhNTnEC5FRz2" + }, + "outputs": [], + "source": [ + "from bigframes.ml.cluster import KMeans\n", + "\n", + "cluster_model = KMeans(n_clusters=10) # We will divide our complaints into 10 groups" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Perform KMeans clustering" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6poSxh-fGJF7" + }, + "outputs": [], + "source": [ + "# Use KMeans clustering to calculate our groups. Will take ~3 minutes.\n", + "cluster_model.fit(combined_df[[\"text_embedding\"]])\n", + "clustered_result = cluster_model.predict(combined_df[[\"text_embedding\"]])\n", + "# Notice the CENTROID_ID column, which is the ID number of the group that\n", + "# each complaint belongs to.\n", + "clustered_result.head(n=5)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Join the group number to the complaints and their text embeddings\n", + "combined_clustered_result = combined_df.join(clustered_result)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our dataframe combined_clustered_result now has three columns: the complaints, their text embeddings, and an ID from 1-10 (inclusive) indicating which semantically similar group they belong to." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "21rNsFMHo8hO" + }, + "source": [ + "## Step 3: Summarize the complaints" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Build prompts - we will choose just two of our categories and prompt PaLM2TextGenerator to identify their salient characteristics. The prompt is natural language in a python string." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2E7wXM_jGqo6" + }, + "outputs": [], + "source": [ + "# Using bigframes, with syntax identical to pandas,\n", + "# filter out the first and second groups\n", + "cluster_1_result = combined_clustered_result[\n", + " combined_clustered_result[\"CENTROID_ID\"] == 1\n", + "][[\"consumer_complaint_narrative\"]]\n", + "cluster_1_result_pandas = cluster_1_result.head(5).to_pandas()\n", + "\n", + "cluster_2_result = combined_clustered_result[\n", + " combined_clustered_result[\"CENTROID_ID\"] == 2\n", + "][[\"consumer_complaint_narrative\"]]\n", + "cluster_2_result_pandas = cluster_2_result.head(5).to_pandas()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZNDiueI9IP5e" + }, + "outputs": [], + "source": [ + "# Build plain-text prompts to send to PaLM 2. Use only 5 complaints from each group.\n", + "prompt1 = 'comment list 1:\\n'\n", + "for i in range(5):\n", + " prompt1 += str(i + 1) + '. ' + \\\n", + " cluster_1_result_pandas[\"consumer_complaint_narrative\"].iloc[i] + '\\n'\n", + "\n", + "prompt2 = 'comment list 2:\\n'\n", + "for i in range(5):\n", + " prompt2 += str(i + 1) + '. ' + \\\n", + " cluster_2_result_pandas[\"consumer_complaint_narrative\"].iloc[i] + '\\n'\n", + "\n", + "print(prompt1)\n", + "print(prompt2)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BfHGJLirzSvH" + }, + "outputs": [], + "source": [ + "# The plain English request we will make of PaLM 2\n", + "prompt = (\n", + " \"Please highlight the most obvious difference between\"\n", + " \"the two lists of comments:\\n\" + prompt1 + prompt2\n", + ")\n", + "print(prompt)" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get a response from PaLM 2 LLM by making a call to Vertex AI using our connection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mL5P0_3X04dE" + }, + "outputs": [], + "source": [ + "from bigframes.ml.llm import PaLM2TextGenerator\n", + "\n", + "session = bf.get_global_session()\n", + "connection = f\"{PROJECT_ID}.{REGION}.{CONN_NAME}\"\n", + "q_a_model = PaLM2TextGenerator(session=session, connection_name=connection)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ICWHsqAW1FNk" + }, + "outputs": [], + "source": [ + "# Make a DataFrame containing only a single row with our prompt for PaLM 2\n", + "df = bf.DataFrame({\"prompt\": [prompt]})" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gB7e1LXU1pst" + }, + "outputs": [], + "source": [ + "# Send the request for PaLM 2 to generate a response to our prompt\n", + "major_difference = q_a_model.predict(df)\n", + "# PaLM 2's response is the only row in the dataframe result \n", + "major_difference[\"ml_generate_text_llm_result\"].iloc[0]" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now see PaLM2TextGenerator's characterization of the different comment groups. Thanks for using BigQuery DataFrames!" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/noxfile.py b/noxfile.py index 34b055de4..3dd23ba04 100644 --- a/noxfile.py +++ b/noxfile.py @@ -609,6 +609,7 @@ def notebook(session): # our test infrastructure. "notebooks/getting_started/getting_started_bq_dataframes.ipynb", "notebooks/generative_ai/bq_dataframes_llm_code_generation.ipynb", + "notebooks/generative_ai/bq_dataframes_llm_kmeans.ipynb", "notebooks/regression/bq_dataframes_ml_linear_regression.ipynb", "notebooks/generative_ai/bq_dataframes_ml_drug_name_generation.ipynb", "notebooks/vertex_sdk/sdk2_bigframes_pytorch.ipynb",