# Google BigQuery Vector Search

[![](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/max-ostapenko/website-source/blob/main/src/posts/bigquery_vector_search/notebook.ipynb)

## Authorise Google Colab with GCP access

First user needs to be authenticated, the project, dataset, and location IDs set, and a dataset for vector search resources created in Google BigQuery.

Adjust the following parameters in the code below:
- project_id: The ID of the Google Cloud project.
- dataset_id: The ID of the dataset to be created.
- location_id: The location where the dataset will be created ('us', 'eu', etc.).


In [None]:
from google.colab import auth
auth.authenticate_user()

project_id = 'max-ostapenko'
dataset_id = 'vector_search'
location_id = 'us'

# Create a dataset for vector search resources
!bq mk --project_id={project_id} --location {location_id} --dataset {dataset_id}

## Connect Vertex AI model via BigQuery connections


In [None]:
# Create BigQuery connection to Google cloud resources
connection_id = "vertex_ai-remote_functions-big_lake"
!bq mk --project_id={project_id} --location={location_id} \
    --connection_type=CLOUD_RESOURCE --connection {connection_id}

# Extract service account ID from the connection details
import subprocess, json
connection_details = json.loads(
    subprocess.check_output("bq show --project_id={project_id} --location={location_id} --format=json --connection {connection_id}".format(
        project_id=project_id,
        location_id=location_id,
        connection_id=connection_id
        ).split(" ")
    ).decode('utf-8')
)
service_account = connection_details["cloudResource"]["serviceAccountId"]

# Authorise 'Vertex AI User' role for connection service account
!gcloud projects add-iam-policy-binding {project_id} \
    --member='serviceAccount:{service_account}' --role='roles/aiplatform.user' > /dev/null

# Create BQ ML model for multimodal embeddings
create_model_query = """
CREATE OR REPLACE MODEL `{project_id}.{dataset_id}.multimodalembedding`
REMOTE WITH CONNECTION `{project_id}.{location_id}.vertex_ai-remote_functions-big_lake`
OPTIONS (ENDPOINT = "textembedding-gecko@latest");
""".format(dataset_id=dataset_id, project_id=project_id, location_id=location_id)
!bq query --project_id={project_id} --use_legacy_sql=false '{create_model_query}'

## Generate embeddings table

[ML.GENERATE_TEXT_EMBEDDING function](https://cloud.google.com/bigquery/docs/reference/standard-sql/bigqueryml-syntax-generate-text-embedding) converts strings to a vector within semantic space.

[Cost of multimodal embeddings generation via Vertex AI endpoint](https://cloud.google.com/vertex-ai/pricing#image_generation).

### Generate embeddings from URLs list


In [3]:
get_product_names_query = """
CREATE OR REPLACE TABLE
    `{dataset_id}.url_embedded` AS (
    SELECT
        *
    FROM
        ML.GENERATE_TEXT_EMBEDDING( MODEL `{dataset_id}.multimodalembedding`,
            (
            SELECT
                url AS content
            FROM `httparchive.urls.latest_crux_desktop`
            WHERE rank = 1000
            LIMIT 1000 ),
            STRUCT(TRUE AS flatten_json_output) ) );
""".format(dataset_id=dataset_id)
!bq query --project_id={project_id} --use_legacy_sql=false '{get_product_names_query}'

# Preview the table
query_product_names = """
SELECT
    content, statistics, ml_embed_text_status, text_embedding
FROM
`{dataset_id}.url_embedded`
LIMIT 5
""".format(dataset_id=dataset_id)
!bq query --project_id={project_id} --use_legacy_sql=false '{query_product_names}'

Waiting on bqjob_r7789bb4637c6f417_0000018dc1c3c75d_1 ... (11s) Current status: DONE   
Replaced max-ostapenko.vector_search.url_embedded

+----------------------+-------------------------------------+----------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### Generate embeddings from IMDB catalogue


In [4]:
get_product_names_query = """
CREATE OR REPLACE TABLE
    `{dataset_id}.movie_title_embedded` AS (
    SELECT
        *
    FROM
        ML.GENERATE_TEXT_EMBEDDING( MODEL `{dataset_id}.multimodalembedding`,
            (
                SELECT
                    primary_title as content
                FROM `bigquery-public-data.imdb.title_basics`
                WHERE
                    title_type = "movie"
                    AND start_year > 1970
                    AND is_adult = 0
                    AND tconst IN (
                        SELECT
                            tconst
                        FROM `bigquery-public-data.imdb.title_ratings`
                        WHERE
                            average_rating > 6.5
                            AND num_votes > 10000
                    )
                ),
            STRUCT(TRUE AS flatten_json_output) ) );
""".format(dataset_id=dataset_id)
!bq query --project_id={project_id} --use_legacy_sql=false '{get_product_names_query}'

# Preview the table
query_product_names = """
SELECT
    content, statistics, ml_embed_text_status, text_embedding
FROM
`{dataset_id}.movie_title_embedded`
LIMIT 5
""".format(dataset_id=dataset_id)
!bq query --project_id={project_id} --use_legacy_sql=false '{query_product_names}'

Waiting on bqjob_r4d8ecda036484bd8_0000018dc1c417df_1 ... (112s) Current status: DONE   
Replaced max-ostapenko.vector_search.movie_title_embedded

+----------+-------------------------------------+----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

## Search for neighbours

The search cost consists of:

- cost of an embedding generation for a search query (can be ignored for short queries),
- cost of BigQuery processing.


In [5]:
select_neighbours_query = """
SELECT
    base.content,
    EUCLIDEAN_DISTANCE(search.text_embedding, base.text_embedding) AS distance
FROM ML.GENERATE_TEXT_EMBEDDING(
    MODEL `{dataset_id}.multimodalembedding`,
    (SELECT "{search_query}" AS content),
    STRUCT(TRUE AS flatten_json_output)
) AS search
CROSS JOIN `{dataset_id}.{source_table}` AS base
ORDER BY distance ASC
LIMIT {limit};
"""

### Websites classification


In [6]:
search_query = "Social"
limit = "10"

search_urls = select_neighbours_query.format(
    source_table="url_embedded",
    dataset_id=dataset_id,
    search_query=search_query,
    limit = limit
)
!bq query --project_id={project_id} --use_legacy_sql=false '{search_urls}'

+--------------------------------+--------------------+
|            content             |      distance      |
+--------------------------------+--------------------+
| https://www.facebook.com/      | 0.6952186979498833 |
| https://web.facebook.com/      | 0.7158822720622858 |
| https://free.facebook.com/     |  0.740260932509051 |
| https://twitter.com/           |   0.75665178425304 |
| https://business.facebook.com/ | 0.7568543973313501 |
| https://apps.facebook.com/     | 0.7650217870318406 |
| https://www.instagram.com/     | 0.7678844917192112 |
| https://www.linkedin.com/      | 0.7805433740785335 |
| https://www.tiktok.com/        | 0.8015824452215249 |
| https://en.wikipedia.org/      | 0.8080364602207089 |
+--------------------------------+--------------------+


### Movie title recommendation


In [7]:
search_query = "Francis Ford Coppola"
top_k = "10"

search_movies = select_neighbours_query.format(
    source_table="movie_title_embedded",
    dataset_id=dataset_id,
    search_query=search_query,
    limit = limit
)
!bq query --project_id={project_id} --use_legacy_sql=false '{search_movies}'

Waiting on bqjob_r26b23e8c2ce5254e_0000018dc1c64ffe_1 ... (0s) Current status: DONE   
+------------------------+--------------------+
|        content         |      distance      |
+------------------------+--------------------+
| Forrest Gump           |  0.693854122498681 |
| Fitzcarraldo           | 0.7192698337456384 |
| The Godfather          | 0.7495456552309414 |
| Pulp Fiction           | 0.7537942915567972 |
| Suspiria               | 0.7555659646216765 |
| Suspiria               | 0.7555659646216765 |
| The Godfather Part II  | 0.7644078793453681 |
| Capote                 | 0.7654285032288327 |
| Hitchcock              | 0.7744345185709752 |
| The Godfather Part III | 0.7786602943923249 |
+------------------------+--------------------+
