# Creating Embedding and Vector Store

This notebook demonstrates the process of creating embeddings and setting up a vector store for a course content retrieval system. 

It covers the following key steps:

1. Importing necessary libraries and creating and setting up database and its configurations
1. Connecting to either a Google Cloud SQL
1. Loading course content data from markdown files
1. Creating embeddings for the course content using a Google Gemini embedding model
1. Storing the embeddings in a vector database for efficient similarity search

Setting up few constants:

In [1]:
project_id = "imrenagi-gemini-experiment" #change this to your project id
region = "us-central1"

instance_name="pyconapac-demo"
database_password = 'testing' #change this to your database password
database_name = 'testing' #change this to your database name
database_user = 'testing' #change this to your database user

# Dont update these lines below

embeddings_table_name = "course_content_embeddings"
gemini_embedding_model = "text-embedding-004"

assert database_name, "⚠️ Please provide a database name"
assert database_user, "⚠️ Please provide a database user"
assert database_password, "⚠️ Please provide a database password"


## Setting Up PostgreSQL in Google Cloud SQL

Here will we set the default GCP project and get information about the user using the GCP account.

In [32]:
# Configure gcloud.
!gcloud config set project {project_id}

# Grant Cloud SQL Client role to authenticated user
current_user = !gcloud auth list --filter=status:ACTIVE --format="value(account)"
print(f"{current_user}")

Updated property [core/project].
['imre.nagi2812@gmail.com']


Before sending query to database, we will have to add required permissions for our notebook so that it can access the database:

In [None]:
print(f"Granting Cloud SQL Client role to {current_user[0]}")
# granting cloudsql client role to the current user
!gcloud projects add-iam-policy-binding {project_id} \
  --member=user:{current_user[0]} \
  --role="roles/cloudsql.client"

Next, we are going to create new postgresql database from Google CloudSQL and create postgresql user/role which will be used to store the embeddings later on

In [34]:
#@markdown Create and setup a Cloud SQL PostgreSQL instance, if not done already.
database_version = !gcloud sql instances describe {instance_name} --format="value(databaseVersion)"
if database_version[0].startswith("POSTGRES"):
  print("Found an existing Postgres Cloud SQL Instance!")
else:
  print("Creating new Cloud SQL instance...")
  !gcloud sql instances create {instance_name} --database-version=POSTGRES_15 \
    --region={region} --cpu=1 --memory=4GB --root-password={database_password} \
    --authorized-networks=0.0.0.0/0
# Create the database, if it does not exist.
out = !gcloud sql databases list --instance={instance_name} --filter="NAME:{database_name}" --format="value(NAME)"
if ''.join(out) == database_name:
  print("Database %s already exists, skipping creation." % database_name)
else:
  !gcloud sql databases create {database_name} --instance={instance_name}
# Create the database user for accessing the database.
!gcloud sql users create {database_user} \
  --instance={instance_name} \
  --password={database_password}

Found an existing Postgres Cloud SQL Instance!
Database testing already exists, skipping creation.
Creating Cloud SQL user...done.                                                
Created user [testing].


Here we are going to get the ip of postgresql we just created. Take note to the database host ip address.

In [35]:
# get the ip address of the instance
ip_addresses = !gcloud sql instances describe {instance_name} --project {project_id} --format 'value(ipAddresses.ipAddress)'
# Split the IP addresses and take the first one
database_host = ip_addresses[0].split(';')[0].strip()
print(f"Using database host: {database_host}")

Using database host: 35.232.5.157


## Prepare the embeddings

Now, we will build the embeddings from the content we have selected. Let's start with storing the content of course_content.jsonl to a dataframe:

In [37]:
import pandas as pd

# Read the JSONL file into a pandas DataFrame
df = pd.read_json('course_content.jsonl', lines=True)
df.head(5)

Unnamed: 0,id,title,content,file_path,slug
0,1,REST Security Cheat Sheet,# REST Security Cheat Sheet\n\n## Introduction...,sources/REST_Security_Cheat_Sheet.md,rest-security-cheat-sheet
1,2,Forgot Password Cheat Sheet,# Forgot Password Cheat Sheet\n\n## Introducti...,sources/Forgot_Password_Cheat_Sheet.md,forgot-password-cheat-sheet
2,3,Authentication Cheat Sheet,# Authentication Cheat Sheet\n\n## Introductio...,sources/Authentication_Cheat_Sheet.md,authentication-cheat-sheet
3,4,Password Storage Cheat Sheet,# Password Storage Cheat Sheet\n\n## Introduct...,sources/Password_Storage_Cheat_Sheet.md,password-storage-cheat-sheet
4,5,Authorization Cheat Sheet,# Authorization Cheat Sheet\n\n## Introduction...,sources/Authorization_Cheat_Sheet.md,authorization-cheat-sheet


Before creating the embedding, we need to split the content of each files into chunks. This is most of the time required, especially when the content is toolong, because embedding has the limit for the number of input token it can accept.

In [38]:
from langchain.text_splitter import MarkdownTextSplitter

text_splitter = MarkdownTextSplitter(
  chunk_size=1000, 
  chunk_overlap=200)

from langchain_core.documents import Document

chunked = []
for index, row in df.iterrows():
    course_content_id = row["id"]
    title = row["title"]
    content = row["content"]
    splits = text_splitter.create_documents([content])
    for s in splits:
        metadata = {"course_content_id": course_content_id, "title": title}
        doc = Document(page_content=s.page_content, metadata=metadata)
        chunked.append(doc)

chunked[0]

Document(metadata={'course_content_id': 1, 'title': 'REST Security Cheat Sheet'}, page_content="# REST Security Cheat Sheet\n\n## Introduction\n\n[REST](http://en.wikipedia.org/wiki/Representational_state_transfer) (or **RE**presentational **S**tate **T**ransfer) is an architectural style first described in [Roy Fielding](https://en.wikipedia.org/wiki/Roy_Fielding)'s Ph.D. dissertation on [Architectural Styles and the Design of Network-based Software Architectures](https://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm).\n\nIt evolved as Fielding wrote the HTTP/1.1 and URI specs and has been proven to be well-suited for developing distributed hypermedia applications. While REST is more widely applicable, it is most commonly used within the context of communicating with services via HTTP.")

Once we have the file content chunked into smaller sizes, we are going to create embedding for each chunked and store it to cloudsql.

Now let's initialize vertex ai sdk and create the embedding services.

In [None]:
from langchain_google_vertexai import VertexAIEmbeddings
import vertexai

# Initialize Vertex AI
vertexai.init(project=project_id, location=region)
# Create a Vertex AI Embeddings service
embeddings_service = VertexAIEmbeddings(model_name=gemini_embedding_model)

Now, let's construct the embeddings and store it to the database.

On the function below we are doing these steps:
1. We are initiating a PostgresEngine. This instance of PostgresEngine will be used to handle database connection as well as authentication.
1. Then, ainit_vectorstore_table() will create a table which will be used to store the chucked content, its embedding, and metadata.
1. We initialize the PostgresVectorStore and provide the engine as well as the embedding service.
1. For each chunked document, we call function aadd_documents to create embedding and create new record on the given table.

In [None]:
from langchain_google_cloud_sql_pg import PostgresEngine, PostgresVectorStore
import uuid

async def create_vectorstore():
    engine = await PostgresEngine.afrom_instance(
        project_id,
        region,
        instance_name,
        database_name,
        user=database_user,
        password=database_password,
    )

    await engine.ainit_vectorstore_table(
        table_name=embeddings_table_name, vector_size=768, overwrite_existing=True
    )

    vector_store = await PostgresVectorStore.create(
        engine,
        table_name=embeddings_table_name,
        embedding_service=embeddings_service,
    )

    ids = [str(uuid.uuid4()) for i in range(len(chunked))]
    await vector_store.aadd_documents(chunked, ids=ids)

await create_vectorstore()