# Vector Embeddings

This example shows how to use GlassFlow to enrich events data with vector embeddings by calling an embeddings model endpoint.

## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Set up Vertex AI in GCP
    - Enable the VertexAI model you want to use (`text-embedding-004` in our case)    
    - Get your GCP service account credentials JSON with permissions `aiplatform.endpoints.predict`
- Have a pinecone index to sink the vectors into


In [1]:
%pip install "glassflow>=2.0.8" pandas Faker


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import glassflow

In [43]:
# fill credentials
# Please edit this variable with your own personal access token from https://app.glassflow.dev/profile
personal_access_token = ""
MODEL_ID = "text-embedding-004"
GCP_PROJECT_ID = ""
GCP_REGION = "us-central1"
GCP_SERVICE_ACCOUNT_JSON = ""  # Service account credentials JSON string
PINECONE_HOST = ""
PINECONE_API_KEY = ""
PINECONE_INDEX_HOST = ""

## Create Pipeline

In [44]:
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [45]:
# Get the space named "examples" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "examples"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Space \"{space.name}\" with ID: {space.id}")

Space "examples" with ID: ebfb6a52-ff0e-4515-ae3b-53ecbee10029


### Transformation Function

In [46]:
%pycat transform.py

### Env Variables needed for transformation

In [47]:
env_vars = [
    {
        "name": "MODEL_ID",
        "value": MODEL_ID
    },
    {
        "name": "GCP_PROJECT_ID",
        "value": GCP_PROJECT_ID
    },
    {
        "name": "GCP_REGION",
        "value": GCP_REGION
    },
    {
        "name": "GCP_SERVICE_ACCOUNT_JSON",
        "value": GCP_SERVICE_ACCOUNT_JSON
    },
]

### Requirements txt

In [48]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
display(requirements_txt)

'google-cloud-aiplatform'

### Create Pipeline

In [65]:
pipeline_name = "vector-embeddings-example"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    space_id=space.id, 
    env_vars=env_vars, 
    requirements=requirements_txt,
    sink_kind="pinecone_json",
    sink_config={
        "api_host": PINECONE_HOST,
        "api_key": PINECONE_API_KEY,
        "index_host": PINECONE_INDEX_HOST,
    }
)
print("Pipeline ID:", pipeline.id)

Pipeline ID: 7cbd5723-d1ab-4b73-a4c7-6badb41d5116


## Produce data and send it to your pipeline

### Create a dummy data generator using python faker library

In [66]:
from faker import Faker

def geo_data_generator():
    fake = Faker()
    return {
        'content': fake.text(max_nb_chars=1000),
        'id': fake.uuid4()
    }

In [67]:
### Get pipeline data source object to publish events to the pipeline

In [68]:
data_source = pipeline.get_source()

In [69]:
# Generate some data and send it to the pipeline. Store it locally to compare
n_events = 10
input_events = []
for i in range(n_events):
    event = geo_data_generator()
    input_events.append(event)
    data_source.publish(event)

In [70]:
## Display data sent to the pipeline

In [71]:
import pandas as pd

display(pd.DataFrame(input_events))

Unnamed: 0,content,id
0,Say own any bad method society edge. Full peop...,4e428baa-7d3d-4cf3-a354-7098f607a770
1,Generation find prove whole. Economic sister c...,ecb003b7-44c6-419a-bc3e-2bed2d5f444d
2,Mind leg area rate. Rise religious happy.\nAge...,3f4409e7-0106-4ca0-8a9e-6fa6488813ce
3,Artist wife should such avoid. Similar another...,5a595aa4-2ab7-45c8-b8ba-e2826d4517de
4,West several right peace glass. Finally imagin...,d98c8808-5d2f-4168-a262-ca06c5fa497e
5,Study far discover threat himself. Same positi...,5d462614-deae-42e3-ad59-5a0127a798e2
6,Continue like bag conference.\nSource every tr...,a4815529-3834-4119-9a83-56537d87ccca
7,Actually evidence PM reason research pretty ru...,646cbc0a-c32d-4498-8250-c91cf8a5bc09
8,Huge on boy each customer prove. Series whose ...,e21a07a3-3411-431b-81f1-22d661463e9f
9,Media move professor example indicate product ...,e24d9844-d1d3-4d7d-aa9e-7a46c9aedd59


## Check your Pinecone index 

Have a look at the newly added documents to your Pinecone index.

## Explore the pipeline on the web-UI


In [None]:
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}"
print(pipeline_url)