# Vector Embeddings

This example shows how to use GlassFlow to enrich events data with vector embeddings by calling an embeddings model endpoint.

## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Set up Vertex AI in GCP
    - Enable the VertexAI model you want to use (`text-embedding-004` in our case)    
    - Get your GCP service account credentials JSON with permissions `aiplatform.endpoints.predict`


In [66]:
%pip install "glassflow>=2.0.8" pandas Faker


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [67]:
import glassflow

In [68]:
# fill credentials
# Please edit this variable with your own personal access token from https://app.glassflow.dev/profile
personal_access_token = ""
MODEL_ID = "text-embedding-004"
GCP_PROJECT_ID = ""
GCP_REGION = "us-central1"
GCP_SERVICE_ACCOUNT_JSON = ""  # Service account credentials JSON string

## Create Pipeline

In [69]:
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [70]:
# Get the space named "examples" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "examples"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Space \"{space.name}\" with ID: {space.id}")

Space "examples" with ID: 4119e23c-c09f-4153-810c-6160ac8581eb


### Transformation Function

In [71]:
%pycat transform.py

### Env Variables needed for transformation

In [72]:
env_vars = [
    {
        "name": "MODEL_ID",
        "value": MODEL_ID
    },
    {
        "name": "GCP_PROJECT_ID",
        "value": GCP_PROJECT_ID
    },
    {
        "name": "GCP_REGION",
        "value": GCP_REGION
    },
    {
        "name": "GCP_SERVICE_ACCOUNT_JSON",
        "value": GCP_SERVICE_ACCOUNT_JSON
    },
]

### Requirements txt

In [73]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
display(requirements_txt)

'google-cloud-aiplatform'

### Create Pipeline

In [74]:
pipeline_name = "vector-embeddings-example"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    space_id=space.id, 
    env_vars=env_vars, 
    requirements=requirements_txt
)
print("Pipeline ID:", pipeline.id)

Pipeline ID: 6e00ee7a-9a34-43a7-ad45-54140c2e97a8


## Produce data and send it to your pipeline

### Create a dummy data generator using python faker library

In [75]:
from faker import Faker

def geo_data_generator():
    fake = Faker()
    return {
        'content': fake.text(max_nb_chars=1000),
        'id': fake.uuid4()
    }

In [76]:
### Get pipeline data source object to publish events to the pipeline

In [77]:
data_source = pipeline.get_source()

In [78]:
# Generate some data and send it to the pipeline. Store it locally to compare
n_events = 10
input_events = []
for i in range(n_events):
    event = geo_data_generator()
    input_events.append(event)
    data_source.publish(event)

In [79]:
## Display data sent to the pipeline

In [80]:
import pandas as pd

display(pd.DataFrame(input_events))

Unnamed: 0,content,id
0,Short fund probably per continue military mode...,7fb430a6-3446-446f-892c-2defdafe9021
1,Strategy participant hand. Word pick parent ma...,7ad1c291-27e5-43dc-bee2-2c09d684cf2d
2,World cell human suggest pay hotel. President ...,39cd3b6c-4b16-46ff-91ed-8168010baad8
3,Quite identify health number list short. Less ...,5278a905-860d-4361-815f-845435b1e2d3
4,More feel itself lawyer practice. Paper fast p...,60df16d0-80c8-4ab4-afc5-87d7001e2b22
5,Onto first community mind since. Wall party ch...,a8f10833-2469-438f-a8dd-05ab6c2ab688
6,Guy election radio score. Thing look federal b...,ffffe379-b6b7-4ce8-b2cf-cc3134c6e609
7,Loss field position before. Team off wide time...,77f8557b-c936-4281-9445-ec882c790baa
8,Throughout commercial tend major religious pla...,74adadbc-b9cd-4f63-a7d9-19911085c6a8
9,Security score together. Enjoy morning share c...,16ac076f-d962-4a1a-acc8-acc19e7a9bf6


## Consume events from the pipeline 

Get pipeline data sink to consume the transformed events from the pipeline.

In [81]:
data_sink = pipeline.get_sink()

In [None]:
output_events = []
while True:
    resp = data_sink.consume()
    if resp.status_code == 200:
        output_events.append(resp.json())
    if len(output_events) == n_events:
        # all events have been consumed
        break

In [None]:
import pandas as pd

display(pd.DataFrame(output_events))

## Explore the pipeline on the web-UI


In [None]:
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}"
print(pipeline_url)