# Vector Embeddings

This example shows how to use GlassFlow to enrich events data with vector embeddings by calling an embeddings model endpoint.

## Pre-requisites

- Create your free GlassFlow account via the [GlassFlow WebApp](https://app.glassflow.dev).
- Get your [Personal Access Token](https://app.glassflow.dev/profile) to authorize the Python SDK to interact with GlassFlow Cloud.
- Set up Vertex AI in GCP
    - Enable the VertexAI model you want to use (`text-embedding-004` in our case)    
    - Get your GCP service account credentials JSON with permissions `aiplatform.endpoints.predict`
- Have a pinecone index to sink the vectors into


In [None]:
%pip install "glassflow>=2.0.8" pandas Faker

In [None]:
import glassflow

In [None]:
# fill credentials
# Please edit this variable with your own personal access token from https://app.glassflow.dev/profile
personal_access_token = ""
MODEL_ID = "text-embedding-004"
GCP_PROJECT_ID = ""
GCP_REGION = "us-central1"
GCP_SERVICE_ACCOUNT_JSON = ""  # Service account credentials JSON string
PINECONE_HOST = ""
PINECONE_API_KEY = ""
PINECONE_INDEX_HOST = ""

## Create Pipeline

In [None]:
client = glassflow.GlassFlowClient(
    personal_access_token=personal_access_token
)

In [None]:
# Get the space named "examples" (or create one if no space is found)
list_spaces = client.list_spaces()

space_name = "examples"
for s in list_spaces.spaces:
    if s["name"] == space_name:
        space = glassflow.Space(
            personal_access_token=client.personal_access_token,
            id=s["id"], 
            name=s["name"]
        )
        break
else:
    space = client.create_space(name=space_name)

print(f"Space \"{space.name}\" with ID: {space.id}")

### Transformation Function

In [None]:
%pycat transform.py

### Env Variables needed for transformation

In [None]:
env_vars = [
    {
        "name": "MODEL_ID",
        "value": MODEL_ID
    },
    {
        "name": "GCP_PROJECT_ID",
        "value": GCP_PROJECT_ID
    },
    {
        "name": "GCP_REGION",
        "value": GCP_REGION
    },
    {
        "name": "GCP_SERVICE_ACCOUNT_JSON",
        "value": GCP_SERVICE_ACCOUNT_JSON
    },
]

### Requirements txt

In [None]:
with open("requirements.txt") as f:
    requirements_txt = f.read()
display(requirements_txt)

### Create Pipeline

In [None]:
pipeline_name = "vector-embeddings-example"

pipeline = client.create_pipeline(
    name=pipeline_name, 
    transformation_file='transform.py',
    space_id=space.id, 
    env_vars=env_vars, 
    requirements=requirements_txt,
    sink_kind="pinecone_json",
    sink_config={
        "api_host": PINECONE_HOST,
        "api_key": PINECONE_API_KEY,
        "index_host": PINECONE_INDEX_HOST,
    }
)
print("Pipeline ID:", pipeline.id)

## Produce data and send it to your pipeline

### Create a dummy data generator using python faker library

In [None]:
from faker import Faker

def geo_data_generator():
    fake = Faker()
    return {
        'content': fake.text(max_nb_chars=1000),
        'id': fake.uuid4()
    }

In [None]:
### Get pipeline data source object to publish events to the pipeline

In [None]:
data_source = pipeline.get_source()

In [None]:
# Generate some data and send it to the pipeline. Store it locally to compare
n_events = 10
input_events = []
for i in range(n_events):
    event = geo_data_generator()
    input_events.append(event)
    data_source.publish(event)

In [None]:
## Display data sent to the pipeline

In [None]:
import pandas as pd

display(pd.DataFrame(input_events))

## Check your Pinecone index 

Have a look at the newly added documents to your Pinecone index.

## Explore the pipeline on the web-UI


In [None]:
pipeline_url = f"https://app.glassflow.dev/pipelines/{pipeline.id}"
print(pipeline_url)