<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/retrievers/pathway_retriever.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pathway Retriever

> [Pathway](https://pathway.com/) is an open data processing framework. It allows you to easily develop data transformation pipelines and Machine Learning applications that work with live data sources and changing data.

This notebook shows how to use the Pathway to deploy a live data indexing pipeline which can be queried from reader. You can add documents to Pathway from existing connectors or create your own connector with Python ensuring your LLM stays up to date with latest information. 

For more details about Pathway vector store, visit [vector store pipeline](https://pathway.com/developers/showcases/vectorstore_pipeline).

In [None]:
from llama_index.retrievers import PathwayRetriever, PathwayVectorServer
import pathway as pw

In [None]:
import getpass
import os
import pathway as pw

# omit if embedder of choice is not OpenAI
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

## Define data sources for Pathway

Pathway can listen to many sources simultaneously, such as local files, S3 folders, cloud storages and any data stream for data changes.

See [pathway-io](https://pathway.com/developers/api-docs/pathway-io) for more information.

In [None]:
data_sources = []
data_sources.append(
    pw.io.fs.read(
        "../data/paul_graham",
        format="binary",
        mode="streaming",
        with_metadata=True,
    )  # This creates a `pathway` connector that tracks
    # all the files in the sample_documents directory
)

## Define Transformation pipeline

Let us create document ingestion pipeline. Pipeline should be list of `TransformComponent`, and end with embedder as last step for indexing.
In this example, let's first split text with number of tokens then, embed with OpenAI embedder.

In [None]:
from llama_index.embeddings import OpenAIEmbedding
from llama_index.node_parser import TokenTextSplitter

embed_model = OpenAIEmbedding(embed_batch_size=10)

transformations_example = [
    TokenTextSplitter(
        chunk_size=150,
        chunk_overlap=10,
        separator=" ",
    ),
    embed_model,
]

## Run the Server

In [None]:
pr = PathwayVectorServer(
    *data_sources,
    transformations=transformations_example,
)

# Define the Host and port that Pathway will be on
PATHWAY_HOST = "127.0.0.1"
PATHWAY_PORT = 8754

# `threaded` runs pathway in detached mode, we have to set it to False when running from terminal or container
# for more information on `with_cache` check out https://pathway.com/developers/api-docs/persistence-api
pr.run_server(
    host=PATHWAY_HOST, port=PATHWAY_PORT, with_cache=False, threaded=True
)

## Create Retriever for llama-index

In [None]:
pr = PathwayRetriever(host=PATHWAY_HOST, port=PATHWAY_PORT)

In [None]:
pr.retrieve(str_or_query_bundle="something")