Project import generated by Copybara.

GitOrigin-RevId: b2954a4e88aa573b5ce2c014876f75c2535abd19
pathwaycom · May 16, 2024 · 06c2d8c · 06c2d8c
1 parent 7b77dc8
commit 06c2d8c
Show file tree

Hide file tree

Showing 44 changed files with 1,319 additions and 532 deletions.
diff --git a/README.md b/README.md
@@ -39,7 +39,9 @@ Analysis of live documents streams.
 
 ![Effortlessly extract and organize unstructured data from PDFs, docs, and more into SQL tables - in real-time](examples/pipelines/unstructured_to_sql_on_the_fly/unstructured_to_sql_demo.gif)
 
-(See: [`unstructured-to-sql`](#examples) app example.)
+
+(Check out: [`gpt_4o_multimodal_rag`](examples/pipelines/gpt_4o_multimodal_rag/README.md) to see the whole pipeline in the works. You may also check out: [`unstructured-to-sql`](examples/pipelines/unstructured_to_sql_on_the_fly/app.py) for a minimal example which works with non-multimodal models as well.)
+
 
 ### Automated real-time knowledge mining and alerting. 
 
@@ -58,7 +60,7 @@ The default [`contextful`](examples/pipelines/contextful/app.py) app example lau
 
 This application template can also be combined with streams of fresh data, such as news feeds or status reports, either through REST or a technology like Kafka. It can also be combined with extra static data sources and user-specific contexts, to provide more relevant answers and reduce LLM hallucination.
 
-Read more about the implementation details and how to extend this application in [our blog article](https://pathway.com/developers/showcases/llm-app-pathway/).
+Read more about the implementation details and how to extend this application in [our blog article](https://pathway.com/developers/user-guide/llm-xpack/llm-app-pathway/).
 
 ### Instructional videos
 
@@ -101,12 +103,6 @@ with increasing number of documents given as a context in the question, until Ch
 
 ## Get Started
 
-To run the `demo-document-indexing` vector indexing pipeline and UI please follow instructions under [examples/pipelines/demo-document-indexing/README.md](examples/pipelines/demo-document-indexing/README.md).
-
-To run the `demo-question-answering` question answering pipeline please follow instructions under [examples/pipelines/demo-question-answering/README.md](examples/pipelines/demo-question-answering/README.md).
-
-For all other demos follow the steps below.
-
 ### Prerequisites
 
 
@@ -120,116 +116,17 @@ Now, follow the steps to install and [get started with one of the provided examp
 
 Alternatively, you can also take a look at the [application showcases](#showcases).
 
-### Step 1: Clone the repository
+### Clone the repository
 
 This is done with the `git clone` command followed by the URL of the repository:
 
 ```bash
 git clone https://github.com/pathwaycom/llm-app.git
 ```
 
-Next, navigate to the repository:
-
-```bash
-cd llm-app
-```
-
-### Step 2: Set environment variables
-
-Create an .env file in the root directory and add the following environment variables, adjusting their values according to your specific requirements and setup.
-
-| Environment Variable        | Description                                                                                                                                                                                                                                      |
-| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
-| APP_VARIANT                 | Determines which pipeline to run in your application. Available modes are [`contextful`, `contextful-s3`, `contextless`, `local`, `unstructured-to-sql`, `alert`, `drive-alert`]. By default, the mode is set to `contextful`.                   |
-| PATHWAY_REST_CONNECTOR_HOST | Specifies the host IP for the REST connector in Pathway. For the dockerized version, set it to `0.0.0.0` Natively, you can use `127.0.0.1`                                                                                                       |
-| PATHWAY_REST_CONNECTOR_PORT | Specifies the port number on which the REST connector service of the Pathway should listen. Here, it is set to 8080.                                                                                                                             |
-| OPENAI_API_KEY              | The API token for accessing OpenAI services. If you are not running the local version, please remember to replace it with your API token, which you can generate from your account on [openai.com](https:/platform.openai.com/account/api-keys). |
-| PATHWAY_PERSISTENT_STORAGE  | Specifies the directory where the cache is stored. You could use /tmpcache.                                                                                                                                                                      |
-
-For example:
-
-```bash
-APP_VARIANT=contextful
-PATHWAY_REST_CONNECTOR_HOST=0.0.0.0
-PATHWAY_REST_CONNECTOR_PORT=8080
-OPENAI_API_KEY=<Your Token>
-PATHWAY_PERSISTENT_STORAGE=/tmp/cache
-```
-
-### Step 3: Build and run the app
-
-You can install and run your chosen LLM App example in two different ways.
-
-#### Using Docker
-
-Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Here is how to use Docker to build and run the LLM App:
-
-```bash
-docker compose run --build --rm -p 8080:8080 llm-app-examples
-```
-
-If you have set a different port in `PATHWAY_REST_CONNECTOR_PORT`, replace the second `8080` with this port in the command above.
-
-When the process is complete, the App will be up and running inside a Docker container and accessible at `0.0.0.0:8080`. From there, you can proceed to the "Usage" section of the documentation for information on how to interact with the application.
-
-#### Native Approach
-
-* **Install poetry:**
-
-    ```bash
-    pip install poetry
-    ```
-
-* **Install llm_app and dependencies:**
-
-    ```bash
-    poetry install --with examples --extras local
-    ```
-
-    You can omit `--extras local` part if you're not going to run local example.
-
-* **Run the examples:** You can start the example with the command:
-
-    ```bash
-    poetry run ./run_examples.py contextful
-    ```
-
-### Step 4: Start to use it
-
-1. **Send REST queries** (in a separate terminal window): These are examples of how to interact with the application once it's running. `curl` is a command-line tool used to send data using various network protocols. Here, it's being used to send HTTP requests to the application.
-
-    ```bash
-    curl --data '{"user": "user", "query": "How to connect to Kafka in Pathway?"}' http://localhost:8080/
-
-    curl --data '{"user": "user", "query": "How to use LLMs in Pathway?"}' http://localhost:8080/
-    ```
-
-    If you are on windows CMD, then the query would rather look like this
-
-    ```cmd
-    curl --data "{\"user\": \"user\", \"query\": \"How to use LLMs in Pathway?\"}" http://localhost:8080/
-    ```
-
-2. **Test reactivity by adding a new file:** This shows how to test the application's ability to react to changes in data by adding a new file and sending a query.
-
-    ```bash
-    cp ./data/documents_extra.jsonl ./data/pathway-docs/
-    ```
-
-    Or if using docker compose:
-
-    ```bash
-    docker compose exec llm-app-examples mv /app/examples/data/documents_extra.jsonl /app/examples/data/pathway-docs/
-    ```
-
-    Let's query again:
-
-    ```bash
-    curl --data '{"user": "user", "query": "How to use LLMs in Pathway?"}' http://localhost:8080/
-    ```
+### Run the chosen example
 
-### Step 5: Launch the User Interface:
-Go to the `examples/ui/` directory (or `examples/pipelines/unstructured/ui` if you are running the unstructured version.) and execute `streamlit run server.py`. Then, access the URL displayed in the terminal to engage with the LLM App using a chat interface. Please note: The provided Streamlit-based interface template is intended for internal rapid prototyping only. In production use, you would normally create your own component instead, taking into account security and authentication, multi-tenancy of data teams, integration with existing UI components, etc.
+Each [example](examples/pipelines/) contains a README.md with instructions on how to run it.
 
 ### Bonus: Build your own Pathway-powered LLM App
 
@@ -251,7 +148,7 @@ Please check out our [Q&A](https://github.com/pathwaycom/llm-app/discussions/cat
 
 ### Raise an issue
 
-To provide feedback or report a bug, please [raise an issue on our issue tracker](https://github.com/pathwaycom/llm-app/issues).
+To provide feedback or report a bug, please [raise an issue on our issue tracker](https://github.com/pathwaycom/pathway/issues).
 
 ## Contributing
 

diff --git a/examples/pipelines/alert/README.md b/examples/pipelines/alert/README.md
@@ -0,0 +1,67 @@
+# Alert Pipeline
+
+This example implements a pipeline that answers questions based on documents in a given folder. Additionally, in your prompts you can ask to be notified of any changes - in such case an alert will be sent to a Slack channel.
+
+Upon starting, a REST API endpoint is opened by the app to serve queries about files inside
+the input folder `data_dir`.
+
+We can create notifications by sending a query to API and stating we want to be notified of the changes.
+One example would be `Tell me and alert about the start date of the campaign for Magic Cola`
+
+What happens next?
+
+Each query text is first turned into a vector using OpenAI embedding service,
+then relevant documentation pages are found using a Nearest Neighbor index computed
+for documents in the corpus. A prompt is built from the relevant documentations pages
+and sent to the OpenAI GPT3.5 chat service for processing and answering.
+
+Once you run, Pathway looks for any changes in data sources and efficiently detects changes
+to the relevant documents. When a change is detected, the LLM is asked to answer the query
+again, and if the new answer is sufficiently different, an alert is created.
+
+## How to run the project
+
+### Setup Slack notifications:
+
+For this demo, Slack notifications are optional and notifications will be printed if no Slack API keys are provided. See: [Slack Apps](https://api.slack.com/apps) and [Getting a token](https://api.slack.com/tutorials/tracks/getting-a-token).
+Your Slack application will need at least `chat:write.public` scope enabled.
+
+### Setup environment:
+Set your env variables in the .env file placed in this directory or in the root of the repo.
+
+```bash
+OPENAI_API_KEY=sk-...
+SLACK_ALERT_CHANNEL_ID=  # If unset, alerts will be printed to the terminal
+SLACK_ALERT_TOKEN=
+PATHWAY_DATA_DIR= # If unset, defaults to ../../data/magic-cola/live/
+PATHWAY_PERSISTENT_STORAGE= # Set this variable if you want to use caching
+```
+
+### Run the project
+
+Make sure you have installed poetry dependencies with `--extras unstructured`. 
+
+```bash
+poetry install --with examples --extras unstructured
+```
+
+Run:
+
+```bash
+poetry run python app.py
+```
+
+If all dependencies are managed manually rather than using poetry, you can run:
+
+```bash
+python app.py
+```
+
+To create alerts, you can call the REST API:
+
+```bash
+curl --data '{
+  "user": "user",
+  "query": "When does the magic cola campaign start? Alert me if the start date changes."
+}' http://localhost:8080/ | jq
+```
diff --git a/examples/pipelines/alert/app.py b/examples/pipelines/alert/app.py
@@ -4,15 +4,12 @@
 This demo is very similar to `contextful` example with an additional real time alerting capability.
 In the demo, alerts are sent to Slack (you need `slack_alert_channel_id` and `slack_alert_token`),
 you can either put these env variables in .env file under llm-app directory,
-or create env variables in the terminal (ie. export in bash)
-If you don't have Slack, you can leave them empty and app will print the notifications to
-standard output instead.
+or create env variables in the terminal (ie. export in bash).
 
 Upon starting, a REST API endpoint is opened by the app to serve queries about files inside
 the input folder `data_dir`.
 
 We can create notifications by sending a query to API and stating we want to be notified of the changes.
-Alternatively, the provided Streamlit chat app can be used.
 One example would be `Tell me and alert about the start date of the campaign for Magic Cola`
 
 What happens next?
@@ -26,37 +23,20 @@
 to the relevant documents. When a change is detected, the LLM is asked to answer the query
 again, and if the new answer is sufficiently different, an alert is created.
 
-Usage:
-In the root of this repository run:
-`poetry run ./run_examples.py alerts`
-or, if all dependencies are managed manually rather than using poetry
-You can either
-`python examples/pipelines/alerts/app.py`
-or
-`python ./run_examples.py alert`
-
-You can also run this example directly in the environment with llm_app installed.
-
-To create alerts:
-You can call the REST API:
-curl --data '{
-  "user": "user",
-  "query": "When does the magic cola campaign start? Alert me if the start date changes."
-}' http://localhost:8080/ | jq
-
-Or start streamlit UI:
-First go to examples/ui directory with `cd llm-app/examples/ui/`
-run `streamlit run server.py`
+Please check the README.md in this directory for how-to-run instructions.
 """
 
 import asyncio
 import os
 
+import dotenv
 import pathway as pw
 from pathway.stdlib.ml.index import KNNIndex
 from pathway.xpacks.llm.embedders import OpenAIEmbedder
 from pathway.xpacks.llm.llms import OpenAIChat, prompt_chat_single_qa
 
+dotenv.load_dotenv()
+
 
 class DocumentInputSchema(pw.Schema):
     doc: str
@@ -154,12 +134,10 @@ def decision_to_bool(decision: str) -> bool:
 
 def run(
     *,
-    data_dir: str = os.environ.get(
-        "PATHWAY_DATA_DIR", "./examples/data/magic-cola/live/"
-    ),
+    data_dir: str = os.environ.get("PATHWAY_DATA_DIR", "../../data/magic-cola/live/"),
     api_key: str = os.environ.get("OPENAI_API_KEY", ""),
-    host: str = "0.0.0.0",
-    port: int = 8080,
+    host: str = os.environ.get("PATHWAY_REST_CONNECTOR_HOST", "0.0.0.0"),
+    port: int = int(os.environ.get("PATHWAY_REST_CONNECTOR_PORT", "8080")),
     embedder_locator: str = "text-embedding-ada-002",
     embedding_dimension: int = 1536,
     model_locator: str = "gpt-3.5-turbo",
@@ -173,8 +151,8 @@ def run(
     embedder = OpenAIEmbedder(
         api_key=api_key,
         model=embedder_locator,
-        retry_strategy=pw.asynchronous.FixedDelayRetryStrategy(),
-        cache_strategy=pw.asynchronous.DefaultCache(),
+        retry_strategy=pw.udfs.FixedDelayRetryStrategy(),
+        cache_strategy=pw.udfs.DefaultCache(),
     )
 
     documents = pw.io.jsonlines.read(
@@ -205,8 +183,8 @@ def run(
         model=model_locator,
         temperature=temperature,
         max_tokens=max_tokens,
-        retry_strategy=pw.asynchronous.FixedDelayRetryStrategy(),
-        cache_strategy=pw.asynchronous.DefaultCache(),
+        retry_strategy=pw.udfs.FixedDelayRetryStrategy(),
+        cache_strategy=pw.udfs.DefaultCache(),
     )
 
     query += query.select(

diff --git a/examples/pipelines/contextful/README.md b/examples/pipelines/contextful/README.md
@@ -0,0 +1,46 @@
+# Contextful Pipeline
+
+This example implements a simple pipeline that answers questions based on documents in a given folder.
+
+Each query text is first turned into a vector using OpenAI embedding service,
+then relevant documentation pages are found using a Nearest Neighbor index computed
+for documents in the corpus. A prompt is built from the relevant documentation pages
+and sent to the OpenAI chat service for processing.
+
+## How to run the project
+
+### Setup environment:
+Set your env variables in the .env file placed in this directory or in the root of the repo.
+
+```bash
+OPENAI_API_KEY=sk-...
+PATHWAY_DATA_DIR= # If unset, defaults to ../../data/pathway-docs/
+PATHWAY_PERSISTENT_STORAGE= # Set this variable if you want to use caching
+```
+
+### Run the project
+
+```bash
+poetry install --with examples
+```
+
+Run:
+
+```bash
+poetry run python app.py
+```
+
+If all dependencies are managed manually rather than using poetry, you can run either:
+
+```bash
+python app.py
+```
+
+To query the pipeline, you can call the REST API:
+
+```bash
+curl --data '{
+  "user": "user",
+  "query": "How to connect to Kafka in Pathway?"
+}' http://localhost:8080/ | jq
+```