Skip to content

Commit

Permalink
Project import generated by Copybara.
Browse files Browse the repository at this point in the history
GitOrigin-RevId: b2954a4e88aa573b5ce2c014876f75c2535abd19
  • Loading branch information
Manul from Pathway committed May 16, 2024
1 parent 7b77dc8 commit 06c2d8c
Show file tree
Hide file tree
Showing 44 changed files with 1,319 additions and 532 deletions.
119 changes: 8 additions & 111 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,9 @@ Analysis of live documents streams.

![Effortlessly extract and organize unstructured data from PDFs, docs, and more into SQL tables - in real-time](examples/pipelines/unstructured_to_sql_on_the_fly/unstructured_to_sql_demo.gif)

(See: [`unstructured-to-sql`](#examples) app example.)

(Check out: [`gpt_4o_multimodal_rag`](examples/pipelines/gpt_4o_multimodal_rag/README.md) to see the whole pipeline in the works. You may also check out: [`unstructured-to-sql`](examples/pipelines/unstructured_to_sql_on_the_fly/app.py) for a minimal example which works with non-multimodal models as well.)


### Automated real-time knowledge mining and alerting.

Expand All @@ -58,7 +60,7 @@ The default [`contextful`](examples/pipelines/contextful/app.py) app example lau

This application template can also be combined with streams of fresh data, such as news feeds or status reports, either through REST or a technology like Kafka. It can also be combined with extra static data sources and user-specific contexts, to provide more relevant answers and reduce LLM hallucination.

Read more about the implementation details and how to extend this application in [our blog article](https://pathway.com/developers/showcases/llm-app-pathway/).
Read more about the implementation details and how to extend this application in [our blog article](https://pathway.com/developers/user-guide/llm-xpack/llm-app-pathway/).

### Instructional videos

Expand Down Expand Up @@ -101,12 +103,6 @@ with increasing number of documents given as a context in the question, until Ch

## Get Started

To run the `demo-document-indexing` vector indexing pipeline and UI please follow instructions under [examples/pipelines/demo-document-indexing/README.md](examples/pipelines/demo-document-indexing/README.md).

To run the `demo-question-answering` question answering pipeline please follow instructions under [examples/pipelines/demo-question-answering/README.md](examples/pipelines/demo-question-answering/README.md).

For all other demos follow the steps below.

### Prerequisites


Expand All @@ -120,116 +116,17 @@ Now, follow the steps to install and [get started with one of the provided examp

Alternatively, you can also take a look at the [application showcases](#showcases).

### Step 1: Clone the repository
### Clone the repository

This is done with the `git clone` command followed by the URL of the repository:

```bash
git clone https://github.com/pathwaycom/llm-app.git
```

Next, navigate to the repository:

```bash
cd llm-app
```

### Step 2: Set environment variables

Create an .env file in the root directory and add the following environment variables, adjusting their values according to your specific requirements and setup.

| Environment Variable | Description |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| APP_VARIANT | Determines which pipeline to run in your application. Available modes are [`contextful`, `contextful-s3`, `contextless`, `local`, `unstructured-to-sql`, `alert`, `drive-alert`]. By default, the mode is set to `contextful`. |
| PATHWAY_REST_CONNECTOR_HOST | Specifies the host IP for the REST connector in Pathway. For the dockerized version, set it to `0.0.0.0` Natively, you can use `127.0.0.1` |
| PATHWAY_REST_CONNECTOR_PORT | Specifies the port number on which the REST connector service of the Pathway should listen. Here, it is set to 8080. |
| OPENAI_API_KEY | The API token for accessing OpenAI services. If you are not running the local version, please remember to replace it with your API token, which you can generate from your account on [openai.com](https:/platform.openai.com/account/api-keys). |
| PATHWAY_PERSISTENT_STORAGE | Specifies the directory where the cache is stored. You could use /tmpcache. |

For example:

```bash
APP_VARIANT=contextful
PATHWAY_REST_CONNECTOR_HOST=0.0.0.0
PATHWAY_REST_CONNECTOR_PORT=8080
OPENAI_API_KEY=<Your Token>
PATHWAY_PERSISTENT_STORAGE=/tmp/cache
```

### Step 3: Build and run the app

You can install and run your chosen LLM App example in two different ways.

#### Using Docker

Docker is a tool designed to make it easier to create, deploy, and run applications by using containers. Here is how to use Docker to build and run the LLM App:

```bash
docker compose run --build --rm -p 8080:8080 llm-app-examples
```

If you have set a different port in `PATHWAY_REST_CONNECTOR_PORT`, replace the second `8080` with this port in the command above.

When the process is complete, the App will be up and running inside a Docker container and accessible at `0.0.0.0:8080`. From there, you can proceed to the "Usage" section of the documentation for information on how to interact with the application.

#### Native Approach

* **Install poetry:**

```bash
pip install poetry
```

* **Install llm_app and dependencies:**

```bash
poetry install --with examples --extras local
```

You can omit `--extras local` part if you're not going to run local example.
* **Run the examples:** You can start the example with the command:
```bash
poetry run ./run_examples.py contextful
```
### Step 4: Start to use it
1. **Send REST queries** (in a separate terminal window): These are examples of how to interact with the application once it's running. `curl` is a command-line tool used to send data using various network protocols. Here, it's being used to send HTTP requests to the application.
```bash
curl --data '{"user": "user", "query": "How to connect to Kafka in Pathway?"}' http://localhost:8080/
curl --data '{"user": "user", "query": "How to use LLMs in Pathway?"}' http://localhost:8080/
```
If you are on windows CMD, then the query would rather look like this
```cmd
curl --data "{\"user\": \"user\", \"query\": \"How to use LLMs in Pathway?\"}" http://localhost:8080/
```
2. **Test reactivity by adding a new file:** This shows how to test the application's ability to react to changes in data by adding a new file and sending a query.

```bash
cp ./data/documents_extra.jsonl ./data/pathway-docs/
```

Or if using docker compose:

```bash
docker compose exec llm-app-examples mv /app/examples/data/documents_extra.jsonl /app/examples/data/pathway-docs/
```

Let's query again:
```bash
curl --data '{"user": "user", "query": "How to use LLMs in Pathway?"}' http://localhost:8080/
```
### Run the chosen example

### Step 5: Launch the User Interface:
Go to the `examples/ui/` directory (or `examples/pipelines/unstructured/ui` if you are running the unstructured version.) and execute `streamlit run server.py`. Then, access the URL displayed in the terminal to engage with the LLM App using a chat interface. Please note: The provided Streamlit-based interface template is intended for internal rapid prototyping only. In production use, you would normally create your own component instead, taking into account security and authentication, multi-tenancy of data teams, integration with existing UI components, etc.
Each [example](examples/pipelines/) contains a README.md with instructions on how to run it.

### Bonus: Build your own Pathway-powered LLM App

Expand All @@ -251,7 +148,7 @@ Please check out our [Q&A](https://github.com/pathwaycom/llm-app/discussions/cat

### Raise an issue

To provide feedback or report a bug, please [raise an issue on our issue tracker](https://github.com/pathwaycom/llm-app/issues).
To provide feedback or report a bug, please [raise an issue on our issue tracker](https://github.com/pathwaycom/pathway/issues).

## Contributing

Expand Down
67 changes: 67 additions & 0 deletions examples/pipelines/alert/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Alert Pipeline

This example implements a pipeline that answers questions based on documents in a given folder. Additionally, in your prompts you can ask to be notified of any changes - in such case an alert will be sent to a Slack channel.

Upon starting, a REST API endpoint is opened by the app to serve queries about files inside
the input folder `data_dir`.

We can create notifications by sending a query to API and stating we want to be notified of the changes.
One example would be `Tell me and alert about the start date of the campaign for Magic Cola`

What happens next?

Each query text is first turned into a vector using OpenAI embedding service,
then relevant documentation pages are found using a Nearest Neighbor index computed
for documents in the corpus. A prompt is built from the relevant documentations pages
and sent to the OpenAI GPT3.5 chat service for processing and answering.

Once you run, Pathway looks for any changes in data sources and efficiently detects changes
to the relevant documents. When a change is detected, the LLM is asked to answer the query
again, and if the new answer is sufficiently different, an alert is created.

## How to run the project

### Setup Slack notifications:

For this demo, Slack notifications are optional and notifications will be printed if no Slack API keys are provided. See: [Slack Apps](https://api.slack.com/apps) and [Getting a token](https://api.slack.com/tutorials/tracks/getting-a-token).
Your Slack application will need at least `chat:write.public` scope enabled.

### Setup environment:
Set your env variables in the .env file placed in this directory or in the root of the repo.

```bash
OPENAI_API_KEY=sk-...
SLACK_ALERT_CHANNEL_ID= # If unset, alerts will be printed to the terminal
SLACK_ALERT_TOKEN=
PATHWAY_DATA_DIR= # If unset, defaults to ../../data/magic-cola/live/
PATHWAY_PERSISTENT_STORAGE= # Set this variable if you want to use caching
```

### Run the project

Make sure you have installed poetry dependencies with `--extras unstructured`.

```bash
poetry install --with examples --extras unstructured
```

Run:

```bash
poetry run python app.py
```

If all dependencies are managed manually rather than using poetry, you can run:

```bash
python app.py
```

To create alerts, you can call the REST API:

```bash
curl --data '{
"user": "user",
"query": "When does the magic cola campaign start? Alert me if the start date changes."
}' http://localhost:8080/ | jq
```
46 changes: 12 additions & 34 deletions examples/pipelines/alert/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,12 @@
This demo is very similar to `contextful` example with an additional real time alerting capability.
In the demo, alerts are sent to Slack (you need `slack_alert_channel_id` and `slack_alert_token`),
you can either put these env variables in .env file under llm-app directory,
or create env variables in the terminal (ie. export in bash)
If you don't have Slack, you can leave them empty and app will print the notifications to
standard output instead.
or create env variables in the terminal (ie. export in bash).
Upon starting, a REST API endpoint is opened by the app to serve queries about files inside
the input folder `data_dir`.
We can create notifications by sending a query to API and stating we want to be notified of the changes.
Alternatively, the provided Streamlit chat app can be used.
One example would be `Tell me and alert about the start date of the campaign for Magic Cola`
What happens next?
Expand All @@ -26,37 +23,20 @@
to the relevant documents. When a change is detected, the LLM is asked to answer the query
again, and if the new answer is sufficiently different, an alert is created.
Usage:
In the root of this repository run:
`poetry run ./run_examples.py alerts`
or, if all dependencies are managed manually rather than using poetry
You can either
`python examples/pipelines/alerts/app.py`
or
`python ./run_examples.py alert`
You can also run this example directly in the environment with llm_app installed.
To create alerts:
You can call the REST API:
curl --data '{
"user": "user",
"query": "When does the magic cola campaign start? Alert me if the start date changes."
}' http://localhost:8080/ | jq
Or start streamlit UI:
First go to examples/ui directory with `cd llm-app/examples/ui/`
run `streamlit run server.py`
Please check the README.md in this directory for how-to-run instructions.
"""

import asyncio
import os

import dotenv
import pathway as pw
from pathway.stdlib.ml.index import KNNIndex
from pathway.xpacks.llm.embedders import OpenAIEmbedder
from pathway.xpacks.llm.llms import OpenAIChat, prompt_chat_single_qa

dotenv.load_dotenv()


class DocumentInputSchema(pw.Schema):
doc: str
Expand Down Expand Up @@ -154,12 +134,10 @@ def decision_to_bool(decision: str) -> bool:

def run(
*,
data_dir: str = os.environ.get(
"PATHWAY_DATA_DIR", "./examples/data/magic-cola/live/"
),
data_dir: str = os.environ.get("PATHWAY_DATA_DIR", "../../data/magic-cola/live/"),
api_key: str = os.environ.get("OPENAI_API_KEY", ""),
host: str = "0.0.0.0",
port: int = 8080,
host: str = os.environ.get("PATHWAY_REST_CONNECTOR_HOST", "0.0.0.0"),
port: int = int(os.environ.get("PATHWAY_REST_CONNECTOR_PORT", "8080")),
embedder_locator: str = "text-embedding-ada-002",
embedding_dimension: int = 1536,
model_locator: str = "gpt-3.5-turbo",
Expand All @@ -173,8 +151,8 @@ def run(
embedder = OpenAIEmbedder(
api_key=api_key,
model=embedder_locator,
retry_strategy=pw.asynchronous.FixedDelayRetryStrategy(),
cache_strategy=pw.asynchronous.DefaultCache(),
retry_strategy=pw.udfs.FixedDelayRetryStrategy(),
cache_strategy=pw.udfs.DefaultCache(),
)

documents = pw.io.jsonlines.read(
Expand Down Expand Up @@ -205,8 +183,8 @@ def run(
model=model_locator,
temperature=temperature,
max_tokens=max_tokens,
retry_strategy=pw.asynchronous.FixedDelayRetryStrategy(),
cache_strategy=pw.asynchronous.DefaultCache(),
retry_strategy=pw.udfs.FixedDelayRetryStrategy(),
cache_strategy=pw.udfs.DefaultCache(),
)

query += query.select(
Expand Down
46 changes: 46 additions & 0 deletions examples/pipelines/contextful/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Contextful Pipeline

This example implements a simple pipeline that answers questions based on documents in a given folder.

Each query text is first turned into a vector using OpenAI embedding service,
then relevant documentation pages are found using a Nearest Neighbor index computed
for documents in the corpus. A prompt is built from the relevant documentation pages
and sent to the OpenAI chat service for processing.

## How to run the project

### Setup environment:
Set your env variables in the .env file placed in this directory or in the root of the repo.

```bash
OPENAI_API_KEY=sk-...
PATHWAY_DATA_DIR= # If unset, defaults to ../../data/pathway-docs/
PATHWAY_PERSISTENT_STORAGE= # Set this variable if you want to use caching
```

### Run the project

```bash
poetry install --with examples
```

Run:

```bash
poetry run python app.py
```

If all dependencies are managed manually rather than using poetry, you can run either:

```bash
python app.py
```

To query the pipeline, you can call the REST API:

```bash
curl --data '{
"user": "user",
"query": "How to connect to Kafka in Pathway?"
}' http://localhost:8080/ | jq
```
Loading

0 comments on commit 06c2d8c

Please sign in to comment.