# Pneuma: Quick Start (Colab)

In this notebook, we show how to use each of Pneuma's features, from registering a dataset to querying the index.

## Installation

Before we proceed, let's install Pneuma & download the test dataset.

---
**Colab will ask you to restart your session after the installation in order for certain dependencies to setup properly. Click the button `Restart Session` and continue from the next cell in the notebook**
---

In [None]:
# Install Pneuma
!pip install pneuma

In [None]:
# Download sample data
!gdown "1NN_TxpgBlCjC_ZEBgOnBPMY0CxEX-_EL" -O "data_src.zip"
!unzip "data_src.zip"

## Offline Stage

In the offline stage, we set up Pneuma, including initializing the database, registering dataset and metadata, generating summaries, and generating both vector and keyword index.

In [None]:
import os
import json

# enforce more deterministic behavior in cuBLAS operations.
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
# select a GPU
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

from google.colab import userdata
from pneuma import Pneuma

We initialize the pneuma object with out_path and call the setup() function to initialize the database. We have the option to use OpenAI LLM & embedding model, which default to `GPT-4o-mini` and `text-embedding-3-small`, respectively. Please set up `OPENAI_API_KEY` in your Colab instance. Alternatively, we can use local models by specifying the paths (`llm_path` and `embed_path`).

In [None]:
# The out_path is used to determine where the dataset and indexes will be stored.
# If not set, it will be defaulted to the current working directory.
out_path = "out_demo"
USE_OPEN_AI = True

if USE_OPEN_AI:
    pneuma = Pneuma(
        out_path=out_path,
        openai_api_key=userdata.get('OPENAI_API_KEY'),
        use_local_model=False,
    )
else:
    pneuma = Pneuma(
        out_path=out_path,
        llm_path="Qwen/Qwen2.5-0.5B-Instruct",  # We use a smaller model to fit in Colab
        embed_path="BAAI/bge-base-en-v1.5",
        max_llm_batch_size=1,  # Limit exploration for limited Colab memory
    )
pneuma.setup()

* Note: For local LLMs, we limit exploration of dynamic batch size selector because it will fill the GPU memory quickly and not cleaned fast enough. This is not good for systems with limited GPU memory such as Colab with the T4 GPU.

### Registration

For this demo, we use a dataset of three tables taken from Chicago Open Data with the following descriptions:

- **5cq6-qygt.csv**: Bus stops in shelters and at Chicago Transport Authority (CTA) rail stations which have digital signs added to them to show upcoming arrivals.
- **5n77-2d6a.csv**: Survey results of the 12th ward residents about issues ranging from climate & sustainability to public safety.
- **28km-gtjn.csv**: Fire stations location in Chicago.

To register a dataset, we call the add_tables function while pointing to a directory and specifying the data creator.

In [None]:
data_path = "data_src/sample_data/csv"
response = pneuma.add_tables(path=data_path, creator="demo_user")
json.loads(response)

Then, we can summarize the tables, all of which are not yet summarized at this point. These summaries then represent the tables for the discovery process.

In [None]:
response = pneuma.summarize()
json.loads(response)

Optionally, if context (metadata) is available, we can register it as well using the add_metadata function.

In [None]:
metadata_path = "data_src/sample_data/metadata.csv"
response = pneuma.add_metadata(metadata_path=metadata_path)
json.loads(response)

### Index Generation
The summaries (and optionally metadata) need to be indexed into a hybrid retriever (combining vector and full-text indices). To do so, we call the generate_index function while specifying a name for the index. By default, this function will index all registered tables.

In [None]:
response = pneuma.generate_index(index_name="demo_index")
json.loads(response)

## Online Stage (Querying)
To retrieve a ranked list of tables, we use the query_index function. In this example, Pneuma correctly identifies all the relevant tables for the queries.

In [None]:
queries = [
    "Which dataset contains climate issues?",  # 5n77-2d6a.csv
    "If I could identify where the bus stops are in Chicago, that would be awesome!"  # 5cq6-qygt.csv
]

response = pneuma.query_index(
    index_name="demo_index",
    queries=queries,
    k=1,
    n=5,
    alpha=0.5,
)
relevant_tables = json.dumps(json.loads(response), indent=4)
print(relevant_tables)