# Pneuma Guide

In this notebook, we show how to use each of Pneuma's features, from registering a dataset to querying the index.

## Offline Stage

In the offline stage, we set up Pneuma, including initializing the database, registering dataset and metadata, generating summaries, and generating both vector and keyword index.

To use pneuma, we import the class Pneuma from the pneuma module. 
- CUBLAS_WORKSPACE_CONFIG is set to ... [L]
- CUDA_VISIBLE_DEVICES is used to select the GPU. 
- The out_path is used to determine where the dataset and indexes will be stored. If not set, it will be defaulted to ~/.local/share/Pneuma/out on Linux or /Documents/Pneuma/out on Windows.

In [1]:
from pneuma import Pneuma
import os
import json

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

out_path = "out_demo/storage"

We initialize the pneuma object with out_path and call the setup() function to initialize the database.

In [3]:
pneuma = Pneuma(out_path=out_path)
pneuma.setup()

2024-11-20 05:21:58,032 [Registration] [INFO] HTTPFS installed and loaded.
2024-11-20 05:21:58,033 [Registration] [INFO] Table status table created.
2024-11-20 05:21:58,034 [Registration] [INFO] ID sequence created.
2024-11-20 05:21:58,053 [Registration] [INFO] Table contexts table created.
2024-11-20 05:21:58,055 [Registration] [INFO] Table summaries table created.
2024-11-20 05:21:58,056 [Registration] [INFO] Indexes table created.
2024-11-20 05:21:58,057 [Registration] [INFO] Index table mappings table created.
2024-11-20 05:21:58,058 [Registration] [INFO] Query history table created.


'{"status": "SUCCESS", "message": "Database Initialized.", "data": null}'

### Registration

To register a dataset, we call the add_tables function while pointing to a directory and specifying the data creator.

In [4]:
data_path = "../data_src/sample_data/csv"

response = pneuma.add_tables(data_path, "demo_user")
response = json.loads(response)
print(response)

2024-11-20 05:21:58,066 [Registration] [INFO] Reading folder ../data_src/sample_data/csv...
2024-11-20 05:21:58,067 [Registration] [INFO] Processing ../data_src/sample_data/csv/5n77-2d6a.csv...
2024-11-20 05:21:58,079 [Registration] [INFO] Processing table ../data_src/sample_data/csv/5n77-2d6a.csv ERROR: This table already exists in the database with id ('../data_src/sample_data/csv/5n77-2d6a.csv',).
2024-11-20 05:21:58,079 [Registration] [INFO] Processing ../data_src/sample_data/csv/inner_folder...
2024-11-20 05:21:58,079 [Registration] [INFO] Reading folder ../data_src/sample_data/csv/inner_folder...
2024-11-20 05:21:58,080 [Registration] [INFO] Processing ../data_src/sample_data/csv/inner_folder/28km-gtjn.csv...
2024-11-20 05:21:58,083 [Registration] [INFO] Processing table ../data_src/sample_data/csv/inner_folder/28km-gtjn.csv ERROR: This table already exists in the database with id ('../data_src/sample_data/csv/inner_folder/28km-gtjn.csv',).
2024-11-20 05:21:58,084 [Registration] 

{'status': 'SUCCESS', 'message': '3 files in folder ../data_src/sample_data/csv has been processed.', 'data': {'file_count': 3, 'tables': [None, None, None]}}


Register context or summaries for dataset with the add_metadata function.

In [5]:
metadata_path = "../data_src/sample_data/metadata.csv"

response = pneuma.add_metadata(metadata_path)
response = json.loads(response)
print(response)

2024-11-20 05:21:58,101 [Registration] [INFO] Context ID: 17
2024-11-20 05:21:58,103 [Registration] [INFO] Summary ID: 18


{'status': 'SUCCESS', 'message': '2 metadata entries has been added.', 'data': {'file_count': 2, 'metadata_ids': [17, 18]}}


### Summarization
By default, calling the summarize function will create summaries for all unsummarized tables.

In [6]:
response = pneuma.summarize()
response = json.loads(response)
print(response)

2024-11-20 05:22:00,317 [accelerate.utils.modeling] [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.20s/it]
2024-11-20 05:22:06,470 [sentence_transformers.SentenceTransformer] [INFO] Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-20 05:22:09,464 [Summarizer] [INFO] Generating summaries for all unsummarized tables...
2024-11-20 05:22:09,466 [Summarizer] [INFO] Found 0 unsummarized tables.


{'status': 'SUCCESS', 'message': 'No unsummarized tables found.\n', 'data': {'table_ids': []}}


### Index Generation
To generate both vector and keyword index, we call the generate_index function while specifying a name for the index. By default, this function will index all registered tables.

In [7]:
response = pneuma.generate_index("demo_index")
response = json.loads(response)
print(response)

2024-11-20 05:22:09,473 [sentence_transformers.SentenceTransformer] [INFO] Use pytorch device_name: cuda
2024-11-20 05:22:09,474 [sentence_transformers.SentenceTransformer] [INFO] Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-20 05:22:12,180 [chromadb.telemetry.product.posthog] [INFO] Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2024-11-20 05:22:12,275 [IndexGenerator] [INFO] No table ids provided. Generating index for all tables...
2024-11-20 05:22:12,277 [IndexGenerator] [INFO] Generating index for 3 tables...


{'status': 'ERROR', 'message': 'Index named demo_index already exists.', 'data': None}


## Online Stage (Querying)
To retrieve a ranked list of tables, we use the query_index function.

In [8]:
response = pneuma.query_index(
    index_name="demo_index",
    query="Which dataset contains climate issues?",
    k=1,
    n=5,
    alpha=0.5,
)
response = json.loads(response)['data']['response']
for r in response:
    print(r)

2024-11-20 05:22:12,285 [sentence_transformers.SentenceTransformer] [INFO] Use pytorch device_name: cuda
2024-11-20 05:22:12,286 [sentence_transformers.SentenceTransformer] [INFO] Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-20 05:22:16,524 [accelerate.utils.modeling] [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.18s/it]


['../data_src/sample_data/csv/5n77-2d6a.csv_SEP_contents_SEP_row-1', 0.5017665606585792, 'date: 2022-11-08 17:47:06 | issue_1: Climate and Sustainability | issue_2: Affordable Housing | issue_3: Equity and Inclusion | issue_4: Public Safety | issue_5: Infrastructure Improvements | issue_6: Economic/Business Development | issue_7: Public Transportation Improvements | ranked_issues: Climate and Sustainability;Affordable Housing;Equity and Inclusion;Public Safety;Infrastructure Improvements;Economic/Business Development;Public Transportation Improvements; | other_priorities: Inclusion of neighborhood organizations and individuals in development decisions.  We want business development, but people who live here should have priority over outsiders in determining what that looks like. \n\nThe environmental impacts of "development" without community involvement have caused significant issues.  We want to see MAT Asphalt shut down, the drive through ban in TOD zones enacted, downzoning of indu