# Pneuma Guide

In this notebook, we show how to use each of Pneuma's features, from registering a dataset to querying the index.

## Offline Stage

In the offline stage, we set up Pneuma, including initializing the database, registering dataset and metadata, generating summaries, and generating both vector and keyword index.

To use pneuma, we import the class Pneuma from the pneuma module. 
- CUBLAS_WORKSPACE_CONFIG is set to ... [L]
- CUDA_VISIBLE_DEVICES is used to select the GPU. 
- The out_path is used to determine where the dataset and indexes will be stored. If not set, it will be defaulted to ~/.local/share/Pneuma/out on Linux or /Documents/Pneuma/out on Windows.

In [1]:
from pneuma import Pneuma
import os
import json

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

out_path = "out_demo/storage"

We initialize the pneuma object with out_path and call the setup() function to initialize the database.

In [3]:
pneuma = Pneuma(out_path=out_path)
pneuma.setup()

2024-11-22 06:23:45,749 [Registration] [INFO] HTTPFS installed and loaded.
2024-11-22 06:23:45,752 [Registration] [INFO] Table status table created.
2024-11-22 06:23:45,752 [Registration] [INFO] ID sequence created.
2024-11-22 06:23:45,753 [Registration] [INFO] Table contexts table created.
2024-11-22 06:23:45,755 [Registration] [INFO] Table summaries table created.
2024-11-22 06:23:45,755 [Registration] [INFO] Indexes table created.
2024-11-22 06:23:45,756 [Registration] [INFO] Index table mappings table created.
2024-11-22 06:23:45,758 [Registration] [INFO] Query history table created.


'{"status": "SUCCESS", "message": "Database Initialized.", "data": null}'

### Registration

To register a dataset, we call the add_tables function while pointing to a directory and specifying the data creator.

In [4]:
data_path = "../data_src/sample_data/csv"

response = pneuma.add_tables(path=data_path, creator="demo_user")
response = json.loads(response)
print(response)

2024-11-22 06:23:45,765 [Registration] [INFO] Reading folder ../data_src/sample_data/csv...
2024-11-22 06:23:45,766 [Registration] [INFO] Processing ../data_src/sample_data/csv/5n77-2d6a.csv...
2024-11-22 06:23:45,783 [Registration] [INFO] Processing table ../data_src/sample_data/csv/5n77-2d6a.csv SUCCESS: Table with ID: ../data_src/sample_data/csv/5n77-2d6a.csv has been added to the database.
2024-11-22 06:23:45,783 [Registration] [INFO] Processing ../data_src/sample_data/csv/inner_folder...
2024-11-22 06:23:45,783 [Registration] [INFO] Reading folder ../data_src/sample_data/csv/inner_folder...
2024-11-22 06:23:45,784 [Registration] [INFO] Processing ../data_src/sample_data/csv/inner_folder/28km-gtjn.csv...
2024-11-22 06:23:45,792 [Registration] [INFO] Processing table ../data_src/sample_data/csv/inner_folder/28km-gtjn.csv SUCCESS: Table with ID: ../data_src/sample_data/csv/inner_folder/28km-gtjn.csv has been added to the database.
2024-11-22 06:23:45,792 [Registration] [INFO] 1 files

{'status': 'SUCCESS', 'message': '3 files in folder ../data_src/sample_data/csv has been processed.', 'data': {'file_count': 3, 'tables': [{'table_id': '../data_src/sample_data/csv/5n77-2d6a.csv', 'table_name': '5n77-2d6a'}, {'table_id': '../data_src/sample_data/csv/inner_folder/28km-gtjn.csv', 'table_name': '28km-gtjn'}, {'table_id': '../data_src/sample_data/csv/5cq6-qygt.csv', 'table_name': '5cq6-qygt'}]}}


Register context or summaries for dataset with the add_metadata function.

In [5]:
metadata_path = "../data_src/sample_data/metadata.csv"

response = pneuma.add_metadata(metadata_path=metadata_path)
response = json.loads(response)
print(response)

2024-11-22 06:23:45,828 [Registration] [INFO] Context ID: 1
2024-11-22 06:23:45,830 [Registration] [INFO] Summary ID: 2


{'status': 'SUCCESS', 'message': '2 metadata entries has been added.', 'data': {'file_count': 2, 'metadata_ids': [1, 2]}}


### Summarization
By default, calling the summarize function will create summaries for all unsummarized tables.

In [6]:
response = pneuma.summarize()
response = json.loads(response)
print(response)

2024-11-22 06:23:47,676 [accelerate.utils.modeling] [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 4/4 [00:04<00:00,  1.21s/it]
2024-11-22 06:23:53,765 [sentence_transformers.SentenceTransformer] [INFO] Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-22 06:23:56,671 [Summarizer] [INFO] Generating summaries for all unsummarized tables...
2024-11-22 06:23:56,673 [Summarizer] [INFO] Found 3 unsummarized tables.


Optimal batch size: 50


100%|██████████| 1/1 [00:03<00:00,  3.84s/it]
Token indices sequence length is longer than the specified maximum sequence length for this model (564 > 512). Running this sequence through the model will result in indexing errors


{'status': 'SUCCESS', 'message': 'Total of 7 summaries has been added with IDs: 3, 4, 5, 6, 7, 8, 9.\n', 'data': {'table_ids': ['../data_src/sample_data/csv/5n77-2d6a.csv', '../data_src/sample_data/csv/inner_folder/28km-gtjn.csv', '../data_src/sample_data/csv/5cq6-qygt.csv'], 'summary_ids': [3, 4, 5, 6, 7, 8, 9]}}


### Index Generation
To generate both vector and keyword index, we call the generate_index function while specifying a name for the index. By default, this function will index all registered tables.

In [7]:
response = pneuma.generate_index(index_name="demo_index")
response = json.loads(response)
print(response)

2024-11-22 06:24:03,472 [sentence_transformers.SentenceTransformer] [INFO] Use pytorch device_name: cuda
2024-11-22 06:24:03,472 [sentence_transformers.SentenceTransformer] [INFO] Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-22 06:24:08,356 [chromadb.telemetry.product.posthog] [INFO] Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2024-11-22 06:24:08,485 [IndexGenerator] [INFO] No table ids provided. Generating index for all tables...
2024-11-22 06:24:08,487 [IndexGenerator] [INFO] Generating index for 3 tables...
2024-11-22 06:24:08,502 [IndexGenerator] [INFO] Vector index named demo_index with id 10 has been created.
2024-11-22 06:24:08,502 [IndexGenerator] [INFO] Processing table ../data_src/sample_data/csv/5n77-2d6a.csv...
2024-11-22 06:24:08,504 [IndexGenerator] [INFO] Processing table ../data_src/sample_data/csv/inner_folder/28km-gtjn.csv...
2024-11-22 06:24:08,505 [IndexGenerator] [INFO] Proce

{'status': 'SUCCESS', 'message': 'Vector and keyword index named demo_index with id 10 and 11 has been created with 3 tables.', 'data': {'table_ids': ['../data_src/sample_data/csv/5n77-2d6a.csv', '../data_src/sample_data/csv/inner_folder/28km-gtjn.csv', '../data_src/sample_data/csv/5cq6-qygt.csv'], 'vector_index_id': 10, 'keyword_index_id': 11, 'vector_index_generation_time': 0.014294862747192383, 'keyword_index_generation_time': 0.010687828063964844}}


## Online Stage (Querying)
To retrieve a ranked list of tables, we use the query_index function.

In [8]:
response = pneuma.query_index(
    index_name="demo_index",
    query="Which dataset contains climate issues?",
    k=1,
    n=5,
    alpha=0.5,
)
response = json.loads(response)
query = response["data"]["query"]
retrieved_tables = response["data"]["response"]

print(f"Query: {query}")
print("Retrieved tables:")
for table in retrieved_tables:
    print(table)

2024-11-22 06:24:08,646 [sentence_transformers.SentenceTransformer] [INFO] Use pytorch device_name: cuda
2024-11-22 06:24:08,646 [sentence_transformers.SentenceTransformer] [INFO] Load pretrained SentenceTransformer: BAAI/bge-base-en-v1.5
2024-11-22 06:24:15,389 [accelerate.utils.modeling] [INFO] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|██████████| 4/4 [00:08<00:00,  2.02s/it]


Query: Which dataset contains climate issues?
Retrieved tables:
../data_src/sample_data/csv/5n77-2d6a.csv
