# A Tutorial for the phi_generation Library: Data-Driven Prompting and RAG


## Introduction


This tutorial demonstrates the key features of the `phi_generation` library, focusing on data ingestion from CSV, preparing data for a Retrieval-Augmented Generation (RAG) pipeline, and querying the RAG database with CSV context.

The library is designed to address the challenge of language models inserting unwanted context in responses, especially in sensitive domains like medical notes. By grounding the model in specific data and controlling the prompt, this tool aims to provide more focused and relevant answers.



**Note:** This tutorial assumes you have the `phi_generation` library installed.

After making changes to the code, you need to rebuild and reinstall the package.
Run the following commands in your `phi_generation` project root:

In [None]:
# !python -m build
# !pip install dist/phi_generation-0.1.0.tar.gz --upgrade

^C
Processing c:\users\noliv\phi_generation_repo_destination\phi_generation\dist\phi_generation-0.1.0.tar.gz


ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\\Users\\noliv\\phi_generation_repo_destination\\phi_generation\\dist\\phi_generation-0.1.0.tar.gz'


[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools >= 40.8.0
* Getting build dependencies for sdist...
running egg_info
writing code_files.egg-info\PKG-INFO
writing dependency_links to code_files.egg-info\dependency_links.txt
writing requirements to code_files.egg-info\requires.txt
writing top-level names to code_files.egg-info\top_level.txt
reading manifest file 'code_files.egg-info\SOURCES.txt'
writing manifest file 'code_files.egg-info\SOURCES.txt'
* Building sdist...
running sdist
running egg_info
writing code_files.egg-info\PKG-INFO
writing dependency_links to code_files.egg-info\dependency_links.txt
writing requirements to code_files.egg-info\requires.txt
writing top-level names to code_files.egg-info\top_level.txt
reading manifest file 'code_files.egg-info\SOURCES.txt'
writing manifest file 'code_files.egg-info\SOURCES.txt'
running check
creating code_files-0.1.0
creating code_files-0.1.0\code_files
creating code_files-0.1.

Then, restart the Jupyter kernel to ensure the updated package is used.

## 1. Preparing Data from CSV for RAG

This section demonstrates how to convert structured data 
in a CSV file into a format suitable for the RAG database. We'll use the `utils` module to convert the CSV to a Markdown table and then save it as a `.md` file in the data directory.

### Converting CSV to Markdown Table

In [1]:
import os
from code_files import utils as ut

# Define the path to your CSV file (for example: user will cahange)

csv_file_path = r"C:\Users\noliv\Downloads\structured_data_filled.csv"

# Convert the CSV to a Markdown table
markdown_table = ut.csv_to_markdown_table(csv_file_path)

if isinstance(markdown_table, str) and markdown_table.startswith("Error"):
    print(markdown_table)
else:
    print("Successfully converted CSV to Markdown table.")
    print("\n--- Markdown Table Preview ---")
    print(markdown_table[:500] + "...\n--- End Preview ---") # Preview a part of the table

    # Save the Markdown table to the data directory
    save_message = ut.markdown_table_to_data_dir(markdown_table, filename="from_csv_data.md")
    print(save_message)

    # Copy the generated Markdown file to the RAG data directory
    # Assuming your notebook is in the project root
    local_md_path = "code_files/data/from_csv_data.md"
    ut.copy_markdown_to_rag_data(local_md_path)


Successfully converted CSV to Markdown table.

--- Markdown Table Preview ---
| Patient Name   |   Patient Age |   Alzheimers |   Anxiety |   Arthritis |   Behavior |   Bipolar |   Cannabis |   Cardio |   Chronic Disease |   Depression |   Diabetes |   Dieting |   Disabilities |   Drug-Induced Delirium |   Exercise |   Gastrointestinal |   Getting Worse or Not Better |   Hospital Admission |   Hospital Readmission |   Hypertension |   Kidney Disease |   Long-Term Care |   Memory Care |   Mental Health Questionnaire |   Obesity/Metabolic |   Osteoarthritis |   Pain |   Pre...
--- End Preview ---
Markdown table written to c:\Users\noliv\phi_generation_repo_destination\phi_generation\code_files\..\data\from_csv_data.md
Error: File not found at code_files/data/from_csv_data.md


## 2. Creating the RAG Database

Now that we have our data (potentially supplemented by other `.md` files in the `code_files/rag_module/data` directory), we'll create the Chroma vector database using the `create_database` module. This process loads the documents, splits them into chunks, generates embeddings, and saves them to the `chroma` directory.

In [None]:
from code_files.rag_module import create_database as crd

# Run the database generation process
crd.generate_data_store()
print("\nRAG database generation complete.")

libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


--- Generating Data Store ---
Loaded 3 documents.
Split documents into 97 chunks.
Error saving to Chroma: [WinError 32] The process cannot access the file because it is being used by another process: 'chroma\\daee5f4c-9b22-493e-b188-e19f8df24101\\data_level0.bin'
--- Data Store Generation Complete ---

RAG database generation complete.


## 3. Querying the RAG Database with Optional CSV Context

This section demonstrates how to query the RAG database using the `query_data` module. Notably, this version of the query tool allows you to optionally provide a CSV file path, whose content will be included in the prompt to the language model along with the retrieved context.

### Option A: Querying witout a CSV file

In [1]:
from code_files.rag_module import query_data as qd

# Define the query text
query = "Answer the question based on the information provided."

# Option 1: Query without providing a CSV file
print("\n--- Querying without CSV Context ---")
response_no_csv = qd.query_rag_with_csv(query)
print(response_no_csv)


--- Querying without CSV Context ---


AttributeError: module 'code_files.rag_module.query_data' has no attribute 'query_rag_with_csv'

In [None]:
import subprocess

# Define the query text
query = "Answer the question based on the information provided."

# Option 1: Query without providing a CSV file
print("\n--- Querying without CSV Context ---")
try:
    result = subprocess.run(
        ["python", "code_files/rag_module/query_data.py", query],
        capture_output=True,
        text=True,
        check=True
    )
    print(result.stdout)
    if result.stderr:
        print(f"Error during query: {result.stderr}")
except subprocess.CalledProcessError as e:
    print(f"Error running query_data.py: {e}")
except FileNotFoundError:
    print("Error: query_data.py not found. Ensure the path is correct.")


### Option B: Query with a CSV file that provides additonal context to the prompt

In [None]:
print("\n--- Querying with CSV Context ---")
csv_file_for_query = "path/to/your/supplemental_data.csv" # Replace with a relevant CSV path

try:
    result = subprocess.run(
        ["python", "code_files/rag_module/query_data.py", query, "--csv_file", csv_file_for_query],
        capture_output=True,
        text=True,
        check=True
    )
    print(result.stdout)
    if result.stderr:
        print(f"Error during query with CSV: {result.stderr}")
except subprocess.CalledProcessError as e:
    print(f"Error running query_data.py with CSV: {e}")
except FileNotFoundError:
    print("Error: query_data.py or the specified CSV file not found. Ensure the paths are correct.")


## 4. Comparing Embeddings (Optional)

The `compare_embeddings` module allows you to query the database directly and see the retrieved documents along with their relevance scores. This can be useful for understanding how well the embeddings are working.

In [None]:
print("\n--- Comparing Embeddings ---")
query_for_comparison = "What are some key details?"

try:
    result = subprocess.run(
        ["python", "code_files/rag_module/compare_embeddings.py", query_for_comparison],
        capture_output=True,
        text=True,
        check=True
    )
    print(result.stdout)
    if result.stderr:
        print(f"Error during embedding comparison: {result.stderr}")
except subprocess.CalledProcessError as e:
    print(f"Error running compare_embeddings.py: {e}")
except FileNotFoundError:
    print("Error: compare_embeddings.py not found. Ensure the path is correct.")

## Conclusion

This tutorial demonstrated how to use the `phi_generation` library to ingest data from CSV files, build a RAG database, and query it with the option to include CSV data directly in the prompt. This approach allows for more controlled and context-aware interactions with language models. Remember to replace the placeholder file paths with your actual file locations.