# A Tutorial for the phi_generation Library: Data-Driven Prompting and RAG (Direct Function Calls)

## Introduction

This tutorial demonstrates the key features of the `phi_generation` library, focusing on data ingestion, preparing data for a Retrieval-Augmented Generation (RAG) pipeline, and querying the RAG database with optional CSV context.

The library aims to provide focused and relevant answers from language models by grounding them in specific data.

**Note:** This tutorial assumes you have the `phi_generation` library installed.

 After making changes to the code, you need to rebuild and reinstall the package.
 Run the following commands in your `phi_generation` project root:

In [3]:
python -m build
pip install dist/code_files-0.1.0.tar.gz --upgrade

SyntaxError: invalid syntax (2719185373.py, line 1)

## 1. Preparing Data from CSV for RAG

This section demonstrates how to convert structured data in a CSV file into a Markdown format suitable for the RAG database using the `utils` module.

In [1]:
import os
from code_files import utils as ut

# Define the path to your CSV file
csv_file_path = r"C:\Users\noliv\Downloads\structured_data_filled.csv"  

# Convert the CSV to a Markdown table
markdown_table = ut.csv_to_markdown_table(csv_file_path)

Some error handling on the above:

In [2]:
if isinstance(markdown_table, str) and markdown_table.startswith("Error"):
    print(markdown_table)
else:
    print("Successfully converted CSV to Markdown table.")
    print("\n--- Markdown Table Preview ---")
    print(markdown_table[:500] + "...\n--- End Preview ---") # Preview a part of the table

    # Save the Markdown table to the data directory
    save_message = ut.var_markdown_to_data_dir(markdown_table, filename="from_csv_data.md")
    print(save_message)

    # Copy the generated Markdown file to the RAG data directory
    # Assuming your notebook is in the project root
    local_md_path = "code_files/rag_module/data/from_csv_data.md"
    ut.local_markdown_to_data_dir(local_md_path)

Successfully converted CSV to Markdown table.

--- Markdown Table Preview ---
| Patient Name   |   Patient Age |   Alzheimers |   Anxiety |   Arthritis |   Behavior |   Bipolar |   Cannabis |   Cardio |   Chronic Disease |   Depression |   Diabetes |   Dieting |   Disabilities |   Drug-Induced Delirium |   Exercise |   Gastrointestinal |   Getting Worse or Not Better |   Hospital Admission |   Hospital Readmission |   Hypertension |   Kidney Disease |   Long-Term Care |   Memory Care |   Mental Health Questionnaire |   Obesity/Metabolic |   Osteoarthritis |   Pain |   Pre...
--- End Preview ---
Markdown written to c:\Users\noliv\phi_generation_repo_destination\phi_generation\code_files\..\code_files/rag_module/data\from_csv_data.md


## 2. Adding Additional Documents to the Data Directory (Optional)

You can add more Markdown files to the 'data' directory, which will be processed by the database creation step.

In [3]:
# Example 1: Adding content directly
document_content = """
# This is an example document

This document contains some additional information for the RAG database.
"""
ut.var_markdown_to_data_dir(document_content, filename = "sample_text.md")
print("Added example_document.md to the data directory.")

Added example_document.md to the data directory.


In [5]:
# Example 2: Copying a local Markdown file

# You can replace with the path to your local .md file, or add many files in a path at once with this function

# This is not an .md so it will throw an error
local_file_to_copy = r"C:\Users\noliv\Downloads\s41592-022-01728-4.pdf" 
copy_result = ut.local_markdown_to_data_dir(local_file_to_copy, "code_files/data", enforce_md=True)
print(copy_result)

Error: Only files with the '.md' extension can be copied.


## 3. Creating the RAG Database

Now that we have our data in the 'data' directory (including the Markdown generated from the CSV and any additional documents), we'll create the Chroma vector database using the `create_database` module.

In [6]:
from code_files.rag_module import create_database as crd

# Run the database generation process
crd.generate_data_store()
print("\nRAG database generation complete.")

--- Generating Data Store ---


libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.
libmagic is unavailable but assists in filetype detection. Please consider installing libmagic for better results.


Loaded 5 documents.
Split documents into 100 chunks.
Error saving to Chroma: [WinError 32] The process cannot access the file because it is being used by another process: 'chroma\\daee5f4c-9b22-493e-b188-e19f8df24101\\data_level0.bin'
--- Data Store Generation Complete ---

RAG database generation complete.


## 4. Querying the RAG Database with Optional CSV Context (Direct Function Call)

This section demonstrates querying the RAG database using the `query_rag_with_csv` function from the `query_data` module.


This is an unhelpful query, and we expect the response to indicate as such.

In [7]:
from code_files.rag_module import query_data as qd

# Define the query text
query = "Answer the question based on the information provided."

In [8]:
# Option A: Query without providing a CSV file
print("\n--- Querying without CSV Context ---")
response_no_csv = qd.query_rag_with_csv(query)
print(response_no_csv)


--- Querying without CSV Context ---


  db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)
  response_text = model.predict(prompt)


Response: I apologize, but without any specific question or additional information provided, I am unable to answer based solely on the context given. Please provide more details or a specific question so that I can assist you further.
Sources: ['C:\\Users\\noliv\\phi_generation_repo_destination\\phi_generation\\code_files\\rag_module\\data\\connor_soap_therapy.md', 'C:\\Users\\noliv\\phi_generation_repo_destination\\phi_generation\\code_files\\rag_module\\data\\connor_soap_therapy.md', 'C:\\Users\\noliv\\phi_generation_repo_destination\\phi_generation\\code_files\\rag_module\\data\\connor_soap_therapy.md']


In [10]:

# Option B: Query with a CSV file providing additional context (Direct Function Call)
print("\n--- Querying with CSV Context ---")

# Replace below with your CSV path:
better_query = "Create a patient profile for Kathryn in SOAP format using the structured patient data provided for Kathryn in the .csv file attached, as well as style supplements for writing a SOAP note in the database."
csv_file_for_query = r"C:\Users\noliv\Downloads\structured_data_filled.csv" 
response_with_csv = qd.query_rag_with_csv(better_query, csv_file_for_query)
print(response_with_csv)


--- Querying with CSV Context ---
Response: Subjective:
Kathryn is a 26-year-old female patient who presents with symptoms of anxiety and behavioral issues. She denies any history of Alzheimer's disease, arthritis, chronic diseases, depression, diabetes, disabilities, drug-induced delirium, gastrointestinal issues, hospital admissions, hospital readmissions, hypertension, kidney disease, long-term care, memory care, obesity/metabolic disorders, osteoarthritis, pain, prediabetes, quality of life concerns, semaglutide use, sleep disturbances, or stress.

Objective:
Upon assessment, Kathryn has reported experiencing anxiety and behavioral changes. She has a positive history of anxiety, behavior issues, cannabis use, cardio conditions, and exercise habits. Kathryn denies any history of Alzheimer's disease, arthritis, chronic diseases, depression, diabetes, disabilities, drug-induced delirium, gastrointestinal issues, hospital admissions, hospital readmissions, hypertension, kidney disease

This looks promising, and it even drew from the style guide without metadata indicating it was a style guide (there are options for this in the functions that add to database), but note that subjective and objective are the same. Future tasks include to adjust context window, try with various temperatures (and observe error propagation), and improve prompting.

## 5. Comparing Embeddings (Optional : Direct Function Call for Error Handling)

The `compare_embeddings` module allows you to query the database directly and see the retrieved documents along with their relevance scores.

In [None]:
from code_files.rag_module import compare_embeddings as ce
print("\n--- Comparing Embeddings ---")
query_for_comparison = "What are some key details?"
embedding_comparison_results = ce.compare_embeddings_query(query_for_comparison)
print(embedding_comparison_results)

## 6. Generating Structured to Unstructured PHI

This section uses the `st_to_unst` module to query a vanilla LLM (i.e., without RAG) to create a doctor's record from a .csv file containing structured PHI for one or more patients. For more specific formatting instructions, such as record templates, compare results qualitatively and with the telephone module. 

This section demonstrates how to use the `st_to_unst` module to:
1. Generate patient health records from structured CSV data.
2. Create an AnnData object to store both structured and unstructured data.
3. Convert the unstructured data to a JSON file.

In [13]:
from code_files.st_to_unst_module import st_to_unst as sun
import pandas as pd

# Define the path to your patient data CSV file
patient_csv_file = r"C:\Users\noliv\Downloads\structured_data_filled.csv" # Replace with the actual path

# Example CSV structure (adjust to your actual data)
# ```csv
# PatientID,Name,Age,Gender,ConditionA,ConditionB,MedicationX,MedicationY
# 1,Alice,30,Female,0,1,1,0
# 2,Bob,25,Male,1,0,0,1
# ```

# 1. Generate an AnnData object
print("\n--- Creating AnnData Object ---")
try:
    adata = sun.create_anndata_from_csv(patient_csv_file, patient_id_columns=["PatientID", "Name"])
    print(adata)
    print("\nSample patient records:")
    print(adata.obs.head())
    print("\nSample unstructured data:")
    print(adata.X.head())
except FileNotFoundError:
    print(f"Error: CSV file not found at {patient_csv_file}")
except Exception as e:
    print(f"An error occurred: {e}")


--- Creating AnnData Object ---


  llm = OpenAI(temperature=0.7)  # You can adjust temperature
  report = llm.predict(introduction_prompt)


An error occurred: An error occurred during AnnData creation: name 'pd' is not defined


In [15]:
from code_files.st_to_unst_module import st_to_unst as stu
import pandas as pd
import anndata as ad
import os

patient_csv_file = r"C:\Users\noliv\Downloads\structured_data_filled.csv"

print("\n--- Creating AnnData Object from CSV ---")
try:
    adata = stu.create_anndata_from_csv(patient_csv_file, patient_id_columns=["PatientID", "Name"])

    # Explore the AnnData object
    print(adata)  # Print basic information
    print("\nSample of structured patient data (adata.obs):")
    print(adata.obs.head())  # Show the first few rows of the original CSV data
    print("\nSample of unstructured patient reports (adata.X):")
    print(adata.X.head())  # Show the patient names and generated reports

except FileNotFoundError:
    print(f"Error: CSV file not found at {patient_csv_file}")
except Exception as e:
    print(f"An error occurred: {e}")


--- Creating AnnData Object from CSV ---
An error occurred: An error occurred during AnnData creation: name 'pd' is not defined


## Conclusion

This tutorial demonstrated how to use the `phi_generation` library to ingest data from CSV files and local Markdown files, build a RAG database, and query it with the option to include CSV data directly in the prompt. The `utils` module provides helpful functions for data handling, and the direct function calls offer a more integrated experience within the Jupyter Notebook. 