
# Structured-to-Unstructured Patient Health Information (PHI) Generation

This notebook demonstrates how to use the `st_to_unst` module from the `phi_generation` repository.

## Purpose
This module transforms **structured patient health inforomation (PHI)** (a .csv file of patient demographic information, and their diseases, symptoms, and medications) into **synthetic, unstructured doctor's notes** using an LLM.

This is critical for simulating realistic PHI notes for HIPAA-compliant research without using real patient data.

---

## Table of Contents
- Setup
- Upload Structured PHI (.csv)
- Generating an Anndata Object of Unstructured PHI (Doctor's Notes) via Iterated LLM Querying
- Validating the Data: Running a Unit Test on adata.obs
- Safely Modifying the Anndata Object


## Setup

In [None]:
# Build Package

# These commands can be run here, or in command line
!python -m build
%pip install dist/phi_gen-0.1.0.tar.gz --upgrade


* Creating isolated environment: venv+pip...
ERROR Source c:\Users\noliv\Downloads does not appear to be a Python project: no pyproject.toml or setup.py
Processing c:\users\noliv\downloads\dist\code_files-0.1.0.tar.gz
Note: you may need to restart the kernel to use updated packages.


ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'C:\\Users\\noliv\\Downloads\\dist\\code_files-0.1.0.tar.gz'


[notice] A new release of pip is available: 24.3.1 -> 25.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:

# Import necessary libraries

import pandas as pd
import numpy as np
import anndata as ad

# Import the st_to_unst module
from phi_gen.st_to_unst_module import st_to_unst as sun


## Upload Structured PHI (.csv)

The .csv file should contain a row for each patient, with the first column representing patient age, the next column(s) representing patient demographic information, and the last columns containing an alphabetized list of patient features, including diseases, symptoms, and medications.

As indicated in the LLM prompting in the `unst_to_st.py` code, the non-demographic patient information columns contain 1's if a patient presents with that feature, and 0 if it is not mentioned. NaN values are ambiguous and should be cleaned out (exercise to try).

In [None]:


# One has been hardcoded in from the local file path to demonstrate the structure
csv_file_path = r"C:\Users\noliv\Downloads\structured_data_filled.csv"

# (Optional) Preview the structured data
df_structured = pd.read_csv(csv_file_path)
df_structured.head()


Unnamed: 0,Patient Name,Patient Age,Alzheimers,Anxiety,Arthritis,Behavior,Bipolar,Cannabis,Cardio,Chronic Disease,...,Memory Care,Mental Health Questionnaire,Obesity/Metabolic,Osteoarthritis,Pain,Prediabetes,Quality of Life,Semaglutide,Sleep,Stress
0,Kathryn,26,0,1,,1,0.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Michael,38,0,0,0.0,0,0.0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
2,Kevin,17,0,1,0.0,0,0.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Megan,25,0,1,0.0,1,1.0,1,1,1,...,0,1,0,0,0,0,1,0,1,1
4,Brian,62,0,0,0.0,0,0.0,0,0,1,...,0,0,1,1,1,0,1,1,1,1


## Generating an Anndata Object of Unstructured PHI (Doctor's Notes) via Iterated LLM Querying


### Understanding the AnnData Object Structure

After generating the synthetic doctor's notes, the information is stored inside an `AnnData` object, which is a specialized data structure commonly used in bioinformatics and healthcare machine learning.

Here's how the `AnnData` object is organized:

| Component | Contents | Purpose |
|:----------|:---------|:--------|
| `adata.X` | A 2D array where each row contains: <br>• Column 0: Patient Name <br>• Column 1: Generated Doctor's Note | Holds unstructured text data generated for each patient. |
| `adata.obs` | A DataFrame (like a table) containing the original structured data from the CSV | Preserves all structured fields like age, diagnoses, medications, etc. |
| `adata.var_names` | ["patient_name", "unstructured_note"] | Names for the two columns inside `adata.X`. |
| `adata.uns` | Metadata dictionary, e.g., the original CSV filename | Stores useful metadata for reproducibility or tracking. |

---



>  __Note:__ Think of `.X` as the *generated text*, `.obs` as the *input patient info* (for reproducibility), and `.uns` as *metadata*.

### Running the Code

In [None]:

# 
adata = sun.process_csv_and_create_anndata(
    csv_file_path,
    patient_id_columns=["Patient Name"]  # Adjust columns if needed
)


### Viewing the Output

In [2]:

# View a summary of the AnnData object
adata


NameError: name 'adata' is not defined

In [None]:

# View structured data (.obs)
adata.obs.head()


Unnamed: 0,Patient Name,Patient Age,Alzheimers,Anxiety,Arthritis,Behavior,Bipolar,Cannabis,Cardio,Chronic Disease,...,Memory Care,Mental Health Questionnaire,Obesity/Metabolic,Osteoarthritis,Pain,Prediabetes,Quality of Life,Semaglutide,Sleep,Stress
0,Kathryn,26,0,1,,1,0.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Michael,38,0,0,0.0,0,0.0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
2,Kevin,17,0,1,0.0,0,0.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Megan,25,0,1,0.0,1,1.0,1,1,1,...,0,1,0,0,0,0,1,0,1,1
4,Brian,62,0,0,0.0,0,0.0,0,0,1,...,0,0,1,1,1,0,1,1,1,1


In [None]:

# View unstructured generated notes (.X)
pd.DataFrame(adata.X, columns=["Patient Name", "Generated Doctor's Note"]).head()


Unnamed: 0,Patient Name,Generated Doctor's Note
0,Kathryn,"\n\nIntroduction:\nJohn Smith, a 45-year-old m..."
1,Michael,"\n\nIntroduction:\n\nThe patient, John Smith, ..."
2,Kevin,\n\nIntroduction:\n\nThis report is written fo...
3,Megan,\nIntroduction:\n\nI am writing this public he...
4,Brian,\n\nIntroduction:\n\nPatient Name: John Smith\...


## Validating the Data: Running a Unit Test on adata.obs



We want to make sure that the structured data (`adata.obs`) exactly matches the original input CSV. This helps ensure no corruption, misalignment, or unwanted modifications occurred during processing.

This can be done using `pandas` built-in testing functions, in a way that mirrors one of the unit tests.

Example test:


In [None]:

# Validate that adata.obs matches the original CSV
pd.testing.assert_frame_equal(
    adata.obs.reset_index(drop=True),
    df_structured.reset_index(drop=True)
)
print("adata.obs matches the input .csv file")


adata.obs matches the input .csv file



(If the test fails, it will raise a detailed error showing exactly where the mismatch happened.)

## Safely Modifying the Anndata Object



Because `AnnData` uses a flexible data structure (`.obs` is a normal `pandas.DataFrame`), it can be edited directly.

As mentioned before, we want to replace NaN values with zeroes to disambiguate.


In [None]:

# Replace NaN values in adata.obs with zeroes
adata.obs = adata.obs.fillna(0)


In [None]:
adata.obs.head()

Unnamed: 0,Patient Name,Patient Age,Alzheimers,Anxiety,Arthritis,Behavior,Bipolar,Cannabis,Cardio,Chronic Disease,...,Memory Care,Mental Health Questionnaire,Obesity/Metabolic,Osteoarthritis,Pain,Prediabetes,Quality of Life,Semaglutide,Sleep,Stress
0,Kathryn,26,0,1,0.0,1,0.0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,Michael,38,0,0,0.0,0,0.0,0,1,0,...,0,0,0,0,0,0,1,0,1,0
2,Kevin,17,0,1,0.0,0,0.0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Megan,25,0,1,0.0,1,1.0,1,1,1,...,0,1,0,0,0,0,1,0,1,1
4,Brian,62,0,0,0.0,0,0.0,0,0,1,...,0,0,1,1,1,0,1,1,1,1


In [None]:
# To confirm more thoroughly:
assert not adata.obs.isnull().values.any()
print("All NaN values replaced with 0 in adata.obs")

All NaN values replaced with 0 in adata.obs



## Summary

Using the `st_to_unst` module from phi_generation, we can now easily transform structured health data into synthetic unstructured clinical notes.  Transforming data into this more standardized structure will enable future tasks such as fine-tuning or validating results from different record generation methods.
