<a href="https://colab.research.google.com/github/nattaran/HealthTequity-LLM/blob/main/generateBloodPressureData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


🩺 Generate Synthetic Blood Pressure Dataset
# ==========================================================
This notebook generates a realistic 30-day blood pressure dataset
for one individual (age, sex, systolic, diastolic)
and saves it with a descriptive filename including the age and sex.
# ----------------------------------------------------------
Example output file:
synthetic_bp_45_female.csv
# ----------------------------------------------------------
Author: Nasrin Attaran
Created: 2025-10-19
Project: HealthTequity Voice Pipeline
# ==========================================================

In [7]:
# --- Step 1: Import required libraries ---
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from pathlib import Path
import matplotlib.pyplot as plt
%matplotlib inline

In [18]:

# ==========================================================
# Step 2: Define where to save the CSV
# Change these two parameters to generate multiple people’s datasets
age = 70             # Example: 30, 45, 60
sex = "female"        # "male" or "female"

# ----------------------------------------------------------
# Other configuration
num_days = 30                         # Number of days to simulate
start_date = datetime.today() - timedelta(days=num_days - 1)
days = pd.date_range(start=start_date, periods=num_days).to_list()

# Fix random seed for reproducibility
np.random.seed(42)



# ==========================================================

In [None]:
# Set parameters

In [16]:

# ----------------------------------------------------------
# 2. Define blood pressure ranges for different regimes
# ----------------------------------------------------------
# "Normal" values represent healthy blood pressure.
# "Hypertensive" values represent high blood pressure.
# All values are in millimeters of mercury (mmHg).


In [19]:
# Systolic: average 115 mmHg (normal: 90–120)
# Diastolic: average 75 mmHg (normal: 60–80)
# Adding small trends over time to mimic real variation.

systolic = np.random.normal(loc=115, scale=8, size=num_days).round(0).astype(int)
diastolic = np.random.normal(loc=75, scale=5, size=num_days).round(0).astype(int)

# Add a mild linear trend to simulate gradual daily changes
systolic += np.linspace(0, 3, num_days).round(0).astype(int)
diastolic += np.linspace(0, 2, num_days).round(0).astype(int)

# Construct the dataset
df = pd.DataFrame({
    "date": [d.strftime("%Y-%m-%d") for d in days],
    "age": age,
    "sex": sex,
    "systolic_mmHg": systolic,
    "diastolic_mmHg": diastolic
})

# Show first few rows for inspection
print("\n📋 Preview of generated dataset:")
display(df.head(10))


# ==========================================================
# Step 5: Save file with descriptive filename
# ==========================================================
# Example: synthetic_bp_45_female.csv
filename = f"synthetic_bp_{age}_{sex.lower()}.csv"
output_csv = DATA_DIR / filename

df.to_csv(output_csv, index=False, encoding="utf-8-sig")
print(f"✅ File saved successfully: {output_csv.resolve()}")


# ==========================================================
# Step 6: Display summary statistics
# ==========================================================
print("\n📊 Summary statistics for verification:")
display(df.describe())



📋 Preview of generated dataset:


Unnamed: 0,date,age,sex,systolic_mmHg,diastolic_mmHg
0,2025-09-21,70,female,119,72
1,2025-09-22,70,female,114,84
2,2025-09-23,70,female,120,75
3,2025-09-24,70,female,127,70
4,2025-09-25,70,female,113,79
5,2025-09-26,70,female,114,69
6,2025-09-27,70,female,129,76
7,2025-09-28,70,female,122,65
8,2025-09-29,70,female,112,69
9,2025-09-30,70,female,120,77


✅ File saved successfully: /content/drive/MyDrive/HealthTequity-LLM/data/synthetic_csv/synthetic_bp_70_female.csv

📊 Summary statistics for verification:


Unnamed: 0,age,systolic_mmHg,diastolic_mmHg
count,30.0,30.0,30.0
mean,70.0,115.066667,75.4
std,0.0,6.987345,4.882057
min,70.0,101.0,65.0
25%,70.0,112.0,72.25
50%,70.0,114.0,75.0
75%,70.0,119.75,79.0
max,70.0,129.0,84.0
