In [1]:
# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np

**Git Repository Link**
> https://github.com/jstraker1/datasummative_2.git

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("abdallaahmed77/healthcare-risk-factors-dataset")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/fleursmith/.cache/kagglehub/datasets/abdallaahmed77/healthcare-risk-factors-dataset/versions/1


In [3]:
# Load dataset
df = pd.read_csv(f"{path}/dirty_v3_path.csv")

# Display first few rows
df.head()

Unnamed: 0,Age,Gender,Medical Condition,Glucose,Blood Pressure,BMI,Oxygen Saturation,LengthOfStay,Cholesterol,Triglycerides,HbA1c,Smoking,Alcohol,Physical Activity,Diet Score,Family History,Stress Level,Sleep Hours,random_notes,noise_col
0,46.0,Male,Diabetes,137.04,135.27,28.9,96.04,6,231.88,210.56,7.61,0,0,-0.2,3.54,0,5.07,6.05,lorem,-137.057211
1,22.0,Male,Healthy,71.58,113.27,26.29,97.54,2,165.57,129.41,4.91,0,0,8.12,5.9,0,5.87,7.72,ipsum,-11.23061
2,50.0,,Asthma,95.24,,22.53,90.31,2,214.94,165.35,5.6,0,0,5.01,4.65,1,3.09,4.82,ipsum,98.331195
3,57.0,,Obesity,,130.53,38.47,96.6,5,197.71,182.13,6.92,0,0,3.16,3.37,0,3.01,5.33,lorem,44.187175
4,66.0,Female,Hypertension,95.15,178.17,31.12,94.9,4,259.53,115.85,5.98,0,1,3.56,3.4,0,6.38,6.64,lorem,44.831426


## 1) Summarising the dataset

In [4]:
# Summarise the dataset
df.info()
df.describe()

# .info() - summary of structure, data types, missing values
# .describe() - summary of numerical values and statistics

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                25500 non-null  float64
 1   Gender             25500 non-null  object 
 2   Medical Condition  25500 non-null  object 
 3   Glucose            25500 non-null  float64
 4   Blood Pressure     25500 non-null  float64
 5   BMI                30000 non-null  float64
 6   Oxygen Saturation  30000 non-null  float64
 7   LengthOfStay       30000 non-null  int64  
 8   Cholesterol        30000 non-null  float64
 9   Triglycerides      30000 non-null  float64
 10  HbA1c              30000 non-null  float64
 11  Smoking            30000 non-null  int64  
 12  Alcohol            30000 non-null  int64  
 13  Physical Activity  30000 non-null  float64
 14  Diet Score         30000 non-null  float64
 15  Family History     30000 non-null  int64  
 16  Stress Level       300

Unnamed: 0,Age,Glucose,Blood Pressure,BMI,Oxygen Saturation,LengthOfStay,Cholesterol,Triglycerides,HbA1c,Smoking,Alcohol,Physical Activity,Diet Score,Family History,Stress Level,Sleep Hours,noise_col
count,25500.0,25500.0,25500.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,54.616784,123.622179,140.455337,28.476155,94.954992,4.414133,213.033891,176.837375,6.294377,0.279767,0.238533,3.803214,4.029654,0.439433,5.917312,6.229836,-0.51701
std,15.973618,41.576943,21.077933,5.728804,3.736202,2.761536,33.519757,48.812827,1.320269,0.448892,0.426194,2.011729,1.817165,0.496326,2.225057,1.187669,100.076959
min,10.0,20.32,74.24,7.67,67.51,1.0,95.73,-22.48,3.28,0.0,0.0,-3.68,-1.75,0.0,-2.44,1.59,-412.169596
25%,45.0,96.28,125.14,24.59,93.0,3.0,189.5,141.28,5.33,0.0,0.0,2.35,2.77,0.0,4.37,5.41,-68.270749
50%,55.0,110.5,138.32,28.05,95.3,4.0,211.835,173.365,5.97,0.0,0.0,3.59,3.79,0.0,5.9,6.23,-0.510742
75%,66.0,136.61,153.79,31.81,97.38,5.0,235.31,208.63,6.92,1.0,0.0,5.06,5.02,1.0,7.44,7.05,66.811399
max,89.0,318.51,226.38,56.85,110.07,19.0,358.37,421.51,12.36,1.0,1.0,12.41,12.06,1.0,15.45,10.35,467.89491


**.info() summary**
* total rows: 30,000
* total columns: 20
* columns with missing values:
    * age, gender, medical condition, glucose, blood pressure each have 25,500 non-null entries, meaning 4,500 missing in each of these 5 columns
* columns with full data (no missing values):
    * 15 columns have all 30,000 non-null entries, including BMI, oxygen saturation, length of stay, cholesterol, triglycerides, HbA1c, smoking, alcohol, physical activity, diet score, family history, stress level, sleep hours, random_notes, noise_col
* Data types breakdown:
    * 13 float columns
    * 4 integer columns
    * 3 object columns

**.describe summary**
* sample sizes: most health variables have 30,000 observations, but age, glucose, and blood pressure each have 25,500
* typical patient profile (means):
    * age: ~54.6 years
    * glucose: ~123.6 mg/dL
    * blood pressure: ~140.5 mmHg
    * BMI: ~28.5 (overweight range)
    * oxygen saturation: ~95%
    * length of stay: ~4.4 days
    * cholesterol: ~213 mg/dL
    * triglycerides: ~177 mg/dL
    * HbA1c: ~6.29%
* lifestyle means:
    * smoking: 0.28
    * alcohol: 0.24 (≈24% drinkers if binary)
    * physical activity: ~3.8 hours/units
    * diet score: ~4.0
    * family history: 0.44 (≈44% with family history if binary)
    * sleep hours: ~6.23
* notable ranges:
    * age: 10 to 89
    * glucose: 20 to 319
    * blood pressure: 74 to 226
    * BMI: 7.7 to 56.9
    * oxygen saturation: 67.5 to 110.1
    * length of stay: 1 to 19 days
    * HbA1c: 3.28 to 12.36


## 2) Cleaning and renaming columns

In [5]:
# Set unnecessary column to be removed
unnecessary_cols = [ "random_notes" , "noise_col" ]

# Remove columns that are not necessary
df.drop(columns=unnecessary_cols, inplace=True)

In [6]:
# Print all columns names
df.columns

Index(['Age', 'Gender', 'Medical Condition', 'Glucose', 'Blood Pressure',
       'BMI', 'Oxygen Saturation', 'LengthOfStay', 'Cholesterol',
       'Triglycerides', 'HbA1c', 'Smoking', 'Alcohol', 'Physical Activity',
       'Diet Score', 'Family History', 'Stress Level', 'Sleep Hours'],
      dtype='object')

**Renaming Columns**

To make the dataset easier to work with in Python and ensure consistent, clear, and code-friendly naming, each column has been converted to lowercase and formatted using snake_case (the standard naming convention in data science)

| Original Column Name     | Renamed Column Name     |
|--------------------------|--------------------------|
| Age                      | age                      |
| Gender                   | gender                   |
| Medical Condition        | medical_condition        |
| Glucose                  | glucose                  |
| Blood Pressure           | blood_pressure           |
| BMI                      | bmi                      |
| Oxygen Saturation        | oxygen_saturation        |
| LengthOfStay             | length_of_stay           |
| Cholesterol              | cholesterol              |
| Triglycerides            | triglycerides            |
| HbA1c                    | hba1c                    |
| Smoking                  | smoking_status           |
| Alcohol                  | alcohol_use              |
| Physical Activity        | physical_activity        |
| Diet Score               | diet_score               |
| Family History           | family_history           |
| Stress Level             | stress_level             |
| Sleep Hours              | sleep_hours              |

In [7]:
# Rename columns
df.rename(columns={

    "Age": "age",
    "Gender": "gender",
    "Medical Condition": "medical_condition",
    "Glucose": "glucose",
    "Blood Pressure": "blood_pressure",
    "BMI": "bmi",
    "Oxygen Saturation": "oxygen_saturation",
    "LengthOfStay": "length_of_stay",
    "Cholesterol": "cholesterol",
    "Triglycerides": "triglycerides",
    "HbA1c": "hba1c",
    "Smoking": "smoking_status",
    "Alcohol": "alcohol_use",
    "Physical Activity": "physical_activity",
    "Diet Score": "diet_score",
    "Family History": "family_history",
    "Stress Level": "stress_level",
    "Sleep Hours": "sleep_hours"

}, inplace=True)

# Print updated columns to check renaming has been done correctly (no typos)
print(df.columns)

Index(['age', 'gender', 'medical_condition', 'glucose', 'blood_pressure',
       'bmi', 'oxygen_saturation', 'length_of_stay', 'cholesterol',
       'triglycerides', 'hba1c', 'smoking_status', 'alcohol_use',
       'physical_activity', 'diet_score', 'family_history', 'stress_level',
       'sleep_hours'],
      dtype='object')


## 3) Summary of variables

| Variable Name          | Summary                                                                                    | Units                      |
|------------------------|--------------------------------------------------------------------------------------------|----------------------------|
| Age                    | The patient's age in years                                                                 | years                      |
| Gender                 | The patient's self-reported gender                                                         | none (categorical)         |
| Medical Condition      | The primary diagnosed medical condition for the patient                                    | none (categorical)         |
| Glucose                | The patient’s random (non-fasting) blood glucose level                                     | mg/dL                      |
| Blood Pressure         | The patient’s systolic blood pressure measurement                                          | mmHg                       |
| BMI                    | Body Mass Index, a calculated measure of weight relative to height                         | none                       |
| Oxygen Saturation      | The percentage of haemoglobin in the blood that is carrying oxygen (SpO₂)                  | %                          |
| Length of Stay         | Number of days the patient stayed in hospital                                              | days                       |
| Cholesterol            | Total cholesterol level in the bloodstream                                                 | mg/dL                      |
| Triglycerides          | The amount of triglycerides (fat) in the blood, recorded as a non-fasting measurement      | mg/dL                      |
| HbA1c                  | Percentage of glycated haemoglobin reflecting average blood sugar over 2–3 months          | %                          |
| Smoking Status         | Indicates whether the patient is a current smoker (binary yes/no)                          | 0 = non-smoker, 1 = smoker |
| Alcohol Use            | Indicates whether the patient consumes alcohol (binary yes/no)                             | 0 = no, 1 = yes            |
| Physical Activity      | A numerical score representing the patient’s level of physical activity                    | approx. hours/week         |
| Diet Score             | An index estimating the quality or healthiness of the patient’s diet                       | numeric score              |
| Family History         | Indicates whether the patient has a family history of chronic illness (binary yes/no)      | 0 = no, 1 = yes            |
| Stress Level           | A numerical score reflecting the patient’s self-reported stress                            | numeric scale              |
| Sleep Hours            | The average number of hours the patient sleeps per night                                   | hours/night                |

## 4) Medical information summary

**Explanations of the Medical Variables**

**Glucose**
* A measure of the concentration of glucose (sugar) circulating in the blood which is used to assess metabolic health and diabetes risk
* The glucose column records random (non-fasting) plasma glucose levels:
    * We can infer this because the mean (~123.6 mg/dL) is too high and the range (20–318 mg/dL) is too wide for fasting glucose values
    * Random plasma glucose is a measurement of the amount of glucose in the bloodstream, taken at any time of day (not specifically after fasting)
    * As food intake impacts plasma glucose levels, a random glucose test is far more variable than a fasting glucose test
* Healthy range for this variable: 4 - 8 mmol/L for a non-diabetic random glucose
* Units: 
    * mg/dL - milligrams of glucose per decilitre of blood
    * mmol/L - millimoles of glucose per litre of blood

**Blood Pressure**
* A measure of the force exerted by blood against artery walls
* The blood pressure column records systolic blood pressure (not the diastolic value):
    * The values in the dataset (~74–226 mmHg, mean ~140 mmHg) align with systolic rather than diastolic readings
    * Systolic pressure: the pressure when your heart pushes blood out around your body
    * Diastolic pressure: the pressure when your heart rests between beats and blood is pushed around your heart
* Healthy range for this variable: 90 - 120 mmHg (systolic)
* Units: mmHg - millimetres of mercury (the standard unit for blood pressure)

**BMI**
* Body Mass Index (BMI) is an indicator of body fatness calculated from weight and height
    * Formula for BMI calculation: weight / height²
    * this is used to classify underweight, healthy, overweight, obese
* The BMI column records pre-calculated BMI scores:
    * The dataset does not include height or weight columns, so BMI must be provided as an already-computed value
* Healthy range for this variable: 18.5 - 24.9

**Oxygen Saturation**
* The percentage of haemoglobin in the blood that is bound to oxygen, reflecting how well oxygen is being carried through the bloodstream
* The oxygen saturation column records SpO₂ values from pulse oximetry:
    * The values (~67–110%) match the expected behaviour of pulse oximeters, including occasional physiologically impossible readings (>100%)
* Healthy range for this variable: 95–100%
* Units: percentage (%) - proportion of haemoglobin that is oxygen-saturated

**Cholesterol**
* A measure of the total cholesterol circulating in the bloodstream
* The cholesterol column records total cholesterol measured in mg/dL:
    * The mean (~213 mg/dL) and distribution match typical total cholesterol values reported in mg/dL
* Healthy range for this variable: < 5 mmol/L 
* Units: 
    * mg/dL - milligrams of cholesterol per decilitre of blood
    * mmol/L - millimoles of cholesterol per litre of blood

**Triglycerides**
* A type of fat found in the blood
* The triglycerides column records random (non-fasting) plasma triglyceride concentration:
    * The mean (~177 mg/dL) and range are consistent with mg/dL reporting
    * We can infer this because the mean (~177 mg/dL) and the range upper range (>400 mg/dL) are too high for fasting triglyceride values
    * Random plasma triglyceride is a measurement of the amount of triglyceride in the bloodstream, taken at any time of day (not specifically after fasting)
    * As food intake impacts plasma triglyceride levels, a random triglyceride test is far more variable than a fasting triglyceride test
* Healthy range for this variable: < 2.3 mmol/L
* Units: 
    * mg/dL - milligrams of triglycerides per decilitre of blood
    * mmol/L - millimoles of triglycerides per litre of blood

**HbA1c**
* A measure of "glycated haemoglobin" which reflects average blood glucose levels over the past 2–3 months
    * The test measures the amount of hemoglobin (the protein in your red blood cells that carries oxygen)in red blood cells that has glucose attached to it
    * Since red blood cells have a lifespan of about 2 to 3 months, the HbA1c level reflects your average blood glucose during that time
* The HbA1c column records HbA1c values in percent (%):
    * This represents the percentage of total hemoglobin protein in the red blood cells that has glucose (sugar) attached to it
    * HbA1C measurement has now been updated to use the unit mmol/mol in the UK
        * However, conversion between using percentage and mmol/mol as units for HbA1c is non-linear, therefors the healthy range here is stated in the older units style, instead of the current NHS published healthy range
* Healthy range for this variable: < 6.0%
* Units: percent (%) - percentage of haemoglobin molecules that have glucose attached

In [8]:
# Print all unique values in the medical condition column
df["medical_condition"].value_counts()

medical_condition
Hypertension    7120
Diabetes        6417
Obesity         3857
Healthy         3039
Asthma          2037
Arthritis       1796
Cancer          1234
Name: count, dtype: int64

**Medical Conditions**
* Hypertension: chronically elevated blood pressure, increasing risk of heart disease and stroke
* Diabetes: a condition where the body cannot regulate blood glucose properly (insulin-related)
* Obesity: excess body fat that increases risk of metabolic and cardiovascular diseases
* Healthy: no significant chronic medical conditions recorded
* Asthma: chronic airway inflammation causing wheezing, breathlessness, and flare-ups
* Arthritis: joint inflammation leading to pain, stiffness, and reduced mobility
* Cancer: uncontrolled cell growth forming malignant tumours or spreading in the body

## 5) Potential research question and plan summary

**Potential research question:**

How do demographic, lifestyle, and medical risk factors individually and jointly influence the likelihood of developing major health conditions within a synthetic dataset, and how can these relationships be used to prototype potential predictive tools for both patients and clinicians?


**Summary of analysis steps:**
1) Clean the dataset
    * Rename columns, delete “noise” columns, handle missing values etc.
2) Describe the dataset population
    * Summarise demographics, lifestyle factors, medical markers, and disease prevalence using means, standard deviations, confidence intervals, and simple comparisons. Additionally creating visualisations to check data for suitability for later statistical tests
3) Analyse how lifestyle factors influence medical risk markers
    * Use correlations and multiple linear regressions to see how lifestyle behaviours relate to BMI, blood pressure, cholesterol, glucose, and other medical variables.
4) Identify which demographic and lifestyle factors predict each disease
    * Fit logistic regressions using age, gender, and lifestyle variables to estimate disease risk.
5) Examine how disease risk changes across age
    * Use the lifestyle based logistic models to estimate and plot how predicted disease probabilities change as individuals get older.
6) Identify which medical risk factors predict each disease
    * Fit logistic regressions that include medical markers to build richer models and compare these to the lifestyle only models.
7) Compare predictive power of lifestyle only models and medical inclusive models
    * Assess how adding medical test results changes coefficient sizes, predicted probabilities, and overall model performance.
8) Evaluate the overall pathway from lifestyle to medical markers to disease
    * Bring together findings to assess whether the dataset supports a plausible chain of relationships.
9) Build a patient facing Shiny app
    * Use the lifestyle based models to create an interface where users input demographic and lifestyle information to obtain personalised risk predictions.
10) Build a clinician facing Shiny app
    * Use the medical inclusive models to allow clinicians to enter any known demographic, lifestyle, as well as the medical test results for more precise risk estimates.
11) Discuss limitations and validity
    * Address synthetic data issues, cross sectional design, model assumptions, and the educational nature of the tool
