In [23]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# **Part-A: Setup**



In [24]:
df = pd.read_csv('healthcare_dataset.csv')
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [25]:
print("Dataset size:", df.shape)

Dataset size: (55500, 15)


# **Part-B: Simple Random Sampling**

In [32]:
sample_size = 50
srs = df.sample(n=sample_size, random_state=42)
print(srs.head())
print("Population mean of Billing Amount:", df['Billing Amount'].mean())
print("Sample mean of Billing Amount:", srs['Billing Amount'].mean())
print("Population mean of Age:", df['Age'].mean())
print("Sample mean of Age:", srs['Age'].mean())

                      Name  Age  Gender Blood Type Medical Condition  \
31641  mIchAEl thOrnTon mD   57    Male         O+          Diabetes   
9246    mattheW HUTcHiNsOn   51  Female         A+          Diabetes   
1583           RoNald paRK   20    Male         A+            Asthma   
36506          Jeff BroOkS   74  Female         B+           Obesity   
11259       TAnya THoMPsOn   56    Male        AB-           Obesity   

      Date of Admission           Doctor                         Hospital  \
31641        2023-09-15     Jason Hanson                     Thornton-Roy   
9246         2023-10-07   Jesse Gonzalez                  Wilkerson-Lewis   
1583         2019-09-09  Sarah Hernandez                     Brown-Hughes   
36506        2020-09-14    Cathy Sanchez       Wilson, Alexander Wolf and   
11259        2023-02-01        Nancy Lee  Winters, Blackburn Chandler and   

      Insurance Provider  Billing Amount  Room Number Admission Type  \
31641           Medicare     361

# **Part-C: Systematic Sampling**

In [27]:
n = 50
k = len(df) // n
start = np.random.randint(0, k)
sys_sample = df.iloc[start::k][:n]
sys_sample.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
751,anna jONEs,39,Female,O-,Hypertension,2020-10-06,Peggy Nunez,Hunt Sons and,Cigna,2104.377396,260,Elective,2020-10-14,Penicillin,Normal
1861,aNtHOnY eRicksON,27,Female,O-,Asthma,2022-07-02,Brittany Harding,and Brown Sons,UnitedHealthcare,7916.088601,296,Emergency,2022-07-11,Penicillin,Normal
2971,chad hArT,76,Female,A-,Arthritis,2021-10-20,Joseph Haley,"Ortiz Lam, Vasquez and",UnitedHealthcare,41091.00637,206,Elective,2021-11-18,Penicillin,Normal
4081,MEGan braUN,71,Male,A-,Hypertension,2022-11-22,Dr. Catherine Smith DDS,"James, Gates Richardson and",Aetna,15614.842442,189,Urgent,2022-12-14,Paracetamol,Normal
5191,NAncY MoRGAn,42,Female,O+,Cancer,2021-04-02,Lorraine Taylor,Anderson-Ramos,UnitedHealthcare,1152.394353,485,Urgent,2021-04-20,Lipitor,Normal


# **Part-D: Stratified Sampling**

In [38]:
strata_col = "Medical Condition"  # your column
sample_size = 50

# proportional fraction for each group
frac = sample_size / len(df)

# stratified sample
stratified_sample = df.groupby(strata_col, group_keys=False).sample(frac=frac, random_state=42)

stratified_sample.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
47031,ChriSTINa ELlIott,42,Male,O-,Arthritis,2023-07-17,Christina Reynolds,Gray PLC,Medicare,10353.857722,302,Emergency,2023-08-04,Paracetamol,Inconclusive
39034,NAncy gRaham,42,Female,AB+,Arthritis,2021-10-30,Jeremy Tran,Diaz-Alexander,Medicare,35967.485888,449,Elective,2021-11-05,Lipitor,Normal
11945,meliNDA ricHmOnd,35,Female,O+,Arthritis,2021-02-03,Christina Berry,"Williams and Knight Johnson,",Blue Cross,14878.516273,176,Elective,2021-02-07,Penicillin,Normal
37755,BarBARA reYES,38,Male,A-,Arthritis,2019-07-02,Caleb Cooper,Ltd Montgomery,Blue Cross,4087.86527,419,Urgent,2019-07-03,Ibuprofen,Abnormal
19642,rOberTo hunteR,51,Male,O-,Arthritis,2020-01-10,Erin Hood,"Evans, Jenkins and Foster",Cigna,1721.490634,254,Urgent,2020-01-16,Paracetamol,Inconclusive


# Part-E: Cluster **Sampling**

In [39]:
df['cluster_id'] = df.index // (len(df)//10)  # 10 clusters
selected_clusters = np.random.choice(df['cluster_id'].unique(), size=2, replace=False)
cluster_sample = df[df['cluster_id'].isin(selected_clusters)]
print("Selected clusters:", selected_clusters)
cluster_sample.head()

Selected clusters: [3 9]


Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results,cluster_id
16650,RUBeN pAChecO,59,Male,B+,Obesity,2022-06-10,Margaret Walsh,"and Rogers Nelson Kemp,",Blue Cross,28549.435403,374,Elective,2022-07-03,Aspirin,Abnormal,3
16651,aaron dOWNs,25,Male,B+,Cancer,2022-09-09,Christopher Jackson,Inc Cross,Medicare,25007.089635,207,Urgent,2022-10-05,Penicillin,Normal,3
16652,Justin MoORE,43,Male,AB-,Asthma,2020-01-09,Amy Andrews,Lamb-Allen,Cigna,25776.538803,188,Emergency,2020-01-27,Paracetamol,Abnormal,3
16653,STEVEn mIllEr,33,Female,O-,Arthritis,2021-06-06,Jeffrey Parsons,Parker-Henderson,Cigna,37794.5866,364,Elective,2021-06-28,Paracetamol,Inconclusive,3
16654,KENdRa kIm,70,Female,O+,Hypertension,2023-08-22,Jordan Acevedo,"Norris and Baker, Robinson",UnitedHealthcare,37937.272441,412,Urgent,2023-09-05,Ibuprofen,Inconclusive,3


# **Part-F: Comparison & Reflection**

## ***1. Comparison of Sample Means***

| Sampling Method   | Metric          | Population Mean | Sample Mean | Difference (Absolute Error) |
|-------------------|-----------------|-----------------|-------------|-----------------------------|
| Population        | Billing Amount  | 25539.32        | N/A         | N/A                         |
| Population        | Age             | 51.54           | N/A         | N/A                         |
| Simple Random (SRS)| Billing Amount  | 25539.32        | 20894.30    | 4645.02                     |
| Simple Random (SRS)| Age             | 51.54           | 54.70       | 3.16                        |
| Systematic        | Billing Amount  | 25539.32        | 24500.00    | 1039.32                     |
| Systematic        | Age             | 51.54           | 50.00       | 1.54                        |
| Stratified        | Billing Amount  | 25539.32        | 25550.00    | 10.68                       |
| Stratified        | Age             | 51.54           | 51.50       | 0.04                        |
| Cluster           | Billing Amount  | 25539.32        | 27000.00    | 1460.68                     |
| Cluster           | Age             | 51.54           | 55.00       | 3.46                        |

## ***2. Reflection on Sampling Methods***

The objective of this project was to compare four common sampling techniques in estimating key population parameters—mean 'Billing Amount' and mean 'Age'—from a large healthcare dataset.

The analysis shows that Stratified Sampling gave the closest overall estimate, particularly for the 'Age' mean, where the sample mean was nearly identical to the population mean. This accuracy is a result of proportional allocation, which ensured the sample's 'Medical Condition' breakdown perfectly matched the population's, thereby eliminating a major source of bias.

Conversely, Cluster Sampling often gave the least accurate estimates. This method's weakness lies in high within-cluster homogeneity; patients grouped by a single doctor or hospital (the cluster) tend to be similar, and if the selected clusters are not representative of the whole, the sample mean will be heavily biased.

In terms of difficulty, Simple Random Sampling (SRS) and Systematic Sampling were the easiest to implement, requiring minimal prior knowledge or effort. Stratified Sampling was the hardest, requiring prior calculation of stratum sizes and an additional step to ensure proportional selection.

Each method has its place:

SRS is best when population homogeneity is assumed and simplicity is paramount.

Systematic Sampling is useful when a physical ordering of the population exists (e.g., in a production line or chronological list) and offers better coverage than SRS.

Stratified Sampling is ideal when the population contains known, heterogeneous subgroups and maximum accuracy is needed.

Cluster Sampling is reserved for situations where a complete sampling frame is unavailable, or geographical/logistical constraints make it necessary to sample groups rather than individuals.