## Healthcare Dataset
This project will look at a synthetic dataset to see what story it tells about the non-real patient population.

Data can be found at this website:
[Healthcare Dataset](https://www.kaggle.com/datasets/prasad22/healthcare-dataset)

### Virtual Environment

| Command  | Linux/Mac | PC - GitBash |
|--------- | --------- | ------------ | 
| Create   | python3 -m venv venv | python -m venv venv|
| Activate | source venv/bin/activate | source venv/Scripts/activate |
| Install | pip install -r requirements.txt or pip install packages | pip install -r requirements.txt or pip install packages |
| Deactivate | deactivate | deactivate|

In [35]:
# This section will have all items/tools that will need to be imported in the project
import pandas as pd
import locale

In [21]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("prasad22/healthcare-dataset")

print("Path to dataset files:", path)

Path to dataset files: /Users/luckyc/.cache/kagglehub/datasets/prasad22/healthcare-dataset/versions/2


### Healthcare Dataset
1. Read in dataset
2. Review the dataset
3. Clean the dataset

In [22]:
# Read in dataset
health_data = pd.read_csv("healthcare_dataset.csv")

In [23]:
# Checking the info relating to the data - A good way to see snapshot of data as well as checking for null values
health_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.4

In [24]:
# A snapshot of the statistics within the dataset
health_data.describe()

Unnamed: 0,Age,Billing Amount,Room Number
count,55500.0,55500.0,55500.0
mean,51.539459,25539.316097,301.134829
std,19.602454,14211.454431,115.243069
min,13.0,-2008.49214,101.0
25%,35.0,13241.224652,202.0
50%,52.0,25538.069376,302.0
75%,68.0,37820.508436,401.0
max,89.0,52764.276736,500.0


In [25]:
# Checking rows and column count
health_data.shape

(55500, 15)

In [26]:
# checking first few  rows
health_data.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [27]:
# Since the dataset is synthetic, there is no need for name column to have names and I will be converting to integers

unique_names = health_data['Name'].unique()
name_to_int = {name: i for i, name in enumerate(unique_names)}
health_data['name_int'] = health_data['Name'].map(name_to_int)

In [28]:
# Checking first few rows
health_data.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results,name_int
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal,0
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive,1
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal,2
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal,3
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal,4


In [29]:
# dropping columns not needed 
cols_to_drop = ['Name', 'Room Number']

health_data_clean = health_data.drop(columns= cols_to_drop)

In [30]:
# view first few rows
health_data_clean.head()

Unnamed: 0,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Admission Type,Discharge Date,Medication,Test Results,name_int
0,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,Urgent,2024-02-02,Paracetamol,Normal,0
1,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,Emergency,2019-08-26,Ibuprofen,Inconclusive,1
2,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,Emergency,2022-10-07,Aspirin,Normal,2
3,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,Elective,2020-12-18,Ibuprofen,Abnormal,3
4,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,Urgent,2022-10-09,Penicillin,Abnormal,4


In [31]:
# Moving a column name_int to beginning

# Put 'Name' first, followed by all other columns except 'Name'
health_data_clean = health_data_clean[['name_int'] + [col for col in health_data_clean.columns if col != 'name_int']]


In [41]:
# updating Billing Amount to show 2 after decimal
health_data_clean['Billing Amount'] = health_data_clean['Billing Amount'].round(2)

# Use a supported locale – this usually works on macOS/Linux:
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')

health_data_clean['Billing Amount (Formatted)'] = health_data_clean['Billing Amount'].apply(
    lambda x: locale.currency(x, grouping=True))


In [42]:
# viewing changes
health_data_clean.head()

Unnamed: 0,name_int,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Admission Type,Discharge Date,Medication,Test Results,Billing Amount (Formatted)
0,0,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.28,Urgent,2024-02-02,Paracetamol,Normal,"$18,856.28"
1,1,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.33,Emergency,2019-08-26,Ibuprofen,Inconclusive,"$33,643.33"
2,2,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.1,Emergency,2022-10-07,Aspirin,Normal,"$27,955.10"
3,3,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78,Elective,2020-12-18,Ibuprofen,Abnormal,"$37,909.78"
4,4,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.32,Urgent,2022-10-09,Penicillin,Abnormal,"$14,238.32"


In [43]:
print(health_data_clean['Billing Amount (Formatted)'].head())


0    $18,856.28
1    $33,643.33
2    $27,955.10
3    $37,909.78
4    $14,238.32
Name: Billing Amount (Formatted), dtype: object
