# **Healthcare Data Analysis Project**

## Introduction
This project aims to explore and analyze a synthetic healthcare dataset. The dataset, designed for educational and research purposes, mimics real-world health data and offers a rich ground for practicing data manipulation, analysis, and predictive modeling in the healthcare.

## Project Objectives
- **Data Cleaning and Preprocessing:** Address data quality issues and prepare the dataset for analysis.
- **Exploratory Data Analysis:** Gain insights into the data by exploring distributions, patterns, and relationships.
- **Predictive Modeling:** Develop a model to predict patient outcomes or other relevant predictions based on the dataset.
- **Data Visualization:** Create visualizations to effectively communicate findings.

# 1. Data Cleaning and Preprocessing:

- Importing Libraries

In [3]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

- Data Loading

In [4]:
df=pd.read_csv("healthcare_dataset_bruité.csv")

- Initial Data Exploration

In [6]:
#First rows
df.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Tiffany Ramirez,81.0,Female,O-,Diabetes,2022-11-17,Patrick Parker,Wallace-Hamilton,Medicare,37490.983364,,Elective,2022-12-01,Aspirin,Inconclusive
1,"Ruben Burns,35,Male,O+,Asthma,2023-06-01,Diane...",,,,,,,,,,,,,,
2,Chad Byrd,61.0,Male,B-,Obesity,2019-01-09,Paul Baker,Walton LLC,Medicare,36874.896997,292.0,Emergency,2019-02-08,Lipitor,Normal
3,Antonio Frederick,49.0,Male,B-,Asthma,2020-05-02,Brian Chandler,Garcia Ltd,Medicare,23303.322092,480.0,Urgent,2020-05-03,Penicillin,Abnormal
4,"Mrs. Brandy Flowers,51,Male,O-,Arthritis,2021-...",,,,,,,,,,,,,,


In [16]:
#Basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                10000 non-null  object 
 1   Age                 6707 non-null   float64
 2   Gender              6707 non-null   object 
 3   Blood Type          6707 non-null   object 
 4   Medical Condition   6707 non-null   object 
 5   Date of Admission   6707 non-null   object 
 6   Doctor              6707 non-null   object 
 7   Hospital            6707 non-null   object 
 8   Insurance Provider  6707 non-null   object 
 9   Billing Amount      6707 non-null   float64
 10  Room Number         6704 non-null   float64
 11  Admission Type      6707 non-null   object 
 12  Discharge Date      6707 non-null   object 
 13  Medication          6707 non-null   object 
 14  Test Results        6707 non-null   object 
dtypes: float64(3), object(12)
memory usage: 1.1+ MB


In [8]:
#Descriptive statistics
df.describe()

Unnamed: 0,Age,Billing Amount,Room Number
count,6707.0,6707.0,6704.0
mean,51.313702,25521.369023,300.208234
std,19.593787,14054.26012,116.141056
min,18.0,1000.180837,101.0
25%,35.0,13660.212091,199.0
50%,52.0,25268.536976,299.0
75%,68.0,37724.630016,401.0
max,85.0,49995.902283,500.0


In [17]:
#Count of Unique Values in Each Column
df.nunique()

Name                  9696
Age                     68
Gender                   2
Blood Type               8
Medical Condition        6
Date of Admission     1772
Doctor                6438
Hospital              5346
Insurance Provider       5
Billing Amount        6707
Room Number            400
Admission Type           3
Discharge Date        1792
Medication               5
Test Results             3
dtype: int64

- Data Cleaning

In [19]:
df.isnull().sum().sort_values(ascending=False)

Room Number           3296
Age                   3293
Gender                3293
Blood Type            3293
Medical Condition     3293
Date of Admission     3293
Doctor                3293
Hospital              3293
Insurance Provider    3293
Billing Amount        3293
Admission Type        3293
Discharge Date        3293
Medication            3293
Test Results          3293
Name                     0
dtype: int64

> *Since there are a lot of NAN values as seen here, they need to be dropped.*

In [20]:
df=df.dropna(subset=['Test Results'])
df.isnull().sum().sort_values(ascending=False)

Room Number           3
Name                  0
Age                   0
Gender                0
Blood Type            0
Medical Condition     0
Date of Admission     0
Doctor                0
Hospital              0
Insurance Provider    0
Billing Amount        0
Admission Type        0
Discharge Date        0
Medication            0
Test Results          0
dtype: int64

> *All NaN values been dropped, except for the room numbers.*

In [27]:
#most_frequent_room=df['Room Number'].mode()[0]
#df["Room Number"].fillna(most_frequent_room, inplace=True)
#df.isnull().sum().sort_values(ascending=False)
#il y'avait un problème quand j'ai essayé d'excuter 

most_frequent_room = df['Room Number'].mode()[0]
df.loc[df['Room Number'].isnull(), 'Room Number'] = most_frequent_room

df['Room Number'].isnull().sum()

0

> *Instead of replacing 'Room Number' missed values by its average, we have chosen to replace it by the most frequent room number since its data type is an "int"*

A present nous n'avons plus de valeures manquantes, nous allons donc passer à la deuxième partie du nettoyage des données qui consiste à retirer les valeures abérente

In [42]:
df.describe()

Unnamed: 0,Age,Billing Amount,Room Number
count,6707.0,6707.0,6707.0
mean,51.313702,25521.369023,300.188907
std,19.593787,14054.26012,116.11867
min,18.0,1000.180837,101.0
25%,35.0,13660.212091,199.0
50%,52.0,25268.536976,299.0
75%,68.0,37724.630016,401.0
max,85.0,49995.902283,500.0


Tout d'abord nous nous interessons aux valeurs quantitative en verifiant que pour le max et le min il n'y ait pas de valeurs aberantes. 
Ce n'est pas le cas 

In [49]:
# df.describe(include="O")

#group by / max doctors

#value count / doctor (appearance)

df.describe(include="O")

Unnamed: 0,Name,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Admission Type,Discharge Date,Medication,Test Results
count,6707,6707,6707,6707,6707,6707,6707,6707,6707,6707,6707,6707
unique,6403,2,8,6,1772,6438,5346,5,3,1792,5,3
top,Michael Johnson,Female,AB+,Hypertension,2022-10-01,Michael Johnson,Smith PLC,Aetna,Urgent,2020-08-08,Penicillin,Abnormal
freq,6,3435,864,1151,13,6,19,1382,2285,11,1405,2280


Ensuite nous nous interessons aux valeures de type objet. 
Nous verifions en particulier la ligne "unique" pour verifier qu'il n'y ait pas plus de valeurs uniques attibués que possible. 
Je n'observe aucune valeur abérrante. 

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Tiffany Ramirez,81.0,Female,O-,Diabetes,2022-11-17,Patrick Parker,Wallace-Hamilton,Medicare,37490.983364,257.0,Elective,2022-12-01,Aspirin,Inconclusive
2,Chad Byrd,61.0,Male,B-,Obesity,2019-01-09,Paul Baker,Walton LLC,Medicare,36874.896997,292.0,Emergency,2019-02-08,Lipitor,Normal
3,Antonio Frederick,49.0,Male,B-,Asthma,2020-05-02,Brian Chandler,Garcia Ltd,Medicare,23303.322092,480.0,Urgent,2020-05-03,Penicillin,Abnormal
5,Patrick Parker,41.0,Male,AB+,Arthritis,2020-08-20,Robin Green,Boyd PLC,Aetna,22522.363385,180.0,Urgent,2020-08-23,Aspirin,Abnormal
7,Patty Norman,55.0,Female,O-,Arthritis,2019-05-16,Brian Kennedy,Brown Inc,Blue Cross,13546.817249,384.0,Elective,2019-06-02,Aspirin,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9993,Michael Munoz,39.0,Male,O-,Hypertension,2023-10-09,Stephen Evans,Moran Ltd,Cigna,12379.134624,380.0,Urgent,2023-10-20,Lipitor,Normal
9994,Jorge Obrien,69.0,Male,A+,Diabetes,2021-12-25,Frank Miller,Scott LLC,UnitedHealthcare,16793.598395,341.0,Elective,2022-01-06,Penicillin,Inconclusive
9996,Stephanie Evans,47.0,Female,AB+,Arthritis,2022-01-06,Christopher Yates,Nash-Krueger,Blue Cross,5995.717488,244.0,Emergency,2022-01-29,Ibuprofen,Normal
9997,Christopher Martinez,54.0,Male,B-,Arthritis,2022-07-01,Robert Nicholson,Larson and Sons,Blue Cross,49559.202905,312.0,Elective,2022-07-15,Ibuprofen,Normal
