
# Data Analysis for cost of care in healthcare
_by Hafsa Laeeque_

---
The task is to<br>
  > 1) analyze the **clinical and financial data** of patients hospitalized for a _certain condition_.<br>
    2) join data given in the different tables.<br>
    3) find insights about **drivers of cost of care**.<br>
    4) document _approach, results and insights_ using [slides](https://docs.google.com/presentation/d/1-gYni51iGkYh4OCCr-BYKAV3YqUH4OzQOUAtz5MkrEg/edit?usp=sharing) and a [document](https://docs.google.com/document/d/1fQB0AP2ue_zKVUUAHx_1sJ626okOMtm_IdLOnw-9__A/edit?usp=sharing), both of which should have a similar narrative.<br>

---
There are 4 excel files in the [datasets](https://github.com/hafsalaeeque/cost-in-healthcare-DS-proj/tree/master/datasets) folder which I will be using.

## Analysing the datasets
### Import packages

In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [36]:
#display columns and rows
pd.options.display.max_columns = 50
pd.options.display.max_rows = 4000

### (A) Load the datasets
A1 - Let's load the first dataset `bill_amount.csv`.

In [11]:
bill_amt = pd.read_csv('datasets/bill_amount.csv')

In [12]:
bill_amt.shape

(13600, 2)

In [13]:
bill_amt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13600 entries, 0 to 13599
Data columns (total 2 columns):
bill_id    13600 non-null int64
amount     13600 non-null float64
dtypes: float64(1), int64(1)
memory usage: 212.6 KB


In [14]:
bill_amt.head()

Unnamed: 0,bill_id,amount
0,40315104,1552.63483
1,2660045161,1032.011951
2,1148334643,6469.605351
3,3818426276,755.965425
4,9833541918,897.347816


In [15]:
bill_amt.bill_id.nunique()

13600

The first excel file, `bill_amt.csv`, does not have any null values. The data shows the cost incurred for each unique bill.

---
A2 - Let's load the second dataset `bill_id.csv`.

In [57]:
bill_id_df = pd.read_csv('datasets/bill_id.csv')

In [22]:
bill_id_df.shape

(13600, 3)

In [23]:
bill_id_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13600 entries, 0 to 13599
Data columns (total 3 columns):
bill_id              13600 non-null int64
patient_id           13600 non-null object
date_of_admission    13600 non-null object
dtypes: int64(1), object(2)
memory usage: 318.8+ KB


In [41]:
bill_id_df.tail()

Unnamed: 0,bill_id,patient_id,date_of_admission
13595,1641053864,a4c61deaa9ce86b4d2289eab6128b872,2015-12-28
13596,6956955826,ac52a32f8ce8c46d82df2d72052ae5a9,2015-12-28
13597,1399259594,4f67a54ab205cc9e7e2b0a4ee08e4fba,2015-12-28
13598,9243628699,a4c61deaa9ce86b4d2289eab6128b872,2015-12-28
13599,4808173213,a4c61deaa9ce86b4d2289eab6128b872,2015-12-28


In [25]:
bill_id_df.bill_id.nunique()

13600

The second excel file, `bill_id.csv`, does not have any null values. The data shows the _date_ for when a patient was admitted in the hospital for each unique bill. This could be joined with our first dataset so we can observe the cost incurred for each patient.

---
A3 - Let's load the third dataset `clinical_data.csv`.

In [26]:
data = pd.read_csv('datasets/clinical_data.csv')

In [27]:
data.shape

(3400, 26)

In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3400 entries, 0 to 3399
Data columns (total 26 columns):
id                    3400 non-null object
date_of_admission     3400 non-null object
date_of_discharge     3400 non-null object
medical_history_1     3400 non-null int64
medical_history_2     3167 non-null float64
medical_history_3     3400 non-null object
medical_history_4     3400 non-null int64
medical_history_5     3096 non-null float64
medical_history_6     3400 non-null int64
medical_history_7     3400 non-null int64
preop_medication_1    3400 non-null int64
preop_medication_2    3400 non-null int64
preop_medication_3    3400 non-null int64
preop_medication_4    3400 non-null int64
preop_medication_5    3400 non-null int64
preop_medication_6    3400 non-null int64
symptom_1             3400 non-null int64
symptom_2             3400 non-null int64
symptom_3             3400 non-null int64
symptom_4             3400 non-null int64
symptom_5             3400 non-null int64
lab

In [29]:
data.isnull().sum().sum()

537

In [46]:
data.head(6)

Unnamed: 0,id,date_of_admission,date_of_discharge,medical_history_1,medical_history_2,medical_history_3,medical_history_4,medical_history_5,medical_history_6,medical_history_7,preop_medication_1,preop_medication_2,preop_medication_3,preop_medication_4,preop_medication_5,preop_medication_6,symptom_1,symptom_2,symptom_3,symptom_4,symptom_5,lab_result_1,lab_result_2,lab_result_3,weight,height
0,1d21f2be18683991eb93d182d6b2d220,2011-01-01,2011-01-11,0,1.0,0,0,0.0,0,0,1,0,1,0,0,1,0,0,0,1,1,13.2,30.9,123.0,71.3,161.0
1,62bdca0b95d97e99e1c712048fb9fd09,2011-01-01,2011-01-11,0,0.0,0,0,0.0,0,0,0,1,1,1,1,0,0,0,1,1,1,13.8,22.6,89.0,78.4,160.0
2,c85cf97bc6307ded0dd4fef8bad2fa09,2011-01-02,2011-01-13,0,0.0,0,0,0.0,0,0,0,1,1,1,1,1,1,1,1,1,0,11.2,26.2,100.0,72.0,151.0
3,e0397dd72caf4552c5babebd3d61736c,2011-01-02,2011-01-14,0,1.0,No,0,0.0,1,1,1,0,1,0,0,1,1,1,1,1,1,13.3,28.4,76.0,64.4,152.0
4,94ade3cd5f66f4584902554dff170a29,2011-01-08,2011-01-16,0,0.0,No,0,0.0,1,1,0,0,0,0,1,0,0,1,0,1,0,12.0,27.8,87.0,55.6,160.0
5,59e07adc2dbc5f70131f57d003610d74,2011-01-07,2011-01-17,0,,No,0,,0,0,0,1,1,1,1,1,1,0,1,1,1,15.8,31.0,75.0,78.8,169.0


The third excel file, `clinical_data.csv` has 25 types of clinical data entries of 3400 patients.<br>
- It covers the
    - the date as to when the patient was admitted and discharged,
    - height of patient,
    - weight of patient,
    - if patient has any/all 7 medical histories, 
    - if patient has had to take any/all the 6 preoperation medication, 
    - if patient showed any/all the 5 symptoms and
    - 3 lab results of patient<br>
    
- The `id` column of the dataset seems to be the `patient_id` column in the `bill_id_df` datset.<br>
- It has **537 null values** in the 2nd and 5th medical history data.<br>
- The dataset needs to be cleaned as there are
     - string objects like "No" in the 3rd medical history column and<br>
     - null values.<br>
     
Considering that we had 13,600 unique patients in the `bill_id_df` dataset, the `clinical_data.csv` has only 25% of these patients' clinical data. 

---
A4 - Let's load the last dataset `demographics.csv`.

In [43]:
demographic = pd.read_csv('datasets/demographics.csv')

In [45]:
demographic.shape

(3000, 5)

In [47]:
demographic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 5 columns):
patient_id         3000 non-null object
gender             3000 non-null object
race               3000 non-null object
resident_status    3000 non-null object
date_of_birth      3000 non-null object
dtypes: object(5)
memory usage: 117.3+ KB


In [50]:
demographic.head(10)

Unnamed: 0,patient_id,gender,race,resident_status,date_of_birth
0,fa2d818b2261e44e30628ad1ac9cc72c,Female,Indian,Singaporean,1971-05-14
1,5b6477c5de78d0b138e3b0c18e21d0ae,f,Chinese,Singapore citizen,1976-02-18
2,320aa16c61937447fd6631bf635e7fde,Male,Chinese,Singapore citizen,1982-07-03
3,c7f3881684045e6c49020481020fae36,Male,Malay,Singapore citizen,1947-06-15
4,541ad077cb4a0e64cc422673afe28aef,m,Chinese,Singaporean,1970-12-12
5,cf280265a73331d6cad35b4800e96abf,Female,Chinese,PR,1966-12-01
6,94f7d3a8a4d6bb14859b64c3f03e4a6c,m,Malay,Singaporean,1975-09-14
7,43dfbeb8d76f3b00b8fa7a49e5a3eb6f,f,chinese,Singaporean,1974-03-04
8,2882e70ff56c2600bbbbb855fcfa96b9,Male,Chinese,Singaporean,1969-04-22
9,36e65f14c328fef0b02aa7d4047c6f74,Female,Chinese,Singapore citizen,1976-10-24


The last excel file, `demographics.csv` has 4 categories of 3000 patients that tell us more about the patient.<br>
- It has the
    - gender of patient,
    - race of patient,
    - resident status of patient and
    - date of birth of patient<br>
    
- The `patient_id` column of the dataset seems to be the `patient_id` column in the `bill_id_df` datset.<br>
- There are no null values.<br>
- The dataset needs to be cleaned as there are
     - dates in object type<br>
     - synonymous terms representing gender, race & citizenship.<br>
     
Considering the `clinical_data.csv` had 3400 patients' clinical data, the `demographics.csv` only has demogrpahic data for 88% of these patients. 

---
I will clean the datasets for our analysis later on.

### (B) Clean the datasets
`bill_amt` dataset does not require any cleaning.<br>
`bill_id_df` dataset column `date_of_admission` needs to be changed to datetime object.<br>
`data` dataset column `medical_history_3` has string objects that needs to be changed to binary & dates to be changed to datetime objects.<br>
`demographic` dataset column for gender, race and citizenship needs to be changed for consistency & dates to be changed to datetime objects.<br>

---
B1 - Clean `bill_id_df`

In [58]:
bill_id_df['date_of_admission'] = pd.to_datetime(bill_id_df.date_of_admission)

In [61]:
bill_id_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13600 entries, 0 to 13599
Data columns (total 3 columns):
bill_id              13600 non-null int64
patient_id           13600 non-null object
date_of_admission    13600 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 318.8+ KB



---
B2 - Clean `data`

In [54]:
data.medical_history_3.unique()

array(['0', 'No', '1', 'Yes'], dtype=object)


---
B2 - Clean `demographic`

In [51]:
demographic.race.unique()

array(['Indian', 'Chinese', 'Malay', 'chinese', 'India', 'Others'],
      dtype=object)

In [52]:
demographic.resident_status.unique()

array(['Singaporean', 'Singapore citizen', 'PR', 'Foreigner'],
      dtype=object)

In [53]:
demographic.gender.unique()

array(['Female', 'f', 'Male', 'm'], dtype=object)