
# Data Analysis for cost of care in healthcare
_by Hafsa Laeeque_

---
The task is to<br>
  > 1) analyze the **clinical and financial data** of patients hospitalized for a _certain condition_.<br>
    2) join data given in the different tables.<br>
    3) find insights about **drivers of cost of care**.<br>
    4) document _approach, results and insights_ using [slides](https://docs.google.com/presentation/d/1-gYni51iGkYh4OCCr-BYKAV3YqUH4OzQOUAtz5MkrEg/edit?usp=sharing) and a [document](https://docs.google.com/document/d/1fQB0AP2ue_zKVUUAHx_1sJ626okOMtm_IdLOnw-9__A/edit?usp=sharing), both of which should have a similar narrative.<br>

---
There are 4 excel files in the [datasets](https://github.com/hafsalaeeque/cost-in-healthcare-DS-proj/tree/master/datasets) folder which I will be using for my analysis.

## Analysing the datasets
### Import packages

In [10]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

In [36]:
#display columns and rows
pd.options.display.max_columns = 50
pd.options.display.max_rows = 4000

### Load the datasets
Let's load the first dataset `bill_amount.csv`.

In [11]:
bill_amt = pd.read_csv('datasets/bill_amount.csv')

In [12]:
bill_amt.shape

(13600, 2)

In [13]:
bill_amt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13600 entries, 0 to 13599
Data columns (total 2 columns):
bill_id    13600 non-null int64
amount     13600 non-null float64
dtypes: float64(1), int64(1)
memory usage: 212.6 KB


In [14]:
bill_amt.head()

Unnamed: 0,bill_id,amount
0,40315104,1552.63483
1,2660045161,1032.011951
2,1148334643,6469.605351
3,3818426276,755.965425
4,9833541918,897.347816


In [15]:
bill_amt.bill_id.nunique()

13600

The first excel file, `bill_amt.csv`, does not have any null values. The data shows the cost incurred for each unique bill.

---
Let's load the second dataset `bill_id.csv`.

In [21]:
bill_id_df = pd.read_csv('datasets/bill_id.csv')

In [22]:
bill_id_df.shape

(13600, 3)

In [23]:
bill_id_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13600 entries, 0 to 13599
Data columns (total 3 columns):
bill_id              13600 non-null int64
patient_id           13600 non-null object
date_of_admission    13600 non-null object
dtypes: int64(1), object(2)
memory usage: 318.8+ KB


In [24]:
bill_id_df.head()

Unnamed: 0,bill_id,patient_id,date_of_admission
0,7968360812,1d21f2be18683991eb93d182d6b2d220,2011-01-01
1,6180579974,62bdca0b95d97e99e1c712048fb9fd09,2011-01-01
2,7512568183,1d21f2be18683991eb93d182d6b2d220,2011-01-01
3,3762633379,62bdca0b95d97e99e1c712048fb9fd09,2011-01-01
4,7654730355,1d21f2be18683991eb93d182d6b2d220,2011-01-01


In [25]:
bill_id_df.bill_id.nunique()

13600

The second excel file, `bill_id.csv`, does not have any null values. The data shows the _date_ for when a patient was admitted in the hospital for each unique bill. This could be joined with our first dataset so we can observe the cost incurred for each patient.

---
Let's load the third dataset `clinical_data.csv`.

In [26]:
data = pd.read_csv('datasets/clinical_data.csv')

In [27]:
data.shape

(3400, 26)

In [28]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3400 entries, 0 to 3399
Data columns (total 26 columns):
id                    3400 non-null object
date_of_admission     3400 non-null object
date_of_discharge     3400 non-null object
medical_history_1     3400 non-null int64
medical_history_2     3167 non-null float64
medical_history_3     3400 non-null object
medical_history_4     3400 non-null int64
medical_history_5     3096 non-null float64
medical_history_6     3400 non-null int64
medical_history_7     3400 non-null int64
preop_medication_1    3400 non-null int64
preop_medication_2    3400 non-null int64
preop_medication_3    3400 non-null int64
preop_medication_4    3400 non-null int64
preop_medication_5    3400 non-null int64
preop_medication_6    3400 non-null int64
symptom_1             3400 non-null int64
symptom_2             3400 non-null int64
symptom_3             3400 non-null int64
symptom_4             3400 non-null int64
symptom_5             3400 non-null int64
lab

In [29]:
data.isnull().sum().sum()

537

In [39]:
data.head(10)

Unnamed: 0,id,date_of_admission,date_of_discharge,medical_history_1,medical_history_2,medical_history_3,medical_history_4,medical_history_5,medical_history_6,medical_history_7,preop_medication_1,preop_medication_2,preop_medication_3,preop_medication_4,preop_medication_5,preop_medication_6,symptom_1,symptom_2,symptom_3,symptom_4,symptom_5,lab_result_1,lab_result_2,lab_result_3,weight,height
0,1d21f2be18683991eb93d182d6b2d220,2011-01-01,2011-01-11,0,1.0,0,0,0.0,0,0,1,0,1,0,0,1,0,0,0,1,1,13.2,30.9,123.0,71.3,161.0
1,62bdca0b95d97e99e1c712048fb9fd09,2011-01-01,2011-01-11,0,0.0,0,0,0.0,0,0,0,1,1,1,1,0,0,0,1,1,1,13.8,22.6,89.0,78.4,160.0
2,c85cf97bc6307ded0dd4fef8bad2fa09,2011-01-02,2011-01-13,0,0.0,0,0,0.0,0,0,0,1,1,1,1,1,1,1,1,1,0,11.2,26.2,100.0,72.0,151.0
3,e0397dd72caf4552c5babebd3d61736c,2011-01-02,2011-01-14,0,1.0,No,0,0.0,1,1,1,0,1,0,0,1,1,1,1,1,1,13.3,28.4,76.0,64.4,152.0
4,94ade3cd5f66f4584902554dff170a29,2011-01-08,2011-01-16,0,0.0,No,0,0.0,1,1,0,0,0,0,1,0,0,1,0,1,0,12.0,27.8,87.0,55.6,160.0
5,59e07adc2dbc5f70131f57d003610d74,2011-01-07,2011-01-17,0,,No,0,,0,0,0,1,1,1,1,1,1,0,1,1,1,15.8,31.0,75.0,78.8,169.0
6,f5c4d97ebf32d49967fbf4f6c5fd52ec,2011-01-06,2011-01-17,0,0.0,0,0,0.0,0,1,0,1,1,1,1,0,0,1,0,1,0,12.1,23.0,83.0,81.8,164.0
7,1e788744568c21b390c5aa8c5dd61335,2011-01-07,2011-01-17,0,0.0,0,0,0.0,0,1,0,0,1,0,1,1,1,1,0,0,0,16.4,26.8,126.0,73.5,173.0
8,457402b26562d41f4e40906d3d17d5d1,2011-01-12,2011-01-18,0,0.0,No,0,0.0,0,0,0,1,1,1,1,0,1,1,1,0,0,12.5,32.9,87.0,98.4,166.0
9,79f52395dab0e6d3a03c48f765cb6562,2011-01-02,2011-01-18,0,0.0,0,0,0.0,0,1,0,1,1,0,1,1,1,1,0,1,0,12.1,23.6,109.0,92.8,176.0


The third excel file, `clinical_data.csv` has 23 types of clinical data entries of 3400 patients.<br>
- It covers the
    - the date as to when the patient was admitted and discharged.
    - height of patient,
    - weight of patient,
    - if patient has any/all 7 medical histories, 
    - if patient has had to take any/all the 6 preoperation medication, 
    - if patient showed any/all the 5 symptoms and
    - 3 lab results of patient<br>
    
- The `id` column of the dataset seems to be the `patient_id` column in the `bill_id_df` datset.<br>
- It has **537 null values** in the 2nd and 5th medical history data.<br>

Considering that we had 13,600 unique patients in the `bill_id_df` dataset, the `clinical_data.csv` has only 25% of these patients' clinical data. 

---
Let's load the last dataset `demographics.csv`.