# Machine Learning Zoomcamp - Capstone Project 2 - Cardiology Unit Admission

As described on the Readme.md file, in this project a Linear Regression model will be trained and tuned to estimate the length of stay for patients in a Cardiology Unit.

The dataset is available at Kaggel at [this address]('https://www.kaggle.com/datasets/mansoorahmad4477/cardiology-unit-admission')

Lets download the dataset

## Dataset Download

As the dataset is available in Kaggle website, the easiest way to download it is using the kagglehub python library.

If you dont have the kagglehub library install you can install it with this command:

``` shell
pip install kagglehub
```

In [2]:
!pip install kagglehub

Collecting kagglehub
  Using cached kagglehub-0.3.6-py3-none-any.whl.metadata (30 kB)
Collecting tqdm (from kagglehub)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached kagglehub-0.3.6-py3-none-any.whl (51 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, kagglehub
Successfully installed kagglehub-0.3.6 tqdm-4.67.1


In [16]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("mansoorahmad4477/cardiology-unit-admission")
print("Path where the dataset was downloaded is: " + path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/mansoorahmad4477/cardiology-unit-admission?dataset_version_number=1...


100%|██████████| 209k/209k [00:00<00:00, 1.48MB/s]

Extracting files...
Path where the dataset was downloaded is: /home/jgrau/.cache/kagglehub/datasets/mansoorahmad4477/cardiology-unit-admission/versions/1





With this command we downloaded the dataset, but we need to store it on our project folder

In [17]:
!mkdir ./data
!mv {path}/* ./data
!ls ./data

cw_22_23_24.csv


## Load and Prepare Data

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [34]:
df = pd.read_csv('./data/cw_22_23_24.csv')

In [30]:
len(df)

9573

In [31]:
df.head()

Unnamed: 0,adm_type,shift_from,ssc,yr_nae,m_no,mrn,pt_name,sex,disease,D.O.A,D.O.D,status,consultant,L.O.S
0,Shift From,ER,No,1,1,21845698,Hara Bibi,F,STEMI,1-Jan-22,1-Jan-22,Discharge,Imran Khan,0
1,Shift From,ER,No,2,2,22000071,Taj Rehman,M,ADHF,1-Jan-22,5-Jan-22,Discharge,Malik Faisal,4
2,Shift From,ER,No,3,3,21838760,Bakhtawar Shah,M,ihd,1-Jan-22,10-Jan-22,Discharge,Asif Iqbal,9
3,Shift From,ER,No,4,4,22000251,Arasal Jan Bibi,F,,1-Jan-22,7-Jan-22,Discharge,Sher Bahadar,6
4,Shift From,Neu,No,5,5,21825110,Khad Mewa,F,,1-Jan-22,2-Jan-22,Discharge,Tariq Nawaz,1


### Columns Description

Based on the information available at Kaggle, this is the description of the information available in each column of the dataset:

- adm_type: Indicates the type of admission (e.g., "Shift From").
- shift_from: Specifies the source from where the patient was shifted (e.g., "ER" for Emergency Room, "Neu" for Neurology).
- ssc: Sehat Sahulat Card insurance or Health card of KPK province.
- yr_nae: Likely represents the year of the admission event.
- m_no: Month number of the admission event.
- mrn: Medical record number, a unique identifier for each patient.
- pt_name: Name of the patient.
- sex: Gender of the patient (e.g., "M" for male, "F" for female).
- disease: Diagnosis or condition of the patient (e.g., "STEMI", "ADHF").
- D.O.A: Date of admission in various formats.
- D.O.D: Date of discharge or death, also in various formats.
- status: Discharge status of the patient (e.g., "Discharge").
- consultant: Name of the consulting doctor responsible for the patient.
- L.O.S: Length of stay in the hospital, measured in days. **--> This will be our target variable**

From these columns, it is possible to identify that some of them must no be part of the ML model, as they wont be at hand at the moment we need to make a prediction, which are D.O.A, D.O.D

In [35]:
del df['D.O.A']
del df['D.O.D']


In [36]:
df.dtypes

adm_type      object
shift_from    object
ssc           object
yr_nae         int64
m_no           int64
mrn           object
pt_name       object
sex           object
disease       object
status        object
consultant    object
L.O.S          int64
dtype: object

In [37]:
# lets normalize the columns names
df.columns = df.columns.str.lower().str.replace(' ','_').str.replace('.', '')
df.columns

Index(['adm_type', 'shift_from', 'ssc', 'yr_nae', 'm_no', 'mrn', 'pt_name',
       'sex', 'disease', 'status', 'consultant', 'los'],
      dtype='object')

We can see that there are columns that can be defined as categorial variables. Lets normalize the information on those.

In [38]:
categorical = list(df.dtypes[df.dtypes == 'object'].index)

for c in categorical:
    df[c] = df[c].str.lower().str.replace(' ','_')

In [39]:
for c in categorical:
    print('********************')
    print(df[c].value_counts())
    print('********************' + "\n\n")

********************
adm_type
shift_from    9192
ibp            194
opd            107
er              78
opd_             2
Name: count, dtype: int64
********************


********************
shift_from
er                8073
post_cath         1013
ibp                167
opd                 90
pcw                 22
                  ... 
gynae                1
ct_icu/ibp           1
amu/er               1
shift_from_cvw       1
s-pacu               1
Name: count, Length: 97, dtype: int64
********************


********************
ssc
no     5658
yes    3915
Name: count, dtype: int64
********************


********************
mrn
5017905     3
22302825    2
5396245     2
22987541    2
5139589     2
           ..
22918773    1
22913469    1
22917790    1
22918295    1
7047427     1
Name: count, Length: 9461, dtype: int64
********************


********************
pt_name
yasmeen_bibi    28
razia_bibi      26
nasreen_bibi    21
shamim_bibi     21
taj_bibi        20
                

We can see there are two date columns which are doa (date of admission) and dod (date of discharge). Lets transform these columns as date format

### Handling missing values

We can see that the only column with missing values is the column *disease*. There are multiple ways to handle this case. Among the options we could mention the following three options:
1. **Delete rows with missing values**: This would be the most simple option, but as the missing values represent around 21% of the rows, we would be missing too much data.
2. **Inpute Missing values**: This mean that we could could fill these values with a constant value or with the most frequent value. The problem with this is that we could be introducing bias to the model.
3. **Create new category for Missing Values**: By filling these values with 'missing' could be helpful as it would be considered as an additional category/value on this column. We are going with this aproach.


In [40]:
df['disease'].fillna('missing', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['disease'].fillna('missing', inplace=True)


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9573 entries, 0 to 9572
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   adm_type    9573 non-null   object
 1   shift_from  9573 non-null   object
 2   ssc         9573 non-null   object
 3   yr_nae      9573 non-null   int64 
 4   m_no        9573 non-null   int64 
 5   mrn         9573 non-null   object
 6   pt_name     9573 non-null   object
 7   sex         9573 non-null   object
 8   disease     9573 non-null   object
 9   status      9573 non-null   object
 10  consultant  9573 non-null   object
 11  los         9573 non-null   int64 
dtypes: int64(3), object(9)
memory usage: 897.6+ KB


### Splitting the data

In [42]:
df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=1)
df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=1)

y_train = df_train.los.values
y_val = df_val.los.values
y_test = df_test.los.values

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

del df_train['los']
del df_val['los']
del df_test['los']