**PROJECT: HOSPITAL ADMISSION ANALYSIS**

**PROBLEM STATEMENT:**

A city hospital wants to analyze its admission trends to improve scheduling, reduce waiting times, and
manage resources more effectively. You've been provided with anonymized patient admission data. Your
task is to perform basic exploratory data analysis (EDA) to discover useful patterns.

**IMPORT LIBRARIES**

In [19]:
import pandas as pd
import matplotlib.pyplot as plt

**DATA EXPLORATION**

In [20]:
# LOAD THE DATASET

admissions_df = pd.read_csv('src/admission.csv')
admissions_df.head()

Unnamed: 0,SNO,MRD No.,D.O.A,D.O.D,AGE,GENDER,RURAL,TYPE OF ADMISSION-EMERGENCY/OPD,month year,DURATION OF STAY,...,CONGENITAL,UTI,NEURO CARDIOGENIC SYNCOPE,ORTHOSTATIC,INFECTIVE ENDOCARDITIS,DVT,CARDIOGENIC SHOCK,SHOCK,PULMONARY EMBOLISM,CHEST INFECTION
0,1,234735,4/1/2017,4/3/2017,81,M,R,E,Apr-17,3,...,0,0,0,0,0,0,0,0,0,0
1,2,234696,4/1/2017,4/5/2017,65,M,R,E,Apr-17,5,...,0,0,0,0,0,0,0,0,0,0
2,3,234882,4/1/2017,4/3/2017,53,M,U,E,Apr-17,3,...,0,0,0,0,0,0,0,0,0,0
3,4,234635,4/1/2017,4/8/2017,67,F,U,E,Apr-17,8,...,0,0,0,0,0,0,0,0,0,0
4,5,234486,4/1/2017,4/23/2017,60,F,U,E,Apr-17,23,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# Data Overview

print("DATA OVERVIEW")
print(admissions_df.info())

DATA OVERVIEW
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15757 entries, 0 to 15756
Data columns (total 56 columns):
 #   Column                           Non-Null Count  Dtype 
---  ------                           --------------  ----- 
 0   SNO                              15757 non-null  int64 
 1   MRD No.                          15757 non-null  object
 2   D.O.A                            15757 non-null  object
 3   D.O.D                            15757 non-null  object
 4   AGE                              15757 non-null  int64 
 5   GENDER                           15757 non-null  object
 6   RURAL                            15757 non-null  object
 7   TYPE OF ADMISSION-EMERGENCY/OPD  15757 non-null  object
 8   month year                       15757 non-null  object
 9   DURATION OF STAY                 15757 non-null  int64 
 10  duration of intensive unit stay  15757 non-null  int64 
 11  OUTCOME                          15757 non-null  object
 12  SMOKING           

**Initial Observations:**

- The dataset contains 56 columns(attributes) and 15,757 entries (representing the no. of patients).
- There are many columns which are not relevant for the questions of our analysis so we need to remove them.
- There are 2 dataTypes object(categorical data), int(numerical/continuous) data.

**DATA PREPROCESSING**

In [22]:
# columns to keep

admissions_df = admissions_df[['MRD No.', 'AGE', 'GENDER', 'TYPE OF ADMISSION-EMERGENCY/OPD', 'D.O.A', 'PRIOR CMP']]
admissions_df.head()

Unnamed: 0,MRD No.,AGE,GENDER,TYPE OF ADMISSION-EMERGENCY/OPD,D.O.A,PRIOR CMP
0,234735,81,M,E,4/1/2017,0
1,234696,65,M,E,4/1/2017,0
2,234882,53,M,E,4/1/2017,0
3,234635,67,F,E,4/1/2017,0
4,234486,60,F,E,4/1/2017,1


In [23]:
# rename columns 

admissions_df.rename(columns={
    'MRD No.': 'patient_id',
    'AGE': 'age',
    'GENDER': 'gender',
    'TYPE OF ADMISSION-EMERGENCY/OPD': 'admission_type',
    'D.O.A': 'admission_date',
    'PRIOR CMP': 'previous_visits'
}, inplace=True)

admissions_df.head()


Unnamed: 0,patient_id,age,gender,admission_type,admission_date,previous_visits
0,234735,81,M,E,4/1/2017,0
1,234696,65,M,E,4/1/2017,0
2,234882,53,M,E,4/1/2017,0
3,234635,67,F,E,4/1/2017,0
4,234486,60,F,E,4/1/2017,1


In [24]:
admissions_df.info() # get teh data overview again

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15757 entries, 0 to 15756
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   patient_id       15757 non-null  object
 1   age              15757 non-null  int64 
 2   gender           15757 non-null  object
 3   admission_type   15757 non-null  object
 4   admission_date   15757 non-null  object
 5   previous_visits  15757 non-null  int64 
dtypes: int64(2), object(4)
memory usage: 738.7+ KB


In [25]:
# convert 'admission_date' to datetime

admissions_df['admission_date'] = pd.to_datetime(admissions_df['admission_date'], errors='coerce')

# extract "admission_day" from "admission_date"

admissions_df['admission_day'] = admissions_df['admission_date'].dt.day_name()

admissions_df.head()

Unnamed: 0,patient_id,age,gender,admission_type,admission_date,previous_visits,admission_day
0,234735,81,M,E,2017-04-01,0,Saturday
1,234696,65,M,E,2017-04-01,0,Saturday
2,234882,53,M,E,2017-04-01,0,Saturday
3,234635,67,F,E,2017-04-01,0,Saturday
4,234486,60,F,E,2017-04-01,1,Saturday


In [26]:
print(admissions_df.info())

# Numeric Data Stats
print("NUMERIC DATA STATS")
print(admissions_df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15757 entries, 0 to 15756
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   patient_id       15757 non-null  object        
 1   age              15757 non-null  int64         
 2   gender           15757 non-null  object        
 3   admission_type   15757 non-null  object        
 4   admission_date   10102 non-null  datetime64[ns]
 5   previous_visits  15757 non-null  int64         
 6   admission_day    10102 non-null  object        
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 861.8+ KB
None
NUMERIC DATA STATS
                age                 admission_date  previous_visits
count  15757.000000                          10102     15757.000000
mean      61.426160  2018-02-08 14:55:19.897049856         0.154471
min        4.000000            2017-01-04 00:00:00         0.000000
25%       54.000000            2017-08-06 00:00:

- After removing the irrelevant columns and adding new columns we have 7 columns(Attributes) left.
- It contains 15,757 rows/entries(representing no. of patients).
- 3 DataTypes object, integer, datetime. This indicates that dataset is mix of continuous/numerical, catgorical and time data.
- The avg. age of patient's is approx. 61.42 years, with min age of 4 years and max of 110 years.
- The mean 61.42 years indicate majority of the patients are older.
- Mostly patients are first time visitors as the mean is low approximately 0.15.

In [27]:
# Data Cleaning 

# check missing values
print("MISSING VALUES")
print(admissions_df.isnull().sum())

# check for duplicated values
print("\nDUPLICATED VALUES")
print(admissions_df.duplicated().sum())

MISSING VALUES
patient_id            0
age                   0
gender                0
admission_type        0
admission_date     5655
previous_visits       0
admission_day      5655
dtype: int64

DUPLICATED VALUES
1409


- We have 5655 missing values in admission_date and admission_day columns.
- We have 1409 duplicated entries in our dataset.

In [28]:
# Handling Duplicated values

admissions_df = admissions_df.drop_duplicates()
admissions_df.duplicated().sum()

np.int64(0)

In [30]:
# Handling Missing values

most_frequent_date = admissions_df['admission_date'].mode()[0]
admissions_df['admission_date'] = admissions_df['admission_date'].fillna(most_frequent_date)

# re extract admission_day from admission_date

admissions_df['admission_day'] = admissions_df['admission_date'].dt.day_name()

# check again for missing values
admissions_df.isnull().sum()

patient_id         0
age                0
gender             0
admission_type     0
admission_date     0
previous_visits    0
admission_day      0
dtype: int64

- Since 35.9% of the admission dates were missing, dropping them would result in significant data loss, so I imputed them with the most frequent admission date. As it was a suitable approach to preserve the dataset and ensure reliable trend analysis.

**Exploratory Data Analysis: Insights into Patient Admissions at the Hospital**