

# Project: Medical Appointments NoShow Dataset Investiagtion

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

**Dataset Description** :
    - The dataset is about Medical appointments with no show or show patients. 
    - The dataset has a well structured and clear features related to each condition (show/noshow).
    - The dataset has several features related to each patient and can effect the show/ no show at appointment, and they are the following: 
        * Patient's age.
        * Patient's gender.
        * Patient's Handicap. 
        * Patient's Scholarship: indicates whether the patient is enrolled in the brazilian welfare program or not.

These features lead to the following Research Questions:

### Research Questions: 

- Is the status show/no show get effected by the patient's age ?
- Is the status show/no show related to the patient's gender ?
- Does the patient's with scholarships are more encouraged to show up in their appointments ?
- Are patients with handicap condition tend to show up or not ?  
        
        

### Packages Importing

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime 
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

### Dataset Loading:

In [2]:
df_noshow = pd.read_csv('noshowappointments-kagglev2-may-2016(1).csv')
df_noshow.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


### Dataset Assessment:

#### Data Types: 
***String*** : 
    - Gender, ScheduledDay, AppointmentDay, Neighbourhood, No-show.
***float***:
    - PatientId
***int***:
    - AppointmentID, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_received.
#### Columns and Rows: 
14 columns,  and 110,527 rows. 

In [3]:
df_noshow.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


#### Checking if columns are in the right format ?
***Scheduled Day and Appointment Day columns need to be converted to datetime in cleaning stage:***

In [4]:
df_noshow[['ScheduledDay','AppointmentDay']].head(2)

Unnamed: 0,ScheduledDay,AppointmentDay
0,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z
1,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z


#### Checking for missing values:
***No Column has missing data points:***

In [5]:
sum(df_noshow.isnull().sum())

0

#### Checking for Duplicates:
***No duplicates where found in the dataset:***

In [6]:
sum(df_noshow.duplicated())

0

#### Descriptive Statistics (Searching for outliers)
    - A minimum age value of -1 was found. This datapoint can be considered outlier.
    - Handicapped should be 1 or 0. However, it has a maximum of 4. Any value more than 1 should be cosidered an outlier. 

In [7]:
df_noshow.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


### Data Cleaning

#### Removing Oultliers
***Removing Age Outliers:***

Number of Age outliers equal one: 

In [None]:
df_noshow.query('Age < 0').count()

In [None]:
df_noshow.drop(df_noshow[(df_noshow.Age < 0)].index, inplace = True)
df_noshow.info()

***Removing Handicap Outliers:***
Number of handicap outliers are 199

In [None]:
df_noshow.query('Handcap > 1').count()

In [None]:
df_noshow.drop(df_noshow[(df_noshow.Handcap > 1)].index, inplace = True)
df_noshow.info()

#### Changing Data format for some columns:


In [None]:
df_noshow['ScheduledDay'] = pd.to_datetime(df_noshow['ScheduledDay'])
df_noshow['AppointmentDay'] = pd.to_datetime(df_noshow['AppointmentDay'])
df_noshow.info()

<a id='eda'></a>
## Exploratory Data Analysis

### Investigating each variable:

***Adding more columns to serve the analysis***

Age Groups column:

In [None]:
df_noshow.loc[:,'Age_groups'] = pd.cut(df_noshow['Age'], bins = [0,18, 35, 55, 115], labels = ['Childs', 'Youngs', 'Adults', 'Seniors'] )

No-Show in 0 and 1, for clearer analysis in multivariate questions:

In [None]:
df_noshow.loc[ :,'No-show-rates'] = df_noshow.loc[:,'No-show'].apply(lambda x: 0 if x == 'No' else 1 )

***Visualization Function***

In [None]:
def show_variable (variable_name):
    df_noshow[variable_name].value_counts(normalize=True).plot.bar(figsize = (8,8), title = variable_name + '- Yes or No', color = 'b');
    plt.xlabel(variable_name);
    plt.ylabel('Count');
    

**1- No-Show Variable:**
It is the main variable that we predict and correlate with other variables. In the following figure, it is shown that 80% of the patients showed up in their appointments, while the other 20% didnot. 

In [None]:
show_variable ('No-show')

**2- Gender Variable:**
Most of the patients are female. 

In [None]:
show_variable ('Gender')

**3- Handicap Variable:**
The number of handicapped patients are very few in the dataset.

In [None]:
show_variable ('Handcap')

**4- Hipertension Variable:**
80% of the patients have longterm hipertension.

In [None]:
show_variable ('Hipertension')

**5- Alcholism Variable:**
Most of the patients are not alcoholic.  

In [None]:
show_variable ('Alcoholism')

**6- Diabetes Variable:**
Almost 90% of the patients don't have diabetes.

In [None]:
show_variable ('Diabetes')

**7- Scholarship Variable:**
Few patients have got a scholarship from the governemnt, the affect of this will be shown later.

In [None]:
show_variable ('Scholarship')

**8- Neighbourhood Variable:**
Most of our patients come from the JARDIM CAMBURI neighbourhood 

In [None]:
df_noshow['Neighbourhood'].value_counts().sort_values(ascending = False)[:10].plot.bar(figsize=(24,6), fontsize = 15.0, color = 'g')
plt.title('Neighbourhood', fontweight="bold", fontsize = 22.0)
plt.ylabel('Count %', fontsize = 20.0)
plt.xlabel('Neighbourhood / Location', fontsize = 20.0)
plt.show()

**9- Age Variable:**
The age distribution is shown below. Where four age groups are mainly childs, youngs, adults and seniors.

In [None]:
df_noshow['Age_groups'].value_counts().sort_values(ascending = False)[:4].plot.bar(figsize=(24,6), fontsize = 15.0, color = 'g')
plt.title('Age Distribution', fontweight="bold", fontsize = 22.0)
plt.ylabel('Count', fontsize = 20.0)
plt.xlabel('Age Groups', fontsize = 20.0)
plt.show()

## Answering Research Questions:


***Multivariate visualization Function***

In [None]:
def visualize_multi_variate (variable_name):
    variable= sns.countplot(x = variable_name, hue = 'No-show', data = df_noshow)
    variable.set_title('Patients with' + variable_name + '- Yes or No')
    plt.xlabel(variable_name + ' Status')
    plt.ylabel('Visits Number')
    plt.show()

***Numerical Analysis Function***

In [None]:
def numerical_analysis (variable_name):
    return df_noshow.groupby(variable_name).mean().loc[:,'No-show-rates']

### Research Question 1: Is the status show/no show get effected by the patient's age ?
The graph shows that the youger the age is , the higher chance that the patient will not show up. Where the no-show up rates for the young groups (childs and youngs) are both 22.5% and 23.8%. On the contrary, the older groups, Adults and seniors, are much likely to show up, as both have rates of 19.7% and 15.6%.

**Visualization:**

In [None]:
fig, ax = plt.subplots()
df_noshow.groupby('No-show')['Age'].mean().plot.bar()
plt.ylabel('Mean_Age');

**Numerical Analysis:**

In [None]:
numerical_analysis ('Age_groups')

### Research Question 2: Is the status show/no show related to the patient's gender ?
   The graph below shows that the gender doesn't affect the status show/no show. That  is because both cases have very close rates, which are 20/3% and 19.9%.

**Visulaization:**

In [None]:
visualize_multi_variate ('Gender')

**Numerical Analysis:**

In [None]:
numerical_analysis ('Gender')

### Research Question 3: Does the patient's with scholarships are more encouraged to show up in their appointments ?
Despite the patients with scholarships are expected to show up more likely, it seems the opposite occurs. The patients with scholarships tend to not show up more than the others. That's shown from the rates, where patients with scholarship have 23.7% of no-show status, and the patients who don''t have are 19.8%. 


**Visulaization:**

In [None]:
visualize_multi_variate ('Scholarship')

**Numerical Analysis:**

In [None]:
numerical_analysis ('Scholarship')

### Research Question 4: Are patients with handicap condition tend to show up or not ?
From the below rates, both handicapped and other patients have very close rates of not showing up in appointments. They are 20.2% for handicapped and 17.9% for others. This can be interpreted as the handicap condition doesn't affect much the no-show rate. HOwever, the handicapped sample is too small to conclude a relation based on it.

**Visualization:** 

In [None]:
visualize_multi_variate ('Handcap')

**Numerical Analysis:**

In [None]:
numerical_analysis ('Handcap')

### Limitations and Discussions
- The data contained only April, May and June data - If there are complete full year data on monthly basis, we would've been able to get more powerful investigations
- The other features such as long-term diseases and alcholoism migh be interesting for further investigation.
- A much bigger and balanced dataset would be more valuable to reach out better and more accurate conclusions.

<a id='conclusions'></a>
## Conclusions


1- Gender doesn't have a direct affect on the chances of a patient showing up or not.

2- Majority of Patients that don't show up are less than age of 40. On the contrary age groups of 40-60 and 60-80 are most likely to attend their appointments.

3- On contrary actually we would expect that patients who have got scholarship should have been all attended the physician but it seems that ~25% of patients did not attend.

4- The handicapped data is not enough to come up with a concrete relationship to the patient's appointment status.

In conclusion, the most likely factor to affect the patient's appointment status is the age variable. 

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])