
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

We will start by importing the necessary packages for the analysis, after that we will explore the data to spot any problems. We will then dive into data cleaning and preparation for explanatory data analysis

In [None]:
# import packages
import numpy as np #For scientific computing
import pandas as pd #For data manipulation 
import matplotlib.pyplot as plt #For data visualization 
import seaborn as sns #For more advamced data visualization 

Questions:
1. Does Show rate differs by gender? males may have higher probability of show up due to social norms that constricts the movement of women.


2. How the scholarship affects the probability of show up? those who were eligible for the scholarship may have shown up more often because the are able to afford the costs or because the fear of loosing the scholarship.



3. AppointmentDay: How delay affects show up?

<a id='wrangling'></a>
## Data Wrangling

In [None]:
# import the data
df = pd.read_csv("../input/noshowappointments/KaggleV2-May-2016.csv")

In [None]:
# display the top 5 rows
df.head(10)

In [None]:
# display the names of all the columns
pd.DataFrame({"column_name": df.columns})

In [None]:
#display number of rows and columns
df.shape

In [None]:
# basic info about the data: number of rows and columns, data types, missing values ..etc
df.info()

#### There is no missing values
#### As we can see the data types needs to be fixed for several columns: 

1. PatientId and AppointmentID are currently numeric where in fact they have should been string or object as they have no numeric meaning.
2. ScheduledDay and AppointmentDay should be converted from object to date-time

#### Other remarks:
1. No-show should be recoded and converted into integer
2. ScheduledDay and AppointmentDay: for more simplicity, we might keep only the date and delete the time.


In [None]:
# convert ID variables into objects
df[["PatientId", "AppointmentID"]] = df[["PatientId", "AppointmentID"]].astype(str)

# deleting the time and keeping the date only
for col in ["AppointmentDay", "ScheduledDay"]:
    df[col] = df[col].apply(lambda x: x.split("T")[0])
    
# converting date varaibles from object to date
for col in ["ScheduledDay", "AppointmentDay"]:
    df[col] = pd.to_datetime(df[col])

In [None]:
#changing the name of No-Show and encoding it to prevent any misconciption

#changing the name
df.rename(columns = {"No-show": "Show"}, inplace = True)

#encoding it
labels = {"No": 1, "Yes":0}
df["Show"] = df["Show"].map(labels)

#converting it into integer
df["Show"] = df["Show"].astype(int)

In [None]:
#changing the name of Gender and encoding it to prevent any misconciption

#changing the name
df.rename(columns = {"Gender": "Male"}, inplace = True)

#encoding it
labels = {"M": 1, "F":0}
df["Male"] = df["Male"].map(labels)

#converting it into integer
df["Male"] = df["Male"].astype(int)

In [None]:
#Lets have a look at data types again
df.info()

**All data types have been corrected, lets now check for duplicates**

In [None]:
df.duplicated().sum()

# ther is no duplicates

#### Checking for extreme values

In [None]:
pd.DataFrame({"min":df.min(), "max":df.max()})

1. The minimum age is -1! 


2. age can not be less than 0

Lets explore it further and see that specific patient

In [None]:
df[df["Age"] < 0]

In [None]:
#drop this row
df.drop(99832, inplace = True)

# Make sure of dropping it
df[df["Age"] < 0]

#### Change age to a categorical variable in order to clearly see the relationship between it and show

In [None]:
df["Age_bins"] = pd.qcut(df.Age, 5, labels = ["0 - 12", "13 - 29", "30 - 44", "45 - 58", "59 - 115"])

In [None]:
df.Age_bins.unique()

#### Create a new variable for those who are diabetic and also hipertension

In [None]:
df["diabetic_hiper"] = df["Diabetes"] * df["Hipertension"]

#### Create new variable for waiting days

In [None]:
df["Wait_days"] = df["AppointmentDay"] - df["ScheduledDay"]
df["Wait_days"] = df["Wait_days"].astype(str)
df["Wait_days"] = df["Wait_days"].apply(lambda x: x.split(" ")[0])
df["Wait_days"] = df["Wait_days"].astype(int)

In [None]:
# Exploring unreasonable wait days
drop_index = df[df["Wait_days"] < 0].index
df.drop(drop_index, inplace = True)

df[df["Wait_days"] < 0]

In [None]:
#Converting wait days into wait bins
df["Wait_days"] = pd.cut(df["Wait_days"], 10, labels = ["0-17", "18-34", "35-53", "54-71", "72-89", "90-106", "107-124", "125-142", "143-160", "161-179"])

In [None]:
df.head()

<a id='eda'></a>
## Exploratory Data Analysis

In [None]:
# Summary Statistics
df.describe()

**Main Remarks:**

The mean age of all patients is 37

For all the patients:

1. 9.8 percent have the scholarship


2. 35 percent are males


3. 19.7 percent have high blood pressure


4. 7.2 percent are diabetic


5. 3 percent drink alcohol


6. 2.2 percent are handicapped


7. 32.1 percent received a sms


8. Show up rate is 79.8 percent


9. 5.9 percent are diabetic and also hipertension 

In [None]:
# Correlation between Show and other variables
df.corr()

In [None]:
# Visualize correlation matrix
plt.figure(figsize = (8, 4), dpi = 100)
sns.heatmap(df.corr(), vmin = -1, vmax = 1, cmap = "viridis", linewidths=0.01, annot=True)

There is no strong correlation between Show up and any other feature. However, there is a strong positive correlation between Hipertension on one hand and Age, and Diabetic on the other hand.

### Answers to the questions

#### Question 1: How gender affects show up?

In [None]:
# The relation ship between Show and other features
male_impact = pd.pivot_table(data = df, index = "Male", values = "Show")
round(male_impact * 100, 2) 

In [None]:
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x=df.Male, y=df.Show)
plt.show()

In [None]:
df.columns 

In [None]:
# Does receiving SMS makes a gender difference?
male_diab_impact = pd.pivot_table(data = df, index = ["Male", "SMS_received"], values = "Show")
round(male_diab_impact * 100, 2) 

In [None]:
# Does being diabetic makes a gender difference?
male_diab_impact = pd.pivot_table(data = df, index = ["Male", "Diabetes"], values = "Show")
round(male_diab_impact * 100, 2) 

In [None]:
# Does age makes a gender difference?
male_diab_impact = pd.pivot_table(data = df, index = ["Male", "Age_bins"], values = "Show")
round(male_diab_impact * 100, 2) 

In [None]:
male_diab_impact = pd.pivot_table(data = df, index = ["Male", "Scholarship"], values = "Show")
round(male_diab_impact * 100, 2) 

#### Answer to Question 1:
When it comes to show up, there is no Significant difference between males and females regardless of:
1. Agg
2. Being diabetic or not.
3. Receiving SMS or not
3. Receiving the scholarship or not

Therefore, we can conclude that Gender has no significant impact on show up

#### Question 2: How the scholarship affects the probability of show up?

In [None]:
# The relationship between Show and scholaship
scholar_impact = pd.pivot_table(data = df, index = "Scholarship", values = "Show")
round(scholar_impact * 100, 2) 

In [None]:
#graph the result
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x=df.Scholarship, y=df.Show)
plt.show()

In [None]:
#Scholar Age Impact
scholar_age_impact = pd.pivot_table(data = df, index = ["Scholarship", "Age_bins"], values = "Show")
round(scholar_age_impact * 100, 2)

In [None]:
#Scholar Gender Impact
scholar_gender_impact = pd.pivot_table(data = df, index = ["Scholarship", "Male"], values = "Show")
round(scholar_gender_impact * 100, 2) 

In [None]:
#Scholar Hipertension Impact
scholar_hiper_impact = pd.pivot_table(data = df, index = ["Scholarship", "Hipertension"], values = "Show")
round(scholar_hiper_impact * 100, 2) 

In [None]:
#Scholar Diabetes Impact
scholar_diab_impact = pd.pivot_table(data = df, index = ["Scholarship", "Diabetes"], values = "Show")
round(scholar_diab_impact * 100, 2) 

In [None]:
#Scholar Alcoholism Impact
scholar_alco_impact = pd.pivot_table(data = df, index = ["Scholarship", "Alcoholism"], values = "Show")
round(scholar_alco_impact * 100, 2)

#### Answer to question 2:
At the first glance it seems that scholarship has negative impact on show up whic is counter intuitive. But when we investigated it further, we concluded that those who were diabetic or hipertension and received the scholaship had higher show up rates. At the contrary, the scholarship led to a negative impact if the beneficiary was alcoholic.

#### Question 3: How delay affects show up?

In [None]:
# The relationship between Show and scholaship
wait_impact = pd.pivot_table(data = df, index = "Wait_days", values = "Show")
round(wait_impact * 100, 2) 

In [None]:
wait_days = df.Wait_days.value_counts(normalize = True).sort_values(ascending = False)[:30]
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x = wait_days.index, y = wait_days.values)
plt.xticks(rotation = 90)
plt.show()

In [None]:
#graph the result
plt.figure(figsize = (8,4), dpi = 100)
sns.barplot(x=df.Wait_days, y=df.Show)
plt.show()

#### Answer Question 3:
Delay Have negative but inconsistent impact on show up

<a id='conclusions'></a>
## Conclusions

**Impact of Gender**

When it comes to show up, there is no Significant difference between males and females regardless of:
1. Agg
2. Being diabetic or not.
3. Receiving SMS or not
3. Receiving the scholarship or not

Therefore, we can conclude that Gender has no significant impact on show up


**Impact of scholarship**

At the first glance it seems that scholarship has negative impact on show up whic is counter intuitive. But when we investigated it further, we concluded that those who were diabetic or hipertension and received the scholaship had higher show up rates. At the contrary, the scholarship led to a negative impact if the beneficiary was alcoholic.


**Impact of delay days**

Delay Have negative but inconsistent impact on show up

In [None]:
nan