# Project: No show appointment for patients Data Analysis
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

This dataset collects information
from 100k medical appointments in
Brazil and is focused on the question
of whether or not patients show up
for their appointment. A number of
characteristics about the patient are
included in each row.
● ‘ScheduledDay’ tells us on
what day the patient set up their
appointment.
● ‘Neighborhood’ indicates the
location of the hospital.
● ‘Scholarship’ indicates
whether or not the patient is
enrolled in Brasilian welfare
program Bolsa Família.
● Be careful about the encoding
of the last column: it says ‘No’ if
the patient showed up to their
appointment, and ‘Yes’ if they
did not show up.
 


### Question(s) for Analysis
What factors are
important for us to
know in order to
predict if a patient will
show up for their
scheduled
appointment?


In [1]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

In [2]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

Requirement already up-to-date: pandas==0.25.0 in /opt/conda/lib/python3.6/site-packages (0.25.0)


<a id='wrangling'></a>
## Data Wrangling


### General Properties


In [3]:
#load data
df=pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [4]:
#exploring shape of data
df.shape

(110527, 14)

In [5]:
#explore the mean and counts and max and min 
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


max age is 115 & min age is -1 & mean is 37

In [6]:
#check if there are any missing value
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


No missing values

In [7]:
#check for the rows which have age with -1
drp=df[df['Age']<0]
drp

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
99832,465943200000000.0,5775010,F,2016-06-06T08:58:13Z,2016-06-06T00:00:00Z,-1,ROMÃO,0,0,0,0,0,0,No



### Data Cleaning
 

In [8]:
# removing the age of -1
df.drop(index=99832,inplace=True)

In [9]:
df.duplicated(['PatientId','No-show']).sum()

38710

there is patients with same status of show with same patient id

In [10]:
df.drop_duplicates(['PatientId','No-show'], inplace=True)

removing duplicated patient id  with status of no show & show

In [11]:
df.dropna()
df.shape

(71816, 14)

In [12]:
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,71816.0,71816.0,71816.0,71816.0,71816.0,71816.0,71816.0,71816.0,71816.0
mean,146624900000000.0,5666493.0,36.527501,0.095536,0.195068,0.070959,0.025036,0.020135,0.335566
std,254491700000000.0,73130.83,23.378262,0.293956,0.396256,0.256758,0.156236,0.155338,0.472191
min,39217.84,5030230.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4175956000000.0,5631622.0,17.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31894250000000.0,5672882.0,36.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94574870000000.0,5716567.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


In [13]:
#cleaning unnecessary data by remove columns
df.drop(['PatientId', 'AppointmentID','ScheduledDay','AppointmentDay'], axis = 1, inplace = True)

In [14]:
#edit column name
df.rename(columns = {'Hipertension' : 'hypertension'}, inplace = True)

<a id='eda'></a>
## Exploratory Data Analysis

Now we are going to explore and visualize data after we wrangling and cleaning it 

## General Exploratory

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.
df.hist(figsize=(9,8));

In [None]:
df.info()

In [None]:
#edit column name and counts no of show& no show
df.rename(columns = {'No-show' : 'No_show'}, inplace = True)
showing=df.No_show=='No'
no_showing=df.No_show=='Yes'
df.count(),df[showing].count(),df[no_showing].count()

number of showing patients is 4 times than no showing patients

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.
#relation between average age and no show
def attending(df,column_name,att,absent):
    
  #df[column_name][showing].mean()
  #df[column_name][no_showing].mean()
  df[column_name][showing].hist(alpha=0.5, bins=10, label='show')
  df[column_name][no_showing].hist(alpha=0.5, bins=10, label='no_show')
  plt.title('relation between showing & average age')
  plt.ylabel('showing & no showing');
  plt.xlabel('Average Age')
  plt.legend();
    
attending(df,'Age',showing,no_showing)

no of shpow patients are more than no_show patients in all ages

In [None]:
#relation between chronic diseases and average age according with no show
df[showing].groupby(['Diabetes','hypertension']).Age.mean().plot(kind= 'bar', color='green', label= 'showing');
df[no_showing].groupby(['Diabetes','hypertension']).Age.mean().plot(kind= 'bar', color='red', label= 'no_showing');
plt.title('relation between chronic disease & average age')
plt.xlabel('Diabetes & hypertension')
plt.ylabel('Average Age')
#df.Age[Diabetes][hypertension].hist(alpha=0.5, bins=15, label='showing')
#df.Age[Diabetes][hypertension].hist(alpha=0.5, bins=15, label='no_showing')
plt.legend();

chronic diseases existing don't affect attedance of patients

In [None]:
df[showing].mean(),df[no_showing].mean()

average age of showing patients is 37 & average age of no_showing patients is 34

In [None]:
#how the gender affect no show?
plt.figure(figsize=[12,8])
df['Gender'][showing].value_counts().plot(kind='pie', label='show')
plt.title('perentage between male and female of attending')
plt.ylabel('number of patients')
plt.xlabel('Gender')
plt.legend();

percentage of gender attend

In [None]:
def attending(df,column_name,att,absent):
    
  #df[column_name][showing].mean()
  #df[column_name][no_showing].mean()
  df[column_name][showing].hist(alpha=0.5, bins=10, label='show')
  df[column_name][no_showing].hist(alpha=0.5, bins=10, label='no_show')
  plt.title('relation between showing & average age')
  plt.ylabel('showing & no showing');
  plt.xlabel('Average Age')
  plt.legend();
    
attending(df,'SMS_received',showing,no_showing)

people who didn't receive SmS are more attend than who received SmS

<a id='conclusions'></a>
## Conclusions

-we can see that the female show more than men
-we can see that the show people is more than no show
-we can see that the less the age the more show is
-we can see that according to chronic disease the show is more than no show
-age affect no of show as 0-10 the most number of show patients rather than the above 65 is the least no of patients show
## Limitations
there is not a correlation relation between show & age & chronic diseases & sex

## Submitting your Project 


In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])