## PROJECT: Predicting no-show appoinments 

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
> In this project I have chosen a No Shows Medical Appointments which can be retireved from [Medical Appointment No Shows
](https://www.kaggle.com/joniarroba/noshowappointments). The dataset consisted of data from hospital in neighborhood of the Municipality of Vitória in the State of Espírito Santo, Brazil. In this experiment, I will perform a data wrangling, an exploratory data analysis that contain visualization and lastly will perdict what factors would help us determine if the patient who has taken the appointments will show up or no show.

<a id='wrangling'></a>
## Data Wrangling 

> 1. Importing libraries 

In [11]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

> 2. Loading data using pandas 

In [12]:
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


> 3. Checking for duplicated values, number of rows and column, types of data, info of the data (for any null values).

In [14]:
print(df.duplicated().sum())

0


In [15]:
df.shape

(110527, 14)

In [16]:
df.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


> 4. Checking/editing for misspelled column name

In [19]:
for i, v in enumerate(df.columns):
    print(i, v)

0 PatientId
1 AppointmentID
2 Gender
3 ScheduledDay
4 AppointmentDay
5 Age
6 Neighbourhood
7 Scholarship
8 Hipertension
9 Diabetes
10 Alcoholism
11 Handcap
12 SMS_received
13 No-show


> From the result here are the columns name that I will edit to avoid confusion int the future:
'Hipertension': 'Hypertension', 
'Handcap': 'Handicap', 
'SMS_received': 'SMSReceived', 
'No-show': 'NoShow'

In [20]:
df = df.rename(columns={'Hipertension': 'Hypertension', 'Handcap': 'Handicap', 'SMS_received': 'SMSReceived', 'No-show': 'NoShow'})

> Testing df with new edited columns name

In [21]:
df.head(1)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMSReceived,NoShow
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No


> 5. Iterate thru the columns to qucikly check if there is any odd value. 

In [18]:
col_list = df.columns.values.tolist()
for col in df:
    print(df[col].unique())

[2.98724998e+13 5.58997777e+14 4.26296230e+12 ... 7.26331493e+13
 9.96997666e+14 1.55766317e+13]
[5642903 5642503 5642549 ... 5630692 5630323 5629448]
['F' 'M']
['2016-04-29T18:38:08Z' '2016-04-29T16:08:27Z' '2016-04-29T16:19:04Z' ...
 '2016-04-27T16:03:52Z' '2016-04-27T15:09:23Z' '2016-04-27T13:30:56Z']
['2016-04-29T00:00:00Z' '2016-05-03T00:00:00Z' '2016-05-10T00:00:00Z'
 '2016-05-17T00:00:00Z' '2016-05-24T00:00:00Z' '2016-05-31T00:00:00Z'
 '2016-05-02T00:00:00Z' '2016-05-30T00:00:00Z' '2016-05-16T00:00:00Z'
 '2016-05-04T00:00:00Z' '2016-05-19T00:00:00Z' '2016-05-12T00:00:00Z'
 '2016-05-06T00:00:00Z' '2016-05-20T00:00:00Z' '2016-05-05T00:00:00Z'
 '2016-05-13T00:00:00Z' '2016-05-09T00:00:00Z' '2016-05-25T00:00:00Z'
 '2016-05-11T00:00:00Z' '2016-05-18T00:00:00Z' '2016-05-14T00:00:00Z'
 '2016-06-02T00:00:00Z' '2016-06-03T00:00:00Z' '2016-06-06T00:00:00Z'
 '2016-06-07T00:00:00Z' '2016-06-01T00:00:00Z' '2016-06-08T00:00:00Z']
[ 62  56   8  76  23  39  21  19  30  29  22  28  54  15  50  4

> From the result of this iteration: 

<ol> <li><strong>ScheduledDay</strong> and <strong>AppointmentDay</strong> will be changed to DateTime</li><li><strong>PatientID</strong> is float number, will need to change it to integer.</li><li> The odd age of the patient is 0 and -1, assumming 0 is newborn infants and -1 is fetus. I decided to drop a column with age -1.</li> 
</ol>

> I will call the new df as ver1_df.

In [55]:
# ScheduledDay and AppointmentDat to Datetime
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']).dt.date.astype('datetime64[ns]')
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay']).dt.date.astype('datetime64[ns]')

#PatientID to integer
df['PatientId'] = df['PatientId'].astype('int64')

> Data with patient age -1

In [56]:
df.loc[df['Age'] == -1]

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMSReceived,NoShow
99832,465943158731293,5775010,F,2016-06-06,2016-06-06,-1,ROMÃO,0,0,0,0,0,0,No


In [57]:
#drop the fetus and called the df, ver1_df
ver1_df = df.drop([99832])

> Testing if the data has been dropped

In [58]:
#checking if the fetus is dropped
ver1_df.loc[df['Age'] == -1]

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMSReceived,NoShow


> 6. Save ver1_df to the folder

In [59]:
ver1_df.to_csv('ver1_df.csv', index=False)

> 7. Checking the columns/unique value to decide which column to edit/drop

In [63]:
#14 features
for i, v in enumerate(ver1_df.columns):
    print(i, v)

0 PatientId
1 AppointmentID
2 Gender
3 ScheduledDay
4 AppointmentDay
5 Age
6 Neighbourhood
7 Scholarship
8 Hypertension
9 Diabetes
10 Alcoholism
11 Handicap
12 SMSReceived
13 NoShow


In [78]:
print(format(ver1_df.Gender.unique()))
print(format(ver1_df.Scholarship.unique()))
print(format(ver1_df.Hypertension.unique()))
print(format(ver1_df.Diabetes.unique()))
print(format(ver1_df.Alcoholism.unique()))
print(format(ver1_df.Handicap.unique()))
print(format(ver1_df.SMSReceived.unique()))
print(format(ver1_df.NoShow.unique()))

['F' 'M']
[0 1]
[1 0]
[0 1]
[0 1]
[0 1 2 3 4]
[0 1]
['No' 'Yes']


<ol>
    <li> <strong>PatientID</strong> and <strong>AppointmentID</strong> will be drop because it is hospital's random generated number. If there is anything associate with price or medical coverage, this factor will be very interesting to explore.</li>
    <li> There are 11 dependent vairable (excludes PatientID and AppointmentID) and 1 varible which is <strong>NoShow</strong>.</li>
    <li> <strong>Handicap</strong> has 4 unique value and <strong>Gender</strong> has F, M values while the other features has integer type. I will convert the other features to object type.</li>
</ol>

In [79]:
#converting to object type
ver1_df['Scholarship'] = ver1_df['Scholarship'].astype('object')
ver1_df['Hypertension'] = ver1_df['Hypertension'].astype('object')
ver1_df['Diabetes'] = ver1_df['Diabetes'].astype('object')
ver1_df['Alcoholism'] = ver1_df['Alcoholism'].astype('object')
ver1_df['Handicap'] = ver1_df['Handicap'].astype('object')
ver1_df['SMSReceived'] = ver1_df['SMSReceived'].astype('object')

In [80]:
#confirming the datatypes
ver1_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 110526 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype         
---  ------          --------------   -----         
 0   PatientId       110526 non-null  int64         
 1   AppointmentID   110526 non-null  int64         
 2   Gender          110526 non-null  object        
 3   ScheduledDay    110526 non-null  datetime64[ns]
 4   AppointmentDay  110526 non-null  datetime64[ns]
 5   Age             110526 non-null  int64         
 6   Neighbourhood   110526 non-null  object        
 7   Scholarship     110526 non-null  object        
 8   Hypertension    110526 non-null  object        
 9   Diabetes        110526 non-null  object        
 10  Alcoholism      110526 non-null  object        
 11  Handicap        110526 non-null  object        
 12  SMSReceived     110526 non-null  object        
 13  NoShow          110526 non-null  object        
dtypes: datetime64[ns](2), int64(3), obje