## 1. Background

This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row. Various aspects of this dataset will be analyzed in this project.

## 2. Import Libraries and read the csv file

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import datetime as dt
import networkx as nx
import matplotlib.pyplot as plt

In [2]:
# load in the dataset into pandas dataframes
df_original = pd.read_csv('C:\\Users\\raz37388\\Desktop\\udacity-git-course\\new-git-project\\No_show_appointment\\noshowappointments-kagglev2-may-2016.csv')

#make a copy of the original dataset to  new dataframe- the new dataframe will be used for the analysis
df = df_original.copy()


#show the first five rows of the dataset
df.head(20)

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
5,95985130000000.0,5626772,F,2016-04-27T08:36:51Z,2016-04-29T00:00:00Z,76,REPÚBLICA,0,1,0,0,0,0,No
6,733688200000000.0,5630279,F,2016-04-27T15:05:12Z,2016-04-29T00:00:00Z,23,GOIABEIRAS,0,0,0,0,0,0,Yes
7,3449833000000.0,5630575,F,2016-04-27T15:39:58Z,2016-04-29T00:00:00Z,39,GOIABEIRAS,0,0,0,0,0,0,Yes
8,56394730000000.0,5638447,F,2016-04-29T08:02:16Z,2016-04-29T00:00:00Z,21,ANDORINHAS,0,0,0,0,0,0,No
9,78124560000000.0,5629123,F,2016-04-27T12:48:25Z,2016-04-29T00:00:00Z,19,CONQUISTA,0,0,0,0,0,0,No


In [3]:
#number of rows and columns of the dataset
df.shape

(110527, 14)

In [4]:
# Columns of the dataset and data type 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


## 3. Meaning of the column headers

**1. PatientId:** indicates the patient ID; duplication is possible due to cases where the same patient booked more than one appointment.

**2.AppointmentID:** indicates appoint ID, this field should be unique

**3.Gender:** indicates the patient's gender (M/F)

**4.ScheduledDay:** indicates the Date/Time the patient set up their appointment.

**5.AppointmentDay:** indicates the date/time the patient called to book their appointment.

**6.Age:** indicates the patient's age.

**7.Neighborhood:** indicates the location of the hospital.

**8.Scholarship:** indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.

**9.Hipertension:** indicates whether or not the patient is experiencing Hypertension.

**10.Diabetes:** indicates whether or not the patient is experiencing Diabetes.

**11.Alcoholism:** indicates whether or not the patient is experiencing Alcoholism.

**12.Handcap:** indicates whether or not the patient is with special needs.

**12.SMS_received:** indicates whether or not the patient has received a reminder text message.

**14.Show-up:** ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.

## 4. Assessment of the Dataset

In [5]:
#check if there is any duplicate data
df[df.duplicated()]

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show


In [6]:
# Checking for the duplicate- should return False since there is none
df.duplicated().any()

False

In [7]:
# view missing value count for each feature of the dataset
df.isnull().sum()

PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64

In [8]:
# checks if any of columns have null values - should print False since there is no none values
df.isnull().sum().any()

False

## 5.Data Quality Issues

**1.Convert all Dataframe headers to lower case.**

In [9]:
#convert the dataframe column headers to lower case
df.columns = df.columns.str.lower()

**2. Change the following headers:**

   - A. PatientId to **patients_id**
     
   - B. AppointmentID to **appointment_id**
     
   - C. ScheduledDay to **scheduled_day**
     
   - D. AppointmentDay to **appointment_day**
     
   - E. No-show to **no_show**

In [10]:
# list of new column names
new_names = {"patientid":"patient_id","appointmentid":"appointment_id","scheduledday":"scheduled_day",
             "appointmentday":"appointment_day","no-show":"no_show"}

In [11]:
#renames the df columns as per the new_names
df = df.rename(columns= new_names)

**3.** Replace 'M' with Male and 'F' with Female in the gender column

In [12]:
#replacing the F and M strings with Female and Male
df['gender'] = df['gender'].replace({'M':'Male','F':'Female'})

#check the unique values in the gender column- should return Female and Male
df.gender.unique()

array(['Female', 'Male'], dtype=object)

**4. The scheduled_day' and appointment_day are strings. For analyzing time, we need to convert these two columns into datetime format.**

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
patient_id         110527 non-null float64
appointment_id     110527 non-null int64
gender             110527 non-null object
scheduled_day      110527 non-null object
appointment_day    110527 non-null object
age                110527 non-null int64
neighbourhood      110527 non-null object
scholarship        110527 non-null int64
hipertension       110527 non-null int64
diabetes           110527 non-null int64
alcoholism         110527 non-null int64
handcap            110527 non-null int64
sms_received       110527 non-null int64
no_show            110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [14]:
# convet the dates to datetime format
df['scheduled_day']= pd.to_datetime(df.scheduled_day)
df['appointment_day']= pd.to_datetime(df.appointment_day)

In [15]:
# finding the year from the date
df['year'] = df['scheduled_day'].dt.year

# Extracting Scheduled Day of the week
df['day'] = df['scheduled_day'].dt.strftime('%a')

In [16]:
# Creating the month category- will return the month of the year
df['scheduled_month'] = df['scheduled_day'].dt.strftime('%b')
df['appointment_month'] = df['appointment_day'].dt.strftime('%b')


#extracting hour- will return a value from 0 to 23 
df['scheduled_hour'] = df['scheduled_day'].dt.hour.astype(int)
df['appointment_hour'] = df['appointment_day'].dt.hour.astype(int)

In [17]:
#extracting the month
#df['scheduled_date'] = df['appointment_day'].dt.day
#df['appointment_date'] = df['appointment_day'].dt.day

In [18]:
df = df.drop([ 'patient_id','appointment_id','scheduled_day','appointment_day'], axis=1)

In [19]:
#see if the dataframe columns have changed
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 16 columns):
gender               110527 non-null object
age                  110527 non-null int64
neighbourhood        110527 non-null object
scholarship          110527 non-null int64
hipertension         110527 non-null int64
diabetes             110527 non-null int64
alcoholism           110527 non-null int64
handcap              110527 non-null int64
sms_received         110527 non-null int64
no_show              110527 non-null object
year                 110527 non-null int64
day                  110527 non-null object
scheduled_month      110527 non-null object
appointment_month    110527 non-null object
scheduled_hour       110527 non-null int32
appointment_hour     110527 non-null int32
dtypes: int32(2), int64(8), object(6)
memory usage: 12.6+ MB


In [20]:
df.head(3)

Unnamed: 0,gender,age,neighbourhood,scholarship,hipertension,diabetes,alcoholism,handcap,sms_received,no_show,year,day,scheduled_month,appointment_month,scheduled_hour,appointment_hour
0,Female,62,JARDIM DA PENHA,0,1,0,0,0,0,No,2016,Fri,Apr,Apr,18,0
1,Male,56,JARDIM DA PENHA,0,0,0,0,0,0,No,2016,Fri,Apr,Apr,16,0
2,Female,62,MATA DA PRAIA,0,0,0,0,0,0,No,2016,Fri,Apr,Apr,16,0


In [21]:
df_patient = df.copy() 

In [22]:
# Store the dataset
%store df_patient

Stored 'df_patient' (DataFrame)


**Note: After this primary wrangling, now the dataset is ready to be explored. However, we will continue to wrangle the dataset as needed.**