# Predicting Doctor Appointment No-Shows
***

## Table of Contents
* [Data Wrangling](#data_wrangling)
    * [General Properties](#general_properties)
    * [Data Cleaning](#data_cleaning)
        * [Appointment ID](#cleaning_appointment_id)
        * [Scheduled Day](#cleaning_scheduled_day)
        * [Appointment Day](#cleaning_appointment_day)
        * [Gender](#cleaning_gender)
        * [Age](#cleaning_age)
        * [Bolsa Familia](#cleaning_bolsa_familia)
        * [Hypertension](#cleaning_hypertension)
        * [Diabetes](#cleaning_diabetes)
        * [Number of Handicaps](#cleaning_number_handicaps)
        * [SMS Received](#cleaning_sms_received)
        * [No Show](#cleaning_no_show)
        * [Neighborhood](#cleaning_neighborhood)
        * [Patient ID](#cleaning_patient_id)
        
    

## Introduction



In this project, my aim is to determine what patient characteristics are connected with not showing up to a doctor's appointment by analyzing a data set of over 100K medical appointments in Vitória, Espírito Santo, Brazil provided by JoniHoppen on [Kaggle](https://www.kaggle.com/joniarroba/noshowappointments).

<a id='data_wrangling'></a>

## Data Wrangling
***

In [1]:
import pandas as pd
import numpy as np
import geocoder
API_KEY = ''

<a id='general_properties'></a>

### General Properties

In [4]:
df = pd.read_csv('KaggleV2-May-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [5]:
# Number of records
n = len(df)
n

110527

In [6]:
# Rename fields to have more consistent formatting and to English where appropriate
df.columns = ['PatientID', 'AppointmentID', 'Gender', 'ScheduledDay',
              'AppointmentDay', 'Age', 'Neighborhood', 'BolsaFamilia', \
              'Hypertension', 'Diabetes', 'Alcoholism', 'NumHandicaps', \
              'SMSReceived', 'NoShow']

In [7]:
# Number of nulls
df.isnull().sum()

PatientID         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighborhood      0
BolsaFamilia      0
Hypertension      0
Diabetes          0
Alcoholism        0
NumHandicaps      0
SMSReceived       0
NoShow            0
dtype: int64

> There are no null records in any of the fields

In [8]:
# Data types
df.dtypes

PatientID         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighborhood       object
BolsaFamilia        int64
Hypertension        int64
Diabetes            int64
Alcoholism          int64
NumHandicaps        int64
SMSReceived         int64
NoShow             object
dtype: object

<a id='data_cleaning'></a>

### Data Cleaning

<a id='cleaning_appointment_id'></a>

#### Appointment ID

In [9]:
# All of the appointment IDs are 7 digits long, and are between these values
min_apptID = df['AppointmentID'].min()
max_apptID = df['AppointmentID'].max()
print('{} - {}'.format(min_apptID, max_apptID))

5030230 - 5790484


In [32]:
# AppointmentIDs are identifiers, so they are also converted to strings
df['AppointmentID'] = df['AppointmentID'].astype(str)

<a id='cleaning_scheduled_day'></a>

#### Scheduled Day

In [11]:
# Convert the scheduled day to a datetime
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay'])

# Verify that ScheduledDay is in a datetime format
df['ScheduledDay'].dtype

dtype('<M8[ns]')

In [12]:
# Check that the <M8[ns] is a datetime format
np.dtype('datetime64[ns]') == np.dtype('<M8[ns]')

True

In [13]:
# Verify that all records were converted to valid datetimes
np.isnat(df['ScheduledDay']).sum()

0

In [14]:
# The first date and time an appointment was scheduled 
np.min(df['ScheduledDay'])

Timestamp('2015-11-10 07:13:56')

In [15]:
# The last date and time an appointment was scheduled 
np.max(df['ScheduledDay'])

Timestamp('2016-06-08 20:07:23')

In [16]:
np.max(df['ScheduledDay']) - np.min(df['ScheduledDay'])

Timedelta('211 days 12:53:27')

> All appointments are created within an about 7-month time frame.

<a id='cleaning_appointment_day'></a>

#### Appointment Day

In [18]:
# Convert the appointment day to a datetime
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay'])

# Verify that AppointmentDay is in a datetime format
df['AppointmentDay'].dtype

dtype('<M8[ns]')

In [19]:
# Verify that all records were converted to valid datetimes
np.isnat(df['AppointmentDay']).sum()

0

In [21]:
# The first scheduled appointment
np.min(df['AppointmentDay'])

Timestamp('2016-04-29 00:00:00')

In [23]:
# The last scheduled appointment
np.max(df['AppointmentDay'])

Timestamp('2016-06-08 00:00:00')

In [24]:
np.max(df['AppointmentDay']) - np.min(df['AppointmentDay'])

Timedelta('40 days 00:00:00')

> While the appointments were scheduled during a 7-month period, they all were scheduled for a 40-day period.

<a id='cleaning_gender'></a>

#### Gender

In [28]:
# Gender data is clean
df['Gender'].value_counts()

F    71840
M    38687
Name: Gender, dtype: int64

<a id='cleaning_age'></a>

#### Age

In [None]:
age_counts = df['Age'].value_counts()
age_counts.index.sort_values()

In [None]:
id = df[df['Age'] == -1]['PatientID']
id = list(id)

In [None]:
df[df['PatientID'] == id[0]]

In [None]:
# Remove the record with age of -1
df = df.drop[df['Age'] == -1]

> I removed the record with the age of -1.  All of the other ages are plausible; although the age of 115 is improbable, it is still possible.

<a id='cleaning_bolsa_familia'></a>

#### Bolsa Família

In [33]:
# BolsaFamilia data is clean
df['BolsaFamilia'].value_counts()

0    99666
1    10861
Name: BolsaFamilia, dtype: int64

<a id='cleaning_hypertension'></a>

#### Hypertension

In [34]:
# Hypertension data is clean
df['Hypertension'].value_counts()

0    88726
1    21801
Name: Hypertension, dtype: int64

<a id='cleaning_diabetes'></a>

#### Diabetes

In [35]:
# Diabetes data is clean
df['Diabetes'].value_counts()

0    102584
1      7943
Name: Diabetes, dtype: int64

<a id='cleaning_alcoholism'></a>

#### Alcholism

In [36]:
# Alcoholism data is clean
df['Alcoholism'].value_counts()

0    107167
1      3360
Name: Alcoholism, dtype: int64

<a id='cleaning_number_handicaps'></a>

#### Number of Handicaps

In [37]:
# This represents the number of handicaps a person has (as defined by the publisher of the dataset)
# All people having between 0-4 handicaps seems reasonable
df['NumHandicaps'].value_counts()

0    108286
1      2042
2       183
3        13
4         3
Name: NumHandicaps, dtype: int64

<a id='cleaning_sms_received'></a>

#### SMS Received

In [38]:
# SMS Received data is clean
df['SMSReceived'].value_counts()

0    75045
1    35482
Name: SMSReceived, dtype: int64

<a id='cleaning_no_show'></a>

#### No Show

In [39]:
df['NoShow'].value_counts()

No     88208
Yes    22319
Name: NoShow, dtype: int64

In [40]:
# Convert NoShow to zeros and ones to harmonize with the way the other boolean fields are expressed
df['NoShow'] = np.where(df['NoShow'].values == 'Yes', 1, 0)
df['NoShow'].value_counts()

0    88208
1    22319
Name: NoShow, dtype: int64

<a id='cleaning_neighborhood'></a>

#### Neighborhood

In [41]:
neighborhood_counts = df['Neighborhood'].value_counts()
neighborhood_counts[:11]

JARDIM CAMBURI       7717
MARIA ORTIZ          5805
RESISTÊNCIA          4431
JARDIM DA PENHA      3877
ITARARÉ              3514
CENTRO               3334
TABUAZEIRO           3132
SANTA MARTHA         3131
JESUS DE NAZARETH    2853
BONFIM               2773
SANTO ANTÔNIO        2746
Name: Neighborhood, dtype: int64

In [42]:
# Number of unique neighborhoods
df['Neighborhood'].nunique()

81

In [44]:
# Organize the neighborhoods into a dataframe
geo = pd.DataFrame(neighborhood_counts.index, columns = ['neighborhood'])
geo.head()

Unnamed: 0,neighborhood
0,JARDIM CAMBURI
1,MARIA ORTIZ
2,RESISTÊNCIA
3,JARDIM DA PENHA
4,ITARARÉ


In [53]:
# Geocode the neighborhood data
for i, row in geo.iterrows():
    # Use Bing
    full_neighborhood = row.str.title() + ', Vitória, Espírito Santo, Brazil'
    result = geocoder.bing(full_neighborhood, key=API_KEY)
    geo.at[i, 'status'] = result.status
    #geo.set_value(i, 'lat', result.lat)

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Jardim+Camburi%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DJardim%2BCamburi%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%B3ria%2C+V

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Santa+Martha%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DSanta%2BMartha%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOver_Query_Limit%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DNan%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=nan&q=nan&q=nan&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http:/

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Ilha+Do+Pr%C3%ADncipe%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DIlha%2BDo%2BPr%25C3%25Adncipe%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Maru%C3%ADpe%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DMaru%25C3%25Adpe%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%B3ria%

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Cruzamento%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DCruzamento%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%B3ria%2C+Vit%C3%B3ri

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Inhanguet%C3%A1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DInhanguet%25C3%25A1%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Monte+Belo%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DMonte%2BBelo%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%B3ria%2C+Vit

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Santos+Reis%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DSantos%2BReis%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%B3ria%2C+V

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Santa+Lu%C3%ADza%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DSanta%2BLu%25C3%25Adza%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%B3

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Ariovaldo+Favalessa%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DAriovaldo%2BFavalessa%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C3%

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Nazareth%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DNazareth%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOver_Query_Limit%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DNan%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=nan&q=nan&q=nan&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev

Status code 429 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 429 Client Error: Too Many Requests for url: http://dev.virtualearth.net/REST/v1/Locations?q=Parque+Industrial%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Error+-+400+Client+Error%3A+Bad+Request+For+Url%3A+Http%3A%2F%2FDev.Virtualearth.Net%2FRest%2FV1%2FLocations%3FQ%3DParque%2BIndustrial%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DOk%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DNan%26Q%3DNan%26Q%3DNan%26Q%3DVit%25C3%25B3Ria%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DEs%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26Q%3DBrazil%252C%2BVit%25C3%25B3Ria%252C%2BEsp%25C3%25Adrito%2BSanto%252C%2BBrazil%26O%3DJson%26Inclnb%3D1%26Key%3DArm0Cqoig2Wdasyt7Dhcswjyfkjihq4Wwka7Llumyspextdjnximerp8Nsot5G0I%26Maxresults%3D1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=nan&q=Vit%C

In [49]:
# Geocode the neighborhood data
for i, row in geo.iterrows():
    # Use Bing
    full_neighborhood = row.str.title() + ', Vitória, Espírito Santo, Brazil'
    result = geocoder.bing(full_neighborhood, key=API_KEY)
    geo.set_value(i, 'status', result.status) 
    geo.set_value(i, 'lat', result.lat)
    geo.set_value(i, 'lng', result.lng)
    geo.set_value(i, 'bing_neighborhood', result.neighborhood)    
    geo.set_value(i, 'city', result.city)
    geo.set_value(i, 'state', result.state)
    geo.set_value(i, 'country', result.country)
    
    # If Bing returns None for neighborhood, use Google
    if geo.loc[i, 'bing_neighborhood'] is None:
        result = geocoder.google(full_neighborhood)
        geo.set_value(i, 'status', result.status) 
        geo.set_value(i, 'lat', result.lat)
        geo.set_value(i, 'lng', result.lng)
        if result.county == 'Vitória': # Google uses county when Bing uses city
            geo.set_value(i, 'city', result.county)
        geo.set_value(i, 'state', result.state)
        if result.country == 'BR':
            geo.set_value(i, 'country', 'Brazil')

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Jardim+Camburi%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Jardim+Camburi%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
  
  import sys
  
  if __name__ == '__main__':
  # Remove the CWD from sys.path while we load stuff.
  # This is added back by InteractiveShellApp.init_path()
  if sys.path[0] == '':
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Maria+Ortiz%2C+Vit%C

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Bonfim%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Bonfim%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Santo+Ant%C3%B4nio%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Santo+Ant%C3%B4nio%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Maru%C3%ADpe%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Maru%C3%ADpe%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Forte+S%C3%A3o+Jo%C3%A3o%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Forte+S%C3%A3o+Jo%C3%A3o%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Ilha+Das+Caieiras%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Ilha+Das+Caieiras%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Inhanguet%C3%A1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Inhanguet%C3%A1%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Do+Cabral%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Do+Cabral%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Santos+Reis%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Santos+Reis%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=De+Lourdes%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=De+Lourdes%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Ariovaldo+Favalessa%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Ariovaldo+Favalessa%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%

Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Ilha+Do+Frade%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Ilha+Do+Frade%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Es%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Brazil%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&o=json&inclnb=1&key=ArM0cQoiG2wDASYT7dhCsWjYFKjIHq4wwKa7lLuMYspeXTdJnXimERp8Nsot5G0I&maxResults=1
Status code 400 from http://dev.virtualearth.net/REST/v1/Locations: ERROR - 400 Client Error: Bad Request for url: http://dev.virtualearth.net/REST/v1/Locations?q=Aeroporto%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Ok%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=nan&q=nan&q=Aeroporto%2C+Vit%C3%B3ria%2C+Esp%C3%ADrito+Santo%2C+Brazil&q=Vit%C3%B3ria%2C+Vi

In [None]:
geo.head()

In [None]:
# Check that all the neighborhoods are unique
(geo['bing_neighborhood'].value_counts() == 1).all()

In [None]:
# Check that each neighborhood returns a unique location
not any(geo.duplicated(['lat', 'lng']))

In [None]:
# Verify that all of the locations are in Vitória, Espírito Santo, Brazil
(geo['status'] == 'OK').all()

In [None]:
(geo['city'] == 'Vitória').all()

In [None]:
(geo['state'] == 'ES').all()

In [None]:
(geo['country'] == 'Brazil').all()

<a id='cleaning_patient_id'></a>

#### Patient ID

In [54]:
# Convert patient ID to a string, as it is meant to be an identifier not a number
df['PatientID'] = df['PatientID'].astype(int).astype(str)

# All identifiers are 7-15 digits long, most have at least 10
lens = df['PatientID'].apply(len)
pd.value_counts(lens)

14    39372
13    28319
15    24919
12    12835
11     4002
10      920
9       136
8        18
5         3
6         2
7         1
Name: PatientID, dtype: int64

> Most of the patient IDs are at least 10 digits long.  There is no consistent length for this field, and it is unknown, and impossible to know, if some of these patient identifiers are flawed.  It is an identifier and does not need to be operated on or with, therefore none of these records will be removed.

In [55]:
unique_vals = df.groupby('PatientID')[['Gender','Age']].nunique()
unique_vals.head()

Unnamed: 0_level_0,Gender,Age
PatientID,Unnamed: 1_level_1,Unnamed: 2_level_1
11111462625267,1,1
111124532532143,1,1
11114485119737,1,1
11116239871275,1,1
1111633122891,1,1


In [57]:
not_unique = unique_vals[(unique_vals['Gender'] > 1) | (unique_vals['Age'] > 1)]
not_unique = not_unique.reset_index()
not_unique.head()

Unnamed: 0,PatientID,Gender,Age
0,112114682124172,1,2
1,11238367556569,1,2
2,1124242331227,1,2
3,1126541547466,1,2
4,112777857389857,1,2


In [59]:
multiple_info_ids = not_unique['PatientID']
multiple_info_ids.head()

0    112114682124172
1     11238367556569
2      1124242331227
3      1126541547466
4    112777857389857
Name: PatientID, dtype: object

In [62]:
multiples = df[df['PatientID'].isin(multiple_info_ids)].sort_values('PatientID')
multiples.head()

Unnamed: 0,PatientID,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighborhood,BolsaFamilia,Hypertension,Diabetes,Alcoholism,NumHandicaps,SMSReceived,NoShow
3850,112114682124172,5490237,F,2016-03-18 14:26:03,2016-05-02,0,RESISTÊNCIA,0,0,0,0,0,1,1
110232,112114682124172,5676082,F,2016-05-09 14:56:13,2016-06-08,1,RESISTÊNCIA,0,0,0,0,0,1,1
10283,11238367556569,5723118,F,2016-05-20 07:58:35,2016-05-20,29,ROMÃO,0,1,0,0,0,0,0
19089,11238367556569,5675794,F,2016-05-09 14:21:41,2016-05-13,28,ROMÃO,0,1,0,0,0,0,0
19091,11238367556569,5675795,F,2016-05-09 14:21:41,2016-05-13,28,ROMÃO,0,1,0,0,0,0,0


In [66]:
max_age = multiples.groupby('PatientID')['Age'].max()
min_age = multiples.groupby('PatientID')['Age'].min()
age_diff = max_age - min_age
age_diff[age_diff <= 1].all()

True

In [67]:
df[df['PatientID'] == '999931985292928']

Unnamed: 0,PatientID,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighborhood,BolsaFamilia,Hypertension,Diabetes,Alcoholism,NumHandicaps,SMSReceived,NoShow
32023,999931985292928,5710157,M,2016-05-17 15:22:01,2016-05-17,90,JABOUR,0,0,0,0,0,0,0
32033,999931985292928,5736368,M,2016-05-25 08:14:58,2016-05-25,90,JABOUR,0,0,0,0,0,0,0
62099,999931985292928,5700484,M,2016-05-16 09:29:43,2016-05-17,90,JABOUR,0,0,0,0,0,0,0
62194,999931985292928,5616762,M,2016-04-25 14:46:41,2016-05-04,90,JABOUR,0,0,0,0,0,1,0
104616,999931985292928,5772701,M,2016-06-03 16:04:03,2016-06-07,90,JABOUR,0,0,0,0,0,1,0
