# Missing Medical Appointments
in this analysis we will explore the reasons behind people not showing up for their medical appointments by examining the data set provided by various medical facilities in Brazil, Rio De Janeiro.



## First step : loading and cleaning the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
noshowappointments_df = pd.read_csv(r"C:\Users\SARA\noshowappointments-kagglev2-may-2016.csv")

the original data included a negative number in the age column, assuming it's an entry error it is corrected below.

In [None]:
mask = noshowappointments_df['Age'] < 0
noshowappointments_df.ix[mask, 'Age'] = noshowappointments_df.ix[mask, 'Age'] * (-1)

changing the data type of the "No-show" column to boolean values for easier handling.



In [None]:
def show_or_noshow(string):
    if string == "Yes" or string =="yes":
        return True
    elif string == "No" or string == "no":
        return False
    else:
        return None

new_Noshow = noshowappointments_df["No-show"].apply(show_or_noshow)
noshowappointments_df["No-show"] = new_Noshow

dropping columns that will not be used in this analysis

In [None]:
noshowappointments_df = noshowappointments_df.drop(noshowappointments_df.columns[[0,1,3,6,7,8,9,10,11,12]], axis=1,inplace=False)

In [None]:
noshowappointments_df.head()


changing the "AppointmentDay" column to just the day of the week on which the Appointment was scheduled.

In [None]:
from datetime import datetime as dt

def parse_date(date):
    if date == '':
        return None
    else:
        date = dt.strptime(date, '%Y-%m-%dT%H:%M:%SZ')
        return dt.strftime(date,"%a")
    
    
noshowappointments_df["AppointmentDay"]= noshowappointments_df["AppointmentDay"].apply(parse_date)
noshowappointments_df.columns=['Gender', 'Appointment Day', 'Age', 'No-show']

In [None]:
noshowappointments_df.groupby("Appointment Day").sum()


## A General look at the data


In [None]:
patient_count_byShow = noshowappointments_df.groupby("No-show").count()
patient_count_byShow

In [None]:
noShow_dist_plot = noshowappointments_df["No-show"].plot(kind="hist")

noShow_dist_plot.set_xlabel("0 = show ; 1 = no-show")

where 1 refers to True, that is, the no-show patients; and 0 refers to False, ie. the patients who did show up.
It can be observed that approximately 20% of all patients fail to show up for their appointments

## Does the gender of the patient play a role in missing an appointment

since the count method will give the same value for all columns, choosing "Appointment Day" arbitraraly just to get the count of patients. Then caculating the percentage of patients who didn't show up to the total number of patients of that gender

In [None]:
gender_grouped = noshowappointments_df.groupby("Gender")
gender_grouped.groups
female_patiencount =gender_grouped.count()["Appointment Day"]["F"]
female_noshowsum = gender_grouped.sum()["No-show"]["F"]
female_noshowperc = female_noshowsum/female_patiencount
female_noshowperc

In [None]:
male_patiencount =gender_grouped.count()["Appointment Day"]["M"]
male_noshowsum = gender_grouped.sum()["No-show"]["M"]
male_noshowperc = male_noshowsum/male_patiencount
male_noshowperc

about 20% of both male and female patients do not show up to their appointments; suggesting no correlation between the gender of the patient and the likelihood of missing an appointment.

## Do people tend to miss more or less appointments as they get older?

In [None]:
age_total_count = noshowappointments_df.groupby("Age").count()["Gender"]


by multiplying by the "No-show" column we are left with the rows that correspond to the True value. the rows that correspond to False will equal 0 and are deleted from the data frame afterwards.

In [None]:
age_True = noshowappointments_df["Age"]*noshowappointments_df["No-show"]
age_No_show = age_True.to_frame(name = "Age; no show")

age_No_show = age_No_show[age_No_show["Age; no show"] != 0]
age_No_show = age_No_show.join(noshowappointments_df["Gender"])
age_No_show = age_No_show.groupby("Age; no show").count()
age_No_show = age_No_show["Gender"]/age_total_count
age_No_show = age_No_show.dropna()
age_No_show =age_No_show.to_frame()
age_No_show.columns=["No-Show"]
ageNoshow_plot = age_No_show.plot(title = "Percentage of no-show patients by age")
ageNoshow_plot.set_xlabel( "Age")
ageNoshow_plot.set_ylabel("% of patients of that age")

In [None]:
age_std_plot = (age_No_show-age_No_show.mean())/age_No_show.std(ddof=0)
stdAgeNoShow_plot = age_std_plot.plot()
stdAgeNoShow_plot.set_xlabel("Age")
stdAgeNoShow_plot.set_ylabel("standardized ratio of no show patients")

There seems to be a slight fluctioation of data along the age axis. the relationship does not seem to be linear.the ratio of no-show
patients starts to decrease after around 20 years of age. of course the data after age 80 is not as reliable as there are fewer data points to rely on.

## Does the no-show rate vary for different days of the week?

In [None]:
weekday_attendance = noshowappointments_df.groupby("Appointment Day")
weekday_attendance_df= weekday_attendance.sum()["No-show"]

In [None]:
total_patient_count = noshowappointments_df.groupby("Appointment Day").count()["No-show"]
total_patient_count

In [None]:
weekday_plot = (weekday_attendance_df/total_patient_count).plot(kind = "bar", title = "No-show patients by Appointment day")
weekday_plot.set_ylabel("% of appointments made on that day")

There is seemingly no correlation between the day of week and the percentage of people missing thier appointments

## Conclusion : Perhaps people just don't show up sometimes.

Of the 3 variables examined none of them seem to have a strong relationship with the rate of no-shows.
Although it seems that in all cases 20% of the patients do miss their appointments.

However it's important to note that the dataset had erroneous data points such patients over the age of 80 (only 5 patients 115 years old). As well as other dataset limitation, listed below:

the data used spans over aproximately 1.5 months of patient appointment records, which might not have been enough.
data was collected from only one city in Brazil.

## Acknowledgement
recources that helped with the code used in this analysis:
links


https://www.tutorialspoint.com/python/time_strptime.htm 

https://stackoverflow.com/questions/16766643/convert-date-string-to-day-of-week