# Project - Investigate Medical  Appointment Dataset

***This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row***


Dataset


1. PatienID---Identification of a patient
2. AppointmentID---Identification number of a patient
3. Gender---Displays the gender of the patient 
4. ScheduledDay---Displays the date on which appointmnet was scheduled
5. AppointmentDay---Shows the date of the appointment
6. Neighbourhood---Indicates the location of the hospital
7. Scholarship	---Indicated is the patient receives a scholarship
8. Hipertension--- Shows if the patient has hypertension
9. Diabetes	---Shows if the patient has diabetes
10. Alcoholism	---Indicates if the patient is an alcoholic
11. Handcap	---Indicates if the patient is handicaped
12. SMS_received	---Shows if message is sent to the patient
13. No-show	-- It says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up



# Importing all the necessary libraries

In [None]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import collections

# Reading the dataset



In [None]:
df = pd.read_csv("../input/hassan/noshowappointments.csv")
df.head()

# Analyzing the dataset

***Check dimensions of the dataframe in terms of rows and columns***


In [None]:
df.shape

Inference drawn:

* The no.of rows are 110527
* The no.of columns are 14

***Checking if the dataset has any duplicate values***




In [None]:
sum(df.duplicated())

Inference drawn:

* The dataset has no duplicate values

***Checking if there are any null or missing values in the dataset***

In [None]:
df.isnull().sum()

Inference drawn:

* The dataset has no missing values

***Displaying the columns in the dataset***

In [None]:
df.columns

Inference drawn:


*   Some column names have incorrect spellings and are in the wrong format so they'll be cleaned accordingly



**Changing column names which are in incorrect format and have wrong spellings**

In [None]:
df.rename(columns={"Hipertension": "Hypertension","AppointmentID": "Appointment_id","ScheduledDay": "Scheduled_day"	,"AppointmentDay":"Appointment_day", "Handcap":"Handicap", "No-show":"No_show", "PatientId":"Patient_id"}, inplace=True)

***Checking if datatypes are in correct format***

In [None]:
df.dtypes

Inference drawn:

* Scheduled_day's data type is object but to make it easy to use for the user, we can convert it in datetime format
* Appointment_day's data type is object but to make it easy to use for the user, we can convert it in datetime format

Note the redundant variables and drop them



In [None]:
df.head()

Inference drawn:

* When we analyze the dataset, we can try can observe that there are no such columns in the dataset which have only 1 unique values in them, and hence we can conclude by stating that there are no redundant variables in the dataset.

# Analysing the variables

Variable 'Patient_id'

In [None]:
df.Patient_id.unique()

Inference:


* The data type of an id should ideally be integer, not float.



In [None]:
df['Patient_id'] = df['Patient_id'].astype('int64')


Variable 'Gender'

In [None]:
df.Gender.unique()

Inference -

* The column has 2 unique values for the genders, male and female in the correct format

Variable 'Scheduled_day'

In [None]:
df.Scheduled_day.unique()


Inference -

* The date type needs to be converted to datetime format

In [None]:
df.Scheduled_day = df.Scheduled_day.apply(np.datetime64)

Variable 'Appointment_day'

In [None]:
df.Appointment_day.unique()

Inference -

* The date type needs to be converted to datetime format

In [None]:
df.Appointment_day = df.Appointment_day.apply(np.datetime64)

Variable 'Age'

In [None]:
df.Age.unique()

Inference -

* The age column has negative values which is highly unlikely to happen. So we'll have to filter out the outliers.

In [None]:
df = df[(df.Age >= 0)]

Variable 'Neighbourhood'

In [None]:
df.Neighbourhood.unique()

Inference -

* The variable shows the neighbourhood in which hospital is located

Variable 'Scholarship'

In [None]:
df.Scholarship.unique()

Inference -

* The variable has 2 unique values which indicate if patient receives a scholarship or no in the correct data type

Variable 'Hypertension'

In [None]:
df.Hypertension.unique()

Inference -

* The variable has 2 unique values which is 1 if patient has hypertension and 0 or else in the correct data type

Variable 'Diabetes'

In [None]:
df.Diabetes.unique()

Inference -

* The variable has 2 unique values which is 1 if patient is diabetic and 0 if not in the correct data type

Variable 'Alcoholism'

In [None]:
df.Alcoholism.unique()

Inference -

* The variable has 2 unique values which is 1 if patient is alcoholic and 0 if patient is non alcoholic in correct data type

Variable 'Handicap'

In [None]:
df.Handicap.unique()

The column has 3 unique values possibly reppresenting the number of disabilities an individual has

Variable 'SMS_received'

In [None]:
df.SMS_received.unique()

Inference -

* The variable has 2 unique values which show if patient had received a message or not in the correct data type

Variable 'No_show'

In [None]:
df.No_show.unique()

Inference -

* The variable has 2 unique values displaying ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up

Adding a new column displaying the waiting period for a patient

In [None]:
df['Wait'] = (df.Appointment_day.dt.date - df.Scheduled_day.dt.date).dt.days
df= df[(df.Wait>=0)]

Adding a new column which shows the day of the appointment

In [None]:
df['appointment_day'] = df.Scheduled_day.dt.day_name()

Understanding the variable 'Appointment_Day'

In [None]:
collections.Counter(df.appointment_day)

By observing, very few appointments are made for the weekend, Saturday with majority of appoints being made for the former part of week on days like Monday, Tuesday, Wednesday with the number dropping in the latter part of week for days like Thursday and Friday

In [None]:
df.head(5)

# Observations

In [None]:
df.hist(figsize=(16,14));

The observations made from the histograms are:



*   Patients are evenly distributed when it comes to their age with majority of patients who are minors make an appointment
*   Majority of patients do not have alcoholism. Only a very small amount of patients have alcoholism
* Majority of patients do not have diabetes. Only a very small amount of patients have have diabetes
* Majority of patients are not handicapped. Only a very small amount of patients have some disability
* Around 75% of patients do not have Hypertension while 25% of patients do have Hypertension
* Almost 7k patients did receive a text message whereas almost 3.9k patients did not receive a text message
* Majority of patiients do not receive a scholarship with a small amount of patients receieving a scholarship
* Majority of patients do not have to wait for more than 20 days with a small amount of patients having to wait upto 75 days

**What percentage of patients missed their appointments?**

In [None]:
x= (df[['No_show']]=='Yes').sum()
y= (df[['No_show']]=='No').sum()

percent= ((x)/(x+y))*100
percent

Inference:


*   20.19% of patients misssed their appointents



**Did the gender play any role in the possibilty of a patient missing their appointment?**

In [None]:
female= df[df['Gender']=='F']
total_females= female.shape[0]
male= df[df['Gender']=='M']
total_males= male.shape[0]
females_who_did_not_attend = (female[["No_show"]]=="Yes").sum()
females_who_attended = (female[["No_show"]]=="No").sum()
males_who_did_not_attend = (male[["No_show"]]=="Yes").sum()
males_who_attended = (male[["No_show"]]=="No").sum()

The percentage of females who missed their appointments

In [None]:
(females_who_did_not_attend/total_females)*100

The percentage of females who attended their appointments

In [None]:
(females_who_attended/total_females)*100

Percentage of males who missed their appointments

In [None]:
(males_who_did_not_attend/total_males)*100

Percentage of males who attended their appointments

In [None]:
(males_who_attended/total_males)*100

Plotting a graph for better understanding

In [None]:
gender =df.groupby('Gender').No_show.value_counts()
gender.plot(kind='bar')

Inference


*   The percentage of female patients who missed their appointments is approximately equal to the number of male patients who misssed ther appointments
*   The percentage of female patients who attended their appointments is approximately equal to the number of male patients who attended ther appointments
* Thus, the gender of a person doesn't play a significant role in causing them to miss theri appointments



**Is there a relation of patient not showing up and the number of days a patient has to wait for the appointent?**

In [None]:
Waiting_df = df[['No_show', 'Wait']].groupby('Wait').count()

Plotting a graph for better understanding

In [None]:
Waiting_df.plot(kind='line', figsize=(15,5))
plt.title("Time gap between scheluded and appointment day's influcence on no-shows")
plt.xlabel('Days between scheduling and appointments')
plt.ylabel('Number of people')

Inference:


*   Majority of patients attend their appointments if the appointments are scheduled in a small time gap, ideally on the same day



**Does the day of the appointment influence the patient's decision to attend or miss the appointent?**

In [None]:
day = df.groupby('appointment_day').No_show.value_counts()
day

Calculating the percentage 

In [None]:
percent= []
i=0
while i<len(day)-1:
  percent.append( day[i+1] *100 /(day[i]+day[i+1])  )
  i=i+2
percent

Plotting a graph for better understanding

In [None]:
day = day.sort_values(ascending=False)

day.plot(kind='bar', figsize=(6,6))

Inference:
* The number of appointments scheduled, attended and missed, both are negligible
* The number of appointents , both missed and attended are maximum for Tuesday
* Wednesday comes right after Tuesday for both having the number of appointments attended as well as missed
* It is followed by Monday with a lesser number of patients attening as well as missing the appointent
* The number of patients attending as well as issing the appointment keeps on decreasing for Thursday and Friday
* Thus, the numbers of patients attending as well as missing the appointnets goes hand in hand
* Saturday is the only day when least number of patients, around 4% of those scheduled will miss their appointments
* For all the other days, around 20% of the scheduled appointents will be cancelled

**Does sending a text message influence the patient's attendance?**

In [None]:
msg= df.groupby("SMS_received").No_show.value_counts()
msg

Calculating the percentage 

In [None]:
Msg_not_received = msg[0][1]*100/(msg[0][0]+msg[0][1])
print(Msg_not_received)

In [None]:
msg.head()

In [None]:
Msg_received = msg[1][1]*100/(msg[1][0]+msg[1][1])
print(Msg_received)

Plotting a graph for better understanding

In [None]:
msg.plot(kind='bar')

Inference:
*   16% of people who did not receive the message did not show up for the appoinment
*   27% of patients did not attend the appointent in spite of getting a message
* Patients receiving text messages had a higher tendency of missing thier appointents



**Does the age of a person play any role in determining if the person will attend his appointment or not?**

Plotting a visual of patients of different ages who did not attend their respective appointments

In [None]:
Age_df =df.query('No_show == "Yes"').groupby('Age').No_show.count()
Age_df.plot(kind='line', figsize=(15,5))


Plotting a visual of patients of different ages who attended their respective appointments

In [None]:
Age_df =df.query('No_show == "No"').groupby('Age').No_show.count()
Age_df.plot(kind='line', figsize=(15,5))


Inference:
* The number of no show appointents was the highest for infants and appears to be increasing upto the age of 20 years after which it declines
* The nmuber of appointnets where patients showed up is again, highest for infants which sharply declines after the age of 5 and almost remains contant till the age of 60 with soe rises after which it continues to decline
* There is no definite trend between age and possibility of patient showing for appointment

**Which neighbourhoods have highest numbers of no-shows?**

Neighbourhoods having most amount of No-Shows

In [None]:
area_df= df.query('No_show=="Yes"').groupby("Neighbourhood").No_show.count()
area_df.sort_values(ascending=False, inplace=True)
area_df

PLotting a graph  for better understanding

In [None]:
area_df.plot(kind='bar', figsize=(12,9))

Areas where most amount of people showed for appointment

In [None]:
area= df.query('No_show=="No"').groupby("Neighbourhood").No_show.count()
area.sort_values(ascending=False, inplace=True)
area

PLotting a graph  for better understanding

In [None]:
area.plot(kind='bar', figsize=(12,9))

The graphs clearly show that patients from certain areas are more likely to not attend their appointmets as compared to patients residing elsewhere

**Is a person have a medical issue more likely to have a no show?**

In [None]:
hypertension_data = df.groupby('Hypertension').No_show.value_counts()
diabetes_data = df.groupby('Diabetes').No_show.value_counts()
alcoholism_data = df.groupby('Alcoholism').No_show.value_counts()
hypertension_data, diabetes_data, alcoholism_data

Plotting graphs for better understanding

In [None]:
hypertension_data.plot(kind="bar")

In [None]:
alcoholism_data.plot(kind='bar')

In [None]:
diabetes_data.plot(kind="bar")

Inference:


*   The percent of no shows for a patient with a medical condition is approximately equal to the percent of no shows for a patient without a pre existing medical condition


#Conclusion



*  In this project, we analyzed the no show database of patients
*  We analyzed all the variables of the dataset
*  Gender of a patient does not have influence on whether the patient shows up or no
* Whether the patient shows up or not is affected by the amount of time between the patient scheduled his appointment and his appointment
* Patient is more likely to show up if the time between the patient scheduled his appointment and his appointment is less
* The weekday on which the appointment has been scheduled does not affect the patient's behaviour except for on Saturday when percentage of patients not showing is the least
* Percentage of patients who received a text message are more likely to not show up as compared to patients who have not received a text message by a small amount
* Age of a person does not affect if the patients attends or misses his appointment
* Percentage of patient having a pre-existing medical condition like Hypertension, Diabetes, Alcoholis are as likely to miss their appointment as conpares to percentage of patients without a medical condition issing their appointments
* In some neighbourhoods, patients are more likely to miss their appointments as compared to other neighbourhoods

