# Project: Investigate a Dataset - [Medical Appointment No Shows]

## Table of Contents


<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
----

### Dataset Description
---

>A person makes a doctor appointment, receives all the instructions and no-show. Who to blame?


#### Content
>110.527 medical appointments and its 14 associated variables (characteristics). The most important one if the patient show-up or no-show to the appointment. Variable names are self-explanatory.


#### Data Dictionary

>**1 - PatientId**
>
>Identification of a patient.
>
>**2 - AppointmentID**
>
>Identification of each appointment.
>
>**3 -  Gender**
>
>Male or Female . Female is the greater proportion, woman takes way more care of they health in comparison to man.
>
>**4 - DataMarcacaoConsulta**
>
>The day of the actuall appointment, when they have to visit the doctor.
>
>**5 - DataAgendamento**
>
>The day someone called or registered the appointment, this is before appointment of course.
>
>**6 - Age**
>
>How old is the patient.
>
>**7 - Neighbourhood**
>
>Where the appointment takes place.
>
>**8 - Scholarship**
>
>True of False.
>
>**9 - Hipertension**
>
>True or False.
>
>**10 - Diabetes**
>
>True or False.
>
>**11 - Alcoholism**
>
>True or False.
>
>**12 - Handcap**
>
>True or False.
>
>**13 - SMS_received**
>
>1 or more messages sent to the patient.
>
>**14 - No-show**
>
>Yes or No.

### Question(s) for Analysis
---


>**Why do patients miss their scheduled appointments?**
**What variable(s) might affect the propabelity of them showing-up for their scheduled appointment?**


 

In [4]:
# import statements for all of the packages in use.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snb

%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling
---

>In this section of the report, we will load in the data, check for cleanliness, and then trim and clean your dataset for analysis.


### Assessing Data
---

> We will start by loading in our data set [**noshowappointments-kagglev2-may-2016.csv**](https://www.google.com/url?q=https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd2e9a_noshowappointments-kagglev2-may-2016/noshowappointments-kagglev2-may-2016.csv&sa=D&source=editors&ust=1653528465662837&usg=AOvVaw1zysYG7tevT7R2axHyyyrd).
>
>And then we will perform a general inspection to see what obstacles might oppose us.

In [90]:
# Loading our data and printing out a few lines. Performing operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')


In [77]:
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [78]:
# exploring the shape of the data
df.shape

(110527, 14)

The Dataframe consistes of 14 columns, and has 110527 entries.

Checking for missing values ,undescriptive column names and inappropriate data types...

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


Checking for duplicates...

In [80]:
df.duplicated().sum()

0

* There is no duplicated or missing values. However **"Hipertension"** column name is misspelled and needs to be renamed to **"Hypertension"**, along with **"No-show"** column to **"No_show"** for easy access.

In [81]:
#looking for more insight on the data using .describe() function
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


* There is a negative value(s) in **"Age"** column which is clearly a typo and needs to be discarded.


### Data Cleaning
---
>After discussing the structure of the data and the problems that need to be
 cleaned, now we perform those cleaning steps in this section.

 

In [91]:
# renaming "Hipertension" and "No-show" columns
new_col_names = {"Hipertension":"Hypertension", "No-show":"No_show"}
df.rename(columns=new_col_names, inplace=True)

# checking that changes took place
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_received,No_show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [92]:
# discarding negative values using .drop() function
wrong_values = df[df['Age'] < 0].index
df.drop(wrong_values, inplace=True)
# checking that changes took place
df.describe()

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0,110526.0
mean,147493400000000.0,5675304.0,37.089219,0.098266,0.197248,0.071865,0.0304,0.022248,0.321029
std,256094300000000.0,71295.44,23.110026,0.297676,0.397923,0.258266,0.171686,0.161543,0.466874
min,39217.84,5030230.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172536000000.0,5640285.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680572.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94389630000000.0,5725523.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


**Last step in data cleaning that we need to do is to remove irrelevant and unnecessary data**

Let's see if any patients have scheduled more than 1 appointment


In [93]:
df['PatientId'].duplicated().sum()

48228

In [106]:
# looking for patients who have scheduled more than 1 appointment and have the same status of showing-up
df.duplicated(['PatientId','No_show']).sum()


38710

It turns out that there is 48228 patient have scheduled more than 1 appointment, 38710 of them with the same status of showing-up.

so we need to discard these data as it tends to be repeated and unnecessary.

In [109]:
df.drop_duplicates(['PatientId', 'No_show'], inplace = True)
df.shape

(71816, 14)

Also we would want to remove irrelevant columns that dosn't help us in this analysis and foucs on the variables that might answer our question. 

In [112]:
#removing the columns ['PatientId', 'AppointmentID', 'ScheduledDay', 'AppointmentDay']
df.drop(columns=['PatientId', 'AppointmentID', 'ScheduledDay', 'AppointmentDay'], inplace=True)
df.head()

Unnamed: 0,Gender,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handcap,SMS_received,No_show
0,F,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,M,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,F,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,F,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,F,56,JARDIM DA PENHA,0,1,1,0,0,0,No


#### Changes Made to the Data:
----

* The columns **"Hipertension"** and **"No-show"** has been renamed to **"Hypertension"** and **"No_show"**.

* The negative value in **"Age"** column has been removed.

* 38710 rows where values of **"PatientId"** and **"No_show"** columns are duplicated has been removed

* The columns **"PatientId"**, **"AppointmentID"**, **"ScheduledDay"**, **"AppointmentDay"** has been removed.



These changes made our data better-suited and ready to be explored in the next section...

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


### Research Question 1 (Replace this header name!)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.


### Research Question 2  (Replace this header name!)

In [None]:
# Continue to explore the data to address your additional research
#   questions. Add more headers as needed if you have more questions to
#   investigate.


<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.

> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 

> **Tip**: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])