> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Before submitting your project, it will be a good idea to go back through your report and remove these sections to make the presentation of your work as tidy as possible. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: No Show Appointments Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

In this project, We are going to analyzing <a href="https://www.google.com/url?q=https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd2e9a_noshowappointments-kagglev2-may-2016/noshowappointments-kagglev2-may-2016.csv&sa=D&ust=1532469042118000">No-Show Appointment</a> dataset. This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

The Dataset on Kaggel: <a href="https://www.kaggle.com/joniarroba/noshowappointments">Medical Appointment No Shows</a>

- 'ScheduledDay' tells us on what day the patient set up their appointment.
- 'Neighborhood' indicates the location of the hospital.
- 'Scholarship' indicates whether or not the patient is enrolled in Brasilian welfare program.
- 'No-show' it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.


**Questions**:
We're going to find:
- "What are most common reasons for patients for not showing up in thier appointments?"
- "Which factors affect showing up in appointments?"


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sb

from pandas.plotting import scatter_matrix
import matplotlib.ticker as ticker

%matplotlib inline

<a id='wrangling'></a>
## Data Wrangling

> **Tip**: In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. Make sure that you document your steps carefully and justify your cleaning decisions.

### General Properties

Read noshowappointments

In [None]:
df = pd.read_csv('data/noshowappointments.csv')
df.head()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.hist(figsize=(20,15));

> **Tip**: You should _not_ perform too many operations in each cell. Create cells freely to explore your data. One option that you can take with this project is to do a lot of explorations in an initial notebook. These don't have to be organized, but make sure you use enough comments to understand the purpose of each code cell. Then, after you're done with your analysis, create a duplicate notebook where you will trim the excess and organize your steps so that you have a flowing, cohesive report.

> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).

### Data Cleaning (Drop invalid, outliers & duplicated values, Change types, Rename columns & Delete unimportant columns)

#### Rename columns

In [None]:
df['SMSReceived'] = df['SMS_received']
df['NoShow'] = df['No-show']

#### Delete unimportant columns

In [None]:
df.drop(['PatientId', 'AppointmentID','SMS_received', 'No-show'], axis=1 ,inplace=True)

In [None]:
df.info()

#### Drop invalid values

As Age can't be negative value, its better to cut negative values of age of the dataset

In [None]:
negative_age = df[df.Age < 0].index
negative_age

In [None]:
df.drop(negative_age ,inplace=True)

In [None]:
sum(df.Age < 0)

#### Drop duplicated values

In [None]:
sum(df.duplicated())

As Duplicatied data would manipulate the insights, its better to drop them from the dataset

In [None]:
df.drop_duplicates(inplace=True)

In [None]:
df.info()

#### Drop outliers

In [None]:
sb.boxplot(df.Age).set_title('Age distribution')
plt.show()

In [None]:
df[(df.Age > 100)]

In [None]:
df = df[(df.Age < 100)]

- Cut off patients of age>100 as they are outliers

#### Change types

In [None]:
df['AppointmentDay'] = pd.to_datetime(df['AppointmentDay']).dt.date
df['ScheduledDay'] = pd.to_datetime(df['ScheduledDay']).dt.date

Remove appointments was set before reserved day

In [None]:
df = df[df['AppointmentDay'] >= df['ScheduledDay']]
df.shape

In [None]:
df['IsShowed'] = (df.NoShow == "Yes").astype(int)

In [None]:
df.info()

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.

### Research Question 1 (Calculate percentages)

In [None]:
def calculate_percent(df_last_ver, column_name, column_value, txt):
    count = df_last_ver[(df_last_ver[column_name]== column_value)][column_name].count()
    percent = (count / df_last_ver[column_name].count()) *100 
    print('{t} percentage = {p:.2f} %'.format(t=txt, p=percent))

In [None]:
calculate_percent(df,'NoShow','No', 'Showed patients')
calculate_percent(df,'NoShow','Yes', 'Never showed patients')

In [None]:
df.NoShow.value_counts().plot.pie(figsize=(6,6), autopct='%.2f%%', explode=(0, .05))
plt.title('Showed & Never Showed Patients chart & percentages')
plt.show()

In [None]:
df.Hipertension.value_counts().plot.pie(figsize=(6,6), autopct='%.2f%%', explode=(0, .05))
plt.title('Hipertension Patients chart & percentages')
plt.show()

- Patients who have Hipertension percentage = 19.73 %
- Patients who don't have Hipertension percentage = 80.27 %

In [None]:
df.Scholarship.value_counts().plot.pie(figsize=(6,6), autopct='%.2f%%', explode=(0, .05))
plt.title('Patients with & without Scholarships chart & percentages')
plt.show()

- 90% of patients doesn't have scholarship
- ~10% of patients have scholarship

### Research Question 2  (Analyze relations between Age, Neighbourhood, Scholarship & Not showing in appointment)

In [None]:
def value_to_class(value):
    if value < 4:
        return 1          #Baby
    elif value < 13:
        return 2          #Child
    elif value < 21:      
        return 3          #Tean
    elif value < 35:
        return 4          #Young adult
    elif value < 60:
        return 5          #Adult
    else:                
        return 6          #Elder
       
age_dict = {
    1:'Baby',
    2:'Child',
    3:'Teen',
    4:'Young adult',
    5:'Adult',
    6:'Elder',
}

In [None]:
df['AgeClass'] = df['Age'].apply(value_to_class)

In [None]:
df.groupby('NoShow').AgeClass.value_counts().plot(kind='bar')
plt.title('Showed & Never showed patients divided with age class histogram');

In [None]:
df.Age.plot.hist()
plt.title('Patients age histogram')
plt.show();
print('average age of all patients: {:.2f}'.format(df.Age.mean()))

In [None]:
sb.boxplot( x=df.Age, y=df.NoShow)
plt.title('Age distribution for Showed & Never showed patients')
plt.show()


- Some Patients who never show in appointments are from age ~= 20:50

In [None]:
fig, ax = plt.subplots()
ax.hist(df['Handcap'], bins=5, edgecolor='black', label='Handcap', alpha = 0.5)
ax.hist(df['NoShow'], bins=5, edgecolor='black', label='NoShow', alpha =0.5)

ax.yaxis.set_major_formatter(ticker.PercentFormatter(xmax=len(df)))

plt.title('Handcaped patients for Showed & Never showed')
plt.legend();


- Number of **Handcaped** Patients who **Never Show** in Appointments are **very small** compared with **none handcapes** patients who **showed** in their appointments

In [None]:
showed = (df.NoShow == 'No')
neverShowed = (df.NoShow == 'Yes')

df.Neighbourhood[showed].hist(alpha=0.5, bins=20, label='showed')
df.Neighbourhood[neverShowed].hist(alpha=0.5, bins=20, label='neverShowed')
plt.title('Showed & Never showed patients according to population distribution')
plt.legend();

In [None]:
df[(df['NoShow']== 'Yes')]['Neighbourhood'].value_counts()

- It's obvious in last hitogram that: **some Neigbourhoods** have **high numbers** of patients who **never show** in appointments than the others 

- From data value_counts in last cell: we can indicate that Neigbourhoods with highly never showed patients are from the following: (JARDIM CAMBURI, MARIA ORTIZ, ITARARÉ, RESISTÊNCIA, CENTRO,..)

In [None]:
df[(df['NoShow']== 'Yes')]['SMSReceived'].value_counts().plot.pie(figsize=(6,6), autopct='%.2f%%', explode=(0, .05))
plt.title('Never showed patients that had or had not recieve SMS')
plt.show()

- 55.72% of patients who never show in appointment didn't recieve SMS 
- 44.28% of patients who never show in appointment recieved SMS 

In [None]:
df.ScheduledDay[showed].hist(alpha=0.5, bins=20, label='showed')
df.ScheduledDay[neverShowed].hist(alpha=0.5, bins=20, label='neverShowed')
plt.title('Showed & Never showed patients according to Scheduled day histogram')
plt.legend();

- **Some ScheduledDay** have **high numbers** of patients who **never show** in appointments than the others, for instance: period of **May 2016**

In [None]:
df.AppointmentDay[showed].hist(alpha=0.5, bins=20, label='showed')
df.AppointmentDay[neverShowed].hist(alpha=0.5, bins=20, label='neverShowed')
plt.title('Showed & Never showed patients according to Appointment day histogram')
plt.legend();

- **Some AppointmentDay** have **high numbers** of patients who **never show** in appointments than the others, such as people who registered their appointment in **June 2016**

In [None]:
df.Scholarship[showed].hist(alpha=0.5, bins=20, label='showed')
df.Scholarship[neverShowed].hist(alpha=0.5, bins=20, label='neverShowed')
plt.title('Showed & Never showed patients according to their Scholarship state histogram')
plt.legend();

In [None]:
df.groupby('NoShow').Scholarship.value_counts()

- **~25%** of people with **no scholarship** tend not to show in appointments

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed. Make sure that you are clear with regards to the limitations of your exploration. If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work, you should save a copy of the report in HTML or PDF form via the **File** > **Download as** submenu. Before exporting your report, check over it to make sure that the flow of the report is complete. You should probably remove all of the "Tip" quotes like this one so that the presentation is as tidy as possible. Congratulations!

#### Never showed patients percentage = 20.11 %

- Showed patients percentage = 79.89 %
- Never showed patients percentage = 20.11 %

- Patients who have Hipertension percentage = 19.73 %
- Patients who don't have Hipertension percentage = 80.27 %

- 90% of patients doesn't have scholarship
- ~10% of patients have scholarship

- average age of all patients: 37.08

- Some Patients who never show in appointments are from age ~= 20:50

- Number of **Handcaped** Patients who **Never Show** in Appointments are **very small** compared with **none handcapes** patients who **showed** in their appointments

- **Some Neigbourhoods** have **high numbers** of patients who **never show** in appointments than the others 
- We can indicate that Neigbourhoods with highly never showed patients are from the following: (JARDIM CAMBURI, MARIA ORTIZ, ITARARÉ, RESISTÊNCIA, CENTRO,..)

- 55.72% of patients who never show in appointment didn't recieve SMS 
- 44.28% of patients who never show in appointment recieved SMS 

- **Some ScheduledDay** have **high numbers** of patients who **never show** in appointments than the others, for instance: period of **May 2016**

- **Some AppointmentDay** have **high numbers** of patients who **never show** in appointments than the others, such as people who registered their appointment in **June 2016**

- **~25%** of people with **no scholarship** tend not to show in appointments