> **Tip**: Welcome to the Investigate a Dataset project! You will find tips in quoted sections like this to help organize your approach to your investigation. Once you complete this project, remove these **Tip** sections from your report before submission. First things first, you might want to double-click this Markdown cell and change the title so that it reflects your dataset and investigation.

# Project: Investigate a Dataset - [No-show appointments]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 

This dataset collects information from 100k medical appointments in Brazil and is focused on the question of whether or not patients show up for their appointment. A number of characteristics about the patient are included in each row.

1:‘ScheduledDay’ tells us on what day the patient set up their appointment.

2:‘Neighborhood’ indicates the location of the hospital.

3:‘Scholarship’ indicates whether or not the patient is enrolled in Brasilian welfare program Bolsa Família.

4:Be careful about the encoding of the last column: it says ‘No’ if the patient showed up to their appointment, and ‘Yes’ if they did not show up.


 

In [4]:
# Use this cell to set up import statements for all of the packages that you
#   plan to use.

# Remember to include a 'magic word' so that your visualizations are plotted
#   inline with the notebook. See this page for more:
#   http://ipython.readthedocs.io/en/stable/interactive/magics.html

#import important Libraries

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
%matplotlib inline 


In [5]:
# Upgrade pandas to use dataframe.explode() function. 
!pip install --upgrade pandas==0.25.0

Collecting pandas==0.25.0
[?25l  Downloading https://files.pythonhosted.org/packages/1d/9a/7eb9952f4b4d73fbd75ad1d5d6112f407e695957444cb695cbb3cdab918a/pandas-0.25.0-cp36-cp36m-manylinux1_x86_64.whl (10.5MB)
[K    100% |████████████████████████████████| 10.5MB 3.1MB/s eta 0:00:01  8% |██▉                             | 931kB 23.4MB/s eta 0:00:01    23% |███████▍                        | 2.4MB 29.6MB/s eta 0:00:01    50% |████████████████▏               | 5.3MB 30.2MB/s eta 0:00:01    63% |████████████████████▍           | 6.7MB 25.8MB/s eta 0:00:01
[?25hCollecting numpy>=1.13.3 (from pandas==0.25.0)
[?25l  Downloading https://files.pythonhosted.org/packages/45/b2/6c7545bb7a38754d63048c7696804a0d947328125d81bf12beaa692c3ae3/numpy-1.19.5-cp36-cp36m-manylinux1_x86_64.whl (13.4MB)
[K    100% |████████████████████████████████| 13.4MB 2.9MB/s eta 0:00:01  2% |▊                               | 307kB 28.5MB/s eta 0:00:01    22% |███████▎                        | 3.0MB 27.0MB/s eta 0:00:01 

<a id='wrangling'></a>
## Data Wrangling

In this section of the report, you will load in the data, check for cleanliness, and then trim and clean your dataset for analysis. 



### General Properties


In [64]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
df = pd.read_csv('noshowappointments-kagglev2-may-2016.csv')
df.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [65]:
# the shape of data 
df.shape

(110527, 14)

In [66]:
# Data contains of 14 colums and 110527 appoinmtments as rows


### Data Cleaning
check for cleanliness, and then trim and clean your dataset for analysis
 

In [67]:
# After discussing the structure of the data and any problems that need to be
#   cleaned, perform those cleaning steps in the second part of this section.


#cheack for missing value
df.info()



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
PatientId         110527 non-null float64
AppointmentID     110527 non-null int64
Gender            110527 non-null object
ScheduledDay      110527 non-null object
AppointmentDay    110527 non-null object
Age               110527 non-null int64
Neighbourhood     110527 non-null object
Scholarship       110527 non-null int64
Hipertension      110527 non-null int64
Diabetes          110527 non-null int64
Alcoholism        110527 non-null int64
Handcap           110527 non-null int64
SMS_received      110527 non-null int64
No-show           110527 non-null object
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB


In [68]:
 # Data has dtypes: float64(1), int64(8), object(5)

In [69]:
df.dtypes

PatientId         float64
AppointmentID       int64
Gender             object
ScheduledDay       object
AppointmentDay     object
Age                 int64
Neighbourhood      object
Scholarship         int64
Hipertension        int64
Diabetes            int64
Alcoholism          int64
Handcap             int64
SMS_received        int64
No-show            object
dtype: object

In [70]:
# Return the number of unique values for each column:


df.nunique()


PatientId          62299
AppointmentID     110527
Gender                 2
ScheduledDay      103549
AppointmentDay        27
Age                  104
Neighbourhood         81
Scholarship            2
Hipertension           2
Diabetes               2
Alcoholism             2
Handcap                5
SMS_received           2
No-show                2
dtype: int64

In [2]:
# cheack for data  Duplicates
df.duplicated().sum()


NameError: name 'df' is not defined

In [72]:
# Data no Duplicates

SyntaxError: invalid syntax (<ipython-input-72-4bb3726a9c8f>, line 1)

In [None]:
df.isnull().sum()

In [None]:
Data has no null values

In [None]:
df['PatientId'].nunique()

In [None]:
62299 of Patient are unique value 

In [None]:
#check num of duplicated ID

df.duplicated(['PatientId','No-show']).sum()

In [None]:
df.describe()


In [None]:
# removes the specified row 

df=df.drop (['PatientId','Alcoholism'], axis=1)
df.info()

In [None]:
df.describe()

In [None]:
#The mean age was 37 and min is -1 .. which is an error .. so will remove it


In [None]:
mask=df.query('Age=="-1"')
mask 


In [None]:

#remove the -1 age value

df.drop(index=99832,inplace=True)

df.describe()

In [None]:
# the age value change to 0

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. 



> **Tip**: - Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.


### Research Question 1 (Which gender was more attuned to the appointment !)

In [None]:
# Use this, and more code cells, to explore your data. Don't forget to add
#   Markdown cells to document your observations and findings.

In [3]:
# #represet the percentage of each gender in making appointments with pie chart
plt.title ("(%) of Appointments for Males vs Females", fontsize=20)
df_clean["Gender"].value_counts().plot(figsize=(8,8),kind="pie",autopct='%.2f',textprops={'fontsize': 18});

NameError: name 'plt' is not defined

In [None]:
# As seen in the figure, female presented as 64.99% of the attendance while males was 34.0%..

This means that the gender was an affects on attendance appointment

<a id='conclusions'></a>
## Conclusions

 limited the percentage of attending appointments for women and men was examied by  the gendr.The largest percentage of attendance for the appointments was in femals.So there is a relationship  between gender and attuned .
 

Limitations :
There are a few wrong data need to be explatined, such,  negative or 0 age values 


## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])