## Project : Predicting  Medical Appointment No Shows
by Obiorah Philip (2304216)

### Business Understanding:

(a) Background Study:
> No-shows for medical appointments is a widespread and serious problem in the healthcare industry. Patients who miss their appointments can cause inefficiencies in the healthcare system, waste of resources, higher expenses, and delays in other patients' access to care. Healthcare practitioners can increase patient attendance rates and streamline their scheduling procedures by being aware of the variables that lead to appointment no-shows. Failure to attend outpatient visits has a detrimental influence on the healthcare result. Thus, healthcare organisations are focusing on new prospects, one of which is to increase healthcare quality.

> The goal of this project is to create a predictive model that can reliably predict whether or not a patient will show up for a planned medical visit. Healthcare practitioners may adopt focused interventions and initiatives to minimise no-show rates, optimise resource allocation, and improve overall patient care by understanding the primary variables impacting no-shows.

(b) Project Overall Aim and Business Objectives:
> The ultimate goal of this research is to develop a prediction model that can accurately forecast if a patient will attend a medical visit. The following business objectives have been defined to attain this goal:

- Develop a thorough grasp of the dataset: Investigate the dataset of 110,527 medical appointments offered and learn about the characteristics and factors linked with each session.

- Determine the following data mining tasks: Determine possible data mining tasks that may be applied to the dataset in order to meet the project goal. Classification, feature selection, data preparation, and model assessment may all be included.

- Connect corporate goals to data mining tasks: Determine the link between the defined business goals and the individual data mining tasks. Choosing relevant features that have a major influence on appointment no-shows, for example, or using a classification model to predict no-show results.




(c) Literature Review:
>Predicting medical appointment attendance has piqued the interest of researchers in the field of healthcare because of its potential influence on resource optimisation, patient care, and overall operational efficiency. Data mining techniques have been widely used to address this issue and identify trends and variables influencing appointment no-shows. Several research have investigated various strategies and approaches, proving data mining's promise in this subject.

> To anticipate medical appointment no-shows, classification techniques such as decision trees, logistic regression, support vector machines, and random forests have been widely employed. These algorithms use patient-related characteristics, scheduling considerations, and historical data to create models that can tell the difference between patients who are more likely to show up and those who are more likely to not show up.

> Another important factor in determining appointment attendance is feature selection. To determine the most relevant characteristics that contribute substantially to appointment no-shows, researchers used approaches such as information gain, chi-square, and correlation analysis. Models with increased accuracy and interpretability may be developed by selecting informative characteristics.

### Data exploration and understanding.

#### Load the dataset

In [1]:
#import the required libraries
import pandas as pd

#load the dataset 
df = pd.read_csv("KaggleV2-May-2016.csv")
#view the dataset
print("\nFirst Few rows of the Dataset:")
df.head()


Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [3]:
## Display the general informati0n about the dataset
print("Dataset Information:")
print(df.info())

Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110527 entries, 0 to 110526
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   PatientId       110527 non-null  float64
 1   AppointmentID   110527 non-null  int64  
 2   Gender          110527 non-null  object 
 3   ScheduledDay    110527 non-null  object 
 4   AppointmentDay  110527 non-null  object 
 5   Age             110527 non-null  int64  
 6   Neighbourhood   110527 non-null  object 
 7   Scholarship     110527 non-null  int64  
 8   Hipertension    110527 non-null  int64  
 9   Diabetes        110527 non-null  int64  
 10  Alcoholism      110527 non-null  int64  
 11  Handcap         110527 non-null  int64  
 12  SMS_received    110527 non-null  int64  
 13  No-show         110527 non-null  object 
dtypes: float64(1), int64(8), object(5)
memory usage: 11.8+ MB
None


In [6]:
# We check for missing values
print("\nMissing Values: ")
print(df.isnull().sum())


Missing Values: 
PatientId         0
AppointmentID     0
Gender            0
ScheduledDay      0
AppointmentDay    0
Age               0
Neighbourhood     0
Scholarship       0
Hipertension      0
Diabetes          0
Alcoholism        0
Handcap           0
SMS_received      0
No-show           0
dtype: int64


In [8]:
# Let check for outliers
print("Check for outliers")
print(df.describe())

Check for outliers
          PatientId  AppointmentID            Age    Scholarship  \
count  1.105270e+05   1.105270e+05  110527.000000  110527.000000   
mean   1.474963e+14   5.675305e+06      37.088874       0.098266   
std    2.560949e+14   7.129575e+04      23.110205       0.297675   
min    3.921784e+04   5.030230e+06      -1.000000       0.000000   
25%    4.172614e+12   5.640286e+06      18.000000       0.000000   
50%    3.173184e+13   5.680573e+06      37.000000       0.000000   
75%    9.439172e+13   5.725524e+06      55.000000       0.000000   
max    9.999816e+14   5.790484e+06     115.000000       1.000000   

        Hipertension       Diabetes     Alcoholism        Handcap  \
count  110527.000000  110527.000000  110527.000000  110527.000000   
mean        0.197246       0.071865       0.030400       0.022248   
std         0.397921       0.258265       0.171686       0.161543   
min         0.000000       0.000000       0.000000       0.000000   
25%         0.000000   

> Dataset consists of 110,527 medical visits, each with 14 variables (characteristics). The key goal variable in the dataset is whether or not the patient arrived for their planned appointment. The dataset contains a variety of appointment and patient-related features that may be studied to acquire a full view of the data.

> Data  Source: https://www.kaggle.com/datasets/joniarroba/noshowappointments
> The initial examination of the data set include reviewing the dataset's general features and determining the data's quality. This step aids in the identification of any data abnormalities, missing values, outliers, and possible concerns that may effect the succeeding analysis and modelling phases.

>The following key aspects and characteristics can be identified through data exploration:

- The dataset contains a large number of medical visits, with a total of 110,527 entries. This size gives a considerable amount of data for analysis and the development of robust prediction models.

- Variables and characteristics: There are 14 variables or characteristics linked with each appointment in the dataset. Patient-related factors such as age, gender, and medical issues, as well as appointment-specific properties such as appointment date, planned date, and wait time, are examples of these variables. Other significant factors may include the patient's neighbourhood, scholarship information, and whether or not the patient got an SMS reminder.

- PatientId and AppointmentID: No outliers were identified as these are identification numbers.
- Age: The minimum age is recorded as -1, which seems to be an invalid value and potentially an outlier. Further investigation is needed to determine the reason for this discrepancy.
zzzzScholarship, Hipertension, Diabetes, Alcoholism, and SMS_received: These variables are binary (0 or 1) indicating the presence or absence of a condition. No outliers were found.
Handcap: The maximum value is 4, indicating a potential outlier. However, without additional information on the scale or definition of the Handcap variable, it is difficult to determine if this value is indeed an outlier or a valid data point.

Unnamed: 0,PatientId,AppointmentID,Age,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received
count,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0,110527.0
mean,147496300000000.0,5675305.0,37.088874,0.098266,0.197246,0.071865,0.0304,0.022248,0.321026
std,256094900000000.0,71295.75,23.110205,0.297675,0.397921,0.258265,0.171686,0.161543,0.466873
min,39217.84,5030230.0,-1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4172614000000.0,5640286.0,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,31731840000000.0,5680573.0,37.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,94391720000000.0,5725524.0,55.0,0.0,0.0,0.0,0.0,0.0,1.0
max,999981600000000.0,5790484.0,115.0,1.0,1.0,1.0,1.0,4.0,1.0


### Data Preparation and Pre-processing.

### Data Modelling and model evaluation

### Project Evaluation and Summary