# [Medical Appointment No Shows - Group ID: 29]

Group Members:



*   Can Zunal 29453
*   Ahmet Büyükaksoy 29308
*   Berk Ay 29026
*   Emre Yaman 28927
*   Görkem Topçu 28862
*   Yağız Toprak Işık 29174





## Introduction

<font color="white">

The goal of the project is to identify possible causes that may affect a patient not showing up for a scheduled medical appointment by using statistical analysis. This will be achieved by analyzing a dataset of 110,527 medical appointments. The target variable is whether the patient showed up for the appointment or not. The end goal is to understand why approximately 30% of patients do not show up for their appointments and to use this information to develop strategies for improving attendance rates by using correlations between the independent variables in the dataset and the target variable.
</font>

### Problem Definition

<font color="white">

This will be conducted  through the following steps:

1.  Data Cleaning: The raw dataset will be cleaned and processed to make sure that it will be suitable for analysis. This may include fixing errors in the data, handling missing values etc.
2. Data Exploration: The cleaned dataset will be explored by summary statistics and visualizations to get a clear comprehension of the data and identify any potential trends or patterns. Correlations among the independent variables and target variable will be analyzed by using appropriate statistical tests.
3. Machine Learning Implementation: Machine Learning models will be implemented in order to check correlations found between the indepent variables.
4.   Result Interpretation: The conclusions from the statistical analysis will be used to form conclusions regarding the potential factors why patients are missing their medical appointments.

</font>




# Importing Libraries

We will be using the following libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from os.path import join

import warnings
warnings.filterwarnings('ignore')

#Importing the Data




In [None]:
from google.colab import drive
drive.mount("./drive")

In [None]:
fname = "KaggleV2-May-2016.csv"
path_prefix = './drive/MyDrive/CS210_Project_Data'
df = pd.read_csv(join(path_prefix, fname))

# Features in the data set
The data consist of 110.527 medical appointments information classified by 14 associated variables namely = 

{PatientID, AppointmentID, Gender, DataMarcacaoColsulta, DataAgadamento, Age, Neighbourhood, Scholarship, Hipertension, Diabetes, Alcoholism, Handcap, SMS_recieved, No-Show}


In [None]:
df.describe()

In [None]:
df.info()

# **Data Preprocessing**

**Date cleaning**

<font color="white">
In this part, it checks if there is any missing value in each collumn and it shows how many there are.





</font>

In [None]:
df.isnull().sum()

Correction of incorrect column names







In [None]:
df.rename(columns = {'Handcap':'Handicap', 'No-show':'NoShow'}, inplace = True)

We convert date data from string to datetime format for making it easier for future use .



In [None]:
df['ScheduledDay'] = pd.to_datetime(df.ScheduledDay)
df['AppointmentDay'] = pd.to_datetime(df.AppointmentDay)

# parse datetime to time
df['ScheduledTime'] = df.ScheduledDay.dt.time

# normalize to set time to midnight, parse datetime to date
df['ScheduledDay'] = df.ScheduledDay.dt.normalize()

In [None]:
df['WaitingDays'] = df['AppointmentDay'] - df['ScheduledDay']

df['WaitingDays'] = df.WaitingDays.dt.days

# detect incorrect data
df.loc[df['WaitingDays'] < 0]

In [None]:
# clean data from incorrect data according to scheduledDay and appointmentDay
drop_idx = df.loc[df['WaitingDays'] < 0].index
df.drop(drop_idx, inplace=True)

## Example Medical Appointments Information 

In [None]:
df.head(3)

# Data Exploration

<font color="white">
In this part the cleaned dataset will be explored by summary statistics and visualizations to get a clear comprehension of the data and identify any potential trends or patterns. Correlations among the independent variables and target variable will be analyzed by using appropriate statistical tests. Data exploration is divided into 2 parts: Correlations between variables and no show rate and additional correlations. Additional correlations are used in order to better understand the dataset.






###  Table Of Contents:
* Correlations Between Variables and No Show Rate:
1. No-show Rate
2. Correlation Between Recieving an SMS and No Show
3. No-show and Gender Correlation
4. No-show Rate With Different Neighbourhoods
5. Probability of showing up depending on the disease and scholarship
6. Probability that Someone Shows Up Given That Person Has No Disease
7. Probability of showing up based on age
8. Probability of Showing Up Depending on the Disease and Scholarship
9. Probability of Showing Up Based on Age
10. Probability of Showing Up Based on the Day of the Week

* Additional Correlations:
1. Correlation of Features
2. Disease and Age Correlation
3. Disease and Gender Correlation


</font>

# Correlations Between Variables and No Show Rate

##No-show Rate

Around 20.2% of people who make appointments don't show up to their appointment.


In [None]:
totalShow  = len(df[df.NoShow == "No"])
totalNoShow = len(df) - totalShow

In [None]:
showDict = {"Absent" : totalNoShow, "Present" :  totalShow}

showDict = dict(sorted(showDict.items()))

keys = list(showDict.keys())
values = list(showDict.values())

plt.pie(values, labels=keys, colors=['pink', 'lightblue'], autopct = "%1.1f%%")

plt.title("Appointment Information: Absent or not?", fontsize = 16)

plt.show()

## Correlation Between Recieving an SMS and No Show

From the the graph we can observe that people that recieved SMS tend to show up to their appointments while those that did not receive a SMS are absent. This is a clear indication that SMS help people remember their appointment and increase the rate they show up to the appointments.

In [None]:
ax = sns.countplot(x = df["SMS_received"], hue = df["NoShow"], data=df)
ax.set_title("Show/NoShow for SMSReceived")
x_ticks_labels=['SMSReceived', 'No SMSReceived']
ax.set_xticklabels(x_ticks_labels)
plt.show()

##No-show and Gender Correlation
In this part, correlation between genders and presence at the appointment are evaluated via bar chart. FEMALE patient to MALE patient ratio is 1.857. Out of 71837 FEMALE patient 57246 of them attended their appointment. Ratio of show is 0.797. Out of 38685 MALE patient 30962 of them attended their appointment. Ratio of show is 0.8. We can say that there is no meaningful difference as 0.797 and 0.8 is extremely close.

In [None]:
# "No Show" = "No" means patient attended appointment and "No Show" = "Yes" means patient missed appointment.
female_count = len(df[df["Gender"] == "F"])
male_count = len(df[df["Gender"] == "M"])

female_show = 0
male_show = 0

female_show_ratio = 0
male_show_ratio = 0

for i in df[df["NoShow"] == "No"] ["Gender"]:
  if i == "F":
    female_show += 1
  elif i == "M":
    male_show += 1

female_show_ratio = round(female_show / female_count,3)
male_show_ratio = round(male_show / male_count,3)

print("FEMALE patient to MALE patient ratio(F/M):", round(female_count/male_count, 3))
print("Out of", female_count, "FEMALE patient", female_show, "of them attended their appointment. Ratio:", female_show_ratio)
print("Out of", male_count, "MALE patient", male_show, "of them attended their appointment. Ratio:", male_show_ratio)

In [None]:
ax = sns.countplot(x=df["Gender"], hue=df["NoShow"], data=df)
ax.set_title("Present/Absent for Females and Males")
x_ticks_labels=['Female', 'Male']
ax.set_xticklabels(x_ticks_labels)
plt.legend(title='Present', loc='upper right', labels=['Yes', 'No'])
plt.show()


## Noshow Up Rate With Different Neighbourhoods

<font color="white">
In this part we intend to figure out if there is a correlation between the neighborhoods the patients live in and their show up rate. In order to do this, we created a new pandas dataframe taking neighborhood names as key words, with show rate and total appointment information. A possible scenario is that some of the neighborhoods have poorer conditions thus inability to attend appointments.
</font>

In [None]:
neighbourhoodList = {}
neighbourhoodShow = {}

for row in range(len(df)):
  neighbourHood = df.iloc[row][6]
  noshow = df.iloc[row][13]

  if neighbourHood not in neighbourhoodList: #Add
    neighbourhoodList[neighbourHood] = 1

    neighbourhoodShow[neighbourHood] = 0
    if noshow == "No":
      neighbourhoodShow[neighbourHood] = 1

  else:
    neighbourhoodList[neighbourHood] += 1
    if noshow == "No":
      neighbourhoodShow[neighbourHood] += 1

# We will work on this later

The mean and of the show rate of all the neighborhoods are respectively 79.46% and 80.24%. Number of neighborhoods that have show up rate average less than the total average is 30 while those who have greater average are 51. Standart deviation is 9.72 which is low, it can be said that the data is clustered around the mean.

In [None]:
df_n = pd.DataFrame(neighbourhoodList.keys(), columns = ["Neighborhoods"])
df_n["Show"] = df_n["Neighborhoods"].map(neighbourhoodShow)
df_n["TotalAppointment"] = df_n["Neighborhoods"].map(neighbourhoodList)

df_n["ShowPercentage"] = df_n["Show"] / df_n["TotalAppointment"]

lessThanAverage = 0
greaterThanAverage  = 0
for average in df_n.ShowPercentage:
  if average < df_n["ShowPercentage"].mean():
    lessThanAverage += 1
  else:
    greaterThanAverage += 1

print("There are total of", len(neighbourhoodList), "neighborhoods.")
print("The mean of the show rate of all the neighborhoods is ", format(df_n["ShowPercentage"].mean() * 100, "0.2f"), "%.", sep = "")
print("The meadian of the show rate of all the neighborhoods is ", format(df_n["ShowPercentage"].median() * 100, "0.2f"), "%.", sep = "")
print("The standart deviation of the show rate of all the neighborhoods is ", format(df_n["ShowPercentage"].std() * 100, "0.2f"), "%.", sep = "")
print("Total neighborhood that have average less than the total average ", lessThanAverage, ".", sep = "")
print("Total neighborhood that have average greater than the total average ", greaterThanAverage, ".", sep = "")

###Graphing the Show Percentage Among Different Neighborhoods

In order to understand the data better we can compare the neighboorhoods with different show percentages. To visualise this we seperated lowest show percentage neighboorhods from the highest show percentage neighboorhoods and plot them. After observation show rates seem similiar so this feature might be irrelevant. Further information is needed to classify the neighboorhoods. 

In [None]:
df_mostShow = df_n.loc[df_n["ShowPercentage"] > 0.83]
sns.barplot(data=df_mostShow, x="Neighborhoods", y="ShowPercentage")
plt.xticks(rotation=90)
plt.title("Neighborhood no show correlation")
plt.show()

In [None]:
df_leastShow = df_n.loc[df_n["ShowPercentage"] < 0.766]

sns.barplot(data=df_leastShow, x="Neighborhoods", y="ShowPercentage")
plt.xticks(rotation=90)
plt.title("Neighborhood no show correlation")
plt.show()

## Probability of Showing Up Depending on the Diseases and Scholarship

For each disease and scholarship, the probability of patients showing up is calculated. It can be observed that there is slight difference between people who have diseases compared to average. However, there are some variables that we shouldn't consider because they are to close to the mean. For instance, people that misses their appointment who are alcoholic tends to have an average no show rate that is almost the same as the overall average. 

In [None]:
def calculateProbabilityDisease(disease):
  show_up = list(df[df[disease] == 1] ["NoShow"])

  showup_ratio = (show_up.count("No") / len(show_up))*100
  showup_ratio = round(showup_ratio,2)

  print(disease + ": " + str(showup_ratio) + "%")

disease_names = ["Hipertension", "Diabetes", "Alcoholism", "Handicap", "Scholarship"]

print("The probability that someone shows up given that person has at least: ")
for dis in disease_names:
  calculateProbabilityDisease(dis)

## Probability that Someone Shows Up Given That Person Has No Disease

In this part we are checking people who have any disease rather than  checking specific diseases. If they have any disease they are added. This value is almost the same as the mean which is 79.8. This means that we can't consider this in the future as it is not relevant.

In [None]:
noshow_up = list(df[(df.Hipertension == 0) & (df.Diabetes == 0) & (df.Alcoholism == 0) & df.Handicap == 0] ["NoShow"])

showup_ratio = (noshow_up.count("No") / len(noshow_up))*100

print("The probability that someone shows up given that person has no disease: " + str("{:.2f}".format(round(showup_ratio, 2))) + "%")

##Probability of Showing Up Based on Age

In this part we calculated and plotted ages effect on no show rate. As can be observed there is a gradual increase with age. We can say that there is a correlation. Especially 60+ range have really high show rate.

In [None]:
age_group_labels = ["[0-18)","[18-30)","[30-45)","[45-60)","[60+]"]

def assign_ageGroup(age):
  if 0 <= age < 18:  
    return age_group_labels[0] # group A
  elif age < 30:
    return age_group_labels[1] # group B
  elif age < 45:
    return age_group_labels[2] # group C
  elif age < 60:
    return age_group_labels[3] # group D
  elif 60 <= 60:
    return age_group_labels[4] # group E
  else:
    return np.nan

df["AgeGroup"] = df["Age"].apply(assign_ageGroup)

In [None]:
from pandas.core.resample import f
ageGroup_noShow_dict = {}

for i in age_group_labels:
  ageGroup_noShow_dict[i] = format(round(len(df[(df.NoShow == "No") & (df.AgeGroup == i)]) / len(df[df.AgeGroup == i]), 3))

# Sort the dictionary by keys
ageGroup_noShow_dict = dict(sorted(ageGroup_noShow_dict.items()))

keys = list(ageGroup_noShow_dict.keys())
values = list(map(float,list(ageGroup_noShow_dict.values())))

In [None]:
for i in ageGroup_noShow_dict.items():
  print(f"{i[0]}: {i[1]}")

In [None]:
# Create the bar chart
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(x=keys, y=values, palette="Blues")

# Add a title and labels for the x- and y-axes
plt.title("Probability of showing up based on age", fontsize=16)
plt.xlabel("Age Groups", fontsize=14)
plt.ylabel("Probability of showing up", fontsize=14)

# Display the chart
plt.show()

## Probability of Showing Up Based on the Day of the Week
In this part we made a plot to see how day of the appointment effect the no show rate. We can observe that as it gets closer to friday the rate of no show decreases. 

In [None]:
days = ['ScheduledDay', 'AppointmentDay']
for c in days:
    df[c] = pd.to_datetime(pd.to_datetime(df[c]).dt.date)

df['dayofweek']=df['AppointmentDay'].dt.day_name()

Shows = df['NoShow'] == "No"
NoShows = df['NoShow'] == "Yes"

xtick = np.array(['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'])
df.dayofweek[NoShows].value_counts().plot.line(label='noshows', figsize=(10, 10), marker='o')
df.dayofweek[Shows].value_counts().plot.line(label='shows', figsize=(10, 10), marker='x')
plt.xticks(np.arange(0, 7, 1), xtick)

plt.title('Appointments in day of the week')
plt.xlabel('Days of Week')
plt.ylabel('Number of Appointments')
plt.legend()

plt.show()

#Additional Correlations:

### Correlation of Features

<font color="white">
In this section we are using a heat map in order visualize correlations between the given features in the data set. With a quick observation we can assume that diabetes and hipertension are highly corelated as they have a correlation value of 0.43.

Feature with highest correlation:

1.   Age and Hipertenson 0.5
2.   Diabetes and Hipertenson 0.43
3.   Diabetes and Age 0.29








</font>

In [None]:
ax = plt.axes()
ax.set_title("Correlation of Features", fontsize = 18)
sns.set(rc = {"figure.figsize" : (10, 10)})

dataplot = sns.heatmap(df.corr(), annot=True)

## Disease and Age Correlation
In this part, correlation between ages and each diseases are evaluated via boxplot. 

In [None]:
ages_hipertension = df[df["Hipertension"] == 1] ["Age"]
ages_diabetes = df[df["Diabetes"] == 1]["Age"]
ages_alcoholism = df[df["Alcoholism"] == 1]["Age"]
ages_handicap = df[df["Handicap"] == 1] ["Age"]

corrDict = {"Ages_Hipertension": ages_hipertension, "Ages_Diabetes": ages_diabetes, "Ages_Alcoholism": ages_alcoholism, "Ages_Handicap": ages_handicap}

labels, corr = corrDict.keys(), corrDict.values()

plt.boxplot(corr)
plt.xticks(range(1,len(labels) + 1), labels)
fig = plt.gcf()
fig.set_size_inches(12,8)
plt.show()

## Disease and Gender Correlation
In this part, correlation between genders and each diseases are evaluated via bar chart

In [None]:
f_count = len(df[df["Gender"] == "F"])
m_count = len(df)-f_count

def gender_disease_corr(disease):
  genderList_disease = list(df[df[disease] == 1] ["Gender"])
  f_d_count = genderList_disease.count("F") # female with disease count
  m_d_count = len(genderList_disease)-f_d_count  # male with disease count

  f_d_ratio = f_d_count/f_count
  m_d_ratio = m_d_count/m_count
  
  data = np.array([int(f_d_ratio * 100), int(m_d_ratio * 100)])
  fig = plt.gcf()
  fig.set_size_inches(6,8)
  plt.bar(x = ["Female", "Male"], height = data)
  plt.title("Incidence percentage of " + disease + " by gender")
  plt.show()

  return f_d_ratio, m_d_ratio

In [None]:
gender_disease_corr("Hipertension")

In [None]:
gender_disease_corr("Diabetes")

In [None]:
gender_disease_corr("Alcoholism")

In [None]:
gender_disease_corr("Handicap")

In [None]:
df.head()

# Machine Learning Models

<font color="white">
This is the section that you primarily need work on for the final report. Implement at least two machine learning models so that you can compare them.
</font>

As a result of the observed correlations that are evaluated in the data exploration part it is understood that NoShow is not correlated with some of the columns on the dataset. Therefore, in order to train a machine learning model, some of the columns are excluded from the dataframe. Also, NoShow outputs were given as "Yes" and "No", they are replaced with 1 and 0. 60 percent of the data is used for training, 20 percent is used for validation and remaining 20 percent is used for testing.

### Implementation

<font color="white">
Implement and evaluate your models. Perform hyperparameter tunning if necessary. Choose the correct evaluation metrics.
</font>

In [None]:
data = df.drop(columns=['Gender', 'Neighbourhood', 'Age', 'AgeGroup', 'ScheduledTime', 'ScheduledDay', 'AppointmentDay', 'AppointmentID', 'PatientId', 'dayofweek'])
data["NoShow"] = data["NoShow"].replace({"Yes": 0, "No": 1})

X = data.drop(columns=['NoShow'])
y = data['NoShow']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, stratify=y_train, random_state=0)

We selected two different algorithms for machine learning model generation which are Random Forest Algorithm and Gradient Boosting Algorithm, and previously splitted data is used for training these models.

In [None]:
#Random Forest
model_rf = RandomForestClassifier(random_state=0)
model_rf.fit(X_train, y_train)

#Gradient Boosting
model_gb = GradientBoostingClassifier(random_state=0)
model_gb.fit(X_train, y_train)

In this part, the performances of previously created machine learning models are evaluated. As it can be seen from the resulting graphs, machine learning model that is generated by using Random Forest Algorithm performs better. Performance is calculated with the given formula: (TP+TN)/(TP+TN+FP+FN)

In [None]:
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import confusion_matrix

fig, axs = plt.subplots(1, 2, figsize=(12, 5))
plot_confusion_matrix(model_rf, X_val, y_val, ax=axs[0], cmap='Blues')
axs[0].set_title('Random Forest Confusion Matrix')

plot_confusion_matrix(model_gb, X_val, y_val, ax=axs[1], cmap="Greens")
axs[1].set_title('Gradient Boosting Confusion Matrix')

plt.show()

In this part performances are evaluated by ROC curve and precision-recall curve. In the ROC curve y-axis represents the true positive rate and x-axis represents the false positive rate. Thus a curve that is closer to upperleft corner is more efficient. For the precision-recall curve, precision is calculated by TP/(TP+FP, positive predictions) and recall is calcuated by TP/(total positive instances in the data). Therefore, higher precision and recall means higher performance.

In [None]:
# Your Code
from sklearn.metrics import RocCurveDisplay, PrecisionRecallDisplay, roc_curve, auc, precision_recall_curve,roc_auc_score

#   ***     BEGIN: ROC CURVE     ***   #
# Make predictions on the validation data using the Random Forest model
y_pred_proba_rf = model_rf.predict_proba(X_val)[:, 1]
fpr_rf, tpr_rf, thresholds_rf = roc_curve(y_val, y_pred_proba_rf)
auc_rf = roc_auc_score(y_val, y_pred_proba_rf)

# Make predictions on the validation data using the Gradient Boosting model
y_pred_proba_gb = model_gb.predict_proba(X_val)[:, 1]
fpr_gb, tpr_gb, thresholds_gb = roc_curve(y_val, y_pred_proba_gb)
auc_gb = roc_auc_score(y_val, y_pred_proba_gb)

fig, ((ax1, ax2),(ax3, ax4)) = plt.subplots(2, 2, figsize=(12,10))

# Random Forest ROC
ax1.plot(fpr_rf, tpr_rf, label='Random Forest (AUC = %.2f)' % auc_rf)
ax1.plot([0,1],[0,1],'r--')
ax1.set_xlabel('FPR')
ax1.set_ylabel('TPR')
ax1.set_title('Random Forest ROC')
ax1.set_xlim(0,1)
ax1.set_ylim(0,1)
ax1.legend()

# Gradient Boosting ROC
ax2.plot(fpr_gb, tpr_gb, label='Gradient Boosting (AUC = %.2f)' % auc_gb)
ax2.plot([0,1],[0,1],'r--')
ax2.set_xlabel('FPR')
ax2.set_ylabel('TPR')
ax2.set_title('Gradient Boosting ROC')
ax2.set_xlim(0,1)
ax2.set_ylim(0,1)
ax2.legend()
#   ***     END: ROC CURVE     ***   #


#   ***     BEGIN: PRECISION-RECALL     ***   #
# Random Forest
precision_rf, recall_rf, thresholds_rf = precision_recall_curve(y_val, y_pred_proba_rf)

# Gradient Boosting
precision_gb, recall_gb, thresholds_gb = precision_recall_curve(y_val, y_pred_proba_gb)

# Plot RF
ax3.plot(recall_rf, precision_rf, label='Random Forest')
ax3.set_xlabel('Recall')
ax3.set_ylabel('Precision')
ax3.set_title('Random Forest Precision-Recall curve')
ax3.legend()

# Plot GB
ax4.plot(recall_gb, precision_gb, label='Gradient Boosting')
ax4.set_xlabel('Recall')
ax4.set_ylabel('Precision')
ax4.set_title('Gradient Boosting Precision-Recall curve')
ax4.legend()
#   ***     END: PRECISION-RECALL     ***   #

plt.show()

## Hyperparameter Tuning


<font color="white">
As it is explained in the previous part higher precision and higher recall means higher performance, so the area under the precision-recall curve (AUPRC) is an indicator of performance. In this part the best n_estimators and the best max_features values are tested for tuning. For each model, new model variations are created by changing n_estimators and max_features iteratively.
</font>

In [None]:
from sklearn.metrics import average_precision_score
# Your Code
n_estimators_vals = [50,100,300,500]
max_features_vals = [2,3,5,7]
auprc_n_estimators_rf = {}
auprc_max_features_rf = {}
auprc_n_estimators_gb = {}
auprc_max_features_gb = {}

# Random Boosting
best_auprc_rf = 0
best_n_estimators_rf = 0
print("Random Forest:\n")
for n_estimators in n_estimators_vals:
  model_rf = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
  model_rf.fit(X_train, y_train)
  y_pred = model_rf.predict(X_test)
  y_pred_proba = model_rf.predict_proba(X_test)[:,1]
  auprc_rf = average_precision_score(y_test, y_pred_proba)
  auprc_n_estimators_rf[n_estimators] = auprc_rf
  if auprc_rf > best_auprc_rf:
    best_n_estimators_rf = n_estimators
    best_auprc_rf = auprc_rf
  print(f"n_estimators {n_estimators}, AUPRC score:   {auprc_rf}")

print("\nBest n_estimators for Random Forest =", best_n_estimators_rf,"\n")

best_auprc_rf = 0
best_max_features_rf = 0
for max_features in max_features_vals:
  model_rf = RandomForestClassifier(n_estimators=best_n_estimators_rf, max_features=max_features, random_state=0)
  model_rf.fit(X_train, y_train)
  y_pred = model_rf.predict(X_test)
  y_pred_proba = model_rf.predict_proba(X_test)[:,1]
  auprc_rf = average_precision_score(y_test, y_pred_proba)
  auprc_max_features_rf[max_features] = auprc_rf
  if auprc_rf > best_auprc_rf:
    best_max_features_rf = max_features
    best_auprc_rf = auprc_rf
  print(f"n_estimators {best_n_estimators_rf} max_features {max_features}, AUPRC score:   {auprc_rf}")

print("\nFor Random Forest, best_n_estimators =", best_n_estimators_rf, "and best max_features =", best_max_features_rf, "\n")

# Gradient Boosting
best_auprc_gb = 0
best_n_estimators_gb = 0
print("\nGradient Boosting:\n")
for n_estimators in n_estimators_vals:
  model_gb = GradientBoostingClassifier(n_estimators=n_estimators, random_state=0)
  model_gb.fit(X_train, y_train)
  y_pred = model_gb.predict(X_test)
  y_pred_proba = model_gb.predict_proba(X_test)[:,1]
  auprc_gb = average_precision_score(y_test, y_pred_proba)
  auprc_n_estimators_gb[n_estimators] = auprc_gb
  if auprc_gb > best_auprc_gb:
    best_n_estimators_gb = n_estimators
    best_auprc_gb = auprc_gb
  print(f"n_estimators {n_estimators}, AUPRC score:   {auprc_gb}")

print("\nBest n_estimators for Gradient Boosting =", best_n_estimators_gb, "\n")

best_auprc_gb = 0
best_max_features_gb = 0
for max_features in max_features_vals:
  model_gb = RandomForestClassifier(n_estimators=best_n_estimators_gb, max_features=max_features, random_state=0)
  model_gb.fit(X_train, y_train)
  y_pred = model_gb.predict(X_test)
  y_pred_proba = model_gb.predict_proba(X_test)[:,1]
  auprc_gb = average_precision_score(y_test, y_pred_proba)
  auprc_max_features_gb[max_features] = auprc_gb
  if auprc_gb > best_auprc_gb:
    best_max_features_gb = max_features
    best_auprc_gb = auprc_gb
  print(f"n_estimators {best_n_estimators_gb} max_features {max_features}, AUPRC score:   {auprc_gb}")

print("\nFor Gradient Boosting, best_n_estimators =", best_n_estimators_gb, "and best max_features =", best_max_features_gb)

### Results & Discussion

<font color="white">

</font>

In this part, performances of Random Forest Model and Gradient Boosting Models are compared after hyperparameter tuning. As it can be observed from the results their performances are similar, but the performance of Random Forest Model is slightly better. 

In [None]:
# Your Code
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12,6))

# AUPRC - n_estimators
ax1.plot(list(auprc_n_estimators_rf.keys()), list(auprc_n_estimators_rf.values()), '--b', marker="o", label="Random Forest")
ax1.plot(list(auprc_n_estimators_gb.keys()), list(auprc_n_estimators_gb.values()), '--r', marker="o", label="Gradient Boosting")
ax1.set_yticks( np.around(np.linspace(0.7,1),decimals=2))
ax1.set_xticks(n_estimators_vals)
ax1.set_title("AUPRC - n_estimators")
ax1.set_xlabel("n_estimators")
ax1.set_ylabel("AUPRC")
ax1.legend()

# AUPRC - max_feature
ax2.plot(list(auprc_max_features_rf.keys()), list(auprc_max_features_rf.values()), '--b', marker="o", label="Random Forest")
ax2.plot(list(auprc_max_features_gb.keys()), list(auprc_max_features_gb.values()), '--r', marker="o", label="Gradient Boosting")
ax2.set_yticks( np.around(np.linspace(0.7,1),decimals=2))
ax2.set_xticks(max_features_vals)
ax2.set_title("AUPRC - max_feature")
ax2.set_xlabel("max_feature")
ax2.set_ylabel("AUPRC")
ax2.legend()

plt.show()

In [None]:
# Your Code
x_train_val = np.concatenate((X_train, X_val), axis=0)
y_train_val = np.concatenate((y_train, y_val), axis=0)

final_model = RandomForestClassifier(random_state = 0, n_estimators=best_n_estimators_rf, max_features=best_max_features_rf)

# Train the final model on the combined train and validation data
final_model.fit(x_train_val, y_train_val)

In [None]:
# Your Code
y_pred_rf = final_model.predict(X_val)

# Plot the confusion matrix for the random forest model
fig, ax1 = plt.subplots(1, 1, figsize=(14, 7))
plot_confusion_matrix(final_model, x_train_val, y_train_val, ax=ax1, cmap = "Blues")
ax1.set_title('Random Forest Confusion Matrix')
plt.show()

# Conclusion

<font color="white">

In the project we cleaned the data from the dataset. Then plotted and analyzed it to make some assumptions. After that we used a machine learning model in order to compare our findings. Our main question was what might be the main cause of the no show in appointments and how we can fix it. Absence rate is 20.2%. From the variables we originaly had, we tried to eliminate the unnecessary variables. While neighbourhood patients live in, gender and alcoholism seems to have low correlation and are pretty similiar to the mean, effect of recieving an SMS about the appointment, age of the patient and the day of the appointment can be clearly seen. Also people who have certain disseases have higher show rate. There are some additional variable that also have minor differences, but we won't be mentioning them. 

In conclusion we can say that to lower the no show rate, following steps could be used:
1.   Make sure that each patient recieves an SMS
2.  Spread out the appointments appropriately
3.  Figure out why there is a change with age, find a solution

First solution can be easily achieved as it is cheap to implement. We would only need a bot that sends appointment information to each patient. Second solution assumes that there is a problem with the current appointment system which should be tested and implemented. Third solution needs further research.

</font>


# Future Work

<font color="white">

Our dataset had 14 features which might not be enough to encompass all the reasons behind the low show rate. Additional information might be needed. Besides that our solutions needs to be tested. New appointment schedule system could be implemented and tested to see if appointment of the days really matters. Figuring out why there is a change with age in no show rate might be a future research topic.
</font>

# Work Division

<font color="white">

Each teammate had a specific part in the project. However, as we come across problems everyone helped figuring out and correcting them. Parts are summarized below:

Yağız Toprak Işık: Data Cleaning

Emre Yaman: Data Cleaning, Introduction 

Ahmet Büyükaksoy: Data Cleaning and Data Exploration Bug Fixing

Berk Ay: Future Generation

Görkem Topçu: Data Cleaning, Future Generation, Machine Learning Model

Can Zunal: Final Report Structure, Future Generation, Machine Learning Model

</font>