# 6.5 Health Care Insurance Case Study - Predicting Medical Appointment No-show Patients
***
#### https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

### Medical appointment no-shows is a large problem in healthcare as approximately 1 in 5 patients will miss their appointment ([Source](https://www.solutionreach.com/blog/which-wins-the-national-average-no-show-rate-or-yours)). This is an issue for everyone involved: 

1) the scheduled patient presumably needs to be seen otherwise they wouldn’t have had an appointment

2) other patients would have liked to have that spot but couldn’t

3) healthcare providers must spend extra time to contact and re-schedule the patient as well as wasted any time they used to prepare for the visit.

**No-shows Definition**: Given the dates and times of scheduling day and appointment day, predict if a patient will miss their medical appointment.

**No-shows Dataset**: the medical appointment no-show hosted on Kaggle [No-shows Data](https://www.kaggle.com/joniarroba/noshowappointments). 


## The Contents with step-by-step coding 4 - 7:

### 1. Exploratory Data Analysis (Data Review, Data Visualization, Data Cleaning, etc.)
### 2. Feature Engineering
### 3. Feature Selection (Variable Importance Ranking) - for data sets with many features
### 4. Data PreProcessing
### 5. Training and Evaluating the Logistic Regression Classification model on the Training set
### 6. Training and Evaluating the Decision Tree Classification model on the Training set
### 7. Training and Evaluating the Random Forest Classification model on the Training set
***

## 4. Data PreProcessing

#### Step 1. Create X and y arrays
#### Step 2. Split training and testing data sets
We need to split up our data into:
- an X array that contains the features/variables to train on
- a y array with the target variable ('Price' column)


### Step 1. Create X and y arrays

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbn
%matplotlib inline

In [3]:
# Set up a path to data folder
import os
os.chdir(r'C:\Users\yumei\CSP Workshop 2023\Data')
os.getcwd()

'C:\\Users\\yumei\\CSP Workshop 2023\\Data'

In [4]:
# Import Data
dfcopy = pd.read_csv(r'C:\Users\yumei\CSP Workshop 2023\Data\dfcopy.csv')
dfcopy.head()

Unnamed: 0,PatientId,AppointmentID,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handicap,SMS_received,...,ScheduledDay_week,ScheduledDay_day,ScheduledDay_hour,ScheduledDay_dayofweek,AppointmentDay_month,AppointmentDay_week,AppointmentDay_day,AppointmentDay_hour,AppointmentDay_dayofweek,Sche_Appt_days
0,3469431281946,5649465,0,REDENÇÃO,0,0,0,0,0,0,...,18,3,6,1,5,19,9,23,0,6.71
1,48826747693,5659726,0,MARUÍPE,0,0,0,0,0,1,...,18,4,14,2,6,22,1,23,2,28.392
2,9726846148373,5749887,0,MARUÍPE,0,0,0,0,0,0,...,22,31,8,1,6,22,1,23,2,1.653
3,28452896784213,5664173,0,SÃO CRISTÓVÃO,0,0,0,0,0,1,...,18,5,11,3,6,23,8,23,2,34.525
4,726999492642124,5650471,0,SANTOS DUMONT,0,0,0,0,0,0,...,18,3,7,1,5,18,3,23,1,0.676


In [5]:
X = dfcopy[['Sche_Appt_days', 'Age', 'ScheduledDay_hour', 'ScheduledDay_day', 'ScheduledDay_dayofweek', 
            'AppointmentDay_day', 'AppointmentDay_dayofweek', 'ScheduledDay_week', 'Gender_M', 'SMS_received']]  #independent columns
# Or, X = dfcopy.drop('TargetNoshow', axis=1, inplace=True)
y = dfcopy['TargetNoshow']  #Target column

KeyError: "['Gender_M'] not in index"

### Step 2. Split training and testing data sets

In [6]:
from sklearn.model_selection import train_test_split

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

In [8]:
X_train.shape

(77367, 10)

In [9]:
X_test.shape

(33158, 10)

In [10]:
y_train.shape

(77367,)

In [11]:
y_test.shape

(33158,)

## 5. Training and Evaluating the Logistic Regression Classification model on the Training set

### Training a Logistic Regression Model


In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
X_train.head()

Unnamed: 0,Sche_Appt_days,Age,ScheduledDay_hour,ScheduledDay_day,ScheduledDay_dayofweek,AppointmentDay_day,AppointmentDay_dayofweek,ScheduledDay_week,Gender_M,SMS_received
105075,8.449,75,13,3,1,11,2,18,0,1
25480,12.382,16,14,28,3,10,1,17,0,0
37462,1.38,25,14,24,1,25,2,21,0,0
22559,1.606,14,9,8,2,9,3,23,1,0
48496,1.351,32,15,25,2,26,3,21,0,0


In [14]:
logimodel = LogisticRegression()
logimodel.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [15]:
# check scikit-learn version
import sklearn
print('sklearn: %s' % sklearn.__version__)

sklearn: 1.0.2


### How to fix future warning:
(https://machinelearningmastery.com/how-to-fix-futurewarning-messages-in-scikit-learn/)

### Evaluate the Logistic Regression model

In [16]:
predictions = logimodel.predict(X_test)

In [17]:
from sklearn.metrics import classification_report, confusion_matrix

In [18]:
print(confusion_matrix(y_test,predictions))

print("\n")

print(classification_report(y_test, predictions))

[[26260   254]
 [ 6528   116]]


              precision    recall  f1-score   support

           0       0.80      0.99      0.89     26514
           1       0.31      0.02      0.03      6644

    accuracy                           0.80     33158
   macro avg       0.56      0.50      0.46     33158
weighted avg       0.70      0.80      0.71     33158



### Modeling Iteration #2 Using 12 variables instead of 10

### Step 1. Create X and y arrays

In [19]:
X2 = dfcopy[['Sche_Appt_days', 'Age', 'ScheduledDay_hour', 'ScheduledDay_day', 'ScheduledDay_dayofweek', 
            'AppointmentDay_day', 'AppointmentDay_dayofweek', 'ScheduledDay_week', 'Gender_M', 'SMS_received',
            'AppointmentDay_week', 'ScheduledDay_month']]  #independent columns and add two more variables
# Or, X = dfcopy.drop('TargetNoshow', axis=1, inplace=True)
y2 = dfcopy['TargetNoshow']  #Target column

### Step 2. Split training and testing data sets

In [20]:
X2_train, X2_test, y2_train, y2_test = train_test_split(X2,y2,test_size = 0.30, random_state = 202)

In [21]:
X2_train.shape

(77367, 12)

### Step 3. Training a Logistic Regression Model

In [22]:
logimodel2 = LogisticRegression()
logimodel2.fit(X2_train,y2_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

### Step 4. Evaluate the model

In [23]:
predictions = logimodel2.predict(X2_test)

In [24]:
from sklearn.metrics import classification_report, confusion_matrix

In [25]:
print(confusion_matrix(y2_test, predictions))
print("\n")
print(classification_report(y2_test, predictions))

[[26155   292]
 [ 6581   130]]


              precision    recall  f1-score   support

           0       0.80      0.99      0.88     26447
           1       0.31      0.02      0.04      6711

    accuracy                           0.79     33158
   macro avg       0.55      0.50      0.46     33158
weighted avg       0.70      0.79      0.71     33158



## 6. Training and Evaluating the Decision Tree Classification model on the Training set

In [26]:
from sklearn.tree import DecisionTreeClassifier
DTmodel = DecisionTreeClassifier()
DTmodel.fit(X2_train,y2_train)

DecisionTreeClassifier()

In [27]:
dt_pred = DTmodel.predict(X2_test)

In [28]:
print(confusion_matrix(y2_test,dt_pred))
print('\n')
print(classification_report(y2_test,dt_pred))

[[21728  4719]
 [ 4361  2350]]


              precision    recall  f1-score   support

           0       0.83      0.82      0.83     26447
           1       0.33      0.35      0.34      6711

    accuracy                           0.73     33158
   macro avg       0.58      0.59      0.58     33158
weighted avg       0.73      0.73      0.73     33158



## 7. Training and Evaluating the Random Forest Classification model on the Training set

In [29]:
from sklearn.ensemble import RandomForestClassifier
RFmodel = RandomForestClassifier(n_estimators = 200, random_state = 101)
RFmodel.fit(X2_train,y2_train)

RandomForestClassifier(n_estimators=200, random_state=101)

In [30]:
rf_pred = RFmodel.predict(X2_test)

In [31]:
print(confusion_matrix(y2_test,rf_pred))
print('\n')
print(classification_report(y2_test,rf_pred))

[[24351  2096]
 [ 5192  1519]]


              precision    recall  f1-score   support

           0       0.82      0.92      0.87     26447
           1       0.42      0.23      0.29      6711

    accuracy                           0.78     33158
   macro avg       0.62      0.57      0.58     33158
weighted avg       0.74      0.78      0.75     33158



In [32]:
# We can use the following library to create roc curve
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score

## 8. Model Comparison and Model Selection

In [34]:
import pandas as pd
Model_Comp = pd.DataFrame(data=[0.80, 0.73, 0.78],index = ['LR','DS','RF'],columns=['F1 Score'])
Model_Comp

Unnamed: 0,F1 Score
LR,0.8
DS,0.73
RF,0.78


#### The course materials are developed mainly based on my personal experience and contributions from the Python learning community. 

Referred Books: 
- Learning Python, 5th Edition by Mark Lutz
- Python Data Science Handbook, Jake, VanderPlas
- Python for Data Analysis, Wes McKinney 

Copyright ©2023 Mei Najim. All rights reserved.  