<a href="https://colab.research.google.com/github/chris-lovejoy/MLmedics/blob/master/exercises/Predicting_No_Shows.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predicting Hospital Non-Attendance

**In this exercise, we will train different machine learning models (including a neural network) to predict whether somebody will attend their upcoming hospital appointment.**

Our models will be trained on the publicly-available ["Medical Appointment No Shows"](https://www.kaggle.com/datasets/joniarroba/noshowappointments) dataset.

This is a *classification* problem, because we're asking the algorithm to classify individuals into two classes: **will attend** or **won't attend**. We'll train a Nearest Neighbours Classifier for this task. You're then invited to try out other classifications algorithms, such as:

- Random Forest
- Support Vector Machine 
- Neural networks

In this exercise, we'll learn how to:

- **Download data** and **load it into our Jupyter Notebook**
- **Import useful libraries** like pandas and sci-kit learn
- **Clean our data**
- **Engineer new features**


## Part 1: Downloading and importing our data

To train a machine learning model, the first thing we need is data.

There are various open-source datasets available on the internet. Great sources of datasets include [Kaggle](https://www.kaggle.com/), [Papers with Code](https://paperswithcode.com/datasets) and [data.world](https://data.world/datasets/health).

For this exercise, we're using a dataset available on Kaggle. You can view information about the dataset and download it [here](https://www.kaggle.com/datasets/joniarroba/noshowappointments). (You may need to create a Kaggle account, which is definitely worth doing - Kaggle is great.)

*Note: we want version 5 of the dataset (which is the default - at the time of writing).*

### Downloading and moving our data

Once we've downloaded the data, we're looking for the *.csv* file (it may be within a .zip file, which needs unzipping). CSV stands for 'comma-separated value' and refers to the fact that each row of data is stored with values that are separated by commas. You can open the file in a 'plain text' editor (like Notepad (windows) or TextEdit (Mac)) to see what this looks like. 

Where we put that file will depend on whether we're running Jupyter Notebook locally or in Google Colab. These options are discussed in the [Jupyter Notebook setup exercise](https://github.com/chris-lovejoy/CodingForMedicine/blob/main/exercises/Setting_up_Jupyter_Notebook.ipynb).

**If you are using Google Colab**, you need to:
1. Make sure you are connected to a runtime (click Connect in the top right, if you aren't)
2. Select the 'Files' folder on the left hand tab
3. Drag our downloaded Breast Cancer 'data.csv' file into the Files tab (it should show the file uploading in the bottom left, and then you'll see 'data.csv' within the Files tab).

**If you are running Jupyter Notebook on your local computer**, you can simply:
1. Drag our 'data.csv' file into the same folder (aka. directory) as this Jupyter Notebook. 


### Importing the 'pandas' library

To load our data, we're going to use a popular library called ["Pandas"](https://pandas.pydata.org). A library is a collection of code with ready-made functions that we can use. We can import it with one line of code and then use it for a wide range of functionality.

In [1]:
import pandas as pd

*If you are running on your local computer and get the following error, it means you need to install pandas on your computer. Instructions for doing so are available [here](https://pypi.org/project/pandas/). Drop me a message if any difficulties (message form at the bottom of this document).*


> "ModuleNotFoundError: No module named 'pandas'

We're also going to need another popular library called numpy, so let's import that too:

In [2]:
import numpy as np

### Importing our data into the notebook

Once we have our data in the correct place and pandas imported, we can load it with the following command. This uses one of the *pandas* functions **read_csv()**, which lets us load csv files and save them as a **'DataFrame'** (which is basically a table).

In [3]:
noShows = pd.read_csv('./no-shows-data.csv')

Our table should now be saved in the 'df' variable, so we should be able to see it by running the cell below.

In [4]:
noShows

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,2.987250e+13,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,5.589978e+14,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4.262962e+12,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,8.679512e+11,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8.841186e+12,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,2.572134e+12,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3.596266e+12,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,1.557663e+13,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,9.213493e+13,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


## Part 2: Visualising our data

To understand our data, we can use functions such as:

- head()
- tail()
- describe() [to be used on specific variables]

In [5]:
noShows.head()

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,29872500000000.0,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,558997800000000.0,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4262962000000.0,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,867951200000.0,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8841186000000.0,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No


In [6]:
noShows.Age.describe()

count    110527.000000
mean         37.088874
std          23.110205
min          -1.000000
25%          18.000000
50%          37.000000
75%          55.000000
max         115.000000
Name: Age, dtype: float64

We can also do a quick screen to see all the unique values for each variable.

In [7]:
print('Age:',sorted(noShows.Age.unique()))
print('Gender:',noShows.Gender.unique())
print('Diabetes:',noShows.Diabetes.unique())
print('Alchoholism:',noShows.Alcoholism.unique())
print('Hypertension:',noShows.Hipertension.unique())
print('Handicap:',noShows.Handcap.unique())
print('Scholarship:',noShows.Scholarship.unique())
print('SMS_received:',noShows.SMS_received.unique())
print('Neighbourhood:',noShows.Neighbourhood.unique())



Age: [-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 102, 115]
Gender: ['F' 'M']
Diabetes: [0 1]
Alchoholism: [0 1]
Hypertension: [1 0]
Handicap: [0 1 2 3 4]
Scholarship: [0 1]
SMS_received: [0 1]
Neighbourhood: ['JARDIM DA PENHA' 'MATA DA PRAIA' 'PONTAL DE CAMBURI' 'REPÚBLICA'
 'GOIABEIRAS' 'ANDORINHAS' 'CONQUISTA' 'NOVA PALESTINA' 'DA PENHA'
 'TABUAZEIRO' 'BENTO FERREIRA' 'SÃO PEDRO' 'SANTA MARTHA' 'SÃO CRISTÓVÃO'
 'MARUÍPE' 'GRANDE VITÓRIA' 'SÃO BENEDITO' 'ILHA DAS CAIEIRAS'
 'SANTO ANDRÉ' 'SOLON BORGES' 'BONFIM' 'JARDIM CAMBURI' 'MARIA ORTIZ'
 'JABOUR' 'ANTÔNIO HONÓRIO' 'RESISTÊNCIA' 'ILHA DE SANTA MARIA'
 'JUCUTUQUARA' 'MONTE BELO' 'MÁ

**TASK**: Use the cell below to explore and better understand the dataset.

## Part 3: Cleaning our data

### Removing outliers

From looking at what df.Age.describe() returns, we can already see some problems. For example, the minimum age is -1 and the maximum age is 115.

Let's remove all datapoints outside of a particular range.

To do this, we can use the **.loc** functionality of pandas to identify datapoints that meet particular conditions.

Let's set the top cut-off as 90 years and the lower cut-off as 5 years:

In [8]:
noShows.loc[(noShows.Age < 90) & (noShows.Age > 5)]

Unnamed: 0,PatientId,AppointmentID,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hipertension,Diabetes,Alcoholism,Handcap,SMS_received,No-show
0,2.987250e+13,5642903,F,2016-04-29T18:38:08Z,2016-04-29T00:00:00Z,62,JARDIM DA PENHA,0,1,0,0,0,0,No
1,5.589978e+14,5642503,M,2016-04-29T16:08:27Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,0,0,0,0,0,No
2,4.262962e+12,5642549,F,2016-04-29T16:19:04Z,2016-04-29T00:00:00Z,62,MATA DA PRAIA,0,0,0,0,0,0,No
3,8.679512e+11,5642828,F,2016-04-29T17:29:31Z,2016-04-29T00:00:00Z,8,PONTAL DE CAMBURI,0,0,0,0,0,0,No
4,8.841186e+12,5642494,F,2016-04-29T16:07:23Z,2016-04-29T00:00:00Z,56,JARDIM DA PENHA,0,1,1,0,0,0,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
110522,2.572134e+12,5651768,F,2016-05-03T09:15:35Z,2016-06-07T00:00:00Z,56,MARIA ORTIZ,0,0,0,0,0,1,No
110523,3.596266e+12,5650093,F,2016-05-03T07:27:33Z,2016-06-07T00:00:00Z,51,MARIA ORTIZ,0,0,0,0,0,1,No
110524,1.557663e+13,5630692,F,2016-04-27T16:03:52Z,2016-06-07T00:00:00Z,21,MARIA ORTIZ,0,0,0,0,0,1,No
110525,9.213493e+13,5630323,F,2016-04-27T15:09:23Z,2016-06-07T00:00:00Z,38,MARIA ORTIZ,0,0,0,0,0,1,No


We can see that this yields 98,378 rows - out of our original 110,527. Gives that our dataset is large, this may not be of concern. If we wanted to minimise data that is removed, we could do further manual inspection of this dataset to understand the age distribution and consider a more refined removal criteria.



If we're happy, we can go ahead and update the dataset by declaring this into the variable:

In [9]:
noShows = noShows.loc[(noShows.Age < 90) & (noShows.Age > 5)]

**TASK**: Have a look at other variables, to see whether there are other outliers or erroneous data points that should be removed.

### Correct spelling mistakes

We can also some simple spelling mistakes in the column names. We can correct this using the **.rename()** function:

In [10]:
noShows = noShows.rename(columns = {'Hipertension': 'Hypertension',
                         'Handcap': 'Handicap'})

### Removing unhelpful columns

There are some columns that won't be helpful for our analysis. We can remove those by using the **.drop()** function:

In [11]:
noShows = noShows.drop('PatientId', axis=1)
noShows = noShows.drop('AppointmentID', axis=1)

### Converting variables to binary

Computers like to work with binary values of 0 or 1 much more than with text. So let's convert our 'diagnosis' column values from 'M' for malignant and 'B' for binary into 1 and 0, using the **.map()** function:

In [12]:
noShows['No-show'] = noShows['No-show'].map({'Yes':1, 'No':0})
noShows['Gender'] = noShows['Gender'].map({'F':1, 'M':0})

### Further cleaning

**TASK:** Have a look through the dataset and see if there's any further cleaning that it requires.

## Part 4: Feature engineering

Sometimes, the variables that we have in our table aren't the best ones we want to feed into our models for training. In this case, we may want to manually "engineer features" for our model. This could involve transforming existing features and/or combining multiple.

A simple example in our case here, could be to extract the waiting time for each individual (ie. the time between their booking being made and the date of their appointmenet). This feature doesn't yet exist, but it seems reasonable that this would be relevant.

### Engineering the "waiting time" variable

To add the waiting time, we'll want to calculate the difference between values in the "ScheduledDay" and "AppointmentDay" columns.

However, on inspection we can notice a problem: The values in both columns are 'string' variables. (See the notebook on [Python Principles](./Python_principles.ipynb) if you're not sure what a string is.)

In [13]:
noShows.ScheduledDay

0         2016-04-29T18:38:08Z
1         2016-04-29T16:08:27Z
2         2016-04-29T16:19:04Z
3         2016-04-29T17:29:31Z
4         2016-04-29T16:07:23Z
                  ...         
110522    2016-05-03T09:15:35Z
110523    2016-05-03T07:27:33Z
110524    2016-04-27T16:03:52Z
110525    2016-04-27T15:09:23Z
110526    2016-04-27T13:30:56Z
Name: ScheduledDay, Length: 98378, dtype: object

In [14]:
type(noShows.AppointmentDay[0])

str

Given that they're currently strings, there's no way they can be easily subtracted. We therefore want to convert them into "datetime" format, which enables us to handle this type of data.

We can do that using the **.apply()** function and the numpy **datetime64** format, as below:

In [15]:
noShows.ScheduledDay = noShows.ScheduledDay.apply(np.datetime64)
noShows.AppointmentDay = noShows.AppointmentDay.apply(np.datetime64)

To make things simpler, we can focus only on the date - and not the time of date. We can update that as below:

In [16]:
noShows['ScheduledDay'] = noShows['ScheduledDay'].dt.date
noShows['AppointmentDay'] = noShows['AppointmentDay'].dt.date

We'll now go ahead and add a new column called "WaitingTime", using the pandas **to_timedelta()** function, as follows:

In [17]:
noShows['WaitingTime'] = pd.to_timedelta((noShows['AppointmentDay'] - noShows['ScheduledDay'])).dt.days
noShows['WaitingTime'] = noShows['WaitingTime'].apply(np.int64)

(Note: It is always better to implement column-wise transformations in this way, rather than a loop - both for better efficiency as well reduced risk of bugs.)

Now we can see our new variable "WaitingTime":

In [18]:
noShows.tail(15)

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Neighbourhood,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,No-show,WaitingTime
110511,1,2016-06-08,2016-06-08,14,MARIA ORTIZ,0,0,0,0,0,0,0,0
110512,1,2016-06-08,2016-06-08,41,MARIA ORTIZ,0,0,0,0,0,0,0,0
110514,1,2016-06-08,2016-06-08,58,MARIA ORTIZ,0,0,0,0,0,0,0,0
110515,0,2016-06-06,2016-06-08,33,MARIA ORTIZ,0,1,0,0,0,0,1,2
110516,1,2016-06-07,2016-06-08,37,MARIA ORTIZ,0,0,0,0,0,0,1,1
110517,1,2016-06-07,2016-06-07,19,MARIA ORTIZ,0,0,0,0,0,0,0,0
110518,1,2016-04-27,2016-06-07,50,MARIA ORTIZ,0,0,0,0,0,1,0,41
110519,1,2016-04-27,2016-06-07,22,MARIA ORTIZ,0,0,0,0,0,1,0,41
110520,1,2016-05-03,2016-06-07,42,MARIA ORTIZ,0,0,0,0,0,1,0,35
110521,1,2016-05-03,2016-06-07,53,MARIA ORTIZ,0,0,0,0,0,1,0,35


**TASK:** Look at the current variables and consider what other features you could extract.

### Adding dummy variables

Sometimes, we'll want to convert a variable into multiple separate variables. For example, we might have categorical variables, which many models can't "natively" incorporate it. 

A solution here can be to generate "dummy variables". Rather than having one variable with all the categories, we create binary (yes/no) variables for each of the categories.

In this exercise, we can do this for hte 'Neighbourhood' variable. We can use the pandas function **"get_dummies"** to do so:

In [19]:
dummy_cols = ['Neighbourhood']
noShows = pd.get_dummies(noShows, columns = dummy_cols)

We can now see **many** new columns, for each potential neighbourhood.

In [20]:
noShows.head()

Unnamed: 0,Gender,ScheduledDay,AppointmentDay,Age,Scholarship,Hypertension,Diabetes,Alcoholism,Handicap,SMS_received,...,Neighbourhood_SANTOS REIS,Neighbourhood_SEGURANÇA DO LAR,Neighbourhood_SOLON BORGES,Neighbourhood_SÃO BENEDITO,Neighbourhood_SÃO CRISTÓVÃO,Neighbourhood_SÃO JOSÉ,Neighbourhood_SÃO PEDRO,Neighbourhood_TABUAZEIRO,Neighbourhood_UNIVERSITÁRIO,Neighbourhood_VILA RUBIM
0,1,2016-04-29,2016-04-29,62,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,2016-04-29,2016-04-29,56,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,2016-04-29,2016-04-29,62,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,2016-04-29,2016-04-29,8,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,2016-04-29,2016-04-29,56,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Given the large number of neighbourhoods in our dataset, this may not actually be the most appropriate way to handle this variable. A more intelligent approach may be to extract further information about the neighbourhoods - such as geographical location or socioeconomic information about them. 

However, we don't have that information readily available, so will stick with this approach for demonstration purposes for now.

## Part 5: Preparing and training the model

Let's now prepare and train our model. We'll start selecting all the variables we want to use for training.

*(Not sure what a list is? Check out the [Python principles](./Python_principles.ipynb) exercise.)*

In [21]:
prediction_var = ['Gender','Age','Scholarship','Hypertension','Diabetes','Alcoholism','Handicap','SMS_received','WaitingTime']


We'll split our data into training and test data, using a function from sklearn.

We do this so we can understand whether our model is actually helpful. In the real world, we want to use our model on data it's never seen before. 

When training it model, we only show it the 'training' data. We can then test it on the 'test' data, and use that to understand how well the model may perform on new (unseen) data.

In [22]:
 from sklearn.model_selection import train_test_split
 
 train, test = train_test_split(noShows, test_size = 0.15)

We can now create our input 'x' variables and our output 'y' variables for the training and test sets:

In [23]:
train_x = train[prediction_var]
train_y = train['No-show']

test_x = test[prediction_var]
test_y = test['No-show']

### Train the model

Finally, we'll import our model and train it. Here we'll train a nearest neighbours classifier. Explaining this is beyond the scope of this exercise, but you can read about it [here](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47).

In [30]:
from sklearn.neighbors import KNeighborsClassifier

In [48]:
model = KNeighborsClassifier()
model.fit(train_x, train_y)

KNeighborsClassifier()

### Generate and assess predictions

We can generate some predictions as follows:

In [49]:
predictions = model.predict(test_x)

And have a quick look at those predictions:

In [53]:
predictions

array([0, 0, 0, ..., 0, 0, 1])

Then, we can quantify those predictions using the sklearn function **f1_score()**:

In [51]:
from sklearn.metrics import f1_score

In [52]:
f1_score(test_y, predictions)

0.23368606701940034

The closer to 1, the better the F1 score (and the closer to 0, the worse).

In reality, just looking at one score is not a good way to assess the model. Assessing model performance (including a description of the F1 score) is covered in more detail in [this exercise](./Breast_cancer_features.ipynb).

## Next Steps

Have a go at modifying our pipeline. See if you can improve the F1 score. Potential variations include:

- Different features (using the 'prediction_var' list)
- Different parameters (such as different learning rates and neural network layers, or different train/test splits)
- Different models (see below)

Popular classifiers to try as alternative models include:

- [Random Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- [Neural networks](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html)
- [Support vector machines (SVMs)](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)

These links will take you to the documentation. See if you can work out how to import them and then train the model based on the description and examples that they provide.

Fill out the form below and we'll provide feedback on your code.

**Any feedback on the exercise? Any questions? Want feedback on your code? Please fill out the form [here](https://docs.google.com/forms/d/e/1FAIpQLSdoOjVom8YKf11LxJ_bWN40afFMsWcoJ-xOrKhMbfBzgxTS9A/viewform).**