# Introduction
In Part I, we did a bit of simple cleaning to remove some NAs. In Part II, we performed uni- and bivariate analysis. 

In Part III, we dummified columns containing categorical data and simplied the dataset to be used. 

In this Part IV, we will perform machine learning. 

![MachineLearningProcess.png](attachment:MachineLearningProcess.png)

We put this section on all of the projects in UpLevel so bear with us if you've seen this before. 

Generally, the machine learning process has five parts:
1. <strong>Split your data into train and test set</strong>
2. <strong>Model creation</strong>
<br>
Import your models from sklearn and instantiate them (assign model object to a variable)
3. <strong>model fitting</strong>
<br>
Fit your training data into the model and train train train
4. <strong>model prediction</strong>
<br>
Make a set of predictions using your test data, and
5. <strong>Model assessment</strong>
<br>
Compare your predictions with ground truth in test data

Highly recommended readings:
1. [Important] https://scipy-lectures.org/packages/scikit-learn/index.html
2. https://machinelearningmastery.com/a-gentle-introduction-to-scikit-learn-a-python-machine-learning-library/
3. https://scikit-learn.org/stable/tutorial/basic/tutorial.html

### Step 1: Import your library
We will be using models from sklearn - a popular machine learning library. However, we won't import everything from sklearn and take just what we need. 

We'll need to import plotting libraries to plot our predictions against the ground truth (test data). 

Import the following:
1. pandas as pd

We'll start with pandas, and import other libraries later.

In [None]:
# Step 1: Import pandas

### Step 2: Read your CSV from Part III as a DataFrame
We'll read the CSV that we exported from the previous Part. 

Again, make sure you have 82,965 rows and 23 columns.

In [None]:
# Step 2: Read your CSV into a DataFrame

### Step 3: Prepare your independent variables and dependent variable
We'll be preparing a DataFrame containing our indepedent variables, and a separate list containing the "hospital_death".

1. Declare a variable, and assign your independent variables to it, i.e. drop "hospital_death" from the DataFrame from Step 2
2. Declare a variable, and assign only values from "hospital_death"

In [None]:
# Step 3a: Prepare your independent variables

# Step 3b: Prepare your dependent variables


### Step 4: Import machine learning libraries
Time to import other libraries. We hope you've taken a look at the two articles at the start of this notebook because it'll be useful. 

This is a classification task, so we will be using Classifiers. 

In addition, since the dependent variables are imbalanced, we will be using AUC as a metric to assess model performance, and confusion_matrix for visual inspection.

Import the following libraries and methods:
1. train_test_split - sklearn.model_selection
2. DummyClassifier - sklearn.dummy
3. LogisticRegression - sklearn.linear_model
4. DecisionTreeClassifier - sklearn.tree
5. RandomForestClassifier - sklearn.ensemble
6. KNeighboursClassifier - sklearn.neighbours
7. roc_auc_score - sklearn.metrics
8. confusion_matrix - sklearn.metrics

Feel free to try other Classifiers, e.g., catboost, xgboost, lightgbm as well later on.

In [None]:
# Step 4: Import the machine learning libraries

### Step 5: Split your independent and dependent variables into train and test
Use the train_test_split to split your dataset, and make sure you stratify by your dependent variable because of how imbalanced the 1 and 0 are.

In [None]:
# Step 5: Split your data

### Step 6: Train your machine learning model
Once you've split your data, machine learning begins. 

This is what you'll need to do:
1. Start with picking a model
2. Declare a variable, and store your model in it (don't forget to use brackets)
3. Fit your training data into the instantiated model
4. Declare a variable that contains predictions from the model you just trained, using the train dataset (X_test)

We recommend starting with DummyClassifier to establish a baseline for your predictions. 

Also, the recommended readings will be very helpful.

In [None]:
# Step 6a: Declare a variable to store the model

# Step 6b: Fit your train dataset

# Step 6c: Declare a variable and store your predictions that you make with your model using X test data


### Step 7: Repeat Step 6 with other models
Now that we're done with DummyClassifier, let's move on and train other classifiers as well.

We'll use LogisticRegression, DecisionTreeClassifier, RandomForestClassifier, and KNeighboursClassifier. 

Don't forget to use unique variable names since we'll have to assess the model performance later. 

In [None]:
# Step 7: Repeat Step 6 with other models

### Step 8: Assess your DummyClassifier model performance
Assessing a classification model is slightly different from assessing a regression - we'll have to use either f1_score or AUC, together with a confusion matrix.

![ConfusionMatrixExpectation.png](attachment:ConfusionMatrixExpectation.png)

The confusion matrix can tell you how your classification went, using the test depedent variable and the predictions using the test independent variables.

Start with DummyClassifier first so that you know what the baseline is.

Print the AUC score, followed by printing the confusion_matrix.

In [None]:
# Step 8: Assess DummyClassifier model performance

If you executed the measurement of the roc_auc_score and the confusion matrix, you should see something like this.

The numbers won't be exact because the train/test split is random. 

![DummyResultsExpectation.png](attachment:DummyResultsExpectation.png)

<strong>When your classification sucks or if you perform random classification, your worst AUC score is around 0.5</strong>

We'll try our best to go above 0.5.

### Step 9: Measure the performance of the models you trained in Step 7
Do the same as you did in Step 8 for the other models you trained.

In [None]:
# Step 9: Repeat the printing of AUC score and the confusion matrix for the other models

### Step 10: Don't panic
When you assessed your model performances, you most likely found that the result were not too different from DummyClassifier's. Only DecisionTree may be better than others.

First things first, take a deep breath - you didn't mess up. Sometimes, data science is not just about modelling. 

When things don't work out, we have to take a step back and think about why. 

Remember all the columns that we dropped in Part I? Let's dig deeper and extract more information for modelling.

Let's go back to the original raw data and perform some <strong>advanced feature engineering</strong> in Part V.

If it's of any consolation, you now know the drill - it'll be easier to model after this.