# Assignment 4

#### Student ID: *Double click here to fill the Student ID*

#### Name: *Double click here to fill the name*

Firstly, install the following dependencies. You may need to restart the kernel after the installation.

In [None]:
!pip install catboost -U -qq
!pip install xgboost -U -qq
!pip install lightgbm -U -qq
!pip install shap -qq
!pip install imodels -qq
!pip install bentoml --pre -qq
!pip install pyngrok -qq
!pip install PyYAML -U -qq

If you are using Colab or Kaggle notebook, try to set up the tunnel using the following commands.

In [None]:
from pyngrok import ngrok, conf
import getpass

In [None]:
print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
conf.get_default().auth_token = getpass.getpass()

In [None]:
# Setup a tunnel to the streamlit port 8050
public_url = ngrok.connect(8050)



## Q1: Analyze hospital readmission dataset with interpretable methods

Hospital readmission is an episode when a patient who had been discharged from a hospital is admitted again within a specified time interval. Generally, a higher readmission rate indicates the ineffectiveness of treatment during past hospitalizations. 

Therefore, the hospital wants your help identifying patients at the highest risk of being readmitted. Doctors will take care of the final decision about when to release each patient, but they hope you could build a model to highlight issues the doctors should consider when discharging a patient. The hospital has given you relevant patient medical information. The given dataset contains the following features:

* Your prediction target is `readmitted`
* Feature names like `number_inpatient` refers to the number of inpatient visits of the patient in the year preceding the encounter
* Features whose names with the word `diag` indicate the diagnostic code of the illness or illnesses the patient was admitted with. For example, `diag_1_428` means the doctor said their first illness diagnosis is number "428".  
* A feature name like `metformin_No` means the patient did not have the medicine `metformin.` If this feature had a value of False, then the patient did take the drug `metformin.`
* Features whose names begin with `medical_specialty` describe the specialty of the doctor who treats the patient. The values in these fields are all `True` or `False.`

Firstly, use the following code snippet to set up the dataset. Note the CSV file can be downloaded from our course website.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import re

# Read the data and fix some column names
X = pd.read_csv('hospital.csv')
X = X.rename(columns = lambda x:re.sub('[^A-Za-z0-9_]+', '', x))

# Split into targt and features
y = X.readmitted              
X.drop(['readmitted'], axis=1, inplace=True)

# Split into training and test set
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=1)

In the following questions, when you use methods
that are inherently random, make sure to use, set the random seed to 42.

(a) Firstly, fit an interpretable model and show doctors some evidence the model is doing something in line with their medical intuition. 

* Try to build an OneR decision rule model as a baseline and calculate the accuracy of the OneR model on the validation set. 
* Now, two patients (Their data are recorded in 21492 and 15100 rows in the original dataset) come in, and they would like to know why they get readmitted or do not get readmitted. Use the rule list generated from the OneR model to give them a reason.  

In [None]:
# coding your answer here.

(b) The doctor is glad that you convinced the patients, but he is still worried about the model performance of OneR you just built. 

* Try to build a more complicated classifier using `LightGBM`, train the model that gives a maximum of 5,000 trees and will stop after 100 consecutive rounds fail to find any improvement. 
* Then report the accuracy of the validation set. Finally, plot the feature importance (Use `gain` as the type of importance instead of `split`) of the top ten important features.

Hint: The [early stop callback](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.early_stopping.html) and [feature_importance](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.feature_importance) attributes may be useful.

In [1]:
# coding your answer here.

(c) It appears `number_inpatient` is a critical feature, and the doctors would like to know more about that. 

* Create a partial dependence plot and an individual conditional expectation plot for them that shows how `num_inpatient` affects the model's predictions. (You should use the validation set to generate the plot)
* In addition, also create the partial dependence plot and the individual conditional expectation plot for the `time_in_hospital` so that they can tell from these plots whether the effect of the `number_inpatient` on the target is big or small. 
* Comment on your results.

In [None]:
# coding your answer here.

(d) Now, the doctors are looking for the local explanations of your model. 
* Try to create a force plot (for the previous two patients) and a summary plot of Shapley values.  (You should use the validation set to generate the plot)
* Use the force plot to give the previous two patients (Their data are recorded in 21492 and 15100 rows in the original dataset) why they were readmitted or not.
* Does the summary plot consistent with the feature importance plot?

In [None]:
# coding your answer here.

(e) Now the doctors are convinced you have the right data, and the model overview looked reasonable. It's time to turn this into a finished product they can use. 
* Try to use `BentoML` to save your LightGBM model, then reload it. 
* Use the reloaded model to make inferences on the previous two patients. Show that the inference results are the same as the original model for these two patients.

Hint: You may refer to https://docs.bentoml.org/en/latest/frameworks/lightgbm.html and our lab for more information.

In [None]:
# coding your answer here.

(f) Deploy your model as a REST API server using `BentoML.` In addition, test your server by sending a request that contains the data of the previous two patients. Show that the responses from the server are the same as the predictions of the original model for these two patients.

Hint: You should first transform the boolean value to an integer (True to 1 and False to 0) and transform the integer to a string before sending the request to your server.

In [None]:
# coding your answer here.

## Q2 Revisit the Ames housing dataset with ensemble learning

We will revisit the Ames Housing dataset from assignment 3 in this question. However, we would like to improve our performance using ensemble learning this time.

Firstly, use the following code snippet to set up the dataset. Note the CSV file can be downloaded from our course website.

In [None]:
# Read the data
X = pd.read_csv("ames.csv")

# Separate target from predictors
y = X.SalePrice              
X.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=42)

# "Cardinality" means the number of unique values in a column
# Here, we select categorical columns with relatively low cardinality 
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and 
                        X_train_full[cname].dtype == "object"]

# Select numeric columns
numeric_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = low_cardinality_cols + numeric_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In the following questions, when you use methods
that are inherently random, make sure to use, set the random seed to 42.

(a) You'll build and train a gradient boosting in this step.

* Use one-hot encoding for the categorical variable.
* Training an XGBoost regression model.   **Leave all other parameters as default.**
* Then, fit the model to the training data and calculate the mean absolute error (MAE) corresponding to the predictions for the validation set.

Hint: Since we are working with both training and validation sets, try to apply the same transform when you encode the variables. You may find [ColumnTransformer](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html) and the `handle_unknown` option in the encoder useful.

In [None]:
# coding your answer here.

(b) A simple technique for improving the gradient boosting model is to use a lower learning rate with larger iterations. Try to build another `XGBRegressor` with a learning rate set to 0.05 and the number of estimators set to 1000 and calculate the MAE on the validation set.

In [None]:
# coding your answer here.

(c) Now train an LGBMRegressor and a CatBoostRegressor with default parameters on the original training data (do not use one-hot encoding before feeding into the models this time; instead, let the models deal with it). Calculate the MAE on the validation set for these two models.

Hint: You may refer to our lab to see how to deal with categorical variables inside these two frameworks.

In [None]:
# coding your answer here.

(d) Build an ensemble model using the hard voting regressor for the `XGBRegressor` in (c), the `LGBMRegressor`, and the `CatBoostRegressor` in (d). Fit the ensemble model on the original training dataset and calculate the MAE on the validation set. In addition, also build a stacking regressor for these three models, fit on the original dataset and calculate the MAE on the validation set. Finally, comment on your results.

Hint: You may find [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) useful when you combine three models with different preprocessing steps.

In [None]:
# coding your answer here.