# **Lab: Model Optimization**



## Exercise 2: Xgboost with Hyperopt

We will train a Xgboost model on the same dataset as previously usiong Hyperopt.


**Pre-requisites:**
- Create a github account (https://github.com/join)
- Install git (https://git-scm.com/book/en/v2/Getting-Started-Installing-Git)
- Install Docker (https://docs.docker.com/get-docker/)

The steps are:
1.   Launch Docker image
2.   Load Data
3.   Train Xgboost model with defauly hyperparameter
4.   Hyperparameter tuning with Hyperopt
5.   Push changes


### 1. Launch Docker image

**[1.1]** Go to the folder you created previously `adv_dsi_lab_3`

In [None]:
# Placeholder for student's code (1 command line)
# Task: Go to the folder you created previously adv_dsi_lab_3

In [None]:
#Solution:
cd ~/Projects/adv_dsi_lab_3

**[1.2]** Run the built Docker image

In [None]:
docker run  -dit --rm --name adv_dsi_lab_3 -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v ~/Projects/adv_dsi/adv_dsi_lab_3:/home/jovyan/work -v ~/.aws:/home/jovyan/.aws -v ~/Projects/adv_dsi/src:/home/jovyan/work/src xgboost-notebook:latest 

Syntax: docker run [OPTIONS] IMAGE

Options:

`-dit: Run container in background and interactive`

`--rm: Automatically remove the container when it exits`

`--name: Assign a name to the container`

`-p: Publish a container's port(s) to the host`

`-e: Set environment variables`

`-v Bind mount a volume`

Documentation: https://docs.docker.com/engine/reference/commandline/run/

**[1.3]** Display last 50 lines of logs

In [None]:
docker logs --tail 50 adv_dsi_lab_2

Syntax: docker logs [OPTIONS] CONTAINER

Options:

`--tail: Number of lines to show from the end of the logs`

Documentation: https://docs.docker.com/engine/reference/commandline/logs/

Copy the url displayed and paste it to a browser in order to launch Jupyter Lab

**[1.4]** Create a new git branch called `xgboost_hyperopt`

In [None]:
git checkout -b xgboost_hyperopt

Documentation: https://www.atlassian.com/git/tutorials/using-branches/git-checkout

**[1.7]** Navigate the folder `notebooks` and create a new jupyter notebook called `2_xgboost_hyperopt.ipynb`

### 2. Load Data

**[2.1]** Import the function you created `load_sets` from `src/data/sets`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import the function you created load_sets from src/data/sets

In [None]:
# Solution
from src.data.sets import load_sets

**[2.2]** Load the saved sets from `data/processed`

In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Load the saved sets from data/processed

In [None]:
#Solution:
X_train, y_train, X_val, y_val, X_test, y_test = load_sets(path='../data/processed/')

# 3. Train Xgboost model

**[3.1]** Import the xgboost package as xgb


In [None]:
# Placeholder for student's code (1 line of Python code)
# Task: Import the xgboost package as xgb

In [None]:
# Solution:
import xgboost as xgb

**[3.2]** Instantiate the RandomForest class into a variable called rf with random_state=8

In [None]:
# Placeholder for student's code (1 line of code)
# Task: instantiate the XGBClassifier class into a variable called xgboost1

In [None]:
# Solution
xgboost1 = xgb.XGBClassifier()

**[3.3]** Task: Fit the model with the prepared data

In [None]:
# Placeholder for student's code (1 line of code)
# Task: Fit the model with the prepared data

In [None]:
# Solution
xgboost1.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

**[3.4]** Import `dump` from `joblib` and save the fitted model into the folder `models` as a file called `xgboost_default`

In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Import dump from joblib and save the fitted model into the folder models as a file called xgboost_default

In [None]:
# Solution:
from joblib import dump 

dump(xgboost1,  '../models/xgboost_default.joblib')

['../models/xgboost_default.joblib']

**[3.5]** Save the predictions from this model for the training and validation sets into 2 variables called `y_train_preds` and `y_val_preds`


In [None]:
# Placeholder for student's code (2 lines of Python code)
# Task: Save the predictions from this model for the training and validation sets into 2 variables called y_train_preds and y_val_preds

In [None]:
# Solution:
y_train_preds = xgboost1.predict(X_train)
y_val_preds = xgboost1.predict(X_val)

**[3.6]** Import `print_reg_perf` from `src/models/performance` and display the accuracy and f1 scores of this baseline model on the training and validation sets

In [None]:
# Placeholder for student's code (3 lines of Python code)
# Task: Import print_reg_perf from src/models/performance and display the accuracy and f1 scores of this baseline model on the training and validation sets

In [None]:
# Solution
from src.models.performance import print_class_perf

print_class_perf(y_preds=y_train_preds, y_actuals=y_train, set_name='Training', average='weighted')
print_class_perf(y_preds=y_val_preds, y_actuals=y_val, set_name='Validation', average='weighted')

Accuracy Training: 0.9241850204839105
F1 Training: 0.9239048901039122
Accuracy Validation: 0.9066156852034707
F1 Validation: 0.9061059642094202


#4. Hyperparameter tuning with Hyperopt

**[4.1]** Import Trials, STATUS_OK, tpe, hp, fmin from hyperopt package

In [None]:
# Placeholder for student's code (1 line of python code)
# Task: Import Trials, STATUS_OK, tpe, hp, fmin from hyperopt package

In [None]:
# Solution:
from hyperopt import Trials, STATUS_OK, tpe, hp, fmin

**[4.2]** Define the search space for xgboost hyperparameters

In [None]:
space = {
    'max_depth' : hp.choice('max_depth', range(5, 20, 1)),
    'learning_rate' : hp.quniform('learning_rate', 0.01, 0.5, 0.05),
    'min_child_weight' : hp.quniform('min_child_weight', 1, 10, 1),
    'subsample' : hp.quniform('subsample', 0.1, 1, 0.05),
    'colsample_bytree' : hp.quniform('colsample_bytree', 0.1, 1.0, 0.05)
}

**[4.3]** Define a function called `objective` with the following logics:
- input parameters: hyperparameter seacrh space (`space`)
- logics: train a xgboost model with the search space and calculate the average accuracy score for cross validation with 10 folds
- output parameters: dictionary with the loss score and STATUS_OK

In [None]:
# Placeholder for student's code (multiple lines of python code)
# Task: Define a function called objective

In [None]:
# Solution:
def objective(space):
    from sklearn.model_selection import cross_val_score
    
    xgboost = xgb.XGBClassifier(
        max_depth = int(space['max_depth']),
        learning_rate = space['learning_rate'],
        min_child_weight = space['min_child_weight'],
        subsample = space['subsample'],
        colsample_bytree = space['colsample_bytree']
    )
    
    acc = cross_val_score(xgboost, X_train, y_train, cv=10, scoring="accuracy").mean()

    return{'loss': 1-acc, 'status': STATUS_OK }

**[4.4]** Launch Hyperopt search and save the result in a variable called `best`

In [None]:
best = fmin(
    fn=objective,   
    space=space,       
    algo=tpe.suggest,       
    max_evals=5
)

100%|██████████| 5/5 [21:19<00:00, 255.92s/trial, best loss: 0.09989847238005789]


**[4.5]** Print the best set of hyperparameters

In [None]:
# Placeholder for student's code (1 line of python code)
# Task: Print the best set of hyperparameters

In [None]:
# Solution:
print("Best: ", best)

Best:  {'colsample_bytree': 0.35000000000000003, 'learning_rate': 0.15000000000000002, 'max_depth': 3, 'min_child_weight': 5.0, 'subsample': 0.9500000000000001}


**[4.6]** Instantiate a XGBClassifier with best set of hyperparameters

In [None]:
# Placeholder for student's code (multiple lines of python code)
# Task: Instantiate a XGBClassifier with best set of hyperparameters

In [None]:
# Solution:
xgboost2 = xgb.XGBClassifier(
    max_depth = best['max_depth'],
    learning_rate = best['learning_rate'],
    min_child_weight = best['min_child_weight'],
    subsample = best['subsample'],
    colsample_bytree = best['colsample_bytree']
)

**[4.7]** Fit the model with the prepared data

In [None]:
# Placeholder for student's code (1 line of python code)
# Task: Fit the model with the prepared data

In [None]:
# Solution:
xgboost2.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.35000000000000003, gamma=0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              learning_rate=0.15000000000000002, max_delta_step=0, max_depth=3,
              min_child_weight=5.0, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=0.9500000000000001,
              tree_method='exact', validate_parameters=1, verbosity=None)

**[4.8]** Display the accuracy and f1 scores of this baseline model on the training and validation sets

In [None]:
# Placeholder for student's code (2 lines of python code)
# Task: Display the accuracy and f1 scores of this baseline model on the training and validation sets

In [None]:
# Solution:
print_class_perf(y_preds=xgboost2.predict(X_train), y_actuals=y_train, set_name='Training', average='weighted')
print_class_perf(y_preds=xgboost2.predict(X_val), y_actuals=y_val, set_name='Validation', average='weighted')

Accuracy Training: 0.8756679155432613
F1 Training: 0.8741427824032268
Accuracy Validation: 0.8759217630622492
F1 Validation: 0.8742533580742383


**[4.9]** Save the fitted model into the folder models as a file called `xgboost_best`

In [None]:
# Placeholder for student's code (1 line of python code)
# Task: Save the fitted model into the folder models as a file called xgboost_best

In [None]:
# Solution:
dump(xgboost2,  '../models/xgboost_best.joblib')

['../models/xgboost_best.joblib']

# 5.   Push changes

**[5.1]** Add you changes to git staging area

In [None]:
# Placeholder for student's code (1 command line)
# Task: Add you changes to git staging area

In [None]:
# Solution:
git add .

**[5.2]** Create the snapshot of your repository and add a description

In [None]:
# Placeholder for student's code (1 command line)
# Task: Create the snapshot of your repository and add a description

In [None]:
# Solution:
git commit -m "xgboost hyperopt"

**[5.3]** Push your snapshot to Github

In [None]:
# Placeholder for student's code (1 command line)
# Task: Push your snapshot to Github

In [None]:
# Solution:
git push

**[5.4]** Check out to the master branch

In [None]:
# Placeholder for student's code (1 command line)
# Task: Check out to the master branch

In [None]:
# Solution:
git checkout master

**[5.5]** Pull the latest updates

In [None]:
# Placeholder for student's code (1 command line)
# Task: Pull the latest updates

In [None]:
git pull

**[5.6]** Check out to the `xgboost_hyperopt` branch


In [None]:
# Placeholder for student's code (1 command line)
# Task: Merge the branch xgboost_hyperopt

In [None]:
# Solution:
git checkout xgboost_hyperopt

**[5.7]** Merge the `master` branch and push your changes


In [None]:
# Placeholder for student's code (2 command lines)
# Task: Merge the master branch and push your changes

In [None]:
# Solution:
git merge master
git push

Documentation: https://www.atlassian.com/git/tutorials/using-branches/git-merge

**[5.8]** Go to Github and merge the branch after reviewing the code and fixing any conflict




**[5.9]** Stop the Docker container

In [None]:
docker stop adv_dsi_lab_2

Documentation: https://docs.docker.com/engine/reference/commandline/stop/