## Tuning Parameters Manually

#### Common Tree Tunable Parameters
    -learning rate: learning rate/eta (how quickly model fits residual error using additional base learners)
    -gamma: min loss reduction to create new tree split
    -lambda: L2 reg on leaf weights
    -alpha: L1 reg on leaf weights
    -max_depth: max depth per tree (how deep can tree grow)
    -subsample: % samples used per tree (0-1: fraction of total training set to be used per boosting round)
    -colsample_bytree: % features used per tree (0-1: fraction of total features to be used per boosting round)
    -number of estimators (base learners)

#### Common Linear Tunable Parameters
    -lambda: L2 reg on weights
    -alpha: L1 reg on weights
    -lambda_bias: L2 reg term on bias
    -number of estimators (base learners)

### Untuned Model

In [2]:
import pandas as pd
import xgboost as xgb
import numpy as np

In [3]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

housing_dmatrix = xgb.DMatrix(data=X,label=y)

untuned_params={"objective":"reg:linear"}

untuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, 
                                 params=untuned_params, nfold=4, metrics="rmse", as_pandas=True, seed=123)

print("Untuned rmse: %f" %((untuned_cv_results_rmse["test-rmse-mean"]).tail(1)))

Untuned rmse: 34624.229980


### Tuned Model (colsample, learning_rate, max_depth)

In [None]:
import pandas as pd
import xgboost as xgb
import numpy as np

In [4]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

housing_dmatrix = xgb.DMatrix(data=X,label=y)

tuned_params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,'max_depth': 5}

tuned_cv_results_rmse = xgb.cv(dtrain=housing_dmatrix, params=tuned_params, nfold=4, 
                                       num_boost_round=200, metrics="rmse", as_pandas=True, seed=123)

print("Tuned rmse: %f" %((tuned_cv_results_rmse["test-rmse-mean"]).tail(1)))


Tuned rmse: 29812.683594


Observation: Tuned RMSE of 29,812 compared to untuned RMSE of 34,624

### Tuning the number of boosting rounds
#### Explore effect of boosting rounds (number of trees built) on the out-of-sample performance of  XGBoost model. 
Schema:  cherry pick with a for loop - use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

In [5]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

# Create the DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary for each tree: params 
params = {"objective":"reg:linear", 'max_depth': 3}

# Create list of number of boosting rounds
num_rounds = [5, 10, 15]

# Empty list to store final round rmse per XGBoost model
final_rmse_per_round = []

# Iterate over num_rounds and build one model per num_boost_round parameter
for curr_num_rounds in num_rounds:

    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain = housing_dmatrix, params = params, nfold = 3, 
                        num_boost_round = curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
    
    # Append final round RMSE
    final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))

print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))


   num_boosting_rounds          rmse
0                    5  50903.299479
1                   10  34774.194010
2                   15  32895.098958


 #### Observation: Increase in boosting rounds (number of trees built) results in a non-linear increase in out of sample performance
 |  num_boosting_rounds | rmse 
|---|---|
| 5  |50903.299479  
|  10 |34774.194010  
| 15  |32895.098958   

### Automated boosting round selection using early_stopping
Now, instead of attempting to cherry pick the best possible number of boosting rounds, rely on XGBoost to automatically select the number of boosting rounds within xgb.cv(). This is done using a technique called __early stopping__.

Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric ("rmse" in this case) does not improve for a given number of rounds. 

The __early_stopping_rounds__ parameter in xgb.cv() will use a large number of boosting rounds (50) (trees). Bear in mind that if the holdout metric continuously improves up through when num_boosting_rounds is reached, then early stopping does not occur.

In [6]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree: params
params = {"objective":"reg:linear", "max_depth":4}

# Perform cross-validation with early stopping: cv_results
cv_results = xgb.cv(dtrain = housing_dmatrix, params = params, nfold = 3, early_stopping_rounds = 5, 
                    num_boost_round = 50, metrics="rmse", as_pandas=True, seed=123)

# Print cv_results
print(cv_results)

print("Early stopped rmse: %f" %((cv_results["test-rmse-mean"]).tail(1)))


    test-rmse-mean  test-rmse-std  train-rmse-mean  train-rmse-std
0    142640.656250     705.559400    141871.630208      403.632626
1    104907.664063     111.113862    103057.036458       73.769561
2     79262.059895     563.766991     75975.966146      253.726099
3     61620.136719    1087.694282     57420.529948      521.658354
4     50437.562500    1846.448017     44552.955729      544.170190
5     43035.658854    2034.471024     35763.949219      681.798925
6     38600.880208    2169.796232     29861.464844      769.571318
7     36071.817708    2109.795430     25994.675781      756.521419
8     34383.184896    1934.546688     23306.836588      759.238254
9     33509.139974    1887.375633     21459.770833      745.624404
10    32916.805990    1850.893363     20148.721354      749.612769
11    32197.832682    1734.456935     19215.382813      641.387376
12    31770.852865    1802.155484     18627.389323      716.256596
13    31482.782552    1779.123767     17960.695312      557.04

### Tuning Learning Rate (ETA)
The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of "eta" penalizing feature weights more strongly, causing much stronger regularization.

In [8]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

# Create your housing DMatrix: housing_dmatrix
housing_dmatrix = xgb.DMatrix(data=X, label=y)

# Create the parameter dictionary for each tree (boosting round)
params = {"objective":"reg:linear", "max_depth":3}

# Create list of eta values and empty list to store final round rmse per xgboost model
eta_vals = [0.001, 0.01, 0.1]

# Empty list to store best RMSE per XGBoost model
best_rmse = []

# Systematically vary the eta 
for curr_val in eta_vals:

    params["eta"] = curr_val
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain = housing_dmatrix, params = params, nfold = 3, early_stopping_rounds = 5, 
                    num_boost_round = 10, metrics="rmse", as_pandas=True, seed=123)
        
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))


     eta      best_rmse
0  0.001  195736.406250
1  0.010  179932.182292
2  0.100   79759.411458


  #### Observation: Increase in learning time (how quickly model fits residual error using additional base learners) <br> results in a non-linear increase in out of sample performance

    
    
|  eta | rmse 
|---|---|
| 0.001  |195736.406250 
| 0.01 |179932.182292  
| 0.1  |79759.411458 

### Tuning max_depth
The parameter __max_depth__ parameter dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.

In [9]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params = {"objective":"reg:linear"}

# Create list of max_depth values
max_depths = [2, 5, 10, 20]

# Empty list to store best RMSE per XGBoost model
best_rmse = []

# Systematically vary the max_depth
for curr_val in max_depths:

    params["max_depth"] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain = housing_dmatrix, params = params, nfold = 2, early_stopping_rounds = 5, 
                    num_boost_round = 10, metrics="rmse", as_pandas=True, seed=123)
    
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))


   max_depth     best_rmse
0          2  37957.468750
1          5  35596.599610
2         10  36065.546875
3         20  36739.576172


|  max_depth      | best_rmse 
|---|---|
| 2  |37957.468750| 
| 5 |35596.599610|  
| 10  |36065.546875| 
| 20 | 36739.576172|

Observation: increasing tree depth led to overfitting

### Tuning colsample_bytree
Tune "colsample_bytree". This is organic with scikit-learn's RandomForestClassifier or RandomForestRegressor, where it is called __max_features__. 

In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

In [10]:
# load Ames housing dataset
housing_data = pd.read_csv("https://s3.amazonaws.com/assets.datacamp.com/production/course_3786/datasets/ames_housing_trimmed_processed.csv")

# build train and target dataframes
X,y = housing_data[housing_data.columns.tolist()[:-1]], housing_data[housing_data.columns.tolist()[-1]]

# Create your housing DMatrix
housing_dmatrix = xgb.DMatrix(data=X,label=y)

# Create the parameter dictionary
params={"objective":"reg:linear","max_depth":3}

# Create list of hyperparameter values: colsample_bytree_vals
colsample_bytree_vals = [0.1, 0.5, 0.8, 1]

# Empty list to store best RMSE per XGBoost model
best_rmse = []

# Systematically vary the hyperparameter value 
for curr_val in colsample_bytree_vals:

    params['colsample_bytree'] = curr_val
    
    # Perform cross-validation
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                 num_boost_round=10, early_stopping_rounds=5,
                 metrics="rmse", as_pandas=True, seed=123)
    
    # Append the final round rmse to best_rmse
    best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])

# Print the resultant DataFrame
print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))


   colsample_bytree     best_rmse
0               0.1  44363.458985
1               0.5  36266.462890
2               0.8  35704.357422
3               1.0  35836.046875


Observation: trees using 80% of features were most performant

|  colsample_bytree| best_rmse |
|---|---|
| 0.1  |44363.458985| 
| 0.5 |36266.462890|  
| 0.8  |35704.357422| 
| 1.0 | 35836.046875|