## Intermediate Machine Learning - kaggle

https://www.kaggle.com/learn/intermediate-machine-learning

### XGBoost / Gradient Boosting

https://www.kaggle.com/code/alexisbcook/xgboost/tutorial

#### Introduction

Previous predictions have been made with the random forest method. This achieves better peformance than a single decision tree simply by averaging the predictions of many decision trees. 

Random forest method is referred to as an "ensemble" method. **Ensemble methods** combine the predictions of severl models (e.g., several trees, in the case of random forests). Gradient boosting is another ensemble method. 

**** Gradient Boosting

**Gradient boosting** is a method that goes through cycles to iteratively add models into an ensemble. 

Begin by initializing the ensemble with a single model. Then, start the cycle:

- First, use the current ensemble to generate predictions for each observation in the dataset. To make a prediciton, we add the predictions from all modcels in the ensemble. 
- These predictions are used to calculate a loss function (like mean squared error)
- Then, use the loss function to fig a new modle that will be added to the ensemble. We determine parameters so that adding this new modle to the ensemble will reduce the loss. 
    - The *"gradient"* in *"gradient boosting"* refers to the fact that we'll use gradient descent on the loss function to determine the parameters in this new model. 
- Finally, we add the new model to ensemble
- Repeat. 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data 
data = pd.read_csv('./melbourne-housing-snapshot/melb_data.csv')

# Select subset of predictors
cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]

# Select target
y = data.Price

# Separate data into training and validation sets
X_train, X_valid, y_train, y_valid = train_test_split(X, y)

**XGBoost** stands for **extreme gradient boosting**. it is an implementation of gradient boosting with additional features focused on performance and speed. *Scikit-learn has another version of gradient boosting, but XGBoost has some technical advantages.*

We import the scikit-learn API for XGBoost (`xgboost.XGBRegresor`) which allows us to build and fit a model. The `XGBRegressor` class has many tunable parameters. 

In [2]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


In [3]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 233209.6351999724


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


### Parameter Tuning

XGBoost has a parameters that can affect accurac y and training speed. The first parameters to understand are: 

n_estimators
`n_estimators` specifies how many times to go through the modeling cycle. It is equal to the number of models that we include in the ensemble. 
- Too *low* a value causes *underfitting*, leading to inaccurate predictiosn on both training data and test data
- Too *high* a value causes *overfitting*, causing acurate predictions in on training data, but inaccurate predictions on test data

Typical values range from 100-1000 but it depends a lot on the `learning_rate` parameter discussed below. 

In [4]:
# Code to set the number of models in the ensemble

my_model = XGBRegressor(n_estimators = 500)
my_model.fit(X_train, y_train)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


early_stopping_rounds

`early_stopping_rounds` is a way to automatically find the ideal value for `n_estimators`. Early stopping cause the model to stop iterating whne the validation score stops improving, even if we aren't at the hard stop for `n_estimators`. It's smart to set a high value for `n_estimators` and then use `early_stopping_rounds` to find the optimal time to stop iterating.

Random chance can cause a single round to not show any improvement so a number of how many rounds of deterioration should be specified to allow before stopping. `early_stopping_rounds=5` is a reasonble choice. 

Some data should also be set aside to calculate validation scores. this is done using the `eval_set` parameter. 

In [5]:
my_model = XGBRegressor(n_estimators=500)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


If you want to fit a model with all of the data, set `n_estimators` to whatever value you found to be optimal when run with early stopping.

learning_rate

We can multiply predictions from each model by a small number before adding them in.
This means each tree we add to the ensemble helps us less. So, we can set a higher value for `n_estimators` without overfitting. If early stopping is used, the appropriate number of trees will be determined automatically. 

A small learning rate and large number of estimators will yield more accureate XGBoost models but it takes the model longer to train due to more iterations through the cycle. As default, `learning_rate=0.1`. 

In [6]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
my_model.fit(X_train, y_train,
             early_stopping_rounds=5,
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


n_jobs

On larger datasets, parallelism can be used to build models faster. It's common to set the parameter `n_jobs` equal to the number of cores on your machine. On smaller datasets, this won't help. 

In [8]:
my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train,
             early_stopping_rounds=5,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


### Conclusion

XGBoost is leading software library for working with standard tabular data (i.e., Pandas DataFrames). Careful parameter tuning can lead to highly accurate models. 