<a href="https://colab.research.google.com/github/natalia7244/Machine-Learning-Exercises/blob/main/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction

It's a powerful ensemble method (uses many models).

Better than one decision tree because it combines many trees.

Random Forest = many trees built randomly and averaged.

Gradient Boosting = builds trees one after another, each one fixes mistakes from the last.

# Gradient Boosting

Gradient Boosting is like building a model step by step, one small piece at a time. Here's how it works:

    Start with a simple model. This model might not be very smart. It makes lots of mistakes.

    Check how wrong the model is. We look at its guesses and compare them to the correct answers. This shows us how bad (or good) it is. We use a loss function to measure the errors.

    Add a new model that fixes the mistakes. This new model is trained to help correct the errors made by the last model.

    Add it to the team. The team (ensemble) now has 2 models: the original + the fixer.

    Repeat. Each time, we add another model to help fix what the team still gets wrong.

# Example

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

data = pd.read_csv('/content/drive/MyDrive/Data_sets/melb_data.csv')

cols_to_use = ['Rooms', 'Distance', 'Landsize', 'BuildingArea', 'YearBuilt']
X = data[cols_to_use]
y = data.Price

X_train, X_valid, y_train, y_valid = train_test_split(X, y)

In [15]:
from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

In [16]:
from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

Mean Absolute Error: 239138.3235939801


# Parametr Tuning

**Parameter: n_estimators**

    Number of trees (models) in the boosting cycle.

    Too low → underfitting (model too weak).

    Too high → overfitting (model too focused on training data).

    Usual range: 100–1000 (but depends on learning_rate).

    Example: XGBRegressor(n_estimators=100)

In [17]:
my_model = XGBRegressor(n_estimators = 500)
my_model.fit(X_train, y_train)

Parameter: **early_stopping_rounds**

    Stops training when model doesn't improve for several rounds.

    Must set eval_set to tell it which data to use for checking.

    Good default: early_stopping_rounds=5

    Set high n_estimators, and early stopping will find the best point to stop.

**Parameter: learning_rate**

    Controls how much each tree changes the model.

    Smaller value = slower learning, more accurate.

    With small learning rate, we can use more trees.

    Default: learning_rate = 0.1

    Use with n_estimators and possibly early_stopping to get better results.

**Parameter: n_jobs**

    Tells the model how many CPU cores to use.

    Speeds up training on large datasets.

    Common setting: n_jobs = -1 (use all cores).

    Doesn’t change accuracy — just saves time.