# Ensemble Models - Lab 1

Example notebook for exploring ensemble models, comparing performance between algorithms:
* Tree model
* Bagging / Random Forest
* Boosted trees

<br>

Documentation:
* **Decision Tree** (regressor) in [Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressor)
* **Random Forest** (regressor) in [Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html#sklearn.ensemble.RandomForestRegressor)
* **Gradient Boosting** (regressor) in [Scikit Learn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html#sklearn.ensemble.GradientBoostingRegressor)
* **XGBoost** in [xgboost](https://xgboost.readthedocs.io/en/stable/python/python_api.html#module-xgboost.sklearn)

<br>
Ricardo Almeida

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import RocCurveDisplay, r2_score
from sklearn.model_selection import train_test_split

In [None]:
RANDOM_SEED = 7657

TEST_SIZE=0.20

#### Loading dataset

California housing dataset

In [None]:
housing = fetch_california_housing()

In [None]:
print(housing.DESCR)

In [None]:
housing.feature_names

In [None]:
housing.target_names

In [None]:
housing.target

In [None]:
X_train, X_dev, y_train, y_dev = train_test_split(
    housing.data, housing.target, random_state=RANDOM_SEED, test_size=TEST_SIZE)

In [None]:
X_train.shape

In [None]:
X_dev.shape

### Decision Tree 

Fiting a Decision Tree (regressor) model and checking for accuracy on both datasets.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
tree = DecisionTreeRegressor(random_state=RANDOM_SEED)
tree.fit(X_train, y_train)

In [None]:
model = tree

In [None]:
# Predict on the train and development datasets
y_train_pred = model.predict(X_train)
y_dev_pred = model.predict(X_dev)

In [None]:
# R2 score on train and development datasets
r2_train = r2_score(y_train, y_train_pred)
r2_dev = r2_score(y_dev, y_dev_pred)

In [None]:
print("R2 score on train dataset: {:.3f}".format(r2_train))
print("R2 score on dev   dataset: {:.3f}".format(r2_dev))

### Random Forest without Bagging

Fiting a Random Forest (regressor) model and checking for accuracy on both datasets.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators=200, random_state=RANDOM_SEED, bootstrap=False)
rf.fit(X_train, y_train)

In [None]:
model = rf

In [None]:
# Predict on the train and development datasets
y_train_pred = model.predict(X_train)
y_dev_pred = model.predict(X_dev)

In [None]:
# R2 score on train and development datasets
r2_train = r2_score(y_train, y_train_pred)
r2_dev = r2_score(y_dev, y_dev_pred)

In [None]:
print("R2 score on train dataset: {:.3f}".format(r2_train))
print("R2 score on dev   dataset: {:.3f}".format(r2_dev))

### Random Forest (with Bagging)

Fiting a Random Forest (regressor) model and checking for accuracy on both datasets.

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators=200, random_state=RANDOM_SEED, bootstrap=True)
rf.fit(X_train, y_train)

In [None]:
model = rf

In [None]:
# Predict on the train and development datasets
y_train_pred = model.predict(X_train)
y_dev_pred = model.predict(X_dev)

In [None]:
# R2 score on train and development datasets
r2_train = r2_score(y_train, y_train_pred)
r2_dev = r2_score(y_dev, y_dev_pred)

In [None]:
print("R2 score on train dataset: {:.3f}".format(r2_train))
print("R2 score on dev   dataset: {:.3f}".format(r2_dev))

### Gradient Boosting

Fiting a Gradient Boosting (regressor) model and checking for accuracy on both datasets.

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gb = GradientBoostingRegressor(n_estimators=200, random_state=RANDOM_SEED)
gb.fit(X_train, y_train)

In [None]:
model = gb

In [None]:
# Predict on the train and development datasets
y_train_pred = model.predict(X_train)
y_dev_pred = model.predict(X_dev)

In [None]:
# R2 score on train and development datasets
r2_train = r2_score(y_train, y_train_pred)
r2_dev = r2_score(y_dev, y_dev_pred)

In [None]:
print("R2 score on train dataset: {:.3f}".format(r2_train))
print("R2 score on dev   dataset: {:.3f}".format(r2_dev))