## In this session
- How to manage model complexity?
- How to make the machine learn?

## Topics covered

- cross-validation
- grid search
- lasso
- random forest
- gradient boosting machines
- neural networks

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import patsy
from sklearn.model_selection import KFold, RepeatedKFold, GridSearchCV, train_test_split, cross_val_score, cross_val_predict, RandomizedSearchCV
from statsmodels.tools.eval_measures import rmse
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.neural_network import MLPRegressor
import warnings
import mglearn
from IPython import display
warnings.filterwarnings('ignore')

In [None]:
filename = 'airbnb_.csv'

In [None]:
filepath = os.path.join('datasets', filename)
filepath

In [None]:
df = pd.read_csv(filepath)

In [None]:
df.info()

In [None]:
df = df[df.price < 500]

In [None]:
df.price.describe()

In [None]:
df.price.quantile([0.01, 0.1, 0.25, 0.5, 0.75, 0.90,0.99])

In [None]:
df.price.plot(kind = 'hist', bins = 25, rwidth = 0.9, title = 'price');

In [None]:
df.ln_price.plot(kind = 'hist', bins = 25, rwidth = 0.9, title = 'log price');

### Problem statement

- Find a good fit...
- while controlling for complexity...
- without overfitting to the training data

#### Controlling for complexity

Find a good balance between good fit (low RMSE) and complex model (many variables) &rarr; add `penalty` to RMSE. Our target is to **minimize**
<center>
    
    error + penalty
    
</center>

The simplest definition:

<br>
<center>
    $RMSE + \lambda * (\sum\left\lvert\beta_j\right\lvert ) \xrightarrow{} min$
</center>
<br>

where $beta_j$ is the parameter of the $j^{th}$ explanatory variable and $\lambda$ denotes the amount of *shrinkage* in the regression equation. 
<br>

We run many broad regressions with a lot of explanatory variables and try to get rid of those parameters which are not important in identifying a robust pattern. We call this *Least Absolute Shrinkage and Selection Operator*, or <bold>`lasso`</bold>.

### Questions: 
##### 1. What is the right $\lambda$ which gives us a robust fit?
##### 2. How can we take out-of-sample performance into account *while* building our models?
##### 3. How can we compare the performance of various models while controlling for overfitting?

### Answers

#### 1. Finding the proper $\lambda$: `grid search`

There are parameters which are not learnt during the estmation process. These parameters are set by trial and error through an iteration on predefined set of values. 

We usally provide a set of values as possible $\lambda$s, estimate our models with each of them, and pick the one which minimizes `error + penalty`.

#### 2. Considering out-of-sample perfomance when building models: `cross-validation`

With cross-validation (CV) we split our training dataset to `n` pieces. We do *n* estimation steps and in each step we train the model on *(n-1)/n*th fraction of the data and check model performance (*rmse*) on the remaining `n`th fraction. We rotate the subsamples so that each observation will be a training data point `n-1` times and measurement point one time.

We call each iteration a `fold`, and the process is an `n-fold cross-validation`.

In [None]:
mglearn.plots.plot_cross_validation()

#### 3. Comparing model performancees: `train-test split`

We split our data to a *training set* and to a *test set*. 
1. We train our model, using cross-validation, on the training set.
2. Then we measure our fit on the test set.
3. We do it for all model versions, and then compare the fits (measured on the test set.)

### Model & data preparation

In [None]:
basic_lev = [
    "n_accommodates",
    "n_beds",
    "n_days_since",
    "f_property_type",
    "f_room_type",
    "f_bathroom",
    "f_cancellation_policy",
    "f_bed_type",
    "f_neighbourhood_cleansed"
]
reviews = ["f_number_of_reviews", "n_review_scores_rating", "flag_review_scores_rating"]
poly_lev = ("n_accommodates2", "n_days_since2", "n_days_since3")
# not use p_host_response_rate due to missing obs
amenities = list(df.filter(regex="^d_.*"))
X1 = ("n_accommodates:f_property_type",
    "f_room_type:f_property_type",
    "f_room_type:d_familykidfriendly",
    "d_airconditioning:f_property_type",
    "d_cats:f_property_type",
    "d_dogs:f_property_type")
X2=("f_property_type:f_neighbourhood_cleansed",
    "f_room_type:f_neighbourhood_cleansed",
    "n_accommodates:f_neighbourhood_cleansed")
X3="(f_property_type + f_room_type + f_cancellation_policy + f_bed_type) * ("+ "+".join(amenities) +")"

In [None]:
df.f_cancellation_policy.unique()

In [None]:
df.f_neighbourhood_cleansed.unique()

<br>


<br>

![](https://www.grants.londoncouncils.gov.uk/images/boroughmap.gif)

In [None]:
amenities[4:9]

- Create train and test set.

In [None]:
df_train, df_test = train_test_split(df, test_size= 0.2, random_state = 42)

In [None]:
print(df_train.shape)
print(df_test.shape)

### Baseline: linear regression (broad model)

In [None]:
vars =" ~ n_accommodates + f_neighbourhood_cleansed"

In [None]:
y_train, X_train = patsy.dmatrices('price' + vars, df_train)
y_test, X_test = patsy.dmatrices('price' + vars, df_test)

In [None]:
df_rmse = pd.DataFrame(columns = ['training set RMSE', 'test set RMSE'])

In [None]:
lin_reg = LinearRegression().fit(X_train, y_train)

In [None]:
type(lin_reg)

In [None]:
price_fitted_train_lin_reg = lin_reg.predict(X_train)
price_fitted_test_lin_reg = lin_reg.predict(X_test)

rmse_train_lin_reg = mean_squared_error(y_train, price_fitted_train_lin_reg, squared= False)
rmse_test_lin_reg = mean_squared_error(y_test, price_fitted_test_lin_reg, squared= False)

print('\nTrain RMSE: {:,.2f}.'.format(rmse_train_lin_reg))
print('Test RMSE: {:,.2f}.\n'.format(rmse_test_lin_reg))

In [None]:
df_rmse.loc['linear regression'] = [rmse_train_lin_reg, rmse_test_lin_reg]

In [None]:
lin_reg.coef_

In [None]:
print(pd.DataFrame({'variable': X_train.design_info.column_names, 'coef': lin_reg.coef_[0]}).to_string(formatters={'coef':'{:,.2f}'.format}))

#### Model performance

In [None]:
df_rmse

#### Visual representation

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_lin_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_lin_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - linear regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()

### ML 'light': Lasso

- Prepare for modelling

In [None]:
vars =" ~ "+"+".join(basic_lev)+"+"+"+".join(reviews)+"+"+"+".join(poly_lev)+"+"+"+".join(X1)+"+"+"+".join(X2)+"+"+"+".join(amenities) # +"+"+X3

In [None]:
vars

In [None]:
y_train, X_train = patsy.dmatrices('price' + vars, df_train)
y_test, X_test = patsy.dmatrices('price' + vars, df_test)

In [None]:
print(f'Number of columns in the broad model: {len(X_train.design_info.column_names)}.')

- Instantiate model.

In [None]:
lasso_model = Lasso()

- Define a set of alphas (aka lambdas). 

In [None]:
tune_grid = dict()
tune_grid['alpha'] = np.arange(0.05, 1, 0.05) # Just to confuse the reader, Lasso's lambda is called 'alpha'. Why? Because 'lambda' is a reserved word in Python.
tune_grid

- Define cross-validation.
- Define grid search.
- Fit model.

`GridSearchCV` not only searches for the best parameters, but also automatically fits a new model on the whole training dataset with the parameters that yielded the best cross-validation performance.  

In [None]:
cv = RepeatedKFold(n_splits = 4, n_repeats= 1, random_state = 20240523)

grid_search = GridSearchCV(
    estimator = lasso_model, 
    param_grid = tune_grid, 
    scoring = 'neg_root_mean_squared_error', 
    cv = cv, 
    verbose = 3)

lasso_reg = grid_search.fit(X_train, y_train)

In [None]:
lasso_reg.best_estimator_

In [None]:
np.nonzero(lasso_reg.best_estimator_.coef_)

In [None]:
df_lasso_coefs = pd.DataFrame({'variable': X_train.design_info.column_names, 'coefficient': lasso_reg.best_estimator_.coef_})
df_lasso_coefs[df_lasso_coefs.coefficient > 0].sort_values('coefficient', ascending = False).reset_index(drop = True).iloc[0:10]

In [None]:
price_fitted_train_lasso_reg = lasso_reg.predict(X_train)
price_fitted_test_lasso_reg = lasso_reg.predict(X_test)

rmse_train_lasso_reg = mean_squared_error(y_train, price_fitted_train_lasso_reg, squared= False)
rmse_test_lasso_reg = mean_squared_error(y_test, price_fitted_test_lasso_reg, squared= False)

In [None]:
rmse_train_lasso_reg

In [None]:
df_rmse.loc['lasso regression'] = [rmse_train_lasso_reg, rmse_test_lasso_reg]

In [None]:
df_rmse

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_lasso_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_lasso_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - lasso regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()

### CART: Classification and Regression Trees

The basic idea of a regression tree is **splitting** the dataset into small **bins** by the values of the explanatory ($x$) variables, and predicting $y$ as the average value of $\hat y$ within those bins. Creating a regression tree is called *building* or *growing a tree*. The algorithm has no formula.

Growing a tree is stepwise process. We start with a root node, which all the observations. The method uses a search algorithm to find the best $x$ varible to split the root node into two nodes which are as different from each other as possible. Then we split thse two nodes into 2x2 nodes by the same fashion. In theory the algorithm would only stop when all observations are in different bins so we introduce some stopping rule. We can set the maximum level of the tree, the number of final, or *terminal leaves*, the minimum number of observiations in the terminal leaves, or the minimum amount of improvement in our predcition error.


![](https://www.tutorialandexample.com/wp-content/uploads/2019/10/Decision-Trees-Root-Node.png)


Trees are very prone to overfitting and they are very short-sighted: at every step they only consider the result of the next step, despite of the fact that each split will affect the possibilities of all subsequent splits. For this reason we never use single trees but they are basic building blocks of other, more effective pattern recognition algorithms.  

In [None]:
mglearn.plots.plot_animal_tree()

#### Building a regression tree

In [None]:
vars =" ~ "+"+".join(basic_lev)+"+"+"+".join(reviews)+"+"+"+".join(amenities)

In [None]:
y_train, X_train = patsy.dmatrices('price' + vars, df_train)
y_test, X_test = patsy.dmatrices('price' + vars, df_test)

In [None]:
cart_reg = DecisionTreeRegressor(random_state = 20240523, max_depth = 3)
cart_reg.fit(X_train, y_train)

In [None]:
vars

In [None]:
from sklearn import tree

In [None]:
plt.figure(figsize=(14,8))
tree.plot_tree(cart_reg, filled = True, rounded= True, fontsize = 7)
plt.plot();

In [None]:
pd.DataFrame({'feature': X_train.design_info.column_names, 'importance': cart_reg.feature_importances_}).iloc[[2,3,38,39,42]]

In [None]:
price_fitted_train_cart_reg = cart_reg.predict(X_train)
price_fitted_test_cart_reg = cart_reg.predict(X_test)

rmse_train_cart_reg = mean_squared_error(y_train, price_fitted_train_cart_reg, squared= False)
rmse_test_cart_reg = mean_squared_error(y_test, price_fitted_test_cart_reg, squared = False)

In [None]:
df_rmse.loc['cart regression'] = [rmse_train_cart_reg, rmse_test_cart_reg]
df_rmse

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_cart_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_cart_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - cart (decision tree) regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()

### Random Forest

`Random forest` is an `ensemble model`: it creates multiple trees, each of which only losely fits to the data. Each tree only uses 
- only a handful of the total variables
- on a sample of the dataset and
- builds only shallow trees.

It makes predictions from every single tree and averages them out to come up with the final prediction. Tests show that the ensemble of these *weak learners* gives a more roubst estimate than a single overly precise model.

1. **Ensemble Power:** Random forests combine multiple decision trees, each trained on a random subset of features and data points. This diversity leads to more robust and accurate predictions compared to single decision trees.
2. **Feature Importance:** Random forests provide insights into feature importance by measuring how much each feature contributes to the overall prediction accuracy. This helps you understand which features are most relevant to your problem.
3. **Handling Missing Data:** Random forests can handle missing data gracefully by using techniques like averaging or imputation. This makes them a good choice for real-world datasets that often contain missing values.
4. **Non-parametric Nature:** Random forests make no assumptions about the underlying data distribution, making them suitable for a wide range of problems without requiring complex data preprocessing.
5. **Scalability:** Random forests can be efficiently trained on large datasets and can handle high-dimensional data with many features. This makes them a powerful tool for big data applications.


**Random Forest Regression Parameters**

A random forest regression model boasts several parameters that influence its behavior and performance. Let's explore some of the key ones:

**1. n_estimators:** This parameter controls the number of decision trees in the forest. More trees generally lead to better accuracy, but also increase training time and computational cost. Finding the optimal number through experimentation is crucial.

**2. max_depth:** This parameter limits the maximum depth of each individual decision tree. Deeper trees can capture complex relationships but are prone to overfitting. Setting an appropriate depth helps prevent overfitting and improvesgeneralizability.

**3. min_samples_split:** This parameter determines the minimum number of samples required to split an internal node in a decision tree. A higher value reduces the risk of overfitting by preventing splits based on too few data points.

**4. min_samples_leaf:** This parameter sets the minimum number of samples required to be at a leaf node. A higher value ensures that each leaf node contains enough data to make reliable predictions.

**5. max_features:** This parameter controls the number of features considered at each split in a decision tree. A lower value introduces randomness and helps prevent overfitting, but might also miss important features.

**6. bootstrap:** This parameter determines whether to use bootstrap sampling when building the trees. Bootstrapping involves randomly sampling data points with replacement, creating multiple training sets for the trees. This helps reduce variance and improvegeneralizability.

**7. random_state:** This parameter sets the seed for the random number generator, ensuring reproducibility of results. Using the same random state allows you to compare different models or parameter settings consistently.

**8. criterion:** This parameter defines the function used to measure the quality of a split in a decision tree. Common options include "mse" for mean squared error and "mae" for mean absolute error. The choice depends on the specific problem and desired outcome.

By carefully tuning these parameters, you can optimize your random forest regression model for your specific task and achieve the best possible performance. Remember, finding the optimal combination often involves experimentation and evaluation on your particular dataset.


In [None]:
vars =" ~ "+"+".join(basic_lev)+"+"+"+".join(reviews)+"+"+"+".join(amenities)
vars.split('+')

In [None]:
df.f_cancellation_policy.unique()

In [None]:
y_train, X_train = patsy.dmatrices('price' + vars, df_train)
y_test, X_test = patsy.dmatrices('price' + vars, df_test)

To find the best combination of these parameter options we use a `grid search`. 

In [None]:
tune_grid = {"max_features": [4, 6, 8, 10, 12], "min_samples_leaf": [5, 10, 15], 'max_depth': [4,5,6]}

In [None]:
%%time
rf_model = RandomForestRegressor(random_state = 20240523)
grid_search = GridSearchCV(
    rf_model,
    tune_grid,
    cv=4,
    scoring="neg_root_mean_squared_error",
    verbose=3,
)
rf_reg = grid_search.fit(X_train, y_train)

In [None]:
rf_reg.best_estimator_

In [None]:
price_fitted_train_rf_reg = rf_reg.best_estimator_.predict(X_train)
price_fitted_test_rf_reg = rf_reg.best_estimator_.predict(X_test)

rmse_train_rf_reg = mean_squared_error(y_train, price_fitted_train_rf_reg, squared= False)
rmse_test_rf_reg = mean_squared_error(y_test, price_fitted_test_rf_reg, squared = False)

In [None]:
rf_reg.cv_results_

In [None]:
df_rf_model_cv_results = pd.DataFrame(rf_reg.cv_results_)[[
    'param_max_depth', 'param_max_features', 'param_min_samples_leaf', 'mean_test_score']]
df_rf_model_cv_results

In [None]:
df_rmse.loc['random forest regression'] = [rmse_train_rf_reg, rmse_test_rf_reg]
df_rmse

In [None]:
rf_reg.best_estimator_.feature_importances_.shape

In [None]:
df_var_imp = pd.DataFrame(
    rf_reg.best_estimator_.feature_importances_, 
    X_train.design_info.column_names)\
    .reset_index()\
    .rename({"index": "variable", 0: "importance"}, axis=1)\
    .sort_values(by=["importance"], ascending=False)\
    .reset_index(drop = True)

df_var_imp['cumulative_importance'] = df_var_imp['importance'].cumsum()
df_var_imp[df_var_imp.cumulative_importance < 0.95].style.format({
    'imp': lambda x: f'{x:,.1%}',
    'cumulative_importance': lambda x: f'{x:,.1%}'})

In [None]:
df_var_imp[df_var_imp.importance > 0.01]\
    .sort_values(by = 'importance')\
    .plot(kind = 'barh', 
          x = 'variable', y = 'importance', 
          figsize = (10,6), grid = True, 
          title = 'Random forest model highest feature importances', 
          xlabel = 'variables', legend = False
         );

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_rf_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_rf_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - random forest regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()

### Gradient Boosting Machines

`Gradient Boosting Machines` is a powerful machine learning technique that excels at both accuracy and handling complex datasets. Boosting technique follows the concept of ensemble learning, and hence it combines multiple simple models (weak learners or base estimators) to generate the final output. 

`Boosting` is one of the popular learning ensemble modeling techniques used to build strong classifiers from various weak classifiers. It starts with building a *primary model* from available training data sets then it identifies the *errors* present in the base model. After identifying the error, a secondary model is built, and further, a third model is introduced in this process. In this way, this process of introducing more models is continued until we get a complete training data set by which model predicts correctly.

Further, instead of using these models separately to predict the outcome if *we use them in form of series or combination*, then we get a resulting model with correct information than all base models. In other words, instead of using each model's individual prediction, if we use average prediction from these models then we would be able to capture more information from the data. It is referred to as ensemble learning and boosting is also based on ensemble methods in machine learning.

![](https://static.javatpoint.com/tutorial/machine-learning/images/gbm-in-machine-learning3.png)

Gradient boosting machines consist 3 elements:
- loss function
- weak learners (simple trees)
- additive model (we use every tree for prediction).

The main difference between random forest and gradient boosting is that the *trees are not build indepentently but in a sequential fashion*. After the first tree we make a prediction. We calculate the predictions errors and we fit the next tree to the errors! The we predict, measure new errors and make another tree to model the errors. We continue this process until some mechanism (for instance a predefined level of errors, or the number of trees) tells us to stop. 

The final model here is a stagewise additive model of many individual trees.

In [None]:
tune_grid = {
    "max_depth": [5, 10],
    "learning_rate": [0.1, 0.2],
    "min_samples_leaf": [5, 10, 20],
    "ccp_alpha": [1,5,10]
}

In [None]:
%%time
gbm_model = GradientBoostingRegressor(random_state = 20240523, max_features='sqrt')

grid_search = GridSearchCV(
    gbm_model,
    tune_grid,
    cv=4,
    scoring="neg_root_mean_squared_error",
    verbose=10,
)

gbm_reg = grid_search.fit(X_train, y_train)

In [None]:
gbm_reg.best_estimator_

In [None]:
price_fitted_train_gbm_reg = gbm_reg.best_estimator_.predict(X_train)
price_fitted_test_gbm_reg = gbm_reg.best_estimator_.predict(X_test)

rmse_train_gbm_reg = mean_squared_error(y_train, price_fitted_train_gbm_reg, squared= False)
rmse_test_gbm_reg = mean_squared_error(y_test, price_fitted_test_gbm_reg, squared = False)

In [None]:
df_rmse.loc['gbm regression'] = [rmse_train_gbm_reg, rmse_test_gbm_reg]
df_rmse

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_gbm_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_gbm_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - GBM regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()

The problem with GBM is that is is **very computation-heavy and difficult to train**. To address that issue Microsoft came up with the [LightGBM](https://github.com/Microsoft/LightGBM) algorithm which excels both in efficieny and accuracy. `LightGBM` splits only one of the nodes, the one with the higher loss.

![](https://static.javatpoint.com/tutorial/machine-learning/images/gbm-in-machine-learning4.png)

We are using the `LightGBM` model in this demo.

#### Light GMB parameters

**1. max_iter:** maximum number of iterations of the boosting process, i.e. the maximum number of trees.

**2. max_depth:** The maximum depth of each tree. The depth of a tree is the number of edges to go from the root to the deepest leaf. Setting an appropriate depth helps prevent overfitting and improves generalizability.

**3. max_leaf_nodes:** The maximum number of leaves for each tree.

**4. min_samples_leaf:** This parameter sets the minimum number of samples required to be at a leaf node. A higher value ensures that each leaf node contains enough data to make reliable predictions.

**5. max_features:** This parameter controls the number of features considered at each split in a decision tree. A lower value introduces randomness and helps prevent overfitting, but might also miss important features.

**6. loss:** The loss function to use in the boosting process. Default is squared error. 

**7. random_state:** This parameter sets the seed for the random number generator, ensuring reproducibility of results. Using the same random state allows you to compare different models or parameter settings consistently.

**8. learning_rate:** Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators.

In [None]:
tune_grid = {
    "max_iter": [100, 200],
    "max_depth": [5, 10],
    "learning_rate": [0.1, 0.2],
    "min_samples_leaf": [5, 10, 20],
}

In [None]:
%%time
lightgbm_model = HistGradientBoostingRegressor(random_state = 20240523)

grid_search = GridSearchCV(
    lightgbm_model,
    tune_grid,
    cv=4,
    scoring="neg_root_mean_squared_error",
    verbose=10,
)

lightgbm_reg = grid_search.fit(X_train, y_train)

In [None]:
price_fitted_train_lightgbm_reg = lightgbm_reg.best_estimator_.predict(X_train)
price_fitted_test_lightgbm_reg = lightgbm_reg.best_estimator_.predict(X_test)

rmse_train_lightgbm_reg = mean_squared_error(y_train, price_fitted_train_lightgbm_reg, squared= False)
rmse_test_lightgbm_reg = mean_squared_error(y_test, price_fitted_test_lightgbm_reg, squared = False)

In [None]:
df_rmse.loc['light gbm regression'] = [rmse_train_lightgbm_reg, rmse_test_lightgbm_reg]
df_rmse

In [None]:
lightgbm_reg.best_estimator_

In [None]:
price_fitted_train_lightgbm_reg = lightgbm_reg.best_estimator_.predict(X_train)
price_fitted_test_lightgbm_reg = lightgbm_reg.best_estimator_.predict(X_test)

rmse_train_lightgbm_reg = mean_squared_error(y_train, price_fitted_train_lightgbm_reg, squared= False)
rmse_test_lightgbm_reg = mean_squared_error(y_test, price_fitted_test_lightgbm_reg, squared = False)

In [None]:
df_rmse.loc['light gbm regression'] = [rmse_train_lightgbm_reg, rmse_test_lightgbm_reg]
df_rmse

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_lightgbm_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_lightgbm_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - light GBM regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()

### Neural Networks

An `artificial neural network` (ANN) or a simple traditional neural network aims to solve trivial tasks with a straightforward network outline. An artificial neural network is loosely inspired from biological neural networks. It is a collection of layers to perform a specific task. Each layer consists of a collection of nodes to operate together.

In [None]:
mglearn.plots.plot_logistic_regression_graph()

These networks usually consist of an `input layer`, one or more `hidden layers`, and an `output layer`. Each node in each network is potentially linked to each node in the preceding and the succceeding layers. While it is possible to solve easy mathematical questions, and computer problems, including basic gate structures with their respective truth tables, it is tough for these networks to solve complicated image processing, computer vision, and natural language processing tasks.

For these problems, we utilize `deep neural networks` (DNN), which often have a complex hidden layer structure with a wide variety of different layers. These additional layers help the model to understand problems better and provide optimal solutions to complex projects. A deep neural network has more layers (more depth) than ANN and each layer adds complexity to the model while enabling the model to process the inputs concisely for outputting the ideal solution.

In [None]:
mglearn.plots.plot_two_hidden_layer_graph()

While neural networks are inspired by biological netowrks, they are fundmanetally different in their architecture.

![](https://images.datacamp.com/image/upload/v1707332849/image4_74e3e8d76f.png)

A neural network each node is essentially a weighted sum of the nodes in the previous layer.

This output is given a twist called `activation function`. The activation function introduces nonlinearity into the network to avoid overfitting and, at the same time, allowing it to learn complex patterns in the data.   
![](https://miro.medium.com/v2/resize:fit:720/format:webp/1*XxxiA0jJvPrHEJHD4z893g.png)

While the inner workings of neural netowrks can be, and usually is, VERY COMPLICATED, a good demonstration of how they work can be found here: https://goo.gl/ou9iMB

`scikit-learn` uses a `multilayer perceptron model`, a special neural network consisting of fully connected neurons with a nonlinear kind of activation function. 

There are many more neural network archtitectures for various use cases.

In [None]:
tune_grid = {
    "max_iter": [100, 200],
    "max_depth": [5, 10],
    "learning_rate": [0.1, 0.2],
    "min_samples_leaf": [5, 10, 20],
}

In [None]:
%%time
dnn_model = MLPRegressor(
    hidden_layer_sizes= [10,4], 
    batch_size= 100,
    early_stopping= True,
    random_state= 20240523)
dnn_reg = dnn_model.fit(X_train, y_train)

In [None]:
dnn_reg.coefs_[0].shape

In [None]:
dnn_reg.coefs_[1].shape

In [None]:
price_fitted_train_dnn_reg = dnn_reg.predict(X_train)
price_fitted_test_dnn_reg = dnn_reg.predict(X_test)

rmse_train_dnn_reg = mean_squared_error(y_train, price_fitted_train_dnn_reg, squared= False)
rmse_test_dnn_reg = mean_squared_error(y_test, price_fitted_test_dnn_reg, squared = False)

In [None]:
df_rmse.loc['dnn regression'] = [rmse_train_dnn_reg, rmse_test_dnn_reg]
df_rmse

In [None]:
fig, axs = plt.subplots(1, 2, figsize = (12,6))
axs[0].scatter(x = y_train, y = price_fitted_train_dnn_reg, marker = '.', color = 'black')
axs[0].axline([0, 0], [1, 1], color = 'k')
axs[0].set_title('train')
axs[0].set_xlim(0,500)
axs[0].set_ylim(0,500)
axs[1].scatter(x = y_test, y = price_fitted_test_dnn_reg, marker = '.', color = 'k')
axs[1].axline([0, 0], [1, 1], color = 'k')
axs[1].set_xlim(0,500)
axs[1].set_ylim(0,500)
axs[1].set_title('test')
fig.suptitle('Original vs predicted values - deep neural network regression')

for ax in axs.flat:
    ax.set(xlabel='original', ylabel='fitted/predicted')

for ax in axs.flat:
    ax.label_outer()