# Light GBM

* Light GBM model is a type of  a GBM model which increase the  performance of XGBoost model.

## 1-)Data Preprocessing

In [1]:
import numpy as np
import pandas as pd 
from sklearn.model_selection import train_test_split

In [2]:
hit = pd.read_csv("Hitters.csv")
df = hit.copy()
df = df.dropna()
dms = pd.get_dummies(df[['League', 'Division', 'NewLeague']])
y = df["Salary"]
X_ = df.drop(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')
X = pd.concat([X_, dms[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=42)

## 2-) Model

In [4]:
#!pip install lightgbm

In [None]:
#conda install -c conda-forge lightgbm

In [3]:
from lightgbm import LGBMRegressor

In [5]:
lgbm = LGBMRegressor()
lgbm_model = lgbm.fit(X_train, y_train)

## 3-) Prediction

In [6]:
from sklearn.metrics import mean_squared_error

In [7]:
y_pred = lgbm_model.predict(X_test, 
                            num_iteration = lgbm_model.best_iteration_)



In [8]:
test_error_before=np.sqrt(mean_squared_error(y_test, y_pred))
test_error_before #test error before  model tuning

363.8712087611089

## 4-) Model Tuning

* In this section, we will try to determine the optimum **learning_rate, max_depth, colsample_bytree ,n_estimators**   with the GridSearchCV method.


* GridSearchCV: Grid Search Cross Validation Methode



* Then , we will create the most optimum model by using optimum **learning_rate, max_depth, colsample_bytree ,n_estimators** .





* **learning_rate, max_depth, colsample_bytree ,n_estimators**  are the hyperparameters that we will determine according to ourselves and we want it to be the most optimum.



* But instead of relying on our own feeling and sense in order to find the  optimum value of these hyperparameters   , we will find the optimum value of these hyperparameters   by using the gridsearch method.


* **max_features** ==>> Maximum number of independent variables to be used


* **n_estimators** ==>> number of trees to be used

In [11]:
from sklearn.model_selection import GridSearchCV

In [15]:
lgbm_grid = {
    'colsample_bytree': [0.4, 0.5,0.6,0.9,1],
    'learning_rate': [0.01, 0.1, 0.5,1],
    'n_estimators': [20, 40, 100, 200, 500,1000],
    'max_depth': [1,2,3,4,5,6,7,8] }

In [16]:
lgbm = LGBMRegressor()

In [17]:
lgbm_cv_model = GridSearchCV(lgbm, lgbm_grid, cv=10, n_jobs = -1, verbose = 2)


* For the most optimum model, multiple parameters are crossed with each other to find the best parameters.




* This process takes a long time. The **n_jobs=-1** variable can be used to reduce this time. This variables allows the processor to run at full performance.



* The _**verbose**_ parameter shows us in detail what operations have been performed for how long and as shown below.

In [18]:
lgbm_cv_model.fit(X_train, y_train)

Fitting 10 folds for each of 960 candidates, totalling 9600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 516 tasks      | elapsed:   14.1s
[Parallel(n_jobs=-1)]: Done 1328 tasks      | elapsed:   32.4s
[Parallel(n_jobs=-1)]: Done 2460 tasks      | elapsed:   58.8s
[Parallel(n_jobs=-1)]: Done 3920 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 5700 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 7808 tasks      | elapsed:  3.3min
[Parallel(n_jobs=-1)]: Done 9600 out of 9600 | elapsed:  4.2min finished


GridSearchCV(cv=10, estimator=LGBMRegressor(), n_jobs=-1,
             param_grid={'colsample_bytree': [0.4, 0.5, 0.6, 0.9, 1],
                         'learning_rate': [0.01, 0.1, 0.5, 1],
                         'max_depth': [1, 2, 3, 4, 5, 6, 7, 8],
                         'n_estimators': [20, 40, 100, 200, 500, 1000]},
             verbose=2)

In [19]:
lgbm_cv_model.best_params_

{'colsample_bytree': 0.4,
 'learning_rate': 0.1,
 'max_depth': 5,
 'n_estimators': 40}

### 4.1)Tuned Model

In [20]:
lgbm_tuned = LGBMRegressor(learning_rate = 0.1, 
                           max_depth = 5, 
                           n_estimators = 40,
                          colsample_bytree = 0.4)

In [21]:
lgbm_tuned = lgbm_tuned.fit(X_train,y_train)

In [23]:
y_pred1 = lgbm_tuned.predict(X_test)

In [24]:
test_error_after=np.sqrt(mean_squared_error(y_test, y_pred1))
test_error_after # test error after model tuning

377.8415676535648