### ðŸ”¹ LightGBM Regression  

LightGBM (Light Gradient Boosting Machine) is a fast and efficient **gradient boosting framework** based on decision trees.  
It is designed to handle **large datasets** and **high-dimensional data** efficiently, while maintaining high accuracy.  

The main idea:  
- LightGBM grows trees **leaf-wise** (best-first) rather than **level-wise**, which often results in better accuracy.  
- Uses gradient boosting to sequentially build trees that correct previous errors.  
- Supports features like **categorical feature handling**, **regularization**, and **parallel learning**.  

The prediction can be expressed as:  

$$
\hat{Y} = \sum_{k=1}^{K} f_k(X), \quad f_k \in \mathcal{F}
$$  

where:  
- $f_k$ is the $k$th regression tree,  
- $mathcal{F}$ is the space of all possible trees,  
- $K$ is the total number of trees.  

LightGBM Regression helps us to:  
- Achieve **high performance** on large datasets.  
- Capture **non-linear and complex relationships** efficiently.  
- Reduce training time while maintaining or improving accuracy.  

In this notebook, we will implement **LightGBM Regression** and compare its performance with XGBoost and Random Forest ðŸš€.  


# --------------------------------------------------------------------------

# load dataset

In [1]:
# from google.colab import files, drive

# up = files.upload()
# drive.mount('/content/drive')

In [2]:
import pandas as pd

df = pd.read_csv('dataset.csv')
df.head(3)

Unnamed: 0,f1,f2,f3,T
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136


# cleaning

In [3]:
# clean the data

# encoding

In [4]:
# encode the data

# define x , y

In [5]:
import numpy as np

x = df[['f1', 'f2', 'f3']].values
y = df['T'].values

In [6]:
x[:5]

array([[ 2. ,  4. ,  8.5],
       [ 2.4,  4. ,  9.6],
       [ 1.5,  4. ,  5.9],
       [ 3.5,  6. , 11.1],
       [ 3.5,  6. , 10.6]])

# spliting

In [7]:
# ! pip install lightgbm

In [8]:
# # finding best random state 

# from sklearn.model_selection import train_test_split
# from lightgbm import LGBMRegressor
# lgb = LGBMRegressor(random_state=1)
# from sklearn.metrics import r2_score

# import time
# t1 = time.time()
# lst = []
# for i in range(1,10):
#     x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=i)
#     lgb.fit(x_train, y_train)
#     yhat_test = lgb.predict(pd.DataFrame(x_test, columns=['f1', 'f2', 'f3']))
#     r2 = r2_score(y_test, yhat_test)
#     lst.append(r2)

# t2 = time.time()
# print(f"run time: {round((t2 - t1) / 60 , 0)} min")
# print(f"R2_score = {round(max(lst),2)}")
# print(f"random_state = {np.argmax(lst) + 1}")

In [9]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=4)  

# scaling

In [10]:
# No need for scaling in lightGBM Regression

# fit train data

In [11]:
# # k-fold cross validation

# from lightgbm import LGBMRegressor
# from sklearn.model_selection import GridSearchCV

# parameters = {
#     '': [],
#     '': []
# }

# lg = LGBMRegressor(random_state=1)
# gs = GridSearchCV(estimator=lg, param_grid=parameters, cv=5)

# gs.fit(x_train, y_train)

# best_params = gs.best_params_
# print(best_params)

In [12]:
# https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html

In [13]:
from lightgbm import LGBMRegressor

lgb = LGBMRegressor(random_state=1) 
lgb.fit(x_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000084 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 157
[LightGBM] [Info] Number of data points in the train set: 800, number of used features: 3
[LightGBM] [Info] Start training from score 255.701250


# predict test data

In [14]:
yhat_test = lgb.predict(pd.DataFrame(x_test, columns=['f1', 'f2', 'f3']))

# evaluate

In [15]:
from sklearn.metrics import r2_score

print("r2-score (train data): %0.4f" % r2_score(y_train, lgb.predict(pd.DataFrame(x_train, columns=['f1', 'f2', 'f3']))))
print("r2-score (test data): %0.4f" % r2_score(y_test, yhat_test))

r2-score (train data): 0.9704
r2-score (test data): 0.9715


In [16]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

print(f"MSE (train data): {mean_squared_error(y_train, lgb.predict(pd.DataFrame(x_train, columns=['f1', 'f2', 'f3'])))}")
print(f"RMSE (train data): {np.sqrt(mean_squared_error(y_train, lgb.predict(pd.DataFrame(x_train, columns=['f1', 'f2', 'f3']))))}")
print(f"MAE (train data): {mean_absolute_error(y_train, lgb.predict(pd.DataFrame(x_train, columns=['f1', 'f2', 'f3'])))}")
print('------------')
print(f"MSE (test data): {mean_squared_error(y_test, yhat_test)}")
print(f"RMSE (test data): {np.sqrt(mean_squared_error(y_test, yhat_test))}")
print(f"MAE (test data): {mean_absolute_error(y_test, yhat_test)}")

MSE (train data): 121.87502274111954
RMSE (train data): 11.039702112879656
MAE (train data): 5.2921496136862105
------------
MSE (test data): 105.358267921499
RMSE (test data): 10.264417563675934
MAE (test data): 4.80454447828185


# predict new data

In [22]:
new_data = pd.DataFrame([[2, 4, 8.5]], columns=['f1', 'f2', 'f3'])
lgb.predict(new_data)

array([196.15849562])

# save the model

In [18]:
# import joblib

# joblib.dump(lgb, 'lgb_model.pkl')

# load the model

In [19]:
# import joblib

# lgb = joblib.load('lgb_model.pkl')