# What is Light GBM?
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Also, it is surprisingly very fast, hence the word ‘Light’.

* XGBoost grow trees in the following manner

![](img/xgboost_tree_growth.png)


* LightGBM grow trees in the follwing manner

![](img/lightgbm_tree_growth.png)


Leaf wise splits lead to increase in complexity and may lead to overfitting and it can be overcome by specifying another parameter max-depth which specifies the depth to which splitting will occur.

Below, we will see the steps to install Light GBM and run a model using it. We will be comparing the results with XGBOOST results to prove that you should take Light GBM in a ‘LIGHT MANNER’.

# Important Parameters of light GBM

* task : default value = train ; options = train , prediction ; Specifies the task we wish to perform which is either train or prediction.

* application: default=regression, type=enum, options= options :

            regression : perform regression task
            binary : Binary classification
            multiclass: Multiclass Classification
            lambdarank : lambdarank application

* data: type=string; training data , LightGBM will train from this data

* num_iterations: number of boosting iterations to be performed ; default=100; type=int

* num_leaves : number of leaves in one tree ; default = 31 ; type =int

* device : default= cpu ; options = gpu,cpu. Device on which we want to train our model. Choose GPU for faster training.

* max_depth: Specify the max depth to which tree will grow. This parameter is used to deal with overfitting.

* min_data_in_leaf: Min number of data in one leaf.

* feature_fraction: default=1 ; specifies the fraction of features to be taken for each iteration

* bagging_fraction: default=1 ; specifies the fraction of data to be used for each iteration and is generally used to speed up the training and avoid overfitting.

* min_gain_to_split: default=.1 ; min gain to perform splitting

* max_bin : max number of bins to bucket the feature values.

* min_data_in_bin : min number of data in one bin

* num_threads: default=OpenMP_default, type=int ;Number of threads for Light GBM.

* label : type=string ; specify the label column

* categorical_feature : type=string ; specify the categorical features we want to use for training our model

* num_class: default=1 ; type=int ; used only for multi-class classification
 


# Tuning Parameters of Light GBM

For best fit

* num_leaves : This parameter is used to set the number of leaves to be formed in a tree. Theoretically relation between num_leaves and max_depth is num_leaves= 2^(max_depth). However, this is not a good estimate in case of Light GBM since splitting takes place leaf wise rather than depth wise. Hence num_leaves set must be smaller than 2^(max_depth) otherwise it may lead to overfitting. Light GBM does not have a direct relation between num_leaves and max_depth and hence the two must not be linked with each other.

* min_data_in_leaf : It is also one of the important parameters in dealing with overfitting. Setting its value smaller may cause overfitting and hence must be set accordingly. Its value should be hundreds to thousands of large datasets.

* max_depth: It specifies the maximum depth or level up to which tree can grow.


For faster speed

* bagging_fraction : Is used to perform bagging for faster results
* feature_fraction : Set fraction of the features to be used at each iteration
* max_bin : Smaller value of max_bin can save much time as it buckets the feature values in discrete bins which is computationally inexpensive.


For better accuracy

* Use bigger training data

* num_leaves : Setting it to high value produces deeper trees with increased accuracy but lead to overfitting. Hence its higher value is not preferred.

* max_bin : Setting it to high values has similar effect as caused by increasing value of num_leaves and also slower our training procedure.


In [1]:
!pip install lightgbm



In [52]:
#importing standard libraries 
import numpy as np 
import pandas as pd 
from pandas import Series, DataFrame 

#import lightgbm 
import lightgbm as lgb


#loading our training dataset 'adult.csv' with name 'data' using pandas 
data=pd.read_csv('data/adult.csv',header=None) 

#Assigning names to the columns 
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income'] 

#glimpse of the dataset 
data.head() 


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital_Status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [53]:
# Label Encoding our target variable 
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder() 
l.fit(data.Income) 

l.classes_ 
data.Income=Series(l.transform(data.Income))  #label encoding our target variable 
data.Income.value_counts() 


0    24720
1     7841
Name: Income, dtype: int64

In [54]:
#One Hot Encoding of the Categorical features 
one_hot_workclass=pd.get_dummies(data.workclass) 
one_hot_education=pd.get_dummies(data.education) 
one_hot_marital_Status=pd.get_dummies(data.marital_Status) 
one_hot_occupation=pd.get_dummies(data.occupation)
one_hot_relationship=pd.get_dummies(data.relationship) 
one_hot_race=pd.get_dummies(data.race) 
one_hot_sex=pd.get_dummies(data.sex) 
one_hot_native_country=pd.get_dummies(data.native_country) 


In [55]:
#removing categorical features 
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True) 


In [56]:
data.dropna(subset=['Income'], how='any', inplace = True)

In [57]:
#Merging one hot encoded features with our dataset 'data' 
data=pd.concat([data,one_hot_workclass,one_hot_education,one_hot_marital_Status,one_hot_occupation,one_hot_relationship,one_hot_race,one_hot_sex,one_hot_native_country],axis=1) 

#Separating our data into features dataset x and our target dataset y 
x=data.drop('Income',axis=1) 
y=data.Income 


In [58]:
#Now splitting our dataset into test and train 
from sklearn.model_selection import train_test_split 
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)

In [59]:
train_data=lgb.Dataset(x_train,label=y_train)

In [60]:
#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':7,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']

In [61]:
#training our model using light gbm
num_round=50
lgbm=lgb.train(param,train_data,num_round)

In [62]:
#predicting on test set
ypred2=lgbm.predict(x_test)
ypred2[0:5]  # showing first 5 predictions

array([0.02060282, 0.20703644, 0.04716318, 0.14248623, 0.64111057])

In [63]:
#converting probabilities into 0 or 1
for i in range(0,9769):
    if ypred2[i]>=.5:       # setting threshold to .5
       ypred2[i]=1
    else:  
       ypred2[i]=0

In [64]:
from sklearn.metrics import roc_auc_score


In [65]:
#calculating roc_auc_score for lightgbm
auc_lgb =  roc_auc_score(y_test,ypred2)
auc_lgb

0.7627844872727022

# Winning solutions in competitions using lightGBM

https://github.com/Microsoft/LightGBM/tree/master/examples

In [66]:
# Reference:
# https://lightgbm.readthedocs.io/en/latest/
# https://github.com/Microsoft/LightGBM/tree/master/examples
# https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/