# LightGBM Regressor

## Part 1 - Data Preprocessing

### Importing the dataset

In [33]:
import pandas as pd
dataset = pd.read_csv('insurance.csv')

In [34]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


### Checking missing data

In [35]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


### Handling categorical variables

Sex column

In [36]:
dataset['sex'].unique()

array(['female', 'male'], dtype=object)

In [37]:
dataset['sex'] = dataset['sex'].apply(lambda x: 0 if x == 'female' else 1)

In [38]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,yes,southwest,16884.924
1,18,1,33.77,1,no,southeast,1725.5523
2,28,1,33.0,3,no,southeast,4449.462
3,33,1,22.705,0,no,northwest,21984.47061
4,32,1,28.88,0,no,northwest,3866.8552


Smoker column

In [39]:
dataset['smoker'].unique()

array(['yes', 'no'], dtype=object)

In [40]:
dataset['smoker'] = dataset['smoker'].apply(lambda x: 0 if x == 'no' else 1)

In [41]:
dataset.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,0,27.9,0,1,southwest,16884.924
1,18,1,33.77,1,0,southeast,1725.5523
2,28,1,33.0,3,0,southeast,4449.462
3,33,1,22.705,0,0,northwest,21984.47061
4,32,1,28.88,0,0,northwest,3866.8552


Region column

In [42]:
dataset['region'].unique()

array(['southwest', 'southeast', 'northwest', 'northeast'], dtype=object)

In [43]:
region_dummies = pd.get_dummies(dataset['region'], drop_first = True)

In [44]:
region_dummies

Unnamed: 0,northwest,southeast,southwest
0,False,False,True
1,False,True,False
2,False,True,False
3,True,False,False
4,True,False,False
...,...,...,...
1333,True,False,False
1334,False,False,False
1335,False,True,False
1336,False,False,True


In [45]:
dataset = pd.concat([region_dummies, dataset], axis = 1)

In [46]:
dataset.head()

Unnamed: 0,northwest,southeast,southwest,age,sex,bmi,children,smoker,region,charges
0,False,False,True,19,0,27.9,0,1,southwest,16884.924
1,False,True,False,18,1,33.77,1,0,southeast,1725.5523
2,False,True,False,28,1,33.0,3,0,southeast,4449.462
3,True,False,False,33,1,22.705,0,0,northwest,21984.47061
4,True,False,False,32,1,28.88,0,0,northwest,3866.8552


In [47]:
dataset.drop(['region'], axis = 1, inplace = True)

In [48]:
dataset['northwest'] = dataset['northwest'].apply(lambda x:0 if x == False else 1)

In [49]:
dataset['southeast'] = dataset['southeast'].apply(lambda x:0 if x == False else 1)

In [50]:
dataset['southwest'] = dataset['southwest'].apply(lambda x:0 if x == False else 1)

In [51]:
dataset.head()

Unnamed: 0,northwest,southeast,southwest,age,sex,bmi,children,smoker,charges
0,0,0,1,19,0,27.9,0,1,16884.924
1,0,1,0,18,1,33.77,1,0,1725.5523
2,0,1,0,28,1,33.0,3,0,4449.462
3,1,0,0,33,1,22.705,0,0,21984.47061
4,1,0,0,32,1,28.88,0,0,3866.8552


### Creating the Training Set and the Test Set

Getting the inputs and output

In [52]:
X = dataset.iloc[:, :-1].values

In [53]:
y = dataset.iloc[:, -1].values

In [54]:
X

array([[ 0.  ,  0.  ,  1.  , ..., 27.9 ,  0.  ,  1.  ],
       [ 0.  ,  1.  ,  0.  , ..., 33.77,  1.  ,  0.  ],
       [ 0.  ,  1.  ,  0.  , ..., 33.  ,  3.  ,  0.  ],
       ...,
       [ 0.  ,  1.  ,  0.  , ..., 36.85,  0.  ,  0.  ],
       [ 0.  ,  0.  ,  1.  , ..., 25.8 ,  0.  ,  0.  ],
       [ 1.  ,  0.  ,  0.  , ..., 29.07,  0.  ,  1.  ]])

In [55]:
y

array([16884.924 ,  1725.5523,  4449.462 , ...,  1629.8335,  2007.945 ,
       29141.3603])

Getting the Training Set and the Test Set

In [56]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [57]:
import lightgbm as lgb
model = lgb.LGBMRegressor()

### Training the model

In [58]:
model.fit(X_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000207 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1070, number of used features: 8
[LightGBM] [Info] Start training from score 13201.182046


### Inference

In [59]:
y_pred = model.predict(X_test)

## Part 3: Evaluating the model

### R-Squared

In [60]:
from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)

In [61]:
r2

0.8875426023265389

### Adjusted R-Squared

In [62]:
k = X_test.shape[1]
n = len(X_test)
adj_r2 = 1-(1-r2)*(n-1)/(n-k-1)

In [63]:
adj_r2

0.8840690147536134

### k-Fold Cross Validation

In [64]:
from sklearn.model_selection import cross_val_score
r2s = cross_val_score(estimator = model,
                      X = X,
                      y = y,
                      scoring = 'r2',
                      cv = 10)
print("R-Squared: {:.2f} %".format(r2s.mean()*100))
print("Standard Deviation: {:.2f} %".format(r2s.std()*100))

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000220 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 8
[LightGBM] [Info] Start training from score 13180.577320
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000161 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 8
[LightGBM] [Info] Start training from score 13281.975004
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000211 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1204, number of used features: 8
[LightGBM] [Info] Start trai

### Grid Search

In [65]:
# if you want to know about  grid search then see 32:00 in lightgbm vedio of ML-LEVEL-2 folder

In [66]:
# the above class which we have mention LGBMRegressor() so under this class what ever the parameters
# we will enter the grid search will go through all those parameters[num_leaves,learning_rate,n_estimators]
# and give us the best parameters from it and those all parameters of LGBMRegressor() we will enter here
# under a variable called parameters in Grid Search

# best_parameters will give us best parameters
# best_r2 will give us best score
# beacuse we want our R-Square in percentage thats why we are multiplying it with 100

In [67]:
from sklearn.model_selection import GridSearchCV
parameters = [{'num_leaves' : [29, 30, 31, 32, 33], 'learning_rate' : [0.08, 0.09, 0.1, 0.11, 0.12],
               'n_estimators' : [80, 90, 100, 110, 120]}]

grid_search = GridSearchCV(estimator = model,
                          param_grid = parameters,
                          scoring = 'r2',
                          cv = 10)

grid_search.fit(X, y)
best_parameters = grid_search.best_params_
best_r2 = grid_search.best_score_

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[LightGBM] [Info] Start training from score 13187.874066
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000225 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1205, number of used features: 8
[LightGBM] [Info] Start training from score 13281.598979
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000192 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 319
[LightGBM] [Info] Number of data points in the train set: 1205, number of used features: 8
[LightGBM] [Info] Start training from score 13231.916176
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000158 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3

In [68]:
print("Best R-Square : {:2f} % ".format(best_r2*100))
print("Best Parameters : ", best_parameters)

Best R-Square : 84.893656 % 
Best Parameters :  {'learning_rate': 0.08, 'n_estimators': 80, 'num_leaves': 30}
