# LightGBM Classifier

## Part 1 - Data Preprocessing

### Importing the dataset

In [31]:
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')

In [32]:
dataset.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### Checking missing data

In [33]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1015.8+ KB


### Handling categorical variables

CustomerId and Surname columns

In [34]:
dataset.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [35]:
dataset.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Geography column

In [36]:
dataset['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [37]:
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first = True)

In [38]:
geography_dummies

Unnamed: 0,Germany,Spain
0,False,False
1,False,True
2,False,False
3,False,False
4,False,True
...,...,...
9995,False,False
9996,False,False
9997,False,False
9998,True,False


In [39]:
dataset = pd.concat([geography_dummies, dataset], axis = 1)

In [40]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [41]:
dataset['Germany'] = dataset['Germany'].apply(lambda x:0 if x == False else 1)

In [42]:
dataset['Spain'] = dataset['Spain'].apply(lambda x:0 if x == False else 1)

In [43]:
dataset.drop(['Geography'], axis = 1, inplace = True)

In [44]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,0,619,Female,42,2,0.0,1,1,1,101348.88,1
1,0,1,608,Female,41,1,83807.86,1,0,1,112542.58,0
2,0,0,502,Female,42,8,159660.8,3,1,0,113931.57,1
3,0,0,699,Female,39,1,0.0,2,0,0,93826.63,0
4,0,1,850,Female,43,2,125510.82,1,1,1,79084.1,0


Gender column

In [45]:
dataset['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [46]:
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)

In [47]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,0,619,0,42,2,0.0,1,1,1,101348.88,1
1,0,1,608,0,41,1,83807.86,1,0,1,112542.58,0
2,0,0,502,0,42,8,159660.8,3,1,0,113931.57,1
3,0,0,699,0,39,1,0.0,2,0,0,93826.63,0
4,0,1,850,0,43,2,125510.82,1,1,1,79084.1,0


### Creating the Training Set and the Test Set

Getting the inputs and output

In [48]:
X = dataset.iloc[:, :-1].values

In [49]:
y = dataset.iloc[:, -1].values

In [50]:
X

array([[0.0000000e+00, 0.0000000e+00, 6.1900000e+02, ..., 1.0000000e+00,
        1.0000000e+00, 1.0134888e+05],
       [0.0000000e+00, 1.0000000e+00, 6.0800000e+02, ..., 0.0000000e+00,
        1.0000000e+00, 1.1254258e+05],
       [0.0000000e+00, 0.0000000e+00, 5.0200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 1.1393157e+05],
       ...,
       [0.0000000e+00, 0.0000000e+00, 7.0900000e+02, ..., 0.0000000e+00,
        1.0000000e+00, 4.2085580e+04],
       [1.0000000e+00, 0.0000000e+00, 7.7200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 9.2888520e+04],
       [0.0000000e+00, 0.0000000e+00, 7.9200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 3.8190780e+04]])

In [51]:
y

array([1, 0, 1, ..., 1, 1, 0])

Getting the Training Set and the Test Set

In [52]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [53]:
import lightgbm as lgb
model = lgb.LGBMClassifier()

### Training the model

In [54]:
model.fit(X_train, y_train)

[LightGBM] [Info] Number of positive: 1632, number of negative: 6368
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000421 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 856
[LightGBM] [Info] Number of data points in the train set: 8000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.204000 -> initscore=-1.361479
[LightGBM] [Info] Start training from score -1.361479


### Inference

In [55]:
y_pred = model.predict(X_test)

### Predicting the result of a single observation

**Homework**

Use our model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: \$ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: \$ 50000

So, should we say goodbye to that customer?

**Solution**

In [56]:
print(model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]]))

[0]


Therefore, our model predicts that this customer stays in the bank!

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array.

**Important note 2:** Notice also that the "France" country was not input as a string in the last column but as "0, 0" in the first two columns. That's because of course the predict method expects the dummy values of the Geography variable.

## Part 3: Evaluating the model

### Making the Confusion Matrix

In [57]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1506,   89],
       [ 184,  221]])

### Accuracy

In [58]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.8635

### k-Fold Cross Validation

In [59]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             scoring = 'accuracy',
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

[LightGBM] [Info] Number of positive: 1833, number of negative: 7167
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001890 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 856
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203667 -> initscore=-1.363533
[LightGBM] [Info] Start training from score -1.363533
[LightGBM] [Info] Number of positive: 1833, number of negative: 7167
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000430 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 857
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203667 -> initscore=-1.363533
[LightGBM]

### Grid Search

In [60]:
from sklearn.model_selection import GridSearchCV
paramaters = [{'num_leaves' : [29, 30, 31, 32, 33], 'learning_rate' : [0.08, 0.09, 0.1, 0.11, 0.12],
               'n_estimators' : [80, 90, 100, 110, 120]}]
grid_search = GridSearchCV(estimator = model,
                           param_grid = paramaters,
                           scoring = 'accuracy',
                           cv = 10,)
grid_search.fit(X, y)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
[LightGBM] [Info] Start training from score -1.362848
[LightGBM] [Info] Number of positive: 1834, number of negative: 7166
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002477 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 855
[LightGBM] [Info] Number of data points in the train set: 9000, number of used features: 11
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.203778 -> initscore=-1.362848
[LightGBM] [Info] Start training from score -1.362848
[LightGBM] [Info] Number of positive: 1833, number of negative: 7167
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000577 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 856
[LightGBM] [Info] Number of data points in the train set: 9000,

In [61]:
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

Best Accuracy: 86.48 %
Best Parameters: {'learning_rate': 0.09, 'n_estimators': 90, 'num_leaves': 32}
