# CatBoost Classifier

## Part 1 - Data Preprocessing

### Importing the dataset

In [1]:
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')

In [2]:
dataset.head()

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


### Checking missing data

In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CustomerId       10000 non-null  int64  
 1   Surname          10000 non-null  object 
 2   CreditScore      10000 non-null  int64  
 3   Geography        10000 non-null  object 
 4   Gender           10000 non-null  object 
 5   Age              10000 non-null  int64  
 6   Tenure           10000 non-null  int64  
 7   Balance          10000 non-null  float64
 8   NumOfProducts    10000 non-null  int64  
 9   HasCrCard        10000 non-null  int64  
 10  IsActiveMember   10000 non-null  int64  
 11  EstimatedSalary  10000 non-null  float64
 12  Exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(3)
memory usage: 1015.8+ KB


### Handling categorical variables

CustomerId and Surname columns

In [4]:
dataset.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [5]:
dataset.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


Geography column

In [6]:
dataset['Geography'].unique()

array(['France', 'Spain', 'Germany'], dtype=object)

In [7]:
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first = True)

In [8]:
geography_dummies

Unnamed: 0,Germany,Spain
0,False,False
1,False,True
2,False,False
3,False,False
4,False,True
...,...,...
9995,False,False
9996,False,False
9997,False,False
9998,True,False


In [9]:
dataset = pd.concat([geography_dummies, dataset], axis = 1)

In [10]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [11]:
dataset.drop(['Geography'], axis = 1, inplace = True)

In [12]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,Female,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,Female,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,Female,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,Female,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,Female,43,2,125510.82,1,1,1,79084.1,0


Gender column

In [13]:
dataset['Gender'].unique()

array(['Female', 'Male'], dtype=object)

In [14]:
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)

In [15]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,False,False,619,0,42,2,0.0,1,1,1,101348.88,1
1,False,True,608,0,41,1,83807.86,1,0,1,112542.58,0
2,False,False,502,0,42,8,159660.8,3,1,0,113931.57,1
3,False,False,699,0,39,1,0.0,2,0,0,93826.63,0
4,False,True,850,0,43,2,125510.82,1,1,1,79084.1,0


In [16]:
dataset['Germany'] = dataset['Germany'].apply(lambda x:0 if x == False else 1)

In [17]:
dataset['Spain'] = dataset['Spain'].apply(lambda x:0 if x == False else 1)

In [18]:
dataset.head()

Unnamed: 0,Germany,Spain,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,0,619,0,42,2,0.0,1,1,1,101348.88,1
1,0,1,608,0,41,1,83807.86,1,0,1,112542.58,0
2,0,0,502,0,42,8,159660.8,3,1,0,113931.57,1
3,0,0,699,0,39,1,0.0,2,0,0,93826.63,0
4,0,1,850,0,43,2,125510.82,1,1,1,79084.1,0


### Creating the Training Set and the Test Set

Getting the inputs and output

In [19]:
X = dataset.iloc[:, :-1].values

In [20]:
y = dataset.iloc[:, -1].values

In [21]:
X

array([[0.0000000e+00, 0.0000000e+00, 6.1900000e+02, ..., 1.0000000e+00,
        1.0000000e+00, 1.0134888e+05],
       [0.0000000e+00, 1.0000000e+00, 6.0800000e+02, ..., 0.0000000e+00,
        1.0000000e+00, 1.1254258e+05],
       [0.0000000e+00, 0.0000000e+00, 5.0200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 1.1393157e+05],
       ...,
       [0.0000000e+00, 0.0000000e+00, 7.0900000e+02, ..., 0.0000000e+00,
        1.0000000e+00, 4.2085580e+04],
       [1.0000000e+00, 0.0000000e+00, 7.7200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 9.2888520e+04],
       [0.0000000e+00, 0.0000000e+00, 7.9200000e+02, ..., 1.0000000e+00,
        0.0000000e+00, 3.8190780e+04]])

In [22]:
y

array([1, 0, 1, ..., 1, 1, 0])

Getting the Training Set and the Test Set

In [23]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [24]:
!pip install catboost



In [25]:
import catboost as cb
model = cb.CatBoostClassifier()

### Training the model

In [26]:
model.fit(X_train, y_train)

Learning rate set to 0.025035
0:	learn: 0.6715999	total: 51.6ms	remaining: 51.5s
1:	learn: 0.6523077	total: 55.3ms	remaining: 27.6s
2:	learn: 0.6343125	total: 59ms	remaining: 19.6s
3:	learn: 0.6177234	total: 67.5ms	remaining: 16.8s
4:	learn: 0.6020753	total: 71.3ms	remaining: 14.2s
5:	learn: 0.5879791	total: 74.7ms	remaining: 12.4s
6:	learn: 0.5754312	total: 79.6ms	remaining: 11.3s
7:	learn: 0.5639214	total: 88ms	remaining: 10.9s
8:	learn: 0.5531841	total: 92.6ms	remaining: 10.2s
9:	learn: 0.5433279	total: 99.3ms	remaining: 9.83s
10:	learn: 0.5316836	total: 106ms	remaining: 9.52s
11:	learn: 0.5232186	total: 110ms	remaining: 9.09s
12:	learn: 0.5128050	total: 114ms	remaining: 8.68s
13:	learn: 0.5025235	total: 119ms	remaining: 8.38s
14:	learn: 0.4939152	total: 123ms	remaining: 8.06s
15:	learn: 0.4847706	total: 127ms	remaining: 7.8s
16:	learn: 0.4784245	total: 131ms	remaining: 7.55s
17:	learn: 0.4709868	total: 135ms	remaining: 7.37s
18:	learn: 0.4643611	total: 140ms	remaining: 7.23s
19:	le

<catboost.core.CatBoostClassifier at 0x7ba6201285e0>

### Inference

In [27]:
y_pred = model.predict(X_test)

### Predicting the result of a single observation

**Homework**

Use our model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: \$ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: \$ 50000

So, should we say goodbye to that customer?

**Solution**

In [28]:
print(model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]]))

[0]


Therefore, our model predicts that this customer stays in the bank!

**Important note 1:** Notice that the values of the features were all input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting our values into a double pair of square brackets makes the input exactly a 2D array.

**Important note 2:** Notice also that the "France" country was not input as a string in the last column but as "0, 0" in the first two columns. That's because of course the predict method expects the dummy values of the Geography variable.

## Part 3: Evaluating the model

### Making the Confusion Matrix

In [29]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1514,   81],
       [ 189,  216]])

### Accuracy

In [30]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

0.865

### k-Fold Cross Validation

In [31]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             scoring = 'accuracy',
                             cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
6:	learn: 0.5690671	total: 29ms	remaining: 4.12s
7:	learn: 0.5564710	total: 33.5ms	remaining: 4.16s
8:	learn: 0.5452966	total: 37.2ms	remaining: 4.1s
9:	learn: 0.5333038	total: 40.9ms	remaining: 4.04s
10:	learn: 0.5214061	total: 45ms	remaining: 4.05s
11:	learn: 0.5126884	total: 51ms	remaining: 4.2s
12:	learn: 0.5022614	total: 55.1ms	remaining: 4.19s
13:	learn: 0.4922916	total: 58.8ms	remaining: 4.14s
14:	learn: 0.4835090	total: 62.8ms	remaining: 4.12s
15:	learn: 0.4762829	total: 66.4ms	remaining: 4.09s
16:	learn: 0.4696501	total: 70.1ms	remaining: 4.05s
17:	learn: 0.4621897	total: 73.8ms	remaining: 4.03s
18:	learn: 0.4554540	total: 78ms	remaining: 4.03s
19:	learn: 0.4492361	total: 81.9ms	remaining: 4.01s
20:	learn: 0.4434998	total: 85.6ms	remaining: 3.99s
21:	learn: 0.4379481	total: 89.3ms	remaining: 3.97s
22:	learn: 0.4323158	total: 92.9ms	remaining: 3.94s
23:	learn: 0.4273168	total: 97ms	remaining: 3.94s
24:	learn: 0.42

### Grid Search

In [32]:
from sklearn.model_selection import GridSearchCV
parameters = [{'learning_rate': [0.001,0.005,0.01], 'depth': [4,7,10], 'l2_leaf_reg': [2,6,10], 'random_strength': [0,5,10]}]
grid_search = GridSearchCV(estimator = model,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10)
grid_search.fit(X, y)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
10:	learn: 0.6322005	total: 130ms	remaining: 11.7s
11:	learn: 0.6278743	total: 145ms	remaining: 11.9s
12:	learn: 0.6227579	total: 160ms	remaining: 12.1s
13:	learn: 0.6177524	total: 174ms	remaining: 12.3s
14:	learn: 0.6148701	total: 177ms	remaining: 11.6s
15:	learn: 0.6097429	total: 192ms	remaining: 11.8s
16:	learn: 0.6051447	total: 207ms	remaining: 12s
17:	learn: 0.6004808	total: 210ms	remaining: 11.5s
18:	learn: 0.5970499	total: 238ms	remaining: 12.3s
19:	learn: 0.5929140	total: 253ms	remaining: 12.4s
20:	learn: 0.5874707	total: 269ms	remaining: 12.5s
21:	learn: 0.5826019	total: 283ms	remaining: 12.6s
22:	learn: 0.5789119	total: 298ms	remaining: 12.7s
23:	learn: 0.5755222	total: 313ms	remaining: 12.7s
24:	learn: 0.5712348	total: 328ms	remaining: 12.8s
25:	learn: 0.5682067	total: 334ms	remaining: 12.5s
26:	learn: 0.5642657	total: 349ms	remaining: 12.6s
27:	learn: 0.5599277	total: 365ms	remaining: 12.7s
28:	learn: 0.557409

  _data = np.array(data, dtype=dtype, copy=copy,


10:	learn: 0.6219335	total: 68.6ms	remaining: 6.17s
11:	learn: 0.6172023	total: 82.2ms	remaining: 6.77s
12:	learn: 0.6106574	total: 88.9ms	remaining: 6.75s
13:	learn: 0.6047622	total: 94.6ms	remaining: 6.66s
14:	learn: 0.6003176	total: 102ms	remaining: 6.69s
15:	learn: 0.5948771	total: 109ms	remaining: 6.69s
16:	learn: 0.5895835	total: 116ms	remaining: 6.68s
17:	learn: 0.5847849	total: 122ms	remaining: 6.67s
18:	learn: 0.5807803	total: 129ms	remaining: 6.64s
19:	learn: 0.5760952	total: 135ms	remaining: 6.6s
20:	learn: 0.5704401	total: 141ms	remaining: 6.58s
21:	learn: 0.5665981	total: 151ms	remaining: 6.7s
22:	learn: 0.5614031	total: 158ms	remaining: 6.71s
23:	learn: 0.5585152	total: 162ms	remaining: 6.59s
24:	learn: 0.5548994	total: 167ms	remaining: 6.51s
25:	learn: 0.5520721	total: 170ms	remaining: 6.36s
26:	learn: 0.5479810	total: 178ms	remaining: 6.41s
27:	learn: 0.5457611	total: 181ms	remaining: 6.29s
28:	learn: 0.5423998	total: 187ms	remaining: 6.27s
29:	learn: 0.5389013	total: 1

In [33]:
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)

Best Accuracy: 86.62 %
Best Parameters: {'depth': 7, 'l2_leaf_reg': 2, 'learning_rate': 0.01, 'random_strength': 5}
