# Bank customers churn

Clients started leaving Bank. Every month. Bank marketers said it was cheaper to retain current customers than to attract new ones.

We need to build the model which predict: the client will leave the bank in the near future or not. We have historical data on customer behavior and termination of agreements with the bank.

Build the model with the highest possible *F1* (min 0.59).
Use *AUC-ROC* and compare it with *F1*.


In [1]:
#library import
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from joblib import dump
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, mean_squared_error,mean_absolute_error, accuracy_score,r2_score, confusion_matrix, recall_score,roc_auc_score, precision_score, f1_score
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle

# 1. About data

In [2]:
#check data
churn_data = pd.read_csv('Churn.csv')
display(churn_data.head())

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [3]:
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


10000 entries and 14 columns.

Feauters:

    RowNumber — index
    CustomerId — id
    Surname — surname
    CreditScore — credit score
    Geography — country of residence
    Gender — gender
    Age — age
    Tenure — number of property
    Balance — money balance
    NumOfProducts — number of using bank products
    HasCrCard — has credit card or not
    IsActiveMember — active client or not
    EstimatedSalary — estimated salary

Target:

    Exited — leave or not
    
We should do str.lower for columns name, change some columns types, delete wrong column and check nulls.
   

In [4]:
#str.lower
churn_data.columns = churn_data.columns.str.lower()

#check nulls
churn_data.isnull().sum()

rownumber            0
customerid           0
surname              0
creditscore          0
geography            0
gender               0
age                  0
tenure             909
balance              0
numofproducts        0
hascrcard            0
isactivemember       0
estimatedsalary      0
exited               0
dtype: int64

In [5]:
#change column type и fill 0
churn_data['tenure'] = churn_data['tenure'].fillna(0).astype('int')
churn_data.isnull().sum()

rownumber          0
customerid         0
surname            0
creditscore        0
geography          0
gender             0
age                0
tenure             0
balance            0
numofproducts      0
hascrcard          0
isactivemember     0
estimatedsalary    0
exited             0
dtype: int64

In [6]:
#delete columns
del churn_data['rownumber']
del churn_data['surname']
churn_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   customerid       10000 non-null  int64  
 1   creditscore      10000 non-null  int64  
 2   geography        10000 non-null  object 
 3   gender           10000 non-null  object 
 4   age              10000 non-null  int64  
 5   tenure           10000 non-null  int32  
 6   balance          10000 non-null  float64
 7   numofproducts    10000 non-null  int64  
 8   hascrcard        10000 non-null  int64  
 9   isactivemember   10000 non-null  int64  
 10  estimatedsalary  10000 non-null  float64
 11  exited           10000 non-null  int64  
dtypes: float64(2), int32(1), int64(7), object(2)
memory usage: 898.6+ KB


In [7]:
#check duplicates
churn_data.duplicated().sum()

0

#### Сonclusion

So, now we have 10000 entries and 12 columns without nulls.

We delete rownumber, because we have standart index and id.

Tenure in dictionary mean "type of property", but here it is "how many property customer has". We canot check true or false it is.
May be, we can fill nulls for median, but 0 is better, because we hevan't more data.

Also we don't need surname column.

# 2. Research

In [8]:
#get_dummies
bank_data = pd.get_dummies(churn_data, drop_first=True)
bank_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   customerid         10000 non-null  int64  
 1   creditscore        10000 non-null  int64  
 2   age                10000 non-null  int64  
 3   tenure             10000 non-null  int32  
 4   balance            10000 non-null  float64
 5   numofproducts      10000 non-null  int64  
 6   hascrcard          10000 non-null  int64  
 7   isactivemember     10000 non-null  int64  
 8   estimatedsalary    10000 non-null  float64
 9   exited             10000 non-null  int64  
 10  geography_Germany  10000 non-null  uint8  
 11  geography_Spain    10000 non-null  uint8  
 12  gender_Male        10000 non-null  uint8  
dtypes: float64(2), int32(1), int64(7), uint8(3)
memory usage: 771.6 KB


In [9]:
#features&target
features = bank_data.drop(['exited', 'customerid'], axis=1)
target = bank_data['exited']

#to numeric
numeric = ['tenure','age','creditscore', 'balance', 'estimatedsalary']

scaler = StandardScaler()
scaler.fit(features[numeric])
features[numeric] = scaler.transform(features[numeric])
display(features.shape)

(10000, 11)

In [10]:
numeric

['tenure', 'age', 'creditscore', 'balance', 'estimatedsalary']

In [11]:
#for two
features_train_valid, features_test, target_train_valid, target_test = train_test_split(
    features, target, test_size=0.20, random_state=12345)

#and three
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_valid, target_train_valid, test_size=0.250, random_state=12345)

print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(6000, 11)
(2000, 11)
(2000, 11)


In [12]:
#check accuracy
target_pred_constant = pd.Series(0, index=target.index)
display(accuracy_score(target, target_pred_constant))

0.7963

#### Сonclusion

We have train, valid ad test.

Do get_dummies and numeric.

Accuracy is 0.80 - very good.

In [13]:
#depth params Decision Tree Classifier
for depth in range(1, 11):
    model =  DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_valid, target_valid)
    predictions_valid = model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))   

max_depth = 1 : 0.8045
max_depth = 2 : 0.8325
max_depth = 3 : 0.844
max_depth = 4 : 0.854
max_depth = 5 : 0.865
max_depth = 6 : 0.8765
max_depth = 7 : 0.897
max_depth = 8 : 0.912
max_depth = 9 : 0.934
max_depth = 10 : 0.953


***I think here max_depth = 5 is enough***

In [14]:
#n_estimators params Random Forest Classifier
for est in range(1, 11):
    model =  RandomForestClassifier(random_state=12345, n_estimators = est)
    model.fit(features_valid, target_valid)
    predictions_valid = model.predict(features_valid)
    print("max_est =", est, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))

max_est = 1 : 0.9255
max_est = 2 : 0.9315
max_est = 3 : 0.964
max_est = 4 : 0.9555
max_est = 5 : 0.98
max_est = 6 : 0.969
max_est = 7 : 0.985
max_est = 8 : 0.979
max_est = 9 : 0.991
max_est = 10 : 0.9815


***5 is enough too.***

In [15]:
print("Accuracy")

#check accuracy, Decision Tree Classifier
model = DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_train, target_train)
predicted_valid =  model.predict(features_valid)
accuracy_valid = accuracy_score(target_valid, predicted_valid)
print("Decision Tree Classifier:",accuracy_valid)

#check accuracy, Logistic Regression 
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
accuracy_valid = accuracy_score(target_valid, predicted_valid)
print("Logistic Regression:",accuracy_valid)

#check accuracy, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
accuracy_valid = accuracy_score(target_valid, predicted_valid)
print("Random Forest Classifier:", accuracy_valid)

Accuracy
Decision Tree Classifier: 0.785
Logistic Regression: 0.8155
Random Forest Classifier: 0.8615


***Random Forest Classifier win. Logistic Regression is second.***

In [16]:
#confusion matrix, Decision Tree Classifier
model = DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
display("confusion matrix, Decision Tree Classifier:", confusion_matrix(target_valid, predicted_valid))

#confusion matrix, Logistic Regression 
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
display("confusion matrix, Logistic Regression:",confusion_matrix(target_valid, predicted_valid))

#confusion matrix, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators =21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
display("confusion matrix, Random Forest Classifier:", confusion_matrix(target_valid, predicted_valid))

'confusion matrix, Decision Tree Classifier:'

array([[1383,  226],
       [ 204,  187]], dtype=int64)

'confusion matrix, Logistic Regression:'

array([[1550,   59],
       [ 310,   81]], dtype=int64)

'confusion matrix, Random Forest Classifier:'

array([[1539,   70],
       [ 207,  184]], dtype=int64)

In [17]:
print("Recall")

#recall score, Decision Tree Classifier
model = DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Decision Tree Classifier:",recall_score(target_valid, predicted_valid))

#recall score, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Logistic Regression:",recall_score(target_valid, predicted_valid))

#recall score, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Random Forest Classifier:", recall_score(target_valid, predicted_valid))

Recall
Decision Tree Classifier: 0.4782608695652174
Logistic Regression: 0.2071611253196931
Random Forest Classifier: 0.47058823529411764


In [18]:
print("Precision")

#Precision score, Decision Tree Classifier
model = DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Decision Tree Classifier:", precision_score(target_valid, predicted_valid))

#Precision score, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Logistic Regression:",precision_score(target_valid, predicted_valid))

#Precision score, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Random Forest Classifier:", precision_score(target_valid, predicted_valid))

Precision
Decision Tree Classifier: 0.45278450363196127
Logistic Regression: 0.5785714285714286
Random Forest Classifier: 0.7244094488188977


In [19]:
print("Auc_roc")

#Auc_roc, Decision Tree Classifier
model =  DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Decision Tree Classifier:",auc_roc)

#Auc_roc, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Logistic Regression:",auc_roc)

#Auc_roc, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Random Forest Classifier:", auc_roc)

Auc_roc
Decision Tree Classifier: 0.6707077675288777
Logistic Regression: 0.7707874027012378
Random Forest Classifier: 0.8219907521470501


In [20]:
print("F-1")

#f-1,Decision Tree Classifier
model = DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
print("Decision Tree Classifier:", f1)

#f-1, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
print("Logistic Regression:", f1)

#f-1, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
f1 = f1_score(target_valid, predicted_valid)
print("Random Forest Classifier:", f1)

F-1
Decision Tree Classifier: 0.4651741293532339
Logistic Regression: 0.3050847457627119
Random Forest Classifier: 0.5705426356589146


#### Сonclusion

Random Forest Classifier win again. Not big: f1- 0.57 and auc_roc - 0.82.

Let's go to disbalance

# 3. What to do with disbalance?

In [21]:
#balance f-1, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Logistic Regression F1:", f1_score(target_valid, predicted_valid))

#balance f-1, Decision Tree Classifier
model =  DecisionTreeClassifier(random_state=12345, max_depth=21, class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Decision Tree Classifier F1:", f1_score(target_valid, predicted_valid))

#balance f-1, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print("Случайный лес F1:", f1_score(target_valid, predicted_valid))

Logistic Regression F1: 0.47763864042933807
Decision Tree Classifier F1: 0.45033112582781454
Случайный лес F1: 0.561128526645768


In [22]:
print("Auc_roc, balance")

#balance auc_roc, Decision Tree Classifier
model =  DecisionTreeClassifier(random_state=12345, max_depth=21, class_weight='balanced')
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Decision Tree Classifier:",auc_roc)

#balance Auc_roc, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Logistic Regression:",auc_roc)

#balance Auc_roc, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Random Forest Classifier:", auc_roc)

Auc_roc, balance
Decision Tree Classifier: 0.657761091303871
Logistic Regression: 0.7729586930294586
Random Forest Classifier: 0.8253120633775168


#### Сonclusion


Random Forest Classifier is win again, but F1 only 0.56. 
Let's check up- and downsampling.

In [23]:
#upsampling
def upsample(features, target, repeat):
    features_zeros = features_train[target_train == 0]
    features_ones = features_train[target_train == 1]
    target_zeros = target_train[target_train == 0]
    target_ones = target_train[target_train == 1]
    repeat = 10
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    return features_upsampled, target_upsampled
features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

print("Upsampling")

#upsampling F1
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("Logistic Regression F1:", f1_score(target_valid, predicted_valid))

model =  DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("Decision Tree Classifier F1:", f1_score(target_valid, predicted_valid))

model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("Random Forest Classifier F1:", f1_score(target_valid, predicted_valid))

Upsampling
Logistic Regression F1: 0.40463065049614116
Decision Tree Classifier F1: 0.4659685863874346
Random Forest Classifier F1: 0.5844504021447721


In [24]:
print("Auc_roc, upsampling")

#upsampling Auc_roc, Decision Tree Classifier
model =  DecisionTreeClassifier(random_state=12345, max_depth=11)
model.fit(features_upsampled, target_upsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Decision Tree Classifier:",auc_roc)

#upsampling Auc_roc, Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_upsampled, target_upsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Logistic Regression:",auc_roc)

#upsampling Auc_roc, случайный лес
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_upsampled, target_upsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Random Forest Classifier:", auc_roc)

Auc_roc, upsampling
Decision Tree Classifier: 0.7494082995426937
Logistic Regression: 0.7735134370445019
Random Forest Classifier: 0.8281159844163027


In [25]:
#downsampling
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    return features_downsampled, target_downsampled
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

print("Downsampling")
#downsampling f1
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("Logistic Regression F1:", f1_score(target_valid, predicted_valid))

model =  DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("Decision Tree Classifier F1:", f1_score(target_valid, predicted_valid))

model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("Random Forest Classifier F1:", f1_score(target_valid, predicted_valid))

Downsampling
Logistic Regression F1: 0.4064587973273942
Decision Tree Classifier F1: 0.4277854195323246
Random Forest Classifier F1: 0.46461949265687585


In [26]:
print("Auc_roc, downsampling")

#downsampling Auc_roc, Decision Tree Classifier
model =  DecisionTreeClassifier(random_state=12345, max_depth=21)
model.fit(features_downsampled, target_downsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Decision Tree Classifier:",auc_roc)

#downsampling Auc_roc, решающее дерево
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_downsampled, target_downsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Logistic Regression:",auc_roc)

#downsampling Auc_roc, случайный лес
model = RandomForestClassifier(random_state=12345, n_estimators = 21)
model.fit(features_downsampled, target_downsampled)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)
print("Random Forest Classifier:", auc_roc)

Auc_roc, downsampling
Decision Tree Classifier: 0.6640126907627969
Logistic Regression: 0.7730206844809964
Random Forest Classifier: 0.8223547532342848


### Сonclusion


Upsampling looks better than downsampling.

F1:0.58 and 0.46.
Auc_roc has same numbers.

I think we should take upsampling and balance model.

And of course Random Forest Classifier.


# 4. Train and test

In [27]:
features_full_train = pd.concat([features_train, features_valid])
target_full_train = pd.concat([target_train, target_valid]) 

features_upsampled, target_upsampled = upsample(features_full_train, target_full_train, 10)

# accuracy, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_full_train, target_full_train)
predicted_valid = model.predict(features_full_train)
accuracy_valid = accuracy_score(target_full_train, predicted_valid)
print("accuracy:", accuracy_valid)

#confusion_matrix,Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators =21, class_weight='balanced')
model.fit(features_full_train, target_full_train)
predicted_valid = model.predict(features_full_train)
display("confusion_matrix:", confusion_matrix(target_full_train, predicted_valid))

#recall_score, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_full_train, target_full_train)
predicted_valid = model.predict(features_full_train)
print("recall_score:", recall_score(target_full_train, predicted_valid))

#precision_score, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_full_train, target_full_train)
predicted_valid = model.predict(features_full_train)
print("precision_score:", precision_score(target_full_train, predicted_valid))

#Auc_roc, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_full_train, target_full_train)
probabilities_valid = model.predict_proba(features_full_train)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(target_full_train, probabilities_one_valid)
print("Auc_roc:", auc_roc)

#f-1, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_full_train, target_full_train)
predicted_valid = model.predict(features_full_train)
f1 = f1_score(target_full_train, predicted_valid)
print("F1:", f1)

accuracy: 0.996625


'confusion_matrix:'

array([[6389,    1],
       [  26, 1584]], dtype=int64)

recall_score: 0.9838509316770186
precision_score: 0.9993690851735015
Auc_roc: 0.999980559686622
F1: 0.9915492957746479


***Test on test***

In [28]:
#f-1, Random Forest Classifier
model = RandomForestClassifier(random_state=12345, n_estimators = 21, class_weight='balanced')
model.fit(features_upsampled, target_upsampled)
predicted_test = model.predict(features_test)
probablities_test = model.predict_proba(features_test)[:, 1]

print('F1 =', f1_score(target_test, predicted_test))
print('AUC-ROC =', roc_auc_score(target_test, probablities_test))


F1 = 0.6025316455696202
AUC-ROC = 0.841482660409635


# Final conclusion

So, Random Forest Regressor is good choice.

Now we have F1 = 0.6025316455696202
AUC-ROC = 0.841482660409635

I think, we can this model with other data.