# Assignment for Regression Project: 

###Data description
● Features
○ RowNumber — data string index
○ CustomerId — unique customer identifier
○ Surname — surname
○ CreditScore — credit score
○ Geography — country of residence
○ Gender — gender
○ Age — age
○ Tenure — period of maturation for a customer’s fixed deposit (years)
○ Balance — account balance
○ NumOfProducts — number of banking products used by the customer
○ HasCrCard — customer has a credit card
○ IsActiveMember — customer’s activeness
○ EstimatedSalary — estimated salary
● Target
○ Exited — сustomer has left
Project evaluation
We’ve put together the evaluation criteria for the project. Read this carefully before
moving on to the task.
● How did you prepare the data for training? Have you processed all of the feature
types?
● Have you explained the preprocessing steps well enough?
● How did you investigate the balance of classes?
● Did you study the model without taking into account the imbalance of classes?
● What are your findings of the task research?
● Have you correctly split the data into sets?
● How have you worked with the imbalance of classes?
● Did you use at least two techniques for imbalance fixing?
● Have you performed the model training, validation, and final testing correctly?
● How high is your F1 score?
● Did you examine the AUC-ROC values?
● Have you kept to the project structure and kept the code neat?
You can use the CRISP-DM methodology that we’ve used in the past, whose approach
will help you address the above questions.

In [17]:
#Import the data and explore it
import pandas as pd

df = pd.read_csv('https://bit.ly/2XZK7Bo')
test_df = pd.read_csv('https://bit.ly/2XZK7Bo')

test_df.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [18]:
#Check data types
test_df.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [19]:
#Check the shape
test_df.shape

(10000, 14)

In [20]:
#Check duplicate values
test_df.duplicated().sum()

0

In [21]:
#Check Null columns
test_df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [31]:
#Import f1 score and calculate using DecisionTreeClassifier
#Import train_test_split
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
#drop null values in Tenure
test_df.dropna(inplace=True)

target = test_df['Exited']
#drop unwanted columns
features = test_df.drop(['Surname','Geography','Gender','RowNumber','CustomerId','Exited'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

model = DecisionTreeClassifier(random_state=42)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)


precision = precision_score(target_valid, predicted_valid)
f1 = f1_score(target_valid, predicted_valid)
print(f1)


0.46447507953340406


In [33]:
#Impove imbalancing and calculate using LogisticRegression
#Train the model without taking into account the imbalance
#It is improving by 0.1 only
#Import train_test_split
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

target = test_df['Exited']
#drop unwanted columns
features = test_df.drop(['Surname','Geography','Gender','RowNumber','CustomerId','Exited'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

model = LogisticRegression(
    random_state=12345, class_weight='balanced', solver='liblinear'
)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))



F1: 0.4606240713224368


In [34]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score


target = test_df['Exited']
#drop unwanted columns
features = test_df.drop(['Surname','Geography','Gender','RowNumber','CustomerId','Exited'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled


features_upsampled, target_upsampled = upsample(
    features_train, target_train, 10
)
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_upsampled, target_upsampled)
predicted_valid=model.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

F1: 0.3568803593303389


In [37]:
#The recall for threshold 0.0 is one: all answers are positive. The recall drops as the threshold increases.
#First, the precision grows and then it drops to zero when the number of positive answers decreases. 
#If there are no answers of class "1", then zero is divided by zero (which is possible in sklearn), and precision is 0.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score

target = test_df['Exited']
#drop unwanted columns
features = test_df.drop(['Surname','Geography','Gender','RowNumber','CustomerId','Exited'], axis=1)

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345
)

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
probabilities_valid = model.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

for threshold in np.arange(0, 0.3, 0.02):
    predicted_valid = probabilities_one_valid > threshold
    precision = precision_score(target_valid, predicted_valid)
    recall = recall_score(target_valid, predicted_valid)
    print(
        'Threshold = {:.2f} | Precision = {:.3f}, Recall = {:.3f}'.format(
            threshold, precision, recall
        )
    )


Threshold = 0.00 | Precision = 0.198, Recall = 1.000
Threshold = 0.02 | Precision = 0.198, Recall = 1.000
Threshold = 0.04 | Precision = 0.198, Recall = 1.000
Threshold = 0.06 | Precision = 0.199, Recall = 1.000
Threshold = 0.08 | Precision = 0.202, Recall = 1.000
Threshold = 0.10 | Precision = 0.206, Recall = 0.987
Threshold = 0.12 | Precision = 0.216, Recall = 0.956
Threshold = 0.14 | Precision = 0.228, Recall = 0.900
Threshold = 0.16 | Precision = 0.238, Recall = 0.823
Threshold = 0.18 | Precision = 0.248, Recall = 0.741
Threshold = 0.20 | Precision = 0.266, Recall = 0.652
Threshold = 0.22 | Precision = 0.292, Recall = 0.572
Threshold = 0.24 | Precision = 0.321, Recall = 0.501
Threshold = 0.26 | Precision = 0.347, Recall = 0.419
Threshold = 0.28 | Precision = 0.366, Recall = 0.344


In [39]:
#Perform the final testing.
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score

df = pd.read_csv('https://bit.ly/2XZK7Bo')
df.dropna(inplace=True)

target = df['Exited']
#drop unwanted columns
features = df.drop(['Surname','Geography','Gender','RowNumber','CustomerId','Exited'], axis=1)

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features, target)
probabilities_valid = model.predict_proba(features)
probabilities_one_valid = probabilities_valid[:, 1]

for threshold in np.arange(0, 0.3, 0.02):
    predicted_valid = probabilities_one_valid > threshold
    precision = precision_score(target, predicted_valid)
    recall = recall_score(target, predicted_valid)
    print(
        'Threshold = {:.2f} | Precision = {:.3f}, Recall = {:.3f}'.format(
            threshold, precision, recall
        )
    )


Threshold = 0.00 | Precision = 0.204, Recall = 1.000
Threshold = 0.02 | Precision = 0.204, Recall = 1.000
Threshold = 0.04 | Precision = 0.204, Recall = 0.999
Threshold = 0.06 | Precision = 0.207, Recall = 0.996
Threshold = 0.08 | Precision = 0.214, Recall = 0.980
Threshold = 0.10 | Precision = 0.223, Recall = 0.948
Threshold = 0.12 | Precision = 0.235, Recall = 0.906
Threshold = 0.14 | Precision = 0.248, Recall = 0.852
Threshold = 0.16 | Precision = 0.260, Recall = 0.787
Threshold = 0.18 | Precision = 0.274, Recall = 0.718
Threshold = 0.20 | Precision = 0.289, Recall = 0.650
Threshold = 0.22 | Precision = 0.302, Recall = 0.580
Threshold = 0.24 | Precision = 0.320, Recall = 0.517
Threshold = 0.26 | Precision = 0.332, Recall = 0.451
Threshold = 0.28 | Precision = 0.352, Recall = 0.406
