In this project, we will create a model to predict whether a client would subscribe to the term deposit product.

The data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls.

Number of clients: 41,188.
Date: from May 2008 to November 2010.

20 different columns.


There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010).

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms.

Links: https://archive.ics.uci.edu/dataset/222/bank+marketing

In [None]:
%pip install lazypredict



In [None]:
from google.colab import drive
drive.mount('content/')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.metrics import roc_auc_score, RocCurveDisplay, classification_report, precision_score, recall_score, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import RandomOverSampler, SMOTE
from collections import Counter
import plotly.graph_objects as go
import pandas as pd
from lazypredict.Supervised import LazyClassifier
import plotly.express as px

Drive already mounted at content/; to attempt to forcibly remount, call drive.mount("content/", force_remount=True).


In [None]:
variable = pd.read_excel('/content/content/MyDrive/0 Colab Notebooks/PORTAFOLIO PERSONAL/BANK MARKETING/DATASET/Variables Table/Variables Table.xlsx')
variable

Unnamed: 0,Variable Name,Role,Type,Demographic,Description,Units,Missing Values
0,age,Feature,Integer,Age,,,no
1,job,Feature,Categorical,Occupation,"type of job (categorical: 'admin.','blue-colla...",,no
2,marital,Feature,Categorical,Marital Status,"marital status (categorical: 'divorced','marri...",,no
3,education,Feature,Categorical,Education Level,"(categorical: 'basic.4y','basic.6y','basic.9y'...",,no
4,default,Feature,Binary,,has credit in default?,,no
5,balance,Feature,Integer,,average yearly balance,euros,no
6,housing,Feature,Binary,,has housing loan?,,no
7,loan,Feature,Binary,,has personal loan?,,no
8,contact,Feature,Categorical,,contact communication type (categorical: 'cell...,,yes
9,day_of_week,Feature,Date,,last contact day of the week,,no


# **Additional Variable Information**

  Input variables:

  1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student", "blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")
related with the last contact of the current campaign:

9 - contact: contact communication type (categorical: "unknown","telephone","cellular")

10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)
**other attributes:**
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

**Output variable (desired target):**

17 - cons.price.idx: consumer price index - monthly indicator (numeric)     
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)     
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):


21 - y - has the client subscribed a term deposit? (binary: "yes","no")

In [None]:
df = pd.read_csv('/content/content/MyDrive/0 Colab Notebooks/PORTAFOLIO PERSONAL/BANK MARKETING/DATASET/bank-additional/bank-additional/bank-additional-full.csv', sep = ';')
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.99,-36.4,4.86,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.99,-36.4,4.86,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.99,-36.4,4.86,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.99,-36.4,4.86,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.99,-36.4,4.86,5191.0,no


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null 

# **EDA**

1. START ANALYSIS AND DETECT OUTLIERS

In [None]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02,258.29,2.57,962.48,0.17,0.08,93.58,-40.5,3.62,5167.04
std,10.42,259.28,2.77,186.91,0.49,1.57,0.58,4.63,1.73,72.25
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.2,-50.8,0.63,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.08,-42.7,1.34,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.75,-41.8,4.86,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.99,-36.4,4.96,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.77,-26.9,5.04,5228.1


In [None]:
fig = px.violin(df, y="age", box=True, points='all')
fig.show()


In [None]:
df.job.value_counts()

Unnamed: 0_level_0,count
job,Unnamed: 1_level_1
admin.,10422
blue-collar,9254
technician,6743
services,3969
management,2924
retired,1720
entrepreneur,1456
self-employed,1421
housemaid,1060
unemployed,1014


In [None]:
df.marital.value_counts()

Unnamed: 0_level_0,count
marital,Unnamed: 1_level_1
married,24928
single,11568
divorced,4612
unknown,80


In [None]:
df.education.value_counts()

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
university.degree,12168
high.school,9515
basic.9y,6045
professional.course,5243
basic.4y,4176
basic.6y,2292
unknown,1731
illiterate,18


In [None]:
df.default.value_counts()


Unnamed: 0_level_0,count
default,Unnamed: 1_level_1
no,32588
unknown,8597
yes,3


In [None]:
df.housing.value_counts()

Unnamed: 0_level_0,count
housing,Unnamed: 1_level_1
yes,21576
no,18622
unknown,990


In [None]:
df.loan.value_counts()

Unnamed: 0_level_0,count
loan,Unnamed: 1_level_1
no,33950
yes,6248
unknown,990


In [None]:
df.contact.value_counts()

Unnamed: 0_level_0,count
contact,Unnamed: 1_level_1
cellular,26144
telephone,15044


In [None]:
df.month.value_counts()

Unnamed: 0_level_0,count
month,Unnamed: 1_level_1
may,13769
jul,7174
aug,6178
jun,5318
nov,4101
apr,2632
oct,718
sep,570
mar,546
dec,182


In [None]:
df.day_of_week.value_counts()

Unnamed: 0_level_0,count
day_of_week,Unnamed: 1_level_1
thu,8623
mon,8514
wed,8134
tue,8090
fri,7827


In [None]:
df.duration.describe()

Unnamed: 0,duration
count,41188.0
mean,258.29
std,259.28
min,0.0
25%,102.0
50%,180.0
75%,319.0
max,4918.0


In [None]:
df.campaign.value_counts()

Unnamed: 0_level_0,count
campaign,Unnamed: 1_level_1
1,17642
2,10570
3,5341
4,2651
5,1599
6,979
7,629
8,400
9,283
10,225


In [None]:
df['emp.var.rate'].value_counts()

Unnamed: 0_level_0,count
emp.var.rate,Unnamed: 1_level_1
1.4,16234
-1.8,9184
1.1,7763
-0.1,3683
-2.9,1663
-3.4,1071
-1.7,773
-1.1,635
-3.0,172
-0.2,10


In [None]:
df.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
count,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0,41188.0
mean,40.02,258.29,2.57,962.48,0.17,0.08,93.58,-40.5,3.62,5167.04
std,10.42,259.28,2.77,186.91,0.49,1.57,0.58,4.63,1.73,72.25
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.2,-50.8,0.63,4963.6
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.08,-42.7,1.34,5099.1
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.75,-41.8,4.86,5191.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.99,-36.4,4.96,5228.1
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.77,-26.9,5.04,5228.1


# **1. Improvement point for marketing department:**

- 33,553 are the people who have been called up to 3 times.
- 81.46% of the agents have managed to contact the customer up to the 3rd attempt.
-**Improvement point:** move on to call the next customer after the 3rd call has been made. This will help improve call effectiveness and achieve more closures in less time.

- After finishing the list, then call the other customers again.

I decided to generate this graph so you can visualize the number of times customers are called.

In [None]:
len(df[df['campaign'] <= 3])

33553

In [None]:
fig = px.violin(df, y="campaign", box=True, points='all', labels={'campaign': 'Frequency of the calls'})
fig.show()

# **2. Improvement point for marketing department:**

- 4,332 sales were achieved within a maximum of 5 call attempts and a maximum call duration of 25 minutes.
- 93.36% of sales were closed within this range.
- **Improvement point:** The fewer call attempts and the shorter the call duration, the higher the probability of closing a sale.

In [None]:
df['duration_to_min'] = df['duration'] / 60 #A new column is created to make the duration in minutes more manageable.

In [None]:

len(df[(df['duration_to_min'] <= 25) & (df['campaign'] <= 5)])


37606

In [None]:
len(df[(df['duration_to_min'] <= 25) & (df['campaign'] <= 5) & (df['y'] == 'yes')])


4332

In [None]:
df.y.value_counts()

Unnamed: 0_level_0,count
y,Unnamed: 1_level_1
no,36548
yes,4640


In [None]:
fig2 = px.scatter(df, x="campaign", y='duration_to_min', color="y", labels={'campaign': 'Frequency of the calls', 'duration_to_min': 'Call length (minutes)', 'y': 'Customers'}, title='More with Less: Optimizing Calls')
fig2.update_layout(legend_title_text='Sales achieved')
fig2.show()


In [None]:
df.pdays.value_counts()

Unnamed: 0_level_0,count
pdays,Unnamed: 1_level_1
999,39673
3,439
6,412
4,118
9,64
2,61
7,60
12,58
10,52
5,46


In [None]:
df.previous.value_counts()

Unnamed: 0_level_0,count
previous,Unnamed: 1_level_1
0,35563
1,4561
2,754
3,216
4,70
5,18
6,5
7,1


In [None]:
df.poutcome.value_counts()

Unnamed: 0_level_0,count
poutcome,Unnamed: 1_level_1
nonexistent,35563
failure,4252
success,1373


In [None]:
df['y'].value_counts()

Unnamed: 0_level_0,count
y,Unnamed: 1_level_1
no,36548
yes,4640


# **Beginning of the predictive model**

In [None]:
df

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,duration_to_min
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,999,0,nonexistent,1.10,93.99,-36.40,4.86,5191.00,no,4.35
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,999,0,nonexistent,1.10,93.99,-36.40,4.86,5191.00,no,2.48
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,999,0,nonexistent,1.10,93.99,-36.40,4.86,5191.00,no,3.77
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,999,0,nonexistent,1.10,93.99,-36.40,4.86,5191.00,no,2.52
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,999,0,nonexistent,1.10,93.99,-36.40,4.86,5191.00,no,5.12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,retired,married,professional.course,no,yes,no,cellular,nov,fri,...,999,0,nonexistent,-1.10,94.77,-50.80,1.03,4963.60,yes,5.57
41184,46,blue-collar,married,professional.course,no,no,no,cellular,nov,fri,...,999,0,nonexistent,-1.10,94.77,-50.80,1.03,4963.60,no,6.38
41185,56,retired,married,university.degree,no,yes,no,cellular,nov,fri,...,999,0,nonexistent,-1.10,94.77,-50.80,1.03,4963.60,no,3.15
41186,44,technician,married,professional.course,no,no,no,cellular,nov,fri,...,999,0,nonexistent,-1.10,94.77,-50.80,1.03,4963.60,yes,7.37


These columns were removed because they do not influence the predictive model.

In [None]:
df = df.drop(['duration', 'contact', 'month', 'day_of_week', 'duration_to_min', 'campaign', 'pdays', 'previous', 'poutcome'], axis=1)

These columns have unknown values. So we will proceed to change these values ​​to not null and then fill them with the most repeated values, so as not to affect the model.

In [None]:
columns_to_clean = ['job', 'marital', 'education', 'default', 'housing']
df[columns_to_clean] = df[columns_to_clean].replace('unknown', np.nan)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             40858 non-null  object 
 2   marital         41108 non-null  object 
 3   education       39457 non-null  object 
 4   default         32591 non-null  object 
 5   housing         40198 non-null  object 
 6   loan            41188 non-null  object 
 7   emp.var.rate    41188 non-null  float64
 8   cons.price.idx  41188 non-null  float64
 9   cons.conf.idx   41188 non-null  float64
 10  euribor3m       41188 non-null  float64
 11  nr.employed     41188 non-null  float64
 12  y               41188 non-null  object 
dtypes: float64(5), int64(1), object(7)
memory usage: 4.1+ MB


In [None]:
for column in columns_to_clean:
   most_frequent_value = df[column].mode()[0]
   df[column].fillna(most_frequent_value, inplace=True)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   emp.var.rate    41188 non-null  float64
 8   cons.price.idx  41188 non-null  float64
 9   cons.conf.idx   41188 non-null  float64
 10  euribor3m       41188 non-null  float64
 11  nr.employed     41188 non-null  float64
 12  y               41188 non-null  object 
dtypes: float64(5), int64(1), object(7)
memory usage: 4.1+ MB


In [None]:
df.default.value_counts()

Unnamed: 0_level_0,count
default,Unnamed: 1_level_1
no,41185
yes,3


In [None]:
df.default.value_counts()

Unnamed: 0_level_0,count
default,Unnamed: 1_level_1
no,41185
yes,3


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   emp.var.rate    41188 non-null  float64
 8   cons.price.idx  41188 non-null  float64
 9   cons.conf.idx   41188 non-null  float64
 10  euribor3m       41188 non-null  float64
 11  nr.employed     41188 non-null  float64
 12  y               41188 non-null  object 
dtypes: float64(5), int64(1), object(7)
memory usage: 4.1+ MB


We convert categorical columns to integers

In [None]:
le = LabelEncoder()

object_columns = df.select_dtypes(include=['object']).columns

for col in object_columns:
    df[col] = le.fit_transform(df[col])

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  int64  
 2   marital         41188 non-null  int64  
 3   education       41188 non-null  int64  
 4   default         41188 non-null  int64  
 5   housing         41188 non-null  int64  
 6   loan            41188 non-null  int64  
 7   emp.var.rate    41188 non-null  float64
 8   cons.price.idx  41188 non-null  float64
 9   cons.conf.idx   41188 non-null  float64
 10  euribor3m       41188 non-null  float64
 11  nr.employed     41188 non-null  float64
 12  y               41188 non-null  int64  
dtypes: float64(5), int64(8)
memory usage: 4.1 MB


In [None]:
df.y.value_counts()

Unnamed: 0_level_0,count
y,Unnamed: 1_level_1
0,36548
1,4640


In [None]:
df

Unnamed: 0,age,job,marital,education,default,housing,loan,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,3,1,0,0,0,0,1.10,93.99,-36.40,4.86,5191.00,0
1,57,7,1,3,0,0,0,1.10,93.99,-36.40,4.86,5191.00,0
2,37,7,1,3,0,1,0,1.10,93.99,-36.40,4.86,5191.00,0
3,40,0,1,1,0,0,0,1.10,93.99,-36.40,4.86,5191.00,0
4,56,7,1,3,0,0,2,1.10,93.99,-36.40,4.86,5191.00,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
41183,73,5,1,5,0,1,0,-1.10,94.77,-50.80,1.03,4963.60,1
41184,46,1,1,5,0,0,0,-1.10,94.77,-50.80,1.03,4963.60,0
41185,56,5,1,6,0,1,0,-1.10,94.77,-50.80,1.03,4963.60,0
41186,44,9,1,5,0,0,0,-1.10,94.77,-50.80,1.03,4963.60,1


In [None]:
X = df.drop(columns='y')
y = df.y

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

In [None]:
print("Training data X:",X_train.shape)
print("Training data y:",y_train.shape)

Training data X: (32950, 12)
Training data y: (32950,)


# Using the Random Forest Classifier model

In [None]:
rfc = RandomForestClassifier(random_state=42)
rfc.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

In [None]:
rfc.fit(X_train,y_train)

In [None]:
y_pred = rfc.predict(X_test)

In [None]:
rf_train_score = rfc.score(X_train, y_train)
rf_test_score = rfc.score(X_test, y_test)

print('Model performance on test data',rf_train_score)
print('Model performance on test data',rf_test_score)

Model performance on test data 0.9791502276176024
Model performance on test data 0.8777615926195679


In [None]:
print("Classification report:\n\n", classification_report(y_test,y_pred))

Classification report:

               precision    recall  f1-score   support

           0       0.91      0.96      0.93      7303
           1       0.44      0.26      0.33       935

    accuracy                           0.88      8238
   macro avg       0.67      0.61      0.63      8238
weighted avg       0.86      0.88      0.86      8238



# Using the Logistic Regression model`

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
model_lr = LogisticRegression()
model_lr.fit(X_train, y_train)

In [None]:
y_pred_lr = model_lr.predict(X_test)

In [None]:
lr_train_score = model_lr.score(X_train, y_train)
lr_test_score = model_lr.score(X_test, y_test)

print('Model performance on test data',lr_train_score)
print('Model performance on test data',lr_test_score)

Model performance on test data 0.8872837632776934
Model performance on test data 0.8856518572469045


In [None]:
print("Classification report:\n\n", classification_report(y_test,y_pred_lr))

Classification report:

               precision    recall  f1-score   support

           0       0.89      1.00      0.94      7303
           1       0.18      0.00      0.00       935

    accuracy                           0.89      8238
   macro avg       0.53      0.50      0.47      8238
weighted avg       0.81      0.89      0.83      8238



# Using the Nearest Centroid model

In [None]:
from sklearn.neighbors import NearestCentroid
from sklearn.model_selection import train_test_split

In [None]:
model_nc = NearestCentroid()
model_nc.fit(X_train, y_train)

In [None]:
y_pred_nc = model_nc.predict(X_test)

In [None]:
nc_train_score = model_nc.score(X_train, y_train)
nc_test_score = model_nc.score(X_test, y_test)

print('Model performance on test data',nc_train_score)
print('Model performance on test data',nc_test_score)

Model performance on test data 0.7207283763277693
Model performance on test data 0.7166788055353241


In [None]:
print("Classification report:\n\n", classification_report(y_test,y_pred_nc))

Classification report:

               precision    recall  f1-score   support

           0       0.95      0.72      0.82      7303
           1       0.24      0.70      0.36       935

    accuracy                           0.72      8238
   macro avg       0.60      0.71      0.59      8238
weighted avg       0.87      0.72      0.77      8238



# Using the Perceptrón model

In [None]:
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split

In [None]:
model_p = Perceptron()
model_p.fit(X_train, y_train)

In [None]:
y_pred_p = model_p.predict(X_test)

In [None]:
p_train_score = model_p.score(X_train, y_train)
p_test_score = model_p.score(X_test, y_test)

print('Model performance on test data',p_train_score)
print('Model performance on test data',p_test_score)

Model performance on test data 0.887556904400607
Model performance on test data 0.8865015780529255


In [None]:
print("Classification report:\n\n", classification_report(y_test,y_pred_p))

Classification report:

               precision    recall  f1-score   support

           0       0.89      1.00      0.94      7303
           1       0.00      0.00      0.00       935

    accuracy                           0.89      8238
   macro avg       0.44      0.50      0.47      8238
weighted avg       0.79      0.89      0.83      8238



**Using the Random Forest Classifier model:**
- Model performance on test data 0.97
- Model performance on test data 0.87
- Precision:
  - 0       ----0.91
  - 1       ----0.44

**Using the Logistic Regression model:**

- Model performance on test data 0.88
- Model performance on test data 0.88

- Precision
 - 0       ----0.89
 - 1      ----0.18

**Using the Nearest Centroid model:**

- Model performance on test data 0.72
- Model performance on test data 0.71
- Precision
 - 0       ----0.95
 - 1       ----0.24

**Using the Perceptrón model:**

- Model performance on test data 0.88
- Model performance on test data 0.88

- Precision:
 - 0       ----0.89
 - 1       ----0.00

# **Random Forest Classifier model IS BETTER FOR THIS PROJECT**