## Ensemble Model Using the Weighted Averaging Technique
In this exercise, we will implement an ensemble model using the weighted averaging technique. We will use the same base models, logistic regression, KNN, and random forest.

In [1]:
import pandas as pd

In [2]:
credData = pd.read_csv('crx-data.csv', sep=',', header=None, na_values='?')
credData.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


In [3]:
# changing the Classes to 1 and 0
credData.loc[credData[15] == '+', 15] = 1
credData.loc[credData[15] == '-', 15] = 0
credData.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,1
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,1
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,1
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,1
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,1


In [4]:
# finding number of null values
credData.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [5]:
# printing shape and data types
print('Shape of raw data set: {}'.format(credData.shape))
print('Data types of data set: {}'.format(credData.dtypes))

Shape of raw data set: (690, 16)
Data types of data set: 0      object
1     float64
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13    float64
14      int64
15     object
dtype: object


In [6]:
# dropping all rows with na values
newcred = credData.dropna(axis=0)
newcred.shape

(653, 16)

In [7]:
# verify that no null values exist
newcred.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

In [8]:
# dummy variables for categorical variables
credCat = pd.get_dummies(newcred[[0,3,4,5,6,9,11,12]])

In [9]:
# separating numerical variables
credNum = newcred[[1,2,7,10,13,14]]

In [10]:
# making X variable
X = pd.concat([credCat,credNum],axis = 1)
print(X.shape)

# making y variable
y = pd.Series(newcred[15], dtype='int')
print(y.shape)

(653, 44)
(653,)


In [11]:
# normalizing the data sets
from sklearn.preprocessing import MinMaxScaler

In [12]:
minmaxScaler = MinMaxScaler()
X_tran = pd.DataFrame(minmaxScaler.fit_transform(X))

In [13]:
from sklearn.model_selection import train_test_split

In [14]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tran, y, test_size=0.3, random_state=123)

In [15]:
# define the three base models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

In [16]:
model1 = LogisticRegression(random_state=123)
model2 = KNeighborsClassifier(n_neighbors=5)
model3 = RandomForestClassifier(n_estimators=500)

In [17]:
# fit models on training set
model1.fit(X_train, y_train)
model2.fit(X_train, y_train)
model3.fit(X_train, y_train)

RandomForestClassifier(n_estimators=500)

In [18]:
# predicting probabalities of each model on test set
pred1 = model1.predict_proba(X_test)
pred2 = model2.predict_proba(X_test)
pred3 = model3.predict_proba(X_test)

In [20]:
# calculate ensemble prediction by applying weights for each prediction
ensemblepred=(pred1 *0.60 + pred2 * 0.20 + pred3 * 0.20)

# diplay first four rows of ensemble prediction array
ensemblepred[0:4,:]

array([[0.83273545, 0.16726455],
       [0.68076376, 0.31923624],
       [0.18027843, 0.81972157],
       [0.07887364, 0.92112636]])

As you can see from the preceding output, we have two probabilities for each example corresponding to each class.

In [21]:
# printing the order of classes for each model
print(model1.classes_)
print(model2.classes_)
print(model3.classes_)

[0 1]
[0 1]
[0 1]


We now have to get the final predictions for each example from the output probabilities. The final prediction will be the class with the highest probability. To get the class with the highest probability, we use the numpy function, **.argmax()**.

In [22]:
import numpy as np

In [23]:
pred = np.argmax(ensemblepred,axis=1)
pred

array([0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

In [24]:
# confusion matrix
from sklearn.metrics import confusion_matrix

In [25]:
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)

[[95 12]
 [28 61]]


In [26]:
# classification report
from sklearn.metrics import classification_report

In [27]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.77      0.89      0.83       107
           1       0.84      0.69      0.75        89

    accuracy                           0.80       196
   macro avg       0.80      0.79      0.79       196
weighted avg       0.80      0.80      0.79       196



## Iteration 2 with Different Weights
From the first iteration, we saw that we got accuracy of **80%**. This metric is a reflection of the weights that we applied in the first iteration. Let's try to change the weights and see what effect it has on the metrics. The process of trying out various weights is based on our judgment of the dataset and the distribution of data. Let's say we feel that the data distribution is more linear, and therefore we decide to increase the weight for the linear regression model and decrease the weights of the other two models.

In [28]:
# calculate ensemble prediction by applying weights for each prediction
ensemblepred=(pred1 *0.70 + pred2 * 0.15 + pred3 * 0.15)

# diplay first four rows of ensemble prediction array
ensemblepred[0:4,:]

array([[0.83052469, 0.16947531],
       [0.65989106, 0.34010894],
       [0.16832483, 0.83167517],
       [0.09101925, 0.90898075]])

As you can see from the preceding output, we have two probabilities for each example corresponding to each class.

In [29]:
# printing the order of classes for each model
print(model1.classes_)
print(model2.classes_)
print(model3.classes_)

[0 1]
[0 1]
[0 1]


We now have to get the final predictions for each example from the output probabilities. The final prediction will be the class with the highest probability. To get the class with the highest probability, we use the numpy function, **.argmax()**.

In [30]:
pred = np.argmax(ensemblepred,axis=1)
pred

array([0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1])

In [31]:
confusionMatrix = confusion_matrix(y_test, pred)
print(confusionMatrix)

[[92 15]
 [26 63]]


In [32]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.78      0.86      0.82       107
           1       0.81      0.71      0.75        89

    accuracy                           0.79       196
   macro avg       0.79      0.78      0.79       196
weighted avg       0.79      0.79      0.79       196



Looking at the results from a business perspective, we can see that with the increase in the recall value of class 1, the card division is getting more creditworthy customers. However, this has come at the cost of increasing the risk with more unworthy customers, with 29% (100% - 71%) being tagged as creditworthy customers. 