## Exercise
Using data `german_credit_data.csv` from the previous lesson, build ensemble learning models to predict credit risk.

* Import all the Python dependencies you will be needing for this exercise.

In [1]:
import pandas as pd
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, \
    GradientBoostingClassifier, StackingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

* Load the dataset in a DataFrame object.

In [2]:
data = pd.read_csv("german_credit_data.csv", index_col="Unnamed: 0")
data.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,male,2,own,,little,1169,6,radio/TV,good
1,22,female,2,own,little,moderate,5951,48,radio/TV,bad
2,49,male,1,own,little,,2096,12,education,good
3,45,male,2,free,little,little,7882,42,furniture/equipment,good
4,53,male,2,free,little,little,4870,24,car,bad


* Get some statistics about the data using Panda's `describe()` and `info()` functions.

In [3]:
data.describe()

Unnamed: 0,Age,Job,Credit amount,Duration
count,1000.0,1000.0,1000.0,1000.0
mean,35.546,1.904,3271.258,20.903
std,11.375469,0.653614,2822.736876,12.058814
min,19.0,0.0,250.0,4.0
25%,27.0,2.0,1365.5,12.0
50%,33.0,2.0,2319.5,18.0
75%,42.0,2.0,3972.25,24.0
max,75.0,3.0,18424.0,72.0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Age               1000 non-null   int64 
 1   Sex               1000 non-null   object
 2   Job               1000 non-null   int64 
 3   Housing           1000 non-null   object
 4   Saving accounts   817 non-null    object
 5   Checking account  606 non-null    object
 6   Credit amount     1000 non-null   int64 
 7   Duration          1000 non-null   int64 
 8   Purpose           1000 non-null   object
 9   Risk              1000 non-null   object
dtypes: int64(4), object(6)
memory usage: 85.9+ KB


* Label encoding for columns `Sex` and `Risk`.

In [5]:
columns_label = ["Sex", "Risk"]
labelencoder = LabelEncoder()
for i in columns_label:
    data[i] = labelencoder.fit_transform(data[i])

data.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk
0,67,1,2,own,,little,1169,6,radio/TV,1
1,22,0,2,own,little,moderate,5951,48,radio/TV,0
2,49,1,1,own,little,,2096,12,education,1
3,45,1,2,free,little,little,7882,42,furniture/equipment,1
4,53,1,2,free,little,little,4870,24,car,0


* Convert age to category, add to new column `Cat Age`.

In [6]:
Cat_Age = []
for i in data["Age"]:
    if i < 25:
        Cat_Age.append("0-25")
    elif (i >= 25) and (i < 30):
        Cat_Age.append("25-30")
    elif (i >= 30) and (i < 35):
        Cat_Age.append("30-35")
    elif (i >= 35) and (i < 40):
        Cat_Age.append("35-40")
    elif (i >= 40) and (i < 50):
        Cat_Age.append("40-50")
    elif (i >= 50) and (i < 76):
        Cat_Age.append("50-75")

data["Cat Age"] = Cat_Age

data.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk,Cat Age
0,67,1,2,own,,little,1169,6,radio/TV,1,50-75
1,22,0,2,own,little,moderate,5951,48,radio/TV,0,0-25
2,49,1,1,own,little,,2096,12,education,1,40-50
3,45,1,2,free,little,little,7882,42,furniture/equipment,1,40-50
4,53,1,2,free,little,little,4870,24,car,0,50-75


* Use `get_dummies` method to make one-hot-encoding for columns: `Housing`, `Saving accounts`, `Checking account`, `Purpose` and `Cat Age`.

In [7]:
columns_dummy = ['Housing', 'Saving accounts', 'Checking account', "Purpose", "Cat Age"]
for i in columns_dummy:
    data = pd.concat([data, pd.get_dummies(data[i])], axis=1)

data.head()

Unnamed: 0,Age,Sex,Job,Housing,Saving accounts,Checking account,Credit amount,Duration,Purpose,Risk,...,furniture/equipment,radio/TV,repairs,vacation/others,0-25,25-30,30-35,35-40,40-50,50-75
0,67,1,2,own,,little,1169,6,radio/TV,1,...,0,1,0,0,0,0,0,0,0,1
1,22,0,2,own,little,moderate,5951,48,radio/TV,0,...,0,1,0,0,1,0,0,0,0,0
2,49,1,1,own,little,,2096,12,education,1,...,0,0,0,0,0,0,0,0,1,0
3,45,1,2,free,little,little,7882,42,furniture/equipment,1,...,1,0,0,0,0,0,0,0,1,0
4,53,1,2,free,little,little,4870,24,car,0,...,0,0,0,0,0,0,0,0,0,1


* Drop unnecessary columns: `Housing`, `Saving accounts`, `Checking account`, `Purpose` and `Cat Age`.

In [8]:
data.drop(['Housing', 'Saving accounts', 'Checking account', "Purpose", "Age", "Cat Age"], axis=1, inplace=True)
data.head()

Unnamed: 0,Sex,Job,Credit amount,Duration,Risk,free,own,rent,little,moderate,...,furniture/equipment,radio/TV,repairs,vacation/others,0-25,25-30,30-35,35-40,40-50,50-75
0,1,2,1169,6,1,0,1,0,0,0,...,0,1,0,0,0,0,0,0,0,1
1,0,2,5951,48,0,0,1,0,1,0,...,0,1,0,0,1,0,0,0,0,0
2,1,1,2096,12,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
3,1,2,7882,42,1,1,0,0,1,0,...,1,0,0,0,0,0,0,0,1,0
4,1,2,4870,24,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


In [9]:
data.describe()

Unnamed: 0,Sex,Job,Credit amount,Duration,Risk,free,own,rent,little,moderate,...,furniture/equipment,radio/TV,repairs,vacation/others,0-25,25-30,30-35,35-40,40-50,50-75
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,...,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.69,1.904,3271.258,20.903,0.7,0.108,0.713,0.179,0.603,0.103,...,0.181,0.28,0.022,0.012,0.149,0.222,0.177,0.153,0.174,0.125
std,0.462725,0.653614,2822.736876,12.058814,0.458487,0.310536,0.452588,0.383544,0.489521,0.304111,...,0.385211,0.449224,0.146757,0.10894,0.356267,0.415799,0.38186,0.360168,0.379299,0.330884
min,0.0,0.0,250.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,1365.5,12.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,1.0,2.0,2319.5,18.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,1.0,2.0,3972.25,24.0,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,3.0,18424.0,72.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


* Separated data as train and test (20% of data is test data, random_state=42).

In [9]:
y = data.Risk
X = data.drop("Risk", axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

* Apply standart scaling for `X_train` and `X_test`.

In [10]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

* Build Bagging meta-estimator model.

In [11]:
model = BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced', random_state=2),
                          random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.7


Confusion matrix:
 [[ 28  31]
 [ 29 112]]


Classification report:
               precision    recall  f1-score   support

           0       0.49      0.47      0.48        59
           1       0.78      0.79      0.79       141

    accuracy                           0.70       200
   macro avg       0.64      0.63      0.64       200
weighted avg       0.70      0.70      0.70       200



* Build Random Forest model.

In [12]:
model = RandomForestClassifier(class_weight='balanced', random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.745


Confusion matrix:
 [[ 23  36]
 [ 15 126]]


Classification report:
               precision    recall  f1-score   support

           0       0.61      0.39      0.47        59
           1       0.78      0.89      0.83       141

    accuracy                           0.74       200
   macro avg       0.69      0.64      0.65       200
weighted avg       0.73      0.74      0.73       200



* Build AdaBoost model.

In [13]:
model = AdaBoostClassifier(base_estimator=DecisionTreeClassifier(class_weight='balanced', random_state=2),
                           random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.665


Confusion matrix:
 [[ 26  33]
 [ 34 107]]


Classification report:
               precision    recall  f1-score   support

           0       0.43      0.44      0.44        59
           1       0.76      0.76      0.76       141

    accuracy                           0.67       200
   macro avg       0.60      0.60      0.60       200
weighted avg       0.67      0.67      0.67       200



* Build Gradient Boosting model.

In [14]:
model = GradientBoostingClassifier(learning_rate=0.01, random_state=2)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.735


Confusion matrix:
 [[  6  53]
 [  0 141]]


Classification report:
               precision    recall  f1-score   support

           0       1.00      0.10      0.18        59
           1       0.73      1.00      0.84       141

    accuracy                           0.73       200
   macro avg       0.86      0.55      0.51       200
weighted avg       0.81      0.73      0.65       200



* Build XGBoost model.

In [15]:
model = xgb.XGBClassifier(random_state=2, eta=0.01)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.78


Confusion matrix:
 [[ 29  30]
 [ 14 127]]


Classification report:
               precision    recall  f1-score   support

           0       0.67      0.49      0.57        59
           1       0.81      0.90      0.85       141

    accuracy                           0.78       200
   macro avg       0.74      0.70      0.71       200
weighted avg       0.77      0.78      0.77       200



* Build Stacking model.

In [16]:
estimators = [
    ('lr', LogisticRegression(class_weight='balanced', random_state=2)),
    ('knn', KNeighborsClassifier()),
    ('dt', DecisionTreeClassifier(class_weight='balanced', random_state=2)),
    ('svm', SVC(class_weight='balanced', random_state=2))
]
final_estimator = LogisticRegression(class_weight='balanced', random_state=2)
model = StackingClassifier(estimators=estimators, final_estimator=final_estimator, cv=5)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
# Check accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
print("\n")
print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
print("\n")
print("Classification report:\n", classification_report(y_test, y_pred))

Accuracy:  0.695


Confusion matrix:
 [[43 16]
 [45 96]]


Classification report:
               precision    recall  f1-score   support

           0       0.49      0.73      0.59        59
           1       0.86      0.68      0.76       141

    accuracy                           0.69       200
   macro avg       0.67      0.70      0.67       200
weighted avg       0.75      0.69      0.71       200



* <b>Bonus</b>: Use `GridSearchCV` to find the best Hyper Parameters and re-build each model.