In [3]:
from sklearn import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import math
import tensorflow as tf
from keras.utils import np_utils

In [6]:
import warnings
warnings.filterwarnings('ignore')

## Bagging K-means

## Fetch the Wisconsin Breast Cancer dataset with the respective scikit-learn methods and split into training and test set (test ratio of 0.2).

In [4]:
cancer = sklearn.datasets.load_breast_cancer()

df_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df_cancer['label'] = cancer.target

cancer_X = df_cancer.drop('label', axis=1)
cancer_y = df_cancer['label']

cancer_X_train, cancer_X_test, cancer_y_train, cancer_y_test = sklearn.model_selection.train_test_split(cancer_X, cancer_y, test_size=0.2, random_state=1)

## Train a k-Means clustering algorithm (𝑘=2) on the training data and inspect its performance on the test set w.r.t. the Adjusted Rand Index (ARI). What does this measure express?

In [7]:
kmeans = sklearn.cluster.KMeans(n_clusters=2)
kmeans.fit(cancer_X_train, cancer_y_train)

labels_pred = kmeans.predict(cancer_X_test)

ari = sklearn.metrics.cluster.adjusted_rand_score(cancer_y_test, labels_pred)

print("ARI is", ari)

ARI is 0.3551488631223801


Rand index computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.
The adjusted Rand index is ensured to have a value close to 0.0 for random labeling independently of the number of clusters and samples and exactly 1.0 when the clusterings are identical.

## Train a BaggingClassifier with k-Means (𝑘=2) as the base estimator. The classifier should use all features and 30 % of the training examples for each created classifier. Also, it should be allowed to use the same example for different classifier instances. There should be 20 estimators used during bagging.

In [8]:
bc = sklearn.ensemble.BaggingClassifier(base_estimator = sklearn.cluster.KMeans(n_clusters=2), max_samples = 0.3, n_estimators = 20, bootstrap=True, random_state = 5)
bc.fit(cancer_X_train, cancer_y_train)

## Evaluate the ARI performance of the bagging classifier. How does it perform? In which scenarios is bagging probably outperforming a single estimator?

In [9]:
pred = bc.predict(cancer_X_test)
ari_bc = sklearn.metrics.cluster.adjusted_rand_score(cancer_y_test, pred)
print("ARI with bagging is", ari_bc)

ARI with bagging is 0.33219165511341014


It does not outperform simple kmeans here, but in general bagging is suitable for high variance low bias models, so to avoid overfitting. It is also useful when we do not have a large sample.

## Random Forest

## Generate some synthetic regression data with the respective scikit-learn method. You want to have 1000 samples with 20 features (10 informative ones). Set a standard deviation for the gaussian noise of 0.2.

In [10]:
features, target = sklearn.datasets.make_regression(n_samples=1000, n_features=20, n_informative=10, noise=0.2, random_state=5)

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(features, target, test_size=0.2, random_state=12)

## Build a pipeline that scales the data and afterwards applies a Random Forest regressor. Set the parameters of the Random Forest to 50 estimators and a maximum number of ten leaf nodes. Train this pipeline and measure its performance on the test set. Use a suitable performance measure.

In [11]:
sc = sklearn.preprocessing.RobustScaler()
sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

rf = sklearn.ensemble.RandomForestRegressor(n_estimators=50, max_leaf_nodes=10)
rf.fit(X_train_std, y_train)

Y_pred = rf.predict(X_test_std)

print('MSE for Random Forest Regressor ', sklearn.metrics.mean_squared_error(Y_pred, y_test))

MSE for Random Forest Regressor  8457.827532905352


## Check the importance values of the features estimated by the Random Forest. Are they compliant with the generated data?

In [12]:
importance = rf.feature_importances_
for i,v in enumerate(importance):
    print('Feature: %0d, Score: %.5f' % (i,v))

Feature: 0, Score: 0.00000
Feature: 1, Score: 0.24205
Feature: 2, Score: 0.00000
Feature: 3, Score: 0.00000
Feature: 4, Score: 0.00000
Feature: 5, Score: 0.00296
Feature: 6, Score: 0.00000
Feature: 7, Score: 0.00860
Feature: 8, Score: 0.00000
Feature: 9, Score: 0.24133
Feature: 10, Score: 0.00000
Feature: 11, Score: 0.00000
Feature: 12, Score: 0.00000
Feature: 13, Score: 0.39075
Feature: 14, Score: 0.00000
Feature: 15, Score: 0.00000
Feature: 16, Score: 0.00000
Feature: 17, Score: 0.00000
Feature: 18, Score: 0.11432
Feature: 19, Score: 0.00000


The results suggest 5 of the 20 features as being important to prediction, which contradicts our generated data with 10 informative features

## AdaBoost with Logistic Regression

## Fetch the Wisconsin Breast Cancer dataset with the respective scikit-learn methods and split into training and test set (test ratio of 0.2). Scale the data to [0,1].

In [13]:
cancer = sklearn.datasets.load_breast_cancer()

df_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df_cancer['label'] = cancer.target

cancer_X = df_cancer.drop('label', axis=1)
cancer_y = df_cancer['label']

cancer_X_train, cancer_X_test, cancer_y_train, cancer_y_test = sklearn.model_selection.train_test_split(cancer_X, cancer_y, test_size=0.2, random_state=1)

sc = sklearn.preprocessing.MinMaxScaler()
sc.fit(cancer_X_train)
cancer_X_train_std = sc.transform(cancer_X_train)
cancer_X_test_std = sc.transform(cancer_X_test)

## Train a Logistic Regression that classifies the training data. Use default parameters. What is the performance of the trained model w.r.t. accuracy?

In [14]:
logisticRegr = sklearn.linear_model.LogisticRegression()

logisticRegr.fit(cancer_X_train_std, cancer_y_train)

pred1 = logisticRegr.predict(cancer_X_test_std)

print("Accuracy for Logistic Regression:", sklearn.metrics.accuracy_score(cancer_y_test, pred1))

Accuracy for Logistic Regression: 0.956140350877193


## Train an AdaBoostClassifier with Logistic Regression (same configuration as before) as the base estimator. There should be 20 estimators used during boosting.

In [15]:
AdaBoost = sklearn.ensemble.AdaBoostClassifier(base_estimator=sklearn.linear_model.LogisticRegression(), n_estimators = 20)

AdaBoost.fit(cancer_X_train_std, cancer_y_train)

pred2 = AdaBoost.predict(cancer_X_test_std)

print("Accuracy for AdaBoostClassifier:", sklearn.metrics.accuracy_score(cancer_y_test, pred2))

Accuracy for AdaBoostClassifier: 0.9035087719298246


AdaBoost classifier begins by fitting on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.
Here, AdaBoost did not improve accuracy.