Adapted by Carlos Toxtli http://www.carlostoxtli.com/#colab-ensem-2

Source: https://github.com/Adamantios/Ensembles-CSL-Imb_Learning-Models_Comparison/blob/master/part_a.ipynb

In [None]:
!git clone https://github.com/Adamantios/Ensembles-CSL-Imb_Learning-Models_Comparison.git
%cd Ensembles-CSL-Imb_Learning-Models_Comparison

Cloning into 'Ensembles-CSL-Imb_Learning-Models_Comparison'...
remote: Enumerating objects: 191, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (140/140), done.[K
remote: Total 191 (delta 100), reused 136 (delta 48), pack-reused 0[K
Receiving objects: 100% (191/191), 3.28 MiB | 6.94 MiB/s, done.
Resolving deltas: 100% (100/100), done.
/content/Ensembles-CSL-Imb_Learning-Models_Comparison


#### Adamantios Zaras AM: 06
#### Panagiotis Souranis AM: 17

# Description
In this part of the project, we created 4 ensemble methods and compared them, 
using statistical analysis methods in 10 different datasets.
#### Ensembles
1. Bagging Ensemble using Random Tree Classifier.  
We follow the procedure described below:
  - Random Search, in order to search fpr hyperparameters.
  - Grid Search in the area near the best parameters found from the Random Search.
  - 10 fold Cross Validation, in order to plot the accuracy vs the number of classifiers used.
  - Prediction using the best estimation for the number of classifiers, combined with the tuned parameters.
2. Random Forest Classifier.  
We follow the procedure described below:
  - Random Search.
  - Grid Search.
  - Prediction using the tuned parameters.
3. Stacking, using a Nearest Neighbors classifier, a Linear SVM, a Decision Tree classifier and a Naive Bayes classifier. The Meta-Classifier is a Logistic Regression Classifier.  
We follow the procedure described below:
  - Random Search.
  - Grid Search.
  - Present plots of each model's performance to the dataset and compare them with the stacking model, which combines them all.
  - Prediction using the tuned parameters of each model.  
  
  **Note:** The StackingCVClassifier, uses the concept of cross-validation:  
  The dataset is split into k folds and in k successive rounds, 
  k-1 folds are used to fit the first level classifier. In each round, 
  the first-level classifiers are then applied to the remaining 1 subset that was not used for model fitting in each iteration. 
  The resulting predictions are then stacked and provided - as input data - 
  to the second-level classifier. After the training of the StackingCVClassifier, 
  the first-level classifiers are fit to the entire dataset.

4. Boosting, using XGBoost with Logistic Regression.  
We follow the procedure described below:
  - Random Search.
  - Grid Search.
  - Prediction using the tuned parameters of each model.

#### Datasets
1. [Spambase](https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data)
2. **Wine** - Using sklearn' s import.
3. **Iris** - Using sklearn' s import.
4. [Breast Cancer](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data?fbclid=IwAR2ZT56DdRbU45HMFvq6gwTdjKsS-RLSQ0B1TQM4cskmA27x-upTF0n66BI)
5. [Seeds](https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt)
6. [Glass Identification](https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data)
7. [Tic Tac Toe](https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data)
8. [Wholesale Customers](https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv) - Predicting the channel.
9. **Digits** - Using sklearn' s import.
10. [Chess (King-Rook vs. King-Pawn)](https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data)


#### Comparison
We first create a table containing the accuracy scores of each algorithm in the datasets and note their ranking.  
Following, we compute the mean ranking and fill the results table.  
Moreover, we run two statistical tests, an alternative of Friedman test (Iman Davenport's correction of Friedman's rank sum test), 
Nemenyi post hoc and Friedman post-hoc test with Bergmann and Hommel’s correction and present their outcomes.

# Globals


### Import all modules.


In [None]:
import io
import time

import requests
import seaborn as sns
import xgboost as xgb
from mlxtend.classifier import StackingCVClassifier
from pandas import read_csv
from scipy.stats import randint as sp_randint
from sklearn import preprocessing
from sklearn.datasets import load_wine, load_digits, load_iris
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV, train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

# Import util functions.
from utils.general import report
from utils.part_a import *

%matplotlib inline




# Spambase



## Prepare the dataset.


In [None]:
# Read the dataset.
url="https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
s=requests.get(url).content
dataset=read_csv(io.StringIO(s.decode('utf-8')))

# Get x and y.
X, y = dataset.iloc[:, :-1].values, dataset.iloc[:, -1].values

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



Fitting 10 folds for each of 200 candidates, totalling 2000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:    1.3s
[Parallel(n_jobs=-1)]: Done 1912 tasks      | elapsed:   21.0s


RandomizedSearchCV took 21.73 seconds for 200 candidates.
Model with rank: 1
Mean validation score: 0.911 (std: 0.017)
Parameters: {'criterion': 'gini', 'max_depth': 19, 'max_features': 25, 'min_samples_split': 15}

Model with rank: 2
Mean validation score: 0.910 (std: 0.018)
Parameters: {'criterion': 'entropy', 'max_depth': 11, 'max_features': 48, 'min_samples_split': 6}

Model with rank: 3
Mean validation score: 0.904 (std: 0.019)
Parameters: {'criterion': 'entropy', 'max_depth': 25, 'max_features': 38, 'min_samples_split': 74}



[Parallel(n_jobs=-1)]: Done 2000 out of 2000 | elapsed:   21.5s finished


### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(10, 14),
              'max_features': range(54, 57),
              'min_samples_split': range(23, 26),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 388 tasks      | elapsed:   11.8s


GridSearchCV took 23.03 seconds.
Model with rank: 1
Mean validation score: 0.913 (std: 0.015)
Parameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 54, 'min_samples_split': 23}

Model with rank: 1
Mean validation score: 0.913 (std: 0.013)
Parameters: {'criterion': 'gini', 'max_depth': 13, 'max_features': 55, 'min_samples_split': 25}

Model with rank: 3
Mean validation score: 0.912 (std: 0.012)
Parameters: {'criterion': 'gini', 'max_depth': 11, 'max_features': 56, 'min_samples_split': 24}

Model with rank: 3
Mean validation score: 0.912 (std: 0.013)
Parameters: {'criterion': 'gini', 'max_depth': 13, 'max_features': 55, 'min_samples_split': 24}

Model with rank: 3
Mean validation score: 0.912 (std: 0.013)
Parameters: {'criterion': 'gini', 'max_depth': 13, 'max_features': 56, 'min_samples_split': 23}



[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed:   23.0s finished



### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 13
clf.max_features = 55
clf.min_samples_split = 25
clf.criterion = 'gini'

# Plot accuracy vs number of estimators for [100, 200, ..., 1000] estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1100, 100))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 800, since the accuracy is high  
and the lower bound of it's deviation is better than the others.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=800, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(32, 34),
              'max_features': range(10, 12),
              'min_samples_split': range(21, 23),
              'n_estimators': range(115, 120),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 32
clf.max_features = 10
clf.min_samples_split = 21
clf.n_estimators = 116
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 100),
              'linearsvc__C': np.logspace(-3, 3),
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': np.logspace(-3, 3)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(39, 41),
              'linearsvc__C': np.logspace(-1, 2, 8),
              'decisiontreeclassifier__max_depth': range(14, 16),
              'meta-logisticregression__C': np.logspace(-1, 2, 4)}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 40
clf2.C = 100
clf3.max_depth = 15
meta_clf.C = 0.1

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Naive Bayes classifier seems to affect negatively the total performance of the model.

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
clf2.C = 100
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(4, 6),
          'gamma': np.arange(1.1, 1.3, 0.1),
          'max_depth': range(1, 8)}
xgb_clf.subsample = 0.78
xgb_clf.colsample_bytree = 0.16

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)



### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 5
xgb_clf.gamma = 1.1
xgb_clf.max_depth = 7

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')


# Wine



## Prepare the dataset.


In [None]:
# Get x and y.
X, y = load_wine(True)

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 1000

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(2, 12),
              'max_features': range(2, 11),
              'min_samples_split': range(25, 40),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 3
clf.max_features = 6
clf.min_samples_split = 32
clf.criterion = 'gini'

# Plot accuracy vs number of estimators for [100, 200, ..., 1000] estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1100, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 100, since the accuracy is high  
and the lower bound of it's deviation is better than the others.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=100, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(36, 40),
              'max_features': range(6, 10),
              'min_samples_split': range(12, 14),
              'n_estimators': range(34, 44, 2),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X, y)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 36
clf.max_features = 6
clf.min_samples_split = 12
clf.n_estimators = 34
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 50),
              'linearsvc__C': [0.01, 0.1, 1, 10, 100],
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000]}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(40, 44),
              'decisiontreeclassifier__max_depth': range(42, 46),
              'linearsvc__C': np.logspace(-1, 1, 8),
              'meta-logisticregression__C': range(90, 110, 2)}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 42
clf2.C = 1
clf3.max_depth = 44
meta_clf.C = 100

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, we notice that the logistic regression model would not have managed to classify the samples well. However, it is very useful for the classification of the meta-samples. By changing the value of C above to 0.1, it becomes clear, because the model's accuracy to the initial data becomes higher, while the stacking model's accuracy decreases.


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(7, 12),
          'gamma': np.arange(2.6, 2.8, 0.1),
          'max_depth': range(2, 4)}
xgb_clf.subsample = 0.82
xgb_clf.colsample_bytree = 0.23

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)



### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 7
xgb_clf.gamma = 2.6
xgb_clf.max_depth = 2

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)


# Iris



## Prepare the dataset.


In [None]:
# Read the dataset.
iris = load_iris()

# Get x and y.
X, y = iris.data, iris.target

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X_train.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(8, 12),
              'max_features': range(2, 5),
              'min_samples_split': range(6, 10),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 8
clf.max_features = 3
clf.min_samples_split = 6
clf.criterion = 'gini'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
There isn't a difference between the numbers of the estimators so we will take the minimum and thats 100 since
accuracy and standard deviation seem to be on the same levels

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=100, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(38, 42),
              'max_features': range(1, 4),
              'min_samples_split': range(10, 14),
              'n_estimators': range(290, 305, 5),
              'criterion': ['gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 38
clf.max_features = 1
clf.min_samples_split = 10
clf.n_estimators = 290
clf.criterion = 'gini'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 40),
              'linearsvc__C': [0.01, 0.1, 1, 10, 100],
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000]}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(10, 13),
              'linearsvc__C': [0.1,0.2,0.3,0.4,0.5],
              'decisiontreeclassifier__max_depth': range(8, 12),
              'meta-logisticregression__C': [0.5, 1, 1.5]}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 11
clf2.C = 0.2
clf3.max_depth = 8
meta_clf.C = 0.5

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Linear SVM classifier seems to affect negatively the total performance of the model.


In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train,train_sizes=np.linspace(0.4, 1.0, 5))


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(7, 12),
          'gamma': [2.4, 2.5, 2.6],
          'max_depth': range(3, 7)}
xgb_clf.subsample = 0.91
xgb_clf.colsample_bytree = 0.19

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


*Note:* The Grid Search procedure did not give better accuracy score, but lead to better std.


### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 7
xgb_clf.gamma = 2.4
xgb_clf.max_depth = 3

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)


# Breast Cancer


## Load the dataset

In [None]:
data = read_csv('C:/Users/User/Desktop/Project ml/datasets/breast-cancer-wisconsin-data/data.csv')
y = data['diagnosis']
unnecessary = ['Unnamed: 32','id','diagnosis']
X = data.drop(unnecessary,axis = 1 )

### Visualize number of data per class

In [None]:
ax = sns.countplot(y,label="Count")       # M = 212, B = 357
B, M = y.value_counts()
print('Number of Benign: ',B)
print('Number of Malignant : ',M)

## Data preprocessing

In [None]:
le = preprocessing.LabelEncoder()
le.fit(y)
y = le.transform(y)
X = np.array(X)
y = np.array(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(12, 14),
              'max_features': range(13, 18),
              'min_samples_split': range(6, 9),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 12
clf.max_features = 13
clf.min_samples_split = 7
clf.criterion = 'entropy'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 600 and 700, since the accuracy is high,  
the lower bound of it's deviation is better than the others and the higher bound, is the best.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=600, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(19,22),
              'max_features': range(1,4),
              'min_samples_split': range(8,11),
              'n_estimators': range(80, 90, 2),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 19
clf.max_features = 2
clf.min_samples_split = 10
clf.n_estimators = 84
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 100),
              'linearsvc__C': [0.01, 0.1, 1, 10, 100],
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000]}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(21,25),
              'linearsvc__C': [1,1.2,1.5,2],
              'decisiontreeclassifier__max_depth': range(39, 44),
              'meta-logisticregression__C': range(98,103)}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 5, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 21
clf2.C = 1.2
clf3.max_depth = 39
meta_clf.C = 100

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Naive Bayes classifier and Decision Tree Classifier seem to affect negatively the total performance of the model.

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(8, 13),
          'gamma': [3.1, 3.2, 3.4,3.5],
          'max_depth': range(4, 8)}
xgb_clf.subsample = 0.97
xgb_clf.colsample_bytree = 0.79

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


*Note:* The Grid Search procedure did not give better accuracy score, but lead to better std.


### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 8
xgb_clf.gamma = 3.1
xgb_clf.max_depth = 4

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)


# Seeds



## Prepare the dataset.


In [None]:

# Read the dataset.
url="https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt"
s=requests.get(url).content
dataset=read_csv(io.StringIO(s.decode('utf-8')), sep='\t+', engine='python', lineterminator='\n')

# Get x and y.
X, y = dataset.iloc[:, :-1].values, dataset.iloc[:, -1].values

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    stratify=y, random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(7, 15),
              'max_features': range(1, X.shape[1]),
              'min_samples_split': range(35, 55),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 7
clf.max_features = 5
clf.min_samples_split = 45
clf.criterion = 'gini'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

*Note:* The method in this dataset seems to result in a model with big variance.

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=600, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

Due to the big vaiance of the model, the result is lower than the expected, but within the accepted limits.

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(23, 31),
              'max_features': range(4, 6),
              'min_samples_split': range(47, 51),
              'n_estimators': range(110, 160, 10),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 30
clf.max_features = 5
clf.min_samples_split = 50
clf.n_estimators = 130
clf.criterion = 'gini'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, X.shape[1]),
              'linearsvc__C': np.logspace(-3, 3),
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': np.logspace(-3, 3)}

candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(1, 6),
              'linearsvc__C': range(170, 190, 4),
              'decisiontreeclassifier__max_depth': range(30, 35),
              'meta-logisticregression__C': [20, 30, 2]}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 5
clf2.C = 170
clf3.max_depth = 33
meta_clf.C = 30

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(12, 16),
          'gamma': [0.5, 1.6, 0.2],
          'colsample_bytree': np.arange(0.5, 0.7, 0.05),
          'max_depth': range(4, 6)}
xgb_clf.subsample = 0.82
xgb_clf.colsample_bytree = 0.62

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)



### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 13
xgb_clf.gamma = 0.2
xgb_clf.max_depth = 4
xgb_clf.colsample_bytree = 0.65

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)


# Glass Identification



## Prepare the dataset.


In [None]:

# Read the dataset.
url="https://archive.ics.uci.edu/ml/machine-learning-databases/glass/glass.data"
s=requests.get(url).content
dataset=read_csv(io.StringIO(s.decode('utf-8')))

# Get x and y.
X, y = dataset.iloc[:, :-1].values, dataset.iloc[:, -1].values

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,shuffle=True,
                                                    random_state=0,stratify = y)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(12, 16),
              'max_features': range(8, 11),
              'min_samples_split': range(13, 16),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 12
clf.max_features = 10
clf.min_samples_split = 13
clf.criterion = 'gini'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
It seems that there is not significant difference between the number of estimators so for convenience
we choose 100 estimators

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=100, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(34, 37),
              'max_features': range(4, 7),
              'min_samples_split': range(11, 14),
              'n_estimators': range(195, 210, 5),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=6, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 34
clf.max_features = 4
clf.min_samples_split = 11
clf.n_estimators = 200
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, X_train.shape[0]//4),
              'linearsvc__C': [0.01, 0.1, 1, 10, 100],
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000]}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=5, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(4, 9),
              'linearsvc__C': range(99, 103),
              'decisiontreeclassifier__max_depth': range(34, 38),
              'meta-logisticregression__C': [99.5,100, 101, 102]}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 6, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 4
clf2.C = 102
clf3.max_depth = 34
meta_clf.C = 102

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Naive Bayes classifier seems to affect negatively the total performance of the model.

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='multi:softmax', silent=True, n_jobs=-1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(4, 7),
          'gamma': [0.85, 0.87,0.90],
          'max_depth': range(4, 9)}
xgb_clf.subsample = 0.63
xgb_clf.colsample_bytree = 0.37

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


*Note:* The Grid Search procedure did not give better accuracy score, but lead to better std.


### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 4
xgb_clf.gamma = 0.85
xgb_clf.max_depth = 4

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)


# Tic Tac Toe



## Prepare the dataset.


In [None]:
# Read the dataset.
url="https://archive.ics.uci.edu/ml/machine-learning-databases/tic-tac-toe/tic-tac-toe.data"
s=requests.get(url).content
dataset=read_csv(io.StringIO(s.decode('utf-8')))

# Get x and y.
X, y = dataset.iloc[:, :-1].values, dataset.iloc[:, -1].values

X_new = OrdinalEncoder().fit_transform(X)
y_new = LabelEncoder().fit_transform(y)

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.3,shuffle=True,stratify =y_new,
                                                    random_state=0)


## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(14, 17),
              'max_features': range(6, 9),
              'min_samples_split': range(2, 8),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 14
clf.max_features = 7
clf.min_samples_split = 3
clf.criterion = 'gini'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 400, since the accuracy is high and the std is low,  
the lower bound of it's deviation is better than the others even though the higher bound, is not the best.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=400, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(21, 24),
              'max_features': range(3, 6),
              'min_samples_split': range(11, 14),
              'n_estimators': range(290, 305, 5),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=6, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 21
clf.max_features = 5
clf.min_samples_split = 11
clf.n_estimators = 290
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 100),
              'linearsvc__C': [0.01, 0.1, 1, 10, 100],
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000]}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(21,24),
              'linearsvc__C': range(98, 104, 2),
              'decisiontreeclassifier__max_depth': range(39, 43),
              'meta-logisticregression__C': [99,100,102,104]}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 6, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 22
clf2.C = 100
clf3.max_depth = 42
meta_clf.C = 104

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Linear SVM classifier seems to affect negatively the total performance of the model.

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(5, 8),
          'gamma': [2.23, 2.3, 2.4],
          'max_depth': range(5, 9)}
xgb_clf.subsample = 0.959
xgb_clf.colsample_bytree = 0.9266

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


*Note:* The Grid Search procedure did not give better accuracy score, but lead to better std.


### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 5
xgb_clf.gamma = 2.23
xgb_clf.max_depth = 5

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')


# Wholesale Customers



## Prepare the dataset.


In [None]:

# Read the dataset.
url="https://archive.ics.uci.edu/ml/machine-learning-databases/00292/Wholesale%20customers%20data.csv"
s=requests.get(url).content
dataset=read_csv(io.StringIO(s.decode('utf-8')))

# Get x and channel column as y.
X, y = dataset.iloc[:, 1:].values, dataset.iloc[:, 0].values

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(1, 7),
              'max_features': range(1, 8),
              'min_samples_split': range(35, 55),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 6
clf.max_features = 4
clf.min_samples_split = 48
clf.criterion = 'entropy'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 500 - 900.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=600, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X.shape[1]),
              'min_samples_split': sp_randint(2, X.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(28, 34),
              'max_features': range(3, 5),
              'min_samples_split': range(7, 9),
              'n_estimators': range(155, 160),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 28
clf.max_features = 4
clf.min_samples_split = 7
clf.n_estimators = 155
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 80),
              'linearsvc__C': np.logspace(-3, 3, 1000),
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': np.logspace(-3, 3, 1000)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(14, 22),
              'linearsvc__C': range(356, 362, 2),
              'decisiontreeclassifier__max_depth': range(51, 55),
              'meta-logisticregression__C': np.arange(0.01, 0.04, 0.01)}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 14
clf2.C = 356
clf3.max_depth = 51
meta_clf.C = 0.03

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(2, 8),
          'gamma': np.arange(2, 4, 0.2),
          'max_depth': range(1, 8)}
xgb_clf.subsample = 0.24
xgb_clf.colsample_bytree = 0.74

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)



### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 2
xgb_clf.gamma = 2
xgb_clf.max_depth = 1

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

# Digits


## Prepare the dataset.


In [None]:
# Get x and y.
X, y = load_digits(return_X_y=True)

# Split to training and test pairs.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=0)

# Scale data.
scaler = preprocessing.MinMaxScaler()
X_train = scaler.fit_transform(X_train.astype(float))
X_test = scaler.transform(X_test.astype(float))

pca = PCA(0.95)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)

## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 80),
              'max_features': sp_randint(1, X_train.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(10, 16),
              'max_features': range(22, 28),
              'min_samples_split': range(10, 14),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 15
clf.max_features = 24
clf.min_samples_split = 12
clf.criterion = 'gini'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 500, since the accuracy is high,  
the lower bound of it's deviation is better than the others and the higher bound, is the best.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=500, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X_train.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(27, 29),
              'max_features': range(5, 7),
              'n_estimators': range(129, 139, 2),
              'criterion': ['entropy', 'gini']}

clf.min_samples_split = 7

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 27
clf.max_features = 6
clf.n_estimators = 129
clf.criterion = 'gini'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, 80),
              'linearsvc__C': np.logspace(-3, 3, 1000),
              'decisiontreeclassifier__max_depth': sp_randint(2, 80),
              'meta-logisticregression__C': np.logspace(-3, 3, 1000)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(2, 4),
              'linearsvc__C': np.arange(0.1, 0.2, 0.02),
              'decisiontreeclassifier__max_depth': range(12, 14),
              'meta-logisticregression__C': range(120, 125)}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 3
clf2.C = 0.1
clf3.max_depth = 13
meta_clf.C = 120

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Decision Tree classifier seems to affect negatively the total performance of the model.

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)


### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 30)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(7, 9),
          'gamma': [0.66, 0.7, 0.02],
          'max_depth': range(10, 13)}
xgb_clf.subsample = 0.75
xgb_clf.colsample_bytree = 0.16

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)



### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 7
xgb_clf.gamma = 0.02
xgb_clf.max_depth = 10

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred)


# Chess (King-Rook vs. King-Pawn)



## Prepare the dataset.


In [None]:

# Read the dataset.
url="https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
s=requests.get(url).content
dataset=read_csv(io.StringIO(s.decode('utf-8')))

# Get x and y.
X, y = dataset.iloc[:, :-1].values, dataset.iloc[:, -1].values

# Split to training and test pairs.
X_new = OrdinalEncoder().fit_transform(X)
y_new = LabelEncoder().fit_transform(y)

X_train, X_test, y_train, y_test = train_test_split(X_new, y_new, test_size=0.3, 
                                                    random_state=0)


## Bagging

In [None]:
# Define a Decision Tree classifier for the ensemble.
clf = DecisionTreeClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation, for the classifier which will be used in the bagging.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(1, 30),
              'max_features': sp_randint(1, X_train.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'criterion': ['gini', 'entropy']}
candidates = 200

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(15, 19),
              'max_features': range(31, 34),
              'min_samples_split': range(14, 17),
              'criterion': ['gini', 'entropy']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=10, n_jobs=-1, verbose=5, 
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


### Create the Bagging Ensemble.
Create the Bagging Ensemble, using the best parameters found from the above procedure.  
Since the mean accuracy is the same, we are using the best scores with the minimum std, between the different CV folds.  
We make prediction to the test data, which were left out at the train-test-split procedure and not to the training data again.  
We are using a 10 fold cross validation method, in order to visualize the best number of classifiers.

In [None]:

# Add best values for the classifier.
clf.max_depth = 15
clf.max_features = 31
clf.min_samples_split = 14
clf.criterion = 'entropy'

# Plot accuracy vs number of estimators.
estimators_vs_acc(clf, X_train, y_train, estimators_array=range(1, 1000, 100))

### Predict
Predict using the best number of classifiers, based on the previous cross validation method.  
The best number of estimators seems to be 1000, since the accuracy is high,  
the lower bound of it's deviation is better than the others and the higher bound, is the best, after 500.

In [None]:
# Create the bagging classifier.
bg_clf = BaggingClassifier(base_estimator=clf, n_estimators=1000, random_state=0)

# Fit and predict.
bg_clf.fit(X_train, y_train)
y_pred = bg_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Random Forest

In [None]:
# Define a Random Forest classifier.
clf = RandomForestClassifier(random_state=0)

### Random Search
Run a Random Search using 10 fold cross validation.

In [None]:

# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'max_depth': sp_randint(2, 50),
              'max_features': sp_randint(1, X_train.shape[1]),
              'min_samples_split': sp_randint(2, X_train.shape[0] / 2),
              'n_estimators': sp_randint(20, 300),
              'criterion': ['gini', 'entropy']}
candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(clf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=8, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)



### Grid Search
Run a Grid search, using 10 fold cross validation for the area around the best results found with the Random Search.

In [None]:
# Specify parameter_grid for the search.
param_grid = {'max_depth': range(41, 45),
              'max_features': range(22, 26),
              'min_samples_split': range(11, 14),
              'n_estimators': range(50, 70, 5),
              'criterion': ['entropy', 'gini']}

# Run a grid search CV.
grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1, verbose=8,
                           iid=True)
start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)

### Predict using the Random Forest Ensemble.
Make prediction with the Random Forest Ensemble, using the best parameters found from the above procedure.


In [None]:

# Add best values for the classifier.
clf.max_depth = 41
clf.max_features = 22
clf.min_samples_split = 11
clf.n_estimators = 50
clf.criterion = 'entropy'
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## Stacking

In [None]:
# Define the stacking classifier.
clf1 = KNeighborsClassifier()
clf2 = LinearSVC(random_state=0)
clf3 = DecisionTreeClassifier(random_state=0)
clf4 = GaussianNB()
meta_clf = LogisticRegression(solver = 'lbfgs', multi_class = 'auto', 
                              random_state=0)
sclf = StackingCVClassifier(classifiers=[clf1, clf2, clf3, clf4], 
                            meta_classifier=meta_clf)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
param_dist = {'kneighborsclassifier__n_neighbors': sp_randint(1, X_train.shape[0]/3),
              'linearsvc__C': [0.01, 0.1, 1, 10, 100],
              'decisiontreeclassifier__max_depth': sp_randint(2, 60),
              'meta-logisticregression__C': [0.01, 0.1, 1, 10, 100, 1000]}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(sclf, param_dist, candidates, cv=10, 
                                   n_jobs=-1, verbose=9, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds, for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)

### Grid Search

In [None]:
# Specify parameter_grid for the search.
param_grid = {'kneighborsclassifier__n_neighbors': range(410, 415),
              'linearsvc__C': [10,12,14,16],
              'decisiontreeclassifier__max_depth': range(25,28),
              'meta-logisticregression__C': [99,100,101,104]}

# Run a grid search CV.
grid_search = GridSearchCV(sclf, param_grid, cv = 5, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'.format((time.time() - start)))
report(grid_search.cv_results_)


### Plot stacking results

In [None]:
# Use best values for the classifier.
clf1.n_neighbors = 410
clf2.C = 14
clf3.max_depth = 25
meta_clf.C = 100

# Plot accuracy +- std for each model separately
# and compare it with the stacking model.
clf_names = ['KNN', 'Linear-SVM', 'Decision Tree', 'Naive-Bayes',
             'Logistic Regression', 'Stacking']
models = clf1, clf2, clf3, clf4, meta_clf, sclf
plot_accuracy_stacking(clf_names, models, X_train, y_train)

*Note:* in this case, the Naive Bayes classifier seems to affect negatively the total performance of the model.

In [None]:
# Plot the learning curves.
plot_learning_curve(models, clf_names, X_train, y_train)

### Predict using the Stacking Ensemble.
Create a classification report with the Stacking Ensemble.


In [None]:
# Fit and predict.
sclf.fit(X_train,y_train)
y_pred = sclf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

## XGBoosting

In [None]:
# Create XGBoost, using LogisticRegression classifier
xgb_clf = xgb.XGBClassifier(learning_rate=0.03, n_estimators=600, 
                            objective='binary:logistic', silent=True, nthread=1)

### Random Search

In [None]:
# Specify parameters and distributions to sample from 
# and candidates to be created.
params = {'min_child_weight': sp_randint(4, 30),
          'gamma': np.random.uniform(0.5, 4,size=5),
          'subsample': np.random.uniform(0, 1, size=4),
          'colsample_bytree': np.random.uniform(0, 1, size=4),
          'max_depth': sp_randint(1, 7)}

candidates = 50

# Run a random search CV.
random_search = RandomizedSearchCV(xgb_clf, params, candidates, cv=10,
                                   n_jobs=-1, verbose=5, iid=True)
start = time.time()
random_search.fit(X_train, y_train)
print('RandomizedSearchCV took {:.2f} seconds for {} candidates.'
      .format((time.time() - start), candidates))
report(random_search.cv_results_)


### Grid Search

In [None]:
# Specify parameter_grid for the search.
params = {'min_child_weight': range(4, 7),
          'gamma': [0.75,0.79,0.84,0.9],
          'max_depth': range(4, 7)}
xgb_clf.subsample = 0.667
xgb_clf.colsample_bytree = 0.363

# Run a grid search CV.
grid_search = GridSearchCV(xgb_clf, params , cv = 10, n_jobs=-1, verbose=5,
                           iid=True)

start = time.time()
grid_search.fit(X_train, y_train)
print('GridSearchCV took {:.2f} seconds.'
      .format((time.time() - start)))
report(grid_search.cv_results_)


*Note:* The Grid Search procedure did not give better accuracy score, but lead to better std.


### Predict using the boosted model.
Make prediction with the XGBoosting method, using the best parameters found from the procedure above.


In [None]:

# Add best values for the classifier.
xgb_clf.min_child_weight = 4
xgb_clf.gamma = 0.75
xgb_clf.max_depth = 6

# Fit and predict.
xgb_clf.fit(X_train,y_train)
y_pred = xgb_clf.predict(X_test)

# Print a classification report.
full_report(y_test, y_pred, 'binary')

# Results
![Comparison Table](https://github.com/Adamantios/Ensembles-CSL-Imb_Learning-Models_Comparison/blob/master/images/Comparison_Table.png?raw=1)

# Conclusion

As wee can see, Bagging has the most wins, next in the ranking are Random Forest and XGBoost
with Stacking having the last position.  
In order to further compare the algorithms, we ran the following statistical test procedure:  

## Iman Davenport's correction of Friedman's rank sum

* Corrected Friedman's chi-squared = 0.13706
* df1 = 3, df2 = 27 - the test's degrees of freedom.
* p-value = 0.937

We noticed that the p-value is higher than 0.05, 
which means that there is not a statistically significant difference among the 4 algorithms.  
However, we also ran the Nemenyi post-hoc test and the Friedman post-hoc test with Bergmann and Hommel’s correction.

## Nemenyi post-hoc

* Critical difference = 1.5549
* k = 4 - the number of groups (or treatments)
* df = 36 - the test's degree of freedom.

| Algorithm         | Bagging| Random Forest| Stacking| XGBoost|
|:-----------------:|:------:|:------------:|:-------:|:------:|
| **Bagging**       | *0.00* | -0.35        |  0.00   | -0.15  |
| **Random Forest** | -0.35  | *0.00*       |  0.05   |  0.20  |
| **Stacking**      | -0.30  |  0.05        | *0.00*  |  0.15  |
| **XGBoost**       | -0.15  |  0.20        |  0.15   | *0.00* |

![Nemenyi](https://github.com/Adamantios/Ensembles-CSL-Imb_Learning-Models_Comparison/blob/master/images/Nemenyi.png?raw=1)  
If two algorithms had a bigger distance than the CD, then it would mean that they have a significant difference, 
which is not valid for our case.

## Friedman post-hoc test with Bergmann and Hommel’s correction

| Algorithm         | Bagging | Random Forest| Stacking| XGBoost |
|:-----------------:|:-------:|:------------:|:-------:|:-------:|
| **Bagging**       |  *ΝΑ*   | 0.5443701    |0.6033318|0.7950122|
| **Random Forest** |0.5443701| *ΝΑ*         |0.9309874|0.7290345|
| **Stacking**      |0.6033318| 0.9309874    |  *ΝΑ*   |0.7950122|
| **XGBoost**       |0.7950122| 0.7290345    |0.7950122|  *ΝΑ*   |

The above test is considering the p-value for all the pairs of algorithms.  
We noticed in this test too, that there is not a statistically significant difference.
