In [13]:
"""
1. Suppose 5 different models reach 0.95 accuracy on a certain dataset. Does it make sense to combine them?
    Absolutely. The only requirement that determines if the combination will actually lead to a higher accuracy
    is that the models make different mistakes. In this way by implementing hard/soft voting we might have a chance
    to even out the mistakes of each other.

2. Difference between hard- and soft voting?
    Hardvoting: each classifier has the same weight when choosing the datapoint label. The final prediction label
    equals the mode of all votes.
    Softvoting: given that the classifier can generate a probability (or at least something similar) the final prediction
    label is based on a weighted decision, that depends on how sure an estimator is about it's respective prediction.

3. Can we parallelize the ensemble training on multiple servers in the following cases?
a)  Bagging: Yes, since the train data samples are completely randomized we can also access our entire train data with different
    servers and make our estimators run simultaneously and speed up the process.
b)  Boosting: No, because the iterations depend on the results of the previous iteration. Therefore a split of our operations on
    multiple servers would probabl worsen the overall calculation speed.
c)  Random Forests: Yes, we could assign a new server for each tree of the forest.
d)  Stacking: Yes, we can speed up the process by separating the models of each layer to a different server. The only restriction is,
    that we have to wait for each server to completely finish his estimation before continueing to the next layer.

4. Benefit of OOB Evaluation?
    We don't need to assign a test data set, since we can use the remainder of our Bagging ensemble and use it as some type of "pseudo
    test set". This should only work correctly if the relative amount of all possible labels is somehow equal and the dataset is fairly
    large.

5. Why are Extra-Trees more random than regular RandomForests?
    The key difference lays in the feature benchmark calculation. While regular RandomForest indiviually calculate the significance bench-
    mark of the best features of their dataset, Extra-Trees randomly assign the benchmark and therefore include a random number of
    features. This makes the significantly faster than regular random forest since the assignment of suitable features grasps a good share
    of the total calculation complexity. Extra trees trade off an even higher bias for lower variance (like all ensembles basically).

6. What hyperparameter adjustments to treat an underfitting AdaBoost ensemble?
    AdaBoost can underfit the data if there are too few iterations. First, we should try to increase the number of
    iterations to a level the calculation effort is not increasing too much. After that we might also play with the learning rate, which 
    is usually too low. If there are regularization parameters involved it's worth to check and probably decrease their impact.

7. How to adjust the learning rate of an overfitting Gradient-Boosting ensemble?
    We should decrease the learning rate and lower the number of iterations.

"""

'\n1. Suppose 5 different models reach 0.95 accuracy on a certain dataset. Does it make sense to combine them?\n    Absolutely. The only requirement that determines if the combination will actually lead to a higher accuracy\n    is that the models make different mistakes. In this way by implementing hard/soft voting we might have a chance\n    to even out the mistakes of each other.\n\n2. Difference between hard- and soft voting?\n    Hardvoting: each classifier has the same weight when choosing the datapoint label. The final prediction label\n    equals the mode of all votes.\n    Softvoting: given that the classifier can generate a probability (or at least something similar) the final prediction\n    label is based on a weighted decision, that depends on how sure an estimator is about it\'s respective prediction.\n\n3. Can we parallelize the ensemble training on multiple servers in the following cases?\na)  Bagging: Yes, since the train data samples are completely randomized we can a

In [14]:
"""
8. Applying Ensemble to MNIST

- train RandomForest, ExtraTrees + SVM on MNIST
- try to combine them to an ensemble with hard and/or soft voting

Previous scores on MNIST
========================
KNN MNIST data: 0.9714
KNN with augmented MNIST data: 0.97815
SVM Classifier: {'svc__estimator__C': 0.07, 'svc__estimator__gamma': 0.1, 'svc__estimator__kernel': 'poly'}: 0.9519

TO DO:
"import" our augmented data creater
"import" our SVM Classifier
"import" our KNN classifier

- create RandomForest and ExtraTrees 

- hard/soft vote the results with VotingClassifier 
    or
- create Stacking engine based on the results of 3 classifiers


"""



In [2]:
"""Get and (augmentation ONLY after successfull testing) MNIST data"""
from joblib import load
from sklearn.metrics import accuracy_score
import numpy as np

mnist_data = load("C:/Users/MaxB2/Documents/Machine_Is_Learning/mnist_dataset_784_v1.joblib")

X,y = mnist_data["data"],mnist_data["target"]
y = y.astype(np.uint8)
X_train, X_test, y_train, y_test = X[:60000],X[60000:],y[:60000],y[60000:] 

from sklearn.model_selection import StratifiedShuffleSplit
SS_split = StratifiedShuffleSplit(n_splits=5, test_size=(1/6), random_state=42)
for train_index, test_index in SS_split.split(X_train,y_train):
    X_train_50k = X_train.loc[train_index]
    y_train_50k = y_train.loc[train_index]
    X_validate_10k = X_train.loc[test_index]
    y_validate_10k = y_train.loc[test_index]

print(len(X_train_50k))
print(len(y_train_50k))
print(len(X_validate_10k))
print(len(y_validate_10k))
print(len(X_test))
print(len(y_test))

import pandas as pd

# The proportions of the income_cat groups are defined as the following fractions
total_digit_counts = mnist_data["target"].value_counts() / len(mnist_data["target"])
strat_validation_value_counts = y_validate_10k.value_counts()/ len(X_validate_10k)
train_set_value_counts = y_train_50k.value_counts()/ len(X_train_50k)
test_set_value_counts = y_test.value_counts()/ len(X_test)
# The absolute value of their respective errors are described as 

# Convert the Series object to a DataFrame with appropriate column names
df_comparison = pd.DataFrame({
    "Normal Data Proportion": total_digit_counts.values,
    "Validation Set Proportion": strat_validation_value_counts.values,
    "Train Set Proportion":train_set_value_counts.values,
    "Test Set Proportion":test_set_value_counts.values
})
df_comparison

50000
50000
10000
10000
10000
10000


Unnamed: 0,Normal Data Proportion,Validation Set Proportion,Train Set Proportion,Test Set Proportion
0,0.112529,0.1124,0.11236,0.1135
1,0.104186,0.1044,0.10442,0.1032
2,0.102014,0.1022,0.10218,0.1028
3,0.099857,0.0993,0.0993,0.101
4,0.0994,0.0992,0.09914,0.1009
5,0.098614,0.0987,0.09872,0.0982
6,0.098229,0.0986,0.09864,0.098
7,0.0975,0.0975,0.09752,0.0974
8,0.097486,0.0974,0.09736,0.0958
9,0.090186,0.0903,0.09036,0.0892


In [4]:
type(mnist_data["target"])

pandas.core.series.Series

In [17]:
# KNN
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=4, weights='distance')
knn.fit(X_train_50k,y_train_50k)
y_val_pred_knn = knn.predict(X_validate_10k)
from sklearn.metrics import accuracy_score
acc_y_val_pred_knn = accuracy_score(y_validate_10k, y_val_pred_knn)
acc_y_val_pred_knn

0.9759

In [27]:
# RFR

from sklearn.ensemble import RandomForestClassifier

rfr = RandomForestClassifier(n_estimators=200,max_leaf_nodes=2048,n_jobs=-1)
rfr.fit(X_train_50k,y_train_50k)
y_val_pred_rfr = rfr.predict(X_validate_10k)
acc_y_val_pred_rfr = accuracy_score(y_validate_10k, y_val_pred_rfr)
acc_y_val_pred_rfr

0.9634

In [28]:
from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier(n_estimators=200,max_leaf_nodes=2048,n_jobs=-1)
etc.fit(X_train_50k,y_train_50k)
y_val_pred_etc = etc.predict(X_validate_10k)
acc_y_val_pred_etc = accuracy_score(y_validate_10k, y_val_pred_etc)
acc_y_val_pred_etc

0.9603

In [52]:
X_validate_10k.iloc[0]

pixel1      0.0
pixel2      0.0
pixel3      0.0
pixel4      0.0
pixel5      0.0
           ... 
pixel780    0.0
pixel781    0.0
pixel782    0.0
pixel783    0.0
pixel784    0.0
Name: 20781, Length: 784, dtype: float64

In [61]:
type(X_validate_10k)


pandas.core.frame.DataFrame

In [69]:
estimators = [etc,rfr,knn]
data4blender = pd.DataFrame(columns=['data', 'label'])
def fill_dataframe(data, labels):
    # Check if the lengths of data and labels match
    if len(data) != len(labels):
        raise ValueError("Lengths of data and labels must match.")
    # Add data and labels to the DataFrame
    data4blender['data'] = data
    data4blender['label'] = labels



triple_predictions = []
temp_triple = []
true_labels = y_train_50k.copy()

for index in range(len(X_train_50k)):
    for estimator in estimators:
        temp_triple.append(estimator.predict(X_train_50k.iloc[index:index+1]))
    triple_predictions.append([temp_triple])
    temp_triple = []

triple_predictions
"""
def triple_pred_plus_label(X_set,y_set,index):
    pred1 = etc.predict(X_set.iloc[index])
    pred2 = rfr.predict(X_set.iloc[index])
    pred3 = knn.predict(X_set.iloc[index])
    return (pred1,pred2,pred3),y_set[index]
"""


KeyboardInterrupt: 

In [38]:
# Voting
"""from sklearn.ensemble import VotingClassifier

vtc = VotingClassifier(
    estimators=[("knn",knn),("rfr",rfr),("etc",etc)],
    voting="soft"
)

vtc.fit(X_train_50k,y_train_50k)
y_val_pred_vtc = vtc.predict(X_validate_10k)
acc_y_val_pred_vtc = accuracy_score(y_validate_10k, y_val_pred_vtc)
acc_y_val_pred_vtc"""
"""y_test_pred_knn = knn.predict(X_test)
from sklearn.metrics import accuracy_score
acc_y_test_pred_knn = accuracy_score(y_test, y_test_pred_knn)
print("KNN Test Set Acc: ",acc_y_test_pred_knn)

y_test_pred_vtc = vtc.predict(X_test)
acc_y_test_pred_vtc = accuracy_score(y_test, y_test_pred_vtc)
print("Voting Classifier Test Set Acc: ",acc_y_test_pred_vtc)"""

0.9755