## Machine Learning Chapter 7

### 1. If you have trained five different models on the exact same training data, and they all achieve 95% precision, is there any chance that you can combine these models to get better results? If so, how? If not, why?
> Menggabungkan beberapa model dengan performa sama dalam ensemble voting dapat meningkatkan hasil, terutama jika model-model tersebut berbeda (misalnya SVM, Decision Tree, Logistic Regression) dan dilatih dengan data yang berbeda.

### 2. What is the difference between hard and soft voting classifiers?
> Hard voting memilih kelas dengan suara terbanyak, sedangkan soft voting menghitung probabilitas kelas rata-rata dan memilih kelas dengan probabilitas tertinggi, memberikan hasil yang lebih baik jika classifier dapat memperkirakan probabilitas kelas.

### 3. Is it possible to speed up training of a bagging ensemble by distributing it across multiple servers? What about pasting ensembles, boosting ensembles, Random Forests, or stacking ensembles?
> Ensemble bagging dapat dilatih secara paralel di beberapa server karena prediktor independen. Sementara itu, boosting harus dilatih secara berurutan karena ketergantungan antar prediktor, sedangkan stacking bisa dilatih paralel dalam lapisan yang sama, tapi lapisan sebelumnya harus selesai terlebih dahulu.

### 4. What is the benefit of out-of-bag evaluation?
> Dalam evaluasi out-of-bag, prediktor dievaluasi dengan data yang tidak digunakan saat pelatihan, memungkinkan evaluasi yang lebih objektif tanpa perlu set validasi terpisah.

### 5. What makes Extra-Trees more random than regular Random Forests? How can this extra randomness help? Are Extra-Trees slower or faster than regular Random Forests?
> Pada Random Forest, hanya subset acak fitur yang dipertimbangkan untuk pemisahan, sedangkan Extra Trees menggunakan threshold acak untuk setiap fitur, yang memperbaiki performa dan mempercepat pelatihan dibandingkan Random Forest.

###  6. If your AdaBoost ensemble underfits the training data, which hyperparameters should you tweak and how?
> Jika AdaBoost underfitting, bisa dicoba dengan meningkatkan jumlah estimator, kurangi regulasi pada estimator dasar, atau sedikit tingkatkan learning rate.

### 7. If your Gradient Boosting ensemble overfits the training set, should you increase or decrease the learning rate?
>Jika Gradient Boosting overfitting, kurangi learning rate atau gunakan early stopping untuk menemukan jumlah prediktor yang tepat.

### 8. Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM classifier. Next, try to combine them into an ensemble that outperforms each individual classifier on the validation set, using soft or hard voting. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

In [2]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

mnist = fetch_openml('mnist_784', version=1)

X = mnist.data.to_numpy()
y = mnist.target.astype(int).to_numpy()

X_train, X_temp, y_train, y_temp = train_test_split(X, y, train_size=50000, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=10000, stratify=y_temp, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

In [3]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.svm import SVC

rf_clf = RandomForestClassifier(random_state=42)
et_clf = ExtraTreesClassifier(random_state=42)
svm_clf = SVC(probability=True, random_state=42)

rf_clf.fit(X_train, y_train)
et_clf.fit(X_train, y_train)
svm_clf.fit(X_train, y_train)

In [4]:
from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(estimators=[('rf', rf_clf), ('et', et_clf), ('svm', svm_clf)], voting='soft')

voting_clf.fit(X_train, y_train)

In [5]:
from sklearn.metrics import accuracy_score

rf_pred = rf_clf.predict(X_val)
et_pred = et_clf.predict(X_val)
svm_pred = svm_clf.predict(X_val)
voting_pred = voting_clf.predict(X_val)

print("Random Forest accuracy:", accuracy_score(y_val, rf_pred))
print("Extra Trees accuracy:", accuracy_score(y_val, et_pred))
print("SVM accuracy:", accuracy_score(y_val, svm_pred))
print("Voting Classifier accuracy:", accuracy_score(y_val, voting_pred))

Random Forest accuracy: 0.967
Extra Trees accuracy: 0.9686
SVM accuracy: 0.9626
Voting Classifier accuracy: 0.9727


In [6]:
rf_pred_test = rf_clf.predict(X_test)
et_pred_test = et_clf.predict(X_test)
svm_pred_test = svm_clf.predict(X_test)
voting_pred_test = voting_clf.predict(X_test)

print("Random Forest test accuracy:", accuracy_score(y_test, rf_pred_test))
print("Extra Trees test accuracy:", accuracy_score(y_test, et_pred_test))
print("SVM test accuracy:", accuracy_score(y_test, svm_pred_test))
print("Voting Classifier test accuracy:", accuracy_score(y_test, voting_pred_test))

Random Forest test accuracy: 0.9653
Extra Trees test accuracy: 0.9679
SVM test accuracy: 0.9624
Voting Classifier test accuracy: 0.9735


#### 9. Run the individual classifiers from the previous exercise to make predictions on the validation set, and create a new training set with the resulting predictions: each training instance is a vector containing the set of predictions from all your classifiers for an image, and the target is the image’s class. Train a classifier on this new training set. Congratulations, you have just trained a blender, and together with the classifiers it forms a stacking ensemble! Now evaluate the ensemble on the test set. For each image in the test set, make predictions with all your classifiers, then feed the predictions to the blender to get the ensemble’s predictions. How does it compare to the voting classifier you trained earlier?

In [None]:
import numpy as np

rf_pred_val = rf_clf.predict(X_val)
et_pred_val = et_clf.predict(X_val)
svm_pred_val = svm_clf.predict(X_val)

X_train_stacked = np.vstack([rf_pred_val, et_pred_val, svm_pred_val]).T
y_train_stacked = y_val

In [None]:
from sklearn.linear_model import LogisticRegression

blender = LogisticRegression(random_state=42)
blender.fit(X_train_stacked, y_train_stacked)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
rf_pred_test = rf_clf.predict(X_test)
et_pred_test = et_clf.predict(X_test)
svm_pred_test = svm_clf.predict(X_test)

X_test_stacked = np.vstack([rf_pred_test, et_pred_test, svm_pred_test]).T

blender_pred = blender.predict(X_test_stacked)

print("Stacking Ensemble accuracy:", accuracy_score(y_test, blender_pred))

Stacking Ensemble accuracy: 0.9507
