**Load the MNIST data (introduced in Chapter 3), and split it into a training set, a validation set, and a test set (e.g., use 50,000 instances for training, 10,000 for validation, and 10,000 for testing). Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM. Next, try to combine them into an ensemble that outperforms them all on the validation set, using a
soft or hard voting classifier. Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?**

### Load the MNIST data

In [None]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version= 1, as_frame= False)

### Split it into a training set, a validation set, and a test set

In [None]:
from sklearn.model_selection import train_test_split

X_val, X_test, y_val, y_test = train_test_split(
    mnist.data, mnist.target, test_size=10000, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(
    X_val, y_val, test_size=10000, random_state=42)

### Then train various classifiers, such as a Random Forest classifier, an Extra-Trees classifier, and an SVM.

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, n_jobs=-1)
rfc.fit(X_train, y_train)

y_pred_rfc = rfc.predict(X_val)

In [None]:
from sklearn.metrics import accuracy_score

accuracy_score(y_val, y_pred_rfc)

0.9681

#### Extra-Trees Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier(n_estimators=100, n_jobs=-1)
etc.fit(X_train, y_train)
y_pred_etc = etc.predict(X_val)

accuracy_score(y_val, y_pred_etc)

0.9716

#### SVM

In [None]:
from sklearn.svm import SVC

svm = SVC()
svm.fit(X_train, y_train)
y_pred_svm = svm.predict(X_val)

accuracy_score(y_val, y_pred_svm)

0.9788

### Try to combine them into an ensemble that outperforms them all on the validation set, using a soft or hard voting classifier

In [None]:
from sklearn.ensemble import VotingClassifier
vc = VotingClassifier([
                       ("RandomForestClf", rfc),
                       ("ExtraTreesClf", etc),
                       ("SVMClf", svm)
])

In [None]:
vc.fit(X_train, y_train)

VotingClassifier(estimators=[('RandomForestClf',
                              RandomForestClassifier(n_jobs=-1)),
                             ('ExtraTreesClf', ExtraTreesClassifier(n_jobs=-1)),
                             ('SVMClf', SVC())])

In [None]:
y_pred_vc = vc.predict(X_val)

accuracy_score(y_val, y_pred_vc)

0.9747

### Once you have found one, try it on the test set. How much better does it perform compared to the individual classifiers?

RandomForestClf: 96.81% accurate. \
ExtraTreesClf: 97.16% accurate.\
SVMClf: 97.88% accurate.

In [None]:
y_final = vc.predict(X_test)

accuracy_score(y_test, y_final)

0.9702

Reference code reached 97.06%, while this code reached 97.02%. However, reference code used LinearSVC algorithm, which is less accurate compared to SVC algorithm (note though that the latter scales very poorly).