# An Ensemble Learner

In [2]:
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.svm import SVC

In [3]:
from sklearn.model_selection import train_test_split

In [4]:
gdb_clf = GradientBoostingClassifier()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

## Use scikit-learn to create a synthetic dataset for either classification or regression with 20,000 instances.

In [5]:
X, y = make_classification(n_samples = 20000, n_features=20, random_state=91)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=91)

## Train an ensemble learner on the above dataset  using three different machine learning algortihms.

In [6]:
voting_clf = VotingClassifier(
    estimators=[('gb', gdb_clf), ('rf', rnd_clf), ('svc', svm_clf)],
    voting='hard'
)
#Train Ensemble learner on the training data
voting_clf.fit(X_train, y_train)

VotingClassifier(estimators=[('gb', GradientBoostingClassifier()),
                             ('rf', RandomForestClassifier()), ('svc', SVC())])

## Evaluate the ensemble learner on the training data and test data.

In [10]:
from sklearn.metrics import accuracy_score
for clf in (gdb_clf, rnd_clf, svm_clf, voting_clf):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print(clf.__class__.__name__, accuracy_score(y_test, y_pred))

GradientBoostingClassifier 0.974
RandomForestClassifier 0.9756
SVC 0.9726
VotingClassifier 0.9754


## Key Observations made
1. When i set the random state to be zero when i am declaring x and y, the accuracy scores for the models are average at 0.89s. 
2. When i set it to be 42, i get 0.91s.
3. The sweet spot with the highest accuracy scores for me was a randomstate=91 where the votingClassifier had a score of 0.975.
4. I used three different algos, Gradient Boosting Classifier that combines weaker learners into a strong learner with each new one being trying to correct its predecessor. I also used the Random Forest Classifier which is an ensemble (group) of decision trees that use bagging and or pasting. I also used a Support Vector Machine Classifier which is known to be effective in high dimensional spaces. Please note that all these algos are classification algorithms because my dataset is a classification one, make_classification.
5. I used hard voting that essentially sums up the predictions of each classifier and then predicts the class that gets the most votes. In this case, Random Forest Classifier got the most votes from the test accuracy scores with a 97.5% accuracy score.
6. My conclusion is that my model is okay, I would personally prefer a prediction of 99.5 and above but I guess I can work with this for now.

## Submit the Jupyter notebook.