# Ensemble Learning
You should build an end-to-end machine learning pipeline using an ensemble learning model. In particular, you should do the following:
- Load the `mnist` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end machine learning pipeline, including an ensemble model, such as [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) or [gradient boosting](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html).
- Optimize your pipeline by cross-validating your design decisions. 
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

In [1]:
import pandas as pd

import warnings
warnings.filterwarnings('ignore')
from sklearn.metrics import f1_score, classification_report, accuracy_score, roc_auc_score, roc_curve, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV

In [2]:
df = pd.read_csv("../../datasets/mnist.csv")
df = df.set_index("id")
df.head()

Unnamed: 0_level_0,class,pixel1,pixel2,pixel3,pixel4,pixel5,pixel6,pixel7,pixel8,pixel9,...,pixel775,pixel776,pixel777,pixel778,pixel779,pixel780,pixel781,pixel782,pixel783,pixel784
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
31953,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34452,8,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
60897,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
36953,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1981,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
df.shape

(4000, 785)

In [4]:
x = df.drop('class', axis=1)
y = df['class'] 

In [5]:
from sklearn.model_selection import train_test_split
#Spliting data Set in training set and testing set

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=1)


In [6]:
Model = []
Accuracy = []
F1Score = []

In [7]:
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier
from sklearn.model_selection import GridSearchCV

params = {
    
    'n_estimators':[50,100,150,200],
    'criterion':['gini','entropy']
    
}

ran_for = RandomForestClassifier()

gs = GridSearchCV(estimator=ran_for, param_grid=params, cv=3,scoring='recall', n_jobs=-1)
gs.fit(x,y)


GridSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,
             param_grid={'criterion': ['gini', 'entropy'],
                         'n_estimators': [50, 100, 150, 200]},
             scoring='recall')

In [8]:
gs.best_params_

{'criterion': 'gini', 'n_estimators': 50}

In [9]:
ran_for = RandomForestClassifier(**gs.best_params_)
ran_for.fit(x_train,y_train)
y_pred_rf = ran_for.predict(x_test)
y_pred_rf1 = ran_for.predict(x_train)

In [10]:
print("Train score: {}%".format( 100 * accuracy_score(y_train, y_pred_rf1)))
print()
print("f1 score {}%".format( 100 * f1_score(y_test, y_pred_rf, average=None)))
print()
print("accuracy score {}%".format( 100 * accuracy_score(y_test, y_pred_rf)))
print()


Train score: 100.0%

f1 score [95.2        92.66409266 88.0733945  91.72932331 91.4893617  87.17948718
 92.74193548 94.96402878 91.21338912 90.        ]%

accuracy score 91.66666666666666%



In [11]:


gb_params ={
    'n_estimators': 1500,
    'max_features': 0.9,
    'learning_rate' : 0.25,
    'max_depth': 4,
    'min_samples_leaf': 2,
    'subsample': 1,
    'max_features' : 'sqrt',
    'verbose': 0
}

 

In [12]:
from sklearn.ensemble import GradientBoostingClassifier

In [13]:


grad_boost = GradientBoostingClassifier(**gb_params)
grad_boost.fit(x_train, y_train)
grad_boost_pred = grad_boost.predict(x_test)
grad_boost_pred1 = grad_boost.predict(x_train)



In [14]:
print("Train score: {}%".format( 100 * accuracy_score(y_train, grad_boost_pred1)))
print()
print("f1 score {}%".format( 100 * f1_score(y_test, grad_boost_pred, average=None)))
print()
print("accuracy score {}%".format( 100 * accuracy_score(y_test, grad_boost_pred)))
print()

Train score: 100.0%

f1 score [98.30508475 95.58232932 92.01877934 93.48659004 94.2408377  91.2
 95.16129032 94.07665505 95.12195122 91.32420091]%

accuracy score 94.08333333333333%

