## Matt Viteri & Yu Mo

In [10]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

### Problem 1: Multi-class Classification – MNIST

In this exercise you will explore the MNIST data set which you will find here: https://www.openml.org/d/554.   MNIST  is  a  data  set  of  handwritten  digits,  and  is  considered  one  of  the“easiest” image recognition problems in computer vision.

* Use the fetch_openml command from sklearn.datasets to import the MNIST data set

* Use Random Forests to try to get the best possible test accuracy on MNIST. This involves getting  acquainted  with  how  Random  Forests  work,  understanding  their  parameters,  and therefore using Cross Validation to find the best settings.  How well can you do?  You should use the accuracy metric, since this is what you used in Lab 5 – therefore this will allow you to compare your results from Random Forests with your results from L1- and L2- Regularized Logistic Regression.  What are the hyperparameters of your best model?

* Use  Boosting  to  do  the  same.   Take  the  time  to  understand  how  XGBoost  works  (and/orother boosting packages available).  Try your best to tune your hyper-parameters.  As added motivation:  typically the winners and near-winners of the Kaggle competition are those thatare best able to tune an cross validate XGBoost.  What are the hyperparameters of your bestmodel?

* (Optional)  Run  multi-class  logistic  regression  on  these  using  the  cross  entropy  loss.   Youmay  have  to  play  around  with  the  hyperparameters  (especially  the  tolerance)  to  get  it  toconverge in a reasonable amount of time.  I recommend the SAGA solver.  Try to optimizethe hyperparameters.  Report your training and test loss from above

* (Optional) Choose an l1 regularizer (penalty), and see if you can get a sparse solution withalmost as good accuracy.

* (Optional) Note that in Logistic Regression, the coefficients returned (i.e., theβ’s) are thesame dimension as the data.  Therefore we can pretend that the coefficients of the solution are an image of the same dimension, and plot it.  Do this for the 10 sets of coefficients thatcorrespond to the 10 classes.  You should observe that, at least for the sparse solutions, these“kind of” look like the digits they are classifying.

**Fetch openml**

In [2]:
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

**Random Forest Accuracy**

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [4]:
rf = RandomForestClassifier(random_state=123)

# check hyperparameters
rf.get_params()

{'bootstrap': True,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 'warn',
 'n_jobs': None,
 'oob_score': False,
 'random_state': 123,
 'verbose': 0,
 'warm_start': False}

In [5]:
hyperparameters = {
    'max_features': ['auto', 'sqrt', 0.33],
    'min_samples_leaf': [1, 3, 5, 10]
}

In [6]:
# perform cross-validation with hyperparmeters
model = GridSearchCV(rf, hyperparameters, cv=10, n_jobs=-1)

# fit model
model.fit(X_train, y_train)



GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None,
            oob_score=False, random_state=123, verbose=0, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'max_features': ['auto', 'sqrt', 0.33], 'min_samples_leaf': [1, 3, 5, 10]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

**This is the holdout accuracy score. In order to get the real accuracy we need to calculate the accuracy score using the predictions**

In [7]:
model.best_score_

0.9489464285714285

**Test accuracy score**

In [8]:
pred = model.predict(X_test)
accuracy_score(y_test, pred)

0.9544285714285714

**Best hyperparameters**

In [9]:
model.best_params_

{'max_features': 0.33, 'min_samples_leaf': 1}

### XGBoost Accuracy

In [21]:
xgb_model = xgb.XGBClassifier(objective='multi:softmax', num_class=10)
xgb_model.get_params()

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'multi:softmax',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1,
 'num_class': 10}

In [65]:
# reshaping data due to time/memory contraints
# before running xgb model

t = X_train[:5000, :350]
t.shape

y2 = y_train[:5000]

In [63]:
parameters = {
  'max_depth': [3, 6],
  'n_estimators': [100, 200]
}

# perform cross-validation on xgb with hyperparmeters
x_model = GridSearchCV(xgb_model, parameters, cv=10, n_jobs=-1)

# fit model
x_model.fit(t, y2)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=1, gamma=0, learning_rate=0.1,
       max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
       n_estimators=100, n_jobs=1, nthread=None, num_class=10,
       objective='multi:softmax', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=1, seed=None, silent=None,
       subsample=1, verbosity=1),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'max_depth': [3, 6], 'n_estimators': [100, 200]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

**Holdout accuracy score**

In [67]:
x_model.best_score_

0.8586

**Test accuracy score**

In [72]:
x_pred = x_model.predict(X_test[:5000, :350])
accuracy_score(y_test[:5000], x_pred)

0.8702

**Best hyperparameters**

In [73]:
x_model.best_params_

{'max_depth': 6, 'n_estimators': 200}

### Problem 2: CIFAR-10

In this problem you will explore the data set CIFAR-10, just as you did above for MNIST. Now that you have your pipeline set up, it should be easy to apply the above procedure to CIFAR-10. If you did something that takes significant computation time, keep in mind that CIFAR-10 is a few times larger.

* (Optional) You can read about the CIFAR-10 and CIFAR-100 data sets here: https://www.cs.toronto.edu/~kriz/cifar.html.

* (Optional)  OpenML  curates  a  number  of  data  sets.   You  will  use  a  subset  of  CIFAR-10 provided by them.  Read here for a description: https://www.openml.org/d/40926.

* Use  the fetch_openml command  fromsklearn.datasets to  import  the  CIFAR-10-Small data set.

* Figure out how to display some of the images in this data set, and display a couple.  While not high resolution, these should be recognizable if you are doing it correctly.

* What is the best accuracy you can get on the test data, by tuning Random Forests?  What are the hyperparameters of your best model?

* What is the best accuracy you can get on the test data, by tuning XGBoost?  What are the hyperparameters of your best model?

* (Optional) You will run multi-class logistic regression on these using the cross entropy loss. You have to specify this specifically (multiclass=’multinomial’). Use cross validation to see how good your accuracy can be.  In this case, cross validate to find as good regularization coefficients  as  you  can, for `l1` & `l2` regularization  (called  penalties),  which  are  naturally supported in sklearn.linearmodel.LogisticRegression. As with MNIST, I recommend you use the solver saga.

* (Optional) Report your training and test loss from above.

* (Optional) How sparse can you make your solutions without deteriorating your testing error too much? Here, I am asking you to try to obtain a sparse solution that has test accuracy that is close to the best solution you found.