# Multiclass Classifiers
In this assignment you will load a dataset and train two models to perform multiclass classification and compare the results of the models. The dataset is the digits dataset available from the sklearn's datasets library. This dataset contain 1797 samples of written digits. The goal is to correctly identify digits from 0 to 9.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

%matplotlib inline

## Load the data

1. import the *load_digits* function from the *sklearn.datasets* library
2. invoke *load_digits* with the *return_X_y* parameter set to true and store the returned dataset in variable **X** and **y**.

In [2]:
from sklearn.datasets import load_digits

X, y = load_digits(return_X_y=True)

#dataset = load_digits()
#X, y = dataset.data, dataset.target


## Exploratory Data Analysis
Perform a few exploratory  steps including:

1. Display the number of rows of data returned
2. Display the number of features in the dataset
3. Use Numpy's **bincount** to display how many samples belong to each class. Is this a balanced dataset?

In [3]:
# target variable has 10 classes that are balanced 

for class_name, class_count in zip(y, np.bincount(y)):
    print(class_name, class_count)


0 178
1 182
2 177
3 183
4 181
5 182
6 181
7 179
8 174
9 180


In [4]:
print('The number of rows in the dataset is {:d}'.format(X.shape[0]))
print('The number of features in the dataset is {:d}'.format(X.shape[1]))
#np.___(y)

The number of rows in the dataset is 1797
The number of features in the dataset is 64


## Prepare training and testing data
1. Use *train_test_split* to split the dataset into a training set and a test set. Set the proportion of test data to 20%. Set a random state value so that the results will be repeatable.

In [5]:
# spliting data into training set and 20% test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Cross validation with Logistic Regression
In this step you will create a LogisticRegression classifier and use 5-fold cross validation to train the model.

1. import *LogisticRegression* classifier from sklearn
2. instantiate a LogisticRegression classifier with the 'lbfgs' solver and 'ovr' multiclass strategy. You may have to set the maximum number of iterations to 1000.
3. perform cross validation on the model
4. print the cross validation scores and the mean of the cross validation scores.

In [6]:
# using OvR multiclass logistic regression

lr_clf = LogisticRegression(solver='lbfgs', multi_class='ovr', max_iter=1000, random_state=20)
lr_cv_scores = cross_val_score(lr_clf, X_train, y_train, cv=5)

print('Accuracy scores for the 5 folds: ', lr_cv_scores)
print('Mean cross validation score: {:.3f}'.format(np.mean(lr_cv_scores)))


Accuracy scores for the 5 folds:  [0.96527778 0.95486111 0.94773519 0.95121951 0.91637631]
Mean cross validation score: 0.947


## Cross validation with RandomForest
Perform the same steps as above but this time with a RandomForestClassifier.

In [7]:
from sklearn import ensemble

rf_clf = ensemble.RandomForestClassifier(n_estimators=24)
rf_cv_scores = cross_val_score(rf_clf, X, y, cv=5)

print('Accuracy scores for the 5 folds: ', rf_cv_scores)
print('Mean cross validation score: {:.3f}'.format(np.mean(rf_cv_scores)))


Accuracy scores for the 5 folds:  [0.925      0.88333333 0.95821727 0.97214485 0.91364903]
Mean cross validation score: 0.930
