# Regression Models - MNIST Dataset

A presentation of this dataset together with an exploratory analysis has been done already in another notebook of this repository (click here to access it: [An Unsupervised Approach to MNIST](https://github.com/LeviGuerra/Machine-Learning-Portfolio/blob/master/Codes_and_Datasets/06_An-Unsupervised-Approach-to-MNIST.ipynb)).

Here the MNIST problem will be solved by supervised means. Concretely, after a dimensional reduction, we will use different regression/classification methods.

### Table of Contents
- [0. Loading the Dataset](#0.-Loading-the-Dataset)
- [1. Dimensional Reduction: PCA](#1.-Dimensional-Reduction:-PCA)
- [2. Solutions of the Problem: Regression Models](#2.-Solutions-of-the-Problem:-Regression-Models)
    - [2.1 Linear Regression](#2.1-Linear-Regression)
    - [2.2 Polynomial Regression](#2.2-Polynomial-Regression)
    - [2.3 Logistic Regression (Classification)](#2.3-Logistic-Regression-(Classification))

## 0. Loading the Dataset

[The dataset was obtained from Google Colab sample_data.]

In [1]:
import numpy as np
import scipy as sc
import sklearn as sk
import pandas  as pd
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn.decomposition import PCA
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures

from sklearn.datasets import load_digits

In [2]:
mnist = pd.read_csv('Datasets/MNIST.csv', header=None)
mnist = np.array(mnist)
print('The shape of the dataset is:',mnist.shape)
print()
print(mnist)

X = mnist[:,1:]
Y = mnist[:,0]

The shape of the dataset is: (20000, 785)

[[6 0 0 ... 0 0 0]
 [5 0 0 ... 0 0 0]
 [7 0 0 ... 0 0 0]
 ...
 [2 0 0 ... 0 0 0]
 [9 0 0 ... 0 0 0]
 [5 0 0 ... 0 0 0]]


## 1. Dimensional Reduction: PCA

We choose the smallest subset of variables able to represent at least the 50% of the original variance. And thus, the system is reduced to 11 variables.

In [3]:
x=X
ipca = PCA(n_components=0.5)
ipca.fit(x)
xt = ipca.transform(x)

print(xt.shape)

(20000, 11)


## 2. Solutions of the Problem: Regression Models

The accuracies obtained for polynomial regressions grow with the degree of polynomial until we reach a degree 6, where a small decay in accuracy appears, most likely due to overfitting. The highest value is obtained for the logistic regression; this shouldn't be surprising, since it's a classification technique.

### 2.1 Linear Regression

In [3]:
acclist = []
iterations=100

for i in range(iterations):
    X_train, X_test, Y_train, Y_test = train_test_split(xt,Y, test_size=0.3)
    regr_lineal = linear_model.LinearRegression()
    regr_lineal.fit(X_train, Y_train)#[:, np.newaxis])
    Ypred = regr_lineal.predict(X_test)
    acc = np.sum(np.round(Ypred) == Y_test)/len(Y_test)
    acclist.append(acc)

print('Mean accuracy of Linear Regression after',iterations,'iterations =', round(np.mean(acclist)*100,2),'%')

Mean accuracy of Linear Regression after 100 iterations = 13.95 %


### 2.2 Polynomial Regression

In [4]:
pol_acc_list = []
max_grados = 5

for gr in range(1, max_grados+1):
    pol = PolynomialFeatures(gr)
    # Polynomial Transformation of X
    x_pol = pol.fit_transform(xt)
  
    X_train, X_test, Y_train, Y_test = train_test_split(x_pol,Y, test_size=0.3)
    
    model = linear_model.LinearRegression()
    model.fit(X_train, Y_train)
    y_pol_pred = model.predict(X_test)
    pol_acc = np.sum(np.round(y_pol_pred) == Y_test)/len(Y_test)
    pol_acc_list.append(pol_acc)
    print('Accuracy of Polynomial Regression for degree',gr,'=', round(np.mean(pol_acc_list)*100,2),'%')
    pol_acc_list=[]

Accuracy of Polynomial Regression for degree 1 = 13.43 %
Accuracy of Polynomial Regression for degree 2 = 27.02 %
Accuracy of Polynomial Regression for degree 3 = 35.6 %
Accuracy of Polynomial Regression for degree 4 = 43.03 %
Accuracy of Polynomial Regression for degree 5 = 41.3 %


### 2.3 Logistic Regression (Classification)

In [6]:
iterations = 5
acc = []
acc_log= []
  
for i in range(iterations):
    X_train, X_test, Y_train, Y_test = train_test_split(xt,Y, test_size=0.3)
    lo = LogisticRegression(multi_class='multinomial', solver="lbfgs",max_iter=5000,verbose=0).fit(X_train, 
                                                                                                   Y_train)
    acc_log.append(lo.score(X_test, Y_test))
print('Mean accuracy of Logistic Regression after',iterations,'iterations =', round(np.mean(acc_log)*100,2),'%')

Mean accuracy of Logistic Regression after 5 iterations = 80.77 %
