# Problem 1
Load mnist digits data set. Estimate base line prediction accuracy with SDGClassifier (20 iteractions), RandomForest(max_depth=3) and RandomForest(max_depth=15). Train model on training data and predict accuracy using testing data. Record the amount of time needed to estimate each. 

In [2]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os


from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1)
mnist.keys()
from sklearn.model_selection import train_test_split
X = mnist["data"]
y = mnist["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [12]:
from sklearn import linear_model
from sklearn.metrics import accuracy_score
clf = linear_model.SGDClassifier(max_iter=20)
import time
t1 = time.time()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
t2 = time.time()
acc_full = accuracy_score(y_pred, y_test)
tdelta = t2 - t1
print("Accuracy score in baseline SGD model is", acc_full)
print("The procedure took", tdelta, "seconds")

Accuracy score in baseline SGD model is 0.8767428571428572
The procedure took 17.249995231628418 seconds




In [13]:
from sklearn.ensemble import RandomForestClassifier
t1 = time.time()
clf = RandomForestClassifier(n_estimators=100, max_depth=15,random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
t2 = time.time()
acc_rf = accuracy_score(y_pred, y_test)
tdelta = t2 - t1
print("Accuracy score in baseline Random Forest with 15 trees is", acc_rf)
print("The procedure took", tdelta, "seconds")

Accuracy score in baseline Random Forest with 15 trees is 0.9627428571428571
The procedure took 35.92093062400818 seconds


In [14]:
from sklearn.ensemble import RandomForestClassifier
t1 = time.time()
clf = RandomForestClassifier(n_estimators=100, max_depth=3,random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
t2 = time.time()
acc_rf = accuracy_score(y_pred, y_test)
tdelta = t2 - t1
print("Accuracy score in baseline Random Forest with 3 trees is", acc_rf)
print("The procedure took", tdelta, "seconds")

Accuracy score in baseline Random Forest with 3 trees is 0.7274857142857143
The procedure took 9.654370069503784 seconds


# Problem 2
Apply PCA to extract principle components responsible for 80% of variance. Apply the algorithms above to the components. Report new accuracy score. Make sure to apply PCA to the data before the split into training and testing. Record time of the PCA procedure and record separately time and accuracy of each estimation and report changes relative to Problem 1. 

In [15]:
from sklearn.decomposition import PCA
X = mnist["data"]
y = mnist["target"]
t1_pca = time.time()
pca = PCA(n_components=0.80)
X_red = pca.fit_transform(X)
t2_pca = time.time()
tdelta_pca = t2_pca - t1_pca
print("The PCA procedure took", tdelta_pca, "seconds")
X_red_tr, X_red_test, y_train, y_test = train_test_split(X_red, y, random_state=42)
t1 = time.time()
clf = linear_model.SGDClassifier(max_iter=20)
clf.fit(X_red_tr, y_train)
y_pred = clf.predict(X_red_test)
t2 = time.time()
tdelta = t2 - t1
print("The SGD after PCA took", tdelta, "seconds")
print("Accuracy score")
accuracy_score(y_pred, y_test)

The PCA procedure took 13.868068933486938 seconds
The SGD after PCA took 3.1241767406463623 seconds
Accuracy score




0.7708

In [20]:
clf = RandomForestClassifier(n_estimators=100, max_depth=15,random_state=42)
t1 = time.time()
clf.fit(X_red_tr, y_train)
y_pred = clf.predict(X_red_test)
t2 = time.time()
tdelta = t2 - t1
print("The RF with 15 trees after PCA took", tdelta, "seconds")
print("Accuracy score")
accuracy_score(y_pred, y_test)

The RF with 15 trees after PCA took 43.31335258483887 seconds
Accuracy score


0.9438857142857143

In [19]:
clf = RandomForestClassifier(n_estimators=100, max_depth=3,random_state=42)
t1 = time.time()
clf.fit(X_red_tr, y_train)
y_pred = clf.predict(X_red_test)
t2 = time.time()
tdelta = t2 - t1
print("The RF with 3 trees after PCA took", tdelta, "seconds")
print("Accuracy score")
accuracy_score(y_pred, y_test)

The RF with 3 trees after PCA took 13.224504232406616 seconds
Accuracy score


0.7295428571428572

Answer: Applying PCA greatly improve the fit of the random forest, so that it exceeds the fit of the SGD classifier. There is no benefit in terms of time.

In [None]:
X_red_tr.shape

(52500, 87)

# Problem 3

Load the same data. Extract 1000 observations, use the code below. Try four different PCA alogrithms that would extract 100 principle components. Use the following PCA algorithms: PCA,  Kernel PCA(Linear), Kernel PCA(Sigmoid(gamma=0.001), LLE (10 neighbors), Isomap. Then estimate logistic regression on the training data and test the accuracy using testing data. 
* What are the accuracy score on testing data you find with each PCA algorithm? 
* Which PCA algorithm has the highest prediction accuracy?
* What is the accuracy of the logistic regression applied to the 1000 obs without applying PCA? 

In [22]:
# Randomly sample 1000 obs, otherwise it will get really slow.
np.random.seed(42)
smp = np.random.randint(50000, size=1000)
X_s = mnist["data"][smp,:]
y_s = mnist["target"][smp]
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_s, y_s, random_state = 42)

In [None]:
#X_train_s.shape
#X_test_s.shape
#y_train_s.shape
#y_test_s.shape

In [23]:
from sklearn.decomposition import KernelPCA
from sklearn.manifold import LocallyLinearEmbedding
from sklearn.decomposition import PCA
from sklearn.manifold import Isomap
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
pca = PCA(n_components=200, random_state=42)
lin_pca = KernelPCA(n_components = 200, kernel="linear", fit_inverse_transform=True, n_jobs = -1, random_state=42)
sig_pca = KernelPCA(n_components = 2, kernel="sigmoid", gamma=0.1, coef0=1, fit_inverse_transform=True, n_jobs = -1, random_state=42)
lle = LocallyLinearEmbedding(n_components=200, n_neighbors=10, random_state=42, n_jobs = -1)
isomap = Isomap(n_components=50)
log_clf = LogisticRegression(random_state=42, solver='lbfgs',  multi_class='multinomial', n_jobs = -1)

In [24]:
for c in (pca, lin_pca,  lle, isomap):
#for c in (pca,isomap):
    t1 = time.time()
    print(c)
    c.fit(X_train_s)
    X_reduced_tr = c.transform(X_train_s)
    X_reduced_test = c.transform(X_test_s)
    log_clf.fit(X_reduced_tr, y_train_s)
    y_pred = log_clf.predict(X_reduced_test)
    acc = accuracy_score(y_pred, y_test_s)
    print("accuracy score using", c, "PCA method is", acc)
    t2 = time.time()
    tdelta = t2 - t1
    print("The procedure took",tdelta , "seconds")

PCA(copy=True, iterated_power='auto', n_components=200, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False)
accuracy score using PCA(copy=True, iterated_power='auto', n_components=200, random_state=42,
    svd_solver='auto', tol=0.0, whiten=False) PCA method is 0.844
The procedure took 1.5418107509613037 seconds
KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
          fit_inverse_transform=True, gamma=None, kernel='linear',
          kernel_params=None, max_iter=None, n_components=200, n_jobs=-1,
          random_state=42, remove_zero_eig=False, tol=0)
accuracy score using KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
          fit_inverse_transform=True, gamma=None, kernel='linear',
          kernel_params=None, max_iter=None, n_components=200, n_jobs=-1,
          random_state=42, remove_zero_eig=False, tol=0) PCA method is 0.844
The procedure took 1.356335163116455 seconds
LocallyLinearEmbedding(eigen_solver='auto', h

In [25]:
t1 = time.time()
c= sig_pca
c.fit(X_train_s)
X_reduced_tr = c.fit_transform(X_train_s)
X_reduced_test = c.fit_transform(X_test_s)
log_clf.fit(X_reduced_tr, y_train_s)
y_pred = log_clf.predict(X_reduced_test)
acc = accuracy_score(y_pred, y_test_s)
print("accuracy score using", c, "PCA method is", acc)
t2 = time.time()
tdelta = t2 - t1
print("The procedure took",tdelta , "seconds")

accuracy score using KernelPCA(alpha=1.0, coef0=1, copy_X=True, degree=3, eigen_solver='auto',
          fit_inverse_transform=True, gamma=0.1, kernel='sigmoid',
          kernel_params=None, max_iter=None, n_components=2, n_jobs=-1,
          random_state=42, remove_zero_eig=False, tol=0) PCA method is 0.116
The procedure took 1.0573294162750244 seconds


In [26]:
log_clf.fit(X_train_s, y_train_s)
y_pred = log_clf.predict(X_test_s)
acc = accuracy_score(y_pred, y_test_s)
print("accuracy score without PCA is", acc)

accuracy score without PCA is 0.832


# Answer
The best PCA methods is  LInear PCA. It achieved accuracy of 0.844. The logistic regression applied withouth PCA had accuracy of 0.832