# Q2

## Conceptual questions:
### Q2.1
Why should we do cross-validation of our models? What is its purpose?  

### Q2.2
Describe K-fold cross-validation. What is the benefit of having multiple, separate validation sets?


## Practical Exercises:
## Q2.3
Load the Boston housing prices (`sklearn.datasets.load_boston`), and fit a linear regressor. Perform 5-, 10-, and 100-fold cross-validation. Examine the returned performance metrics.  
Use ShuffleSplit cross-validation with varying training sizes; does this compare to k-fold?

In [None]:
from sklearn.datasets import load_boston
from sklearn.model_selection import KFold, train_test_split, ShuffleSplit, cross_val_score
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
import numpy as np

x_data, y_labels = load_boston(return_X_y=True)
x_data_train, x_data_test, y_labels_train, y_labels_test = train_test_split(x_data, y_labels, train_size=0.8)

lrc = LinearRegression()
scores_5 = cross_val_score(lrc, x_data_train, y_labels_train, cv=5)
scores_10 = cross_val_score(lrc, x_data_train, y_labels_train, cv=10)
scores_100 = cross_val_score(lrc, x_data_train, y_labels_train, cv=100)
scores_ss = cross_val_score(lrc, x_data_train, y_labels_train, cv=ShuffleSplit(n_splits=100, train_size=0.8))
print('5-fold: ' + str(np.mean(scores_5)))
print('10-fold: ' + str(np.mean(scores_10)))
print('100-fold: ' + str(np.mean(scores_100)))
print('ShuffleSplit: ' + str(np.mean(scores_ss)))

print(x_data_train.shape)
plt.figure()
plt.plot(np.linspace(0,10,5), scores_5,'.-')
plt.plot(np.linspace(0,10,10),scores_10,'.-')
plt.title('KFold 5 & 10')
plt.figure()
plt.plot(np.linspace(0,10,100),scores_100,'.-')
plt.title('KFold 100')
plt.figure()
plt.plot(np.linspace(0,10,100),scores_ss,'.-')
plt.title('ShuffleSplit')

5- and 10-fold CV are similar. 100-fold CV leaves 4 samples in the validation set, resulting in high variance. ShuffleSplit with a large training size behaves similarly to 100-fold CV. Large values for `n_splits` gives redundant information since the validation sets overlap.

## Q2.4
Load the digits dataset (`sklearn.datasets.load_digits`). Note that each sample is an 8x8 image flattened to be 64x1. Select a classifier (e.g. logistic regression or support vector machine), and try to classify the digits. How well can you do? Where are the mistakes coming from?

Note: If you want to visualize the data, you can use numpy's reshape function:  
`from matplotlib import pyplot as plt; import sklearn.datasets
x,y = sklearn.datasets.load_digits(return_X_y=True)
x_reshaped = np.reshape(x, (x.shape[0], 8, 8))
plt.imshow(x_reshaped[5, :, :])`

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix

x_data, y_labels = load_digits(return_X_y=True)
x_data_train, x_data_test, y_labels_train, y_labels_test = train_test_split(x_data, y_labels, train_size=0.6)

logr = LogisticRegression(solver='liblinear')
logr.fit(x_data_train, y_labels_train)
plot_confusion_matrix(logr, x_data_test, y_labels_test)

## Q2.5
Return to Q2.4; instead of looking at all 64 pixels (dimensions), first use PCA to reduce the number pixels and then try to classify the samples. Justify your choice for the number of components (hint: this is quantifiable).

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import plot_confusion_matrix
from sklearn.decomposition import PCA

x_data, y_labels = load_digits(return_X_y=True)
x_data_train, x_data_test, y_labels_train, y_labels_test = train_test_split(x_data, y_labels, train_size=0.6)

score_list = []
for i in range(1,32):
    pca = PCA(n_components=i)
    pc = pca.fit_transform(x_data_train)
    logr = LogisticRegression(solver='liblinear')
    score_list.append(np.mean(cross_val_score(logr, pc, y_labels_train)))
plt.plot(score_list,'.-')
plt.xlabel('Number of PCs')
plt.ylabel('Mean Accuracy')
plt.title('')
#pc_test = pca.transform(x_data_test)
#plot_confusion_matrix(logr, pc_test, y_labels_test)

Performance for more than ~8 components doesn't improve cross-validation results by much. Any number of components between 8-16 would work.