## Solutions to exercises

**Not all solutions are complete. Some solutions required functions or variables that are already defined in the main notebooks.**

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

### Exercise 1
We can also use ``LinearRegression`` to multi-feature data. Create a dataset that has 5 features. Calculate the coefficients and plot response vs the first feature, plot also the fitted line. 

In [None]:
from sklearn import datasets
from sklearn.linear_model import LinearRegression

X_multi, y_multi = datasets.make_regression( n_samples=30, n_features=5, n_informative=5, random_state=0, noise=75)
# just to have positive values only 
X_multi = X_multi + 3 
y_multi = y_multi + 310

regr_multi = LinearRegression()
regr_multi.fit(X_multi, y_multi)
print(regr_multi.intercept_, regr_multi.coef_)

### Exercise 2
Run a classification for Iris data available in ``datasets``. You can try to run model twice, first time using 2 features that have high correlation with output and second time with two less correlated features. You might also read and change parameters of the ``KNeighborsClassifier``.


In [None]:
from sklearn import datasets
iris = datasets.load_iris()

In [None]:
X_ir = iris.data
y_ir = iris.target
print(iris.DESCR)

In [None]:
# check how many classes you have
np.unique(y_ir)

Let's plot using only 2 features that has high correlation:

In [None]:
plt.scatter(X_ir[:, 2], X_ir[:, 3], c=y_ir, s=20)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier()
clf.fit(X_ir[:,2:], y_ir)

Let's plot results:

In [None]:
from matplotlib.colors import ListedColormap

def plot_iris(X_ir, y_ir, ind_x, ind_y):
    x_min, x_max = X_ir[:, ind_x].min() - 0.2, X_ir[:, ind_x].max() + 0.2
    y_min, y_max = X_ir[:, ind_y].min() - 0.2, X_ir[:, ind_y].max() + 0.2
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 50),
                         np.linspace(y_min, y_max, 50))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure()
    cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X_ir[:, ind_x], X_ir[:, ind_y], c=y_ir, s=20)
    
plot_iris(X_ir, y_ir, ind_x=2, ind_y=3)

**We can check how the model would work if we chose the first tw features:**

In [None]:
plt.scatter(X_ir[:, 0], X_ir[:, 1], c=y_ir, s=20)

we can see that task will be harder

In [None]:
clf = KNeighborsClassifier()
clf.fit(X_ir[:,:2], y_ir)

plot_iris(X_ir, y_ir, ind_x=0, ind_y=1)

Still the algorithm identified correctly most of the points. You can also try to change number of neighbors.

### Exercise 3
Use PCA for the Iris dataset.

In [None]:
from sklearn.decomposition import PCA, IncrementalPCA

n_components = 2
pca = PCA(n_components=n_components)
X_pca = pca.fit_transform(X_ir)

colors = ['navy', 'turquoise', 'darkorange']
    
for color, i, target_name in zip(colors, [0, 1, 2], iris.target_names):
    plt.scatter(X_pca[y_ir == i, 0], X_pca[y_ir == i, 1],
                color=color)

## Exercise 4

Using ``make_data`` function generate a new dataset with different sample size. Calculate cross validation score using one od the [splitter methods available in scikit-learn](http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection). See how the scores differ with the sample size.

In [None]:
X_new, y_new = make_data(N=50)

X_new_tr, X_new_ts, y_new_tr, y_new_ts = train_test_split(X_new, y_new)

poly2new = PolynomialRegression(2)
poly2new.fit(X_new_tr, y_new_tr)
plot_regr(X_new_tr, y_new_tr, poly2new)

In [None]:
plot_regr(X_new_ts, y_new_ts, poly2new, color="r")

In [None]:
from sklearn.cross_validation import ShuffleSplit

scores = cross_val_score(poly2new, X_new, y_new, cv=ShuffleSplit(n=y_new.shape[0]))
print("Scores for regr: {}, mean score = {:03.2f}, std = {:03.2f}".format(scores, scores.mean(), scores.std()))

### Exercise 5

Change number of neigbors in ``KNeighborsClassifier`` model and run ``permutation_test_score`` again. Try a very large number, e.g. 300, can you explain the result? 

In [None]:
from sklearn import datasets
cancer = datasets.load_breast_cancer()

X_can = cancer.data
y_can = cancer.target

In [None]:
from sklearn.cross_validation import permutation_test_score
clf = KNeighborsClassifier(n_neighbors=300)
score, permutation_scores, pvalue = permutation_test_score(
    clf, X_can, y_can, scoring="accuracy", cv=None, n_permutations=1000, n_jobs=1)
print("Classification score %s (pvalue : %s)" % (score, pvalue))

In [None]:
plt.hist(permutation_scores, 20, label='Permutation scores',
         edgecolor='black', alpha=0.6)
ylim = plt.ylim()

plt.plot(2 * [score], ylim, '--g', linewidth=3,
         label='Classification Score')
plt.title("p_value = {:06.5f}".format(pvalue))
plt.ylim(ylim)
plt.legend()
plt.xlabel('Score')
plt.show()

### Exercise 6

Run permutation test score for the model build for Iris data. You can use original data or after PCA.

In [None]:
clf_ir = KNeighborsClassifier()

score_ir, permutation_scores_ir, pvalue_ir = permutation_test_score(
    clf_ir, X_pca, y_ir, scoring="accuracy", cv=None, n_permutations=1000, n_jobs=1)
print("Classification score %s (pvalue : %s)" % (score, pvalue))

In [None]:
plt.hist(permutation_scores_ir, 20, label='Permutation scores',
         edgecolor='black', alpha=0.6)
ylim = plt.ylim()

plt.plot(2 * [score_ir], ylim, '--g', linewidth=3,
         label='Classification Score')
plt.title("p_value = {:06.5f}".format(pvalue_ir))
plt.ylim(ylim)
plt.legend()
plt.xlabel('Score')
plt.show()

### Exercise 6

Validate the model using ``cross_val_score``. Try different kernels for SVM (you can read more [here](http://scikit-learn.org/stable/modules/svm.html))

In [None]:
from sklearn.cross_validation import cross_val_score, ShuffleSplit, LeaveOneOut

svc = SVC(kernel='linear')
scores = cross_val_score(svc, fmri_masked_2lb, conditions_2lb, cv=LeaveOneOut(n=conditions.shape[0]))
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))

# you can also try a default kernel
svc = SVC()
scores = cross_val_score(svc, fmri_masked_2lb, conditions_2lb, cv=LeaveOneOut(n=conditions.shape[0]))
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))

### Exercise 7
Check if KNeighborsClassifier would work for this dataset. Validate the model in the same way as SVC.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
clf_kn = KNeighborsClassifier()

In [None]:
scores = cross_val_score(clf_kn, fmri_masked_2lb, conditions_2lb, cv=LeaveOneOut(n=conditions.shape[0]))
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))

### Exercise 8

Try to run model using all conditions (except rest state). This is multiclass classification, try one-vs-all and one-vs-one strategies (can read more [here](https://en.wikipedia.org/wiki/Multiclass_classification))which one should be faster?
Does the new model has as high score as the one with two conditions only? Which conditions is the easiest to identify by the model and which one is the hardest?

In [None]:
# choosing new masks
conditions_new = conditions[conditions != b'rest']
fmri_masked_new = fmri_masked[conditions != b'rest']
fmri_masked_new.shape

In [None]:
# running One-vs-one multiclass class. 
# note, that this will take a while... can you explain why?
from sklearn.cross_validation import cross_val_score, ShuffleSplit, LeaveOneOut
svc_new_ovo = SVC(kernel='linear', decision_function_shape="ovo")
scores = cross_val_score(svc_new_ovo, fmri_masked_new, conditions_new, cv=LeaveOneOut(n=conditions_new.shape[0]))
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))

In [None]:
# let's try one-vs-all now, it should be much faster
from sklearn.cross_validation import cross_val_score, ShuffleSplit, LeaveOneOut
svc_new_ovr = SVC(kernel='linear', decision_function_shape="ovr")
scores = cross_val_score(svc_new_ovr, fmri_masked_new, conditions_new, cv=LeaveOneOut(n=conditions_new.shape[0]))
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))

In [None]:
# lets split manualy for two sets and see which conditions are easier to identify
# since one vs all is much faster and  give the same reuslts, we will use this model 
fmri_new_tr, fmri_new_ts, cond_new_tr, cond_new_ts = train_test_split(fmri_masked_new, conditions_new)
svc_new_ovr.fit(fmri_new_tr, cond_new_tr)
cond_new_pred = svc_new_ovr.predict(fmri_new_ts)

acc_cond = {}
for cn in np.unique(cond_new_ts):
    acc_cond[cn] = cond_new_pred[(cond_new_pred==cn) & (cond_new_ts==cn)].shape[0] / cond_new_ts[cond_new_ts==cn].shape[0]

print(acc_cond)