# Exercises 16: Matplotlib
Now let's make some nice plots!

## Exercise 16.1
Let's get back to our data on Diabetes disease progression

In [None]:
%matplotlib inline
import numpy as np
from sklearn.datasets import load_diabetes
try:
    diabetes = load_diabetes(scaled=False)
except:
    diabetes = load_diabetes()
    diabetes.data[:,1] = np.where(diabetes.data[:,1] > 0, 2, 1)

Remember a description of the data can be found in `diabetes.DESCR`. The data itself can be accessed with `diabetes.data`, which is a numpy array of the values for the 10 features measured. The target value, here the progression of the illness is in `diabetes.target`. The name of the features can be found in `diabetes.feature_names`
### Exercise 16.1.1
We want to look at the correlation between the *BMI* and the disease progression.
* plot the disease progression (y-axis) as a function of BMI (x-axis).
* use green crosses as markers
* set the x and y axis labels (to "Diabetes progression" and "BMI")
* set the title of the plot to "Correlation between BMI and diabetes progression"
* set the fontsize to 14 for the axes labels and 16 for the title

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure()
plt.plot(diabetes.data[:,2], diabetes.target, 'gx')
plt.xlabel("BMI", fontsize=14)
plt.ylabel("Diabetes progression", fontsize=14)
plt.title("Correlation between BMI and diabetes progression", fontsize=16)
plt.show()
plt.close()

### Exercise 16.1.2 (Supplementary)
The axis tick labels are still a bit small to look good. Set their fontsize to 12.

*HINTs*: 
* use matplotlib.rcParams to change this font size. You will need to import `matplotlib`. I suggest you import it as `mpl`
* `mpl.rcParams.keys()` will show you all the available things that you can customize. Because there are so many, it's not easy to find what you want. Try a list comprehension to filter only relevant keys `[k for k in mpl.rcParams.keys() if "size" in k]`

In [None]:
import matplotlib as mpl
print([k for k in mpl.rcParams.keys() if "size" in k])

In [None]:
mpl.rcParams['ytick.labelsize'] = 12
mpl.rcParams['xtick.labelsize'] = 12
plt.figure()
plt.plot(diabetes.data[:,2], diabetes.target, 'gx')
plt.xlabel("BMI", fontsize=14)
plt.ylabel("Diabetes progression", fontsize=14)
plt.title("Correlation between BMI and diabetes progression", fontsize=16)
plt.show()
plt.close()

### Exercise 16.1.3
Let's see if our two categories *man* and *woman* overlap on that plot. Redo the plot above, but with *men* as green crosses and *women* as red circles.

*Hint: Remember the mask from exercise 15.2.2 e*

In [None]:
# First we prepare the mask
female = diabetes.data[:, 1] == 2
male = np.logical_not(female)

In [None]:
# Now we make the plot
plt.figure()
plt.plot(diabetes.data[male, 2], diabetes.target[male], 'gx', label="men")
plt.plot(diabetes.data[female, 2], diabetes.target[female], 'r.', label="women")
plt.xlabel("BMI", fontsize=14)
plt.ylabel("Diabetes progression", fontsize=14)
plt.title("Correlation between BMI and diabetes progression", fontsize=16)
plt.legend(loc="best")
plt.show()
plt.close()

### Exercise 16.1.4
Let's have a look at the distribution of diabetes progression.
#### Exercise 16.1.4a 
Make a histogram of the diabetes progression

In [None]:
plt.figure()
plt.hist(diabetes.target)
plt.xlabel("Diabetes progression", fontsize=14)
plt.show()
plt.close()

#### Exercise 16.1.4b
Now split the data into two parts, *men* and *women*.

In [None]:
plt.figure()
plt.hist([diabetes.target[male], diabetes.target[female]], density=True, label=['men', 'women'])
plt.xlabel("Diabetes progression", fontsize=14)
plt.legend()
plt.show()
plt.close()

### Exercise 16.1.5
It can be useful to look at whether different features are correlated with one another or not. So let's make an image showing the correlation matrix between all the features plus the target. I've prepared the matrix for you in `corr_mat`
* To show an image, in this case a matrix, you can use the `plt.imshow` function
* the default colormap is ugly, many others can be found in `matplotlib.cm`. For correlations `matplotlib.cm.bwr` is nice (white for no correlation and red for positive, blue for negative). 
* use `vmin=-1` and `vmax=1` to set the min and max value for the colormap
* Finally, labels should be the names of the features. I've prepared an array `labels` containing all the labels. You can set the labels using `plt.xticks` and `plt.yticks`. They take two arrays as argument, the positions of the ticks and their label.
* for the x-axis, the labels should be rotated to be readable. You can use `rotation=90` as supplementary argument in the `plt.xticks` call.
* `plt.colorbar` will show the colorbar

In [None]:
corr_mat = np.corrcoef(np.concatenate((diabetes.data, diabetes.target[:,np.newaxis]),axis=1).T)
labels = np.append(diabetes.feature_names,"Progerssion")

In [None]:
plt.figure()
plt.imshow(corr_mat, vmin=-1, vmax=1, cmap=mpl.cm.bwr)
plt.xticks(range(11),labels, rotation=90)
plt.yticks(range(11),labels)
plt.colorbar()

In [None]:
help(plt.imshow)

# Supplementary
### Exercise 16.1.6
Now let's look at the distributions for the features:
* We first normalise the data, I've done that for you
* Make a histogram with all the features that have a correlation coefficient above 0.4. I've already prepared the array with the correlation coefficients `correlations`. Use 15 bins.
* Set the label of the y-axis to "Probability"
* This plot is a bit hard to read, we will try to improve on it in the next figure

In [None]:
diabetes.data = (diabetes.data - np.average(diabetes.data, axis=0))/np.std(diabetes.data, axis=0)
correlations = np.abs(np.corrcoef(np.concatenate((diabetes.data, diabetes.target[:,np.newaxis]),axis=1).T))[:-1,-1]

In [None]:
cutoff = 0.4
highly_correlated = diabetes.data[:, correlations > cutoff]
high_corr_labels = [feature for feature, corr in zip(diabetes.feature_names, correlations) if corr > cutoff]

In [None]:
plt.hist(highly_correlated, bins=15, density=True, label=high_corr_labels)
plt.legend(loc="best")
plt.ylabel("Probability")
plt.legend()
plt.show()
plt.close()

### Exercise 16.1.7
Histograms represented with bars are hard to read. You could instead use `histtype=step`, but I find even this hard to read. Instead, we will use the data generated from the `plt.hist` call to make a normal line plot.
* `plt.hist` returns 3 arrays: the values, the bin edges, and the patches. Retrieve those and use them to make a normal line plot
* label the curves and show the labels

So here first a plot with `histtype=step`:

In [None]:
plt.hist(highly_correlated, bins=15, density=True, histtype='step', label=high_corr_labels)
plt.legend(loc="best")
plt.ylabel("Probability")
plt.show()
plt.close()

Now let's make a line plot of the histograms.

In [None]:
y, x, p = plt.hist(highly_correlated, bins=15, density=True)
plt.close()
x = 0.5 * (x[1:] + x[:-1])
plt.figure()
for fname, yi in zip(high_corr_labels, y):
    plt.plot(x, yi, label=fname)
plt.ylabel("Probability")
plt.xlabel("Feature")
plt.legend(loc="best")
plt.show()

# Supplementary: Subplots
In this exercise we look at how to make subplots in matplotlib. You can find an example and some explanations at the end of the presentation on matplotlib.
## Exercise 16.2 

Make a figure with one plot for every feature except `sex` (which is categorical), showing the disease progression on the y axis as a function of that feature on the x-axis. This is 9 plots, so make it 3 lines of 3 plots.
#### Exercise 16.2.1
Make this plot and try to make it nice, with titles, axis labels, reasonable size, etc.

*HINT: for most features that are normally set through `plt.feature()`, for a subplot `ax` (an `AxesSubplot` object) the feature is set with `ax.set_feature`. So for example instead of `plt.title("My title")`, for a subplot `ax` it would be `ax.set_title("My title")`*

In [None]:
f, axes = plt.subplots(3, 3, figsize=(10, 10), sharey = True)
features = np.array(diabetes.feature_names)
mask =  features != "sex"
features = features[mask]
data = diabetes.data[:, mask]
for fname, xi, ax in zip(features, data.T, axes.flatten()):
    ax.plot(xi,diabetes.target,"rx")
    ax.set_title("{} vs {}".format("Progression",fname))
    ax.set_xlabel(fname)
for ax in axes[:,0]:
    ax.set_ylabel("Diabetes progression")
plt.subplots_adjust(wspace=0.05, hspace=0.35)

#### Exercise 16.2.2
Now we would like to have more explicit labels for the different features. For this we can use a dictionary mapping the feature keys to a nicer name. I've prepared the dictionary. Use it to redo the plot above but with the more explicit feature names for titles and labels...

In [None]:
fname_dict = dict(zip(features, ["Age", "Body Mass Index", "Blood Pressure",
                                 "Total Cholesterol","low-density lipoprot", "high-density lipoprot",
                                 "Total cholesterol / HDL", "serum triglycerides", "blood sugar level"]))

In [None]:
f, axes = plt.subplots(3, 3, figsize=(15, 10), sharey = True)
data = diabetes.data[:, mask]
for fname, xi, ax in zip(features, data.T, axes.flatten()):
    ax.plot(xi,diabetes.target,"rx")
    ax.set_title("{} vs {}".format("Progression",fname_dict[fname]))
    ax.set_xlabel(fname_dict[fname])
for ax in axes[:,0]:
    ax.set_ylabel("Diabetes progression")
plt.subplots_adjust(wspace=0.05, hspace=0.35)