# Exercises 17: scipy
Let's use scipy's optimization module to fit functions to some data.

As in the last exercises, we will work with the Diabetes progression dataset. We will also be using matplotlib and numpy, which are essential when working with data. Let's import them below and make sure plots get displayed inline in this notebook:

In [None]:
%matplotlib inline
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.datasets import load_diabetes
try:
    diabetes = load_diabetes(scaled=False)
except:
    diabetes = load_diabetes()
    diabetes.data[:,1] = np.where(diabetes.data[:,1] > 0, 2, 1)

## Exercise 17.1
### Exercise 17.1.1
As we saw in Ex. 16.1.5, `s3` is highly correlated to `s4`. Try to fit the function `exponential` I defined for you below to `s3` vs `s4`, using the `scipy.optimize.curve_fit` function. Then plot the data and the fitted function.

In [None]:
from scipy import optimize

def exponential(x, a, b, c): 
    return a * np.exp(- 0.01 * b * x) + c

In [None]:
# get the data to fit from the diabetes dataset
i = diabetes.feature_names.index("s3")
j = diabetes.feature_names.index("s4")
x_data = diabetes.data[:,i]
y_data = diabetes.data[:,j]

# fit exponential to the data
fit_result = optimize.curve_fit(exponential, x_data, y_data)
print(fit_result)

In [None]:
# Now we plot the data and the fitted function
plt.figure()

# here we plot the data
plt.plot(x_data, y_data, "x", label = "data")

# Now we plot the fitted function. We want to evaluate it for each
# point in our x_data. To be able to plot the function as a line, we
# sort the x_data first.
x_fit = np.sort(x_data)
y_fit = exponential(x_fit, fit_result[0][0],  fit_result[0][1],  fit_result[0][2])
plt.plot(x_fit, y_fit, "--", label = "fit")

# Finally we set some legends and labels for our plot
plt.legend(loc = "best")
plt.xlabel(diabetes.feature_names[i])
plt.ylabel(diabetes.feature_names[j])
plt.show()

### Exercise 17.1.2
Now let's look at `bmi` and the disease progression, which have a decent positive correlation. As above, use `scipy.optimize.curve_fit` to fit a function to the data, but this time fit a simple line. Then plot the data with the fitted line

In [None]:
def line(x, a, b): 
    return a*x + b

# get the data to fit from the diabetes dataset
i = diabetes.feature_names.index("bmi")
x_data = diabetes.data[:,i]
y_data = diabetes.target

# fit line to the data
fit_result = optimize.curve_fit(line, x_data, y_data)
print(fit_result)

In [None]:
plt.figure()
plt.plot(x_data, y_data, "x", label = "data")

# Now we plot the fitted function. As it is a line, we only need
# to evaluate it at two points to be able to plot it correctly. For that
# we simply use the current x-limits of the plot so that our fitted line will
# span the whole x-range of our plot.
x_fit = np.array(plt.xlim())
y_fit = line(x_fit, fit_result[0][0], fit_result[0][1])
plt.plot(x_fit, y_fit, "--", label = "fit")

# Finally we set some legends and labels for our plot
plt.legend(loc = "best")
plt.xlabel(diabetes.feature_names[i])
plt.ylabel("Disease progression")
plt.show()

# Supplementary
We will now do some clustering with scipy. For these you will need to look at the supplementary slides

## Exercise 17.2
In our dataset there is one categorical variable (`"sex"`). We will use all the other features for clustering, generate 2 clusters and then look at them.
### Exercise 17.2.1
Use kmeans clustering (`scipy.cluster.vq.kmeans2`) to generate 2 clusters from all the features except `"sex"`. 

In [None]:
from scipy import cluster
# We use a mask to filter out the column for the sex feature
data = diabetes.data[:, np.array(diabetes.feature_names) != "sex"]
print("our data has dimensions", data.shape)

# Now we perform the kmeans clustering with two clusters
cluster_centroids, cluster_assignments = cluster.vq.kmeans2(data, 2)

### Exercise 17.2.2
Now make a plot of the disease progression (y-axis) as a function of the `"sex"` feature, but with each cluster plotted with its own marker and color (you can use a mask for that).

Are the clusters well separated on these variables?

In [None]:
plt.figure()

i = diabetes.feature_names.index('sex')
# cluster_assignments is an array of 0 and 1, assigning each data point to
# one of the two clusters. We can use it for masking
mask1 = cluster_assignments == 1
plt.plot(diabetes.data[mask1,i], diabetes.target[mask1], "o")

mask2 = np.logical_not(mask1)
plt.plot(diabetes.data[mask2,i], diabetes.target[mask2], "x")

plt.xlabel(diabetes.feature_names[i])
plt.ylabel("Disease progression")
plt.show()

### Exercise 17.2.3
Now make a plot of `"s3"` (y-axis) as a function of the `"s1"` feature, but with each cluster plotted with its own marker and color as above.

Are the clusters well separated on these two variable?

In [None]:
plt.figure()

# cluster_assignments is an array of 0 and 1, assigning each data point to
# one of the two clusters. We can use it for masking
i = diabetes.feature_names.index('s1')
j = diabetes.feature_names.index('s3')
plt.plot(diabetes.data[mask1,i], diabetes.data[mask1,j], "o")
plt.plot(diabetes.data[mask2,i], diabetes.data[mask2,j], "x")
plt.xlabel(diabetes.feature_names[i])
plt.ylabel(diabetes.feature_names[j])
plt.show()