In [None]:
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import os
import pandas as pd
import scipy.stats

#the datasets
import sklearn.datasets
import sklearn.cluster

#For better statistical plotting
import seaborn as sns

%matplotlib inline

# Using SVD and PCA to reduce dimensions

We will use the iris dataset that is included in the sklearn package and convert the basic data in it to a dataframe

In [None]:
iris = sklearn.datasets.load_iris()
df_iris = pd.DataFrame(iris.data, columns=iris.feature_names)

## Task
 
 * investigate what iris contains by reading print(iris['DESCR'])
 * and investigate the structure of the data in the dataframe with describe() and by looking at the head of the DataFrame
 * Use the "pairplot" from seaborn to investigate the relation between the data, can you by eye discover correlation between the different type of data?

In [None]:
#code 

Singular value decomposition factorizes matrix <br>
$X \in mxn$ <br> into <br> $X = U \, * \, \sum \, * \, V^T$ <br>
Where:

* $U \in m×m$ that is called the left singular vectors where each columns is a orthonormal eigenvectors of $XX^T$
* the diagonal matrix  Σ with entries single entries $\in  \mathbb{R}$ that are the non-negative singular values of  X
* and the right singular vectors in Matrix $V \in n×n$ where the columns are the set of orthonormal eigenvectors of  $X^TX$


The SVD algorithm from the numpy package can calculate this quite efficiently

In [None]:
U_iris, S_iris, Vt_iris = np.linalg.svd(df_iris, full_matrices=False)

## Task
Plot the cummulative sum of the entries in $S_{iris}^2$ divided by the sum of $S_{iris}^2$ and use the equations 12.8 and 12.9 in the book to label the axis. what is the sum of $S_{iris}^2$? use the answer to label the points in legend of this plot.

In [None]:
#Code

## Task
Label the axis in the following plot. It seems as the SVD has selected a vector in which the data is well separated. Try the other vectors and see if you can find other vectors that separate the different flowers.

In [None]:
fig,ax=plt.subplots(1,2,figsize=(12,6))
selector=iris['target']

for i in np.unique(selector):
    selector_name=iris['target_names'][i]
    ax[0].scatter(U_iris[selector==i,0],U_iris[selector==i,1],label=selector_name)
    ax[1].scatter(df_iris.iloc[selector==i,0],df_iris.iloc[selector==i,1],label=selector_name)
[ax[a].legend() for a in range(2)] 

In [None]:
iris['target_names']

### Illustration of SVD
To better understand the effect of SVD we will look at a simplified dataset and normalize the data.
Then we will simplify and illustrate the calculation of the projection onto the axis we will rotate the axis system and use the y-value as perpendicular projection $z_i$ in formula 12.5 or figure 12.2 in the book.

## Task

* create a new DataFrame with the petal length and width that only contains the varieties 'versicolor' and 'virginica'
* centralize each of the axis (subtract the mean) and normalize the scale (divide by the variance)
* make a scatterplot between the normalized petal length and width.matrix}



## Task

Create a composite figure with a few plots in each of which you rotate the axis by an angle $\alpha$. The coordinates can be understood as:
  <br> x=r cos($\alpha$) and y=r sin($\alpha$) <br>
  So the rotation of around the center we can express as:<br>
  $$ \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} cos(\alpha) & sin(\alpha) \\-sin(\alpha) & cos(\alpha)\end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}$$

## Task 

The y-value of each point corresponds to the $z_i$ value or the value that is not explained by the vector we assume here that for the perfect vector the euclidian distance is the same as the y-value. <br> The sum of the squared values corresponds to the variance that is not explained by the vector and that is to be minimized. The x-values (after rotation) correspond to the value of the variance that is explained by the first independent vector. For an $\alpha$ from 0 to 360 plot the sum of the squared y-values against the angle. With matplotlib this can e.g. be achieved with:

```
    fig, ax = plt.subplots(subplot_kw={'projection': 'polar'})
    ax.plot(alpha, r)
```

## Task

Finally compare if the minimum values of this plot and the vector that you get back for the a new SVD are comparable.

# Limitations of SVD/PCA
It is a fair assumption to consider PCA a non normallized version of SVD (check the math) however there are other that this part of the lab shall reveal. SVD is often used to look at measured (discrete) data and extract the normalized vectors from it. Here we shall first construct known data and then try to retrieve the data in it. We use for this we construct a sequence of non overlapping spectra of different species and look at the progression of the kinetic:<be>
WE excite the ground state of the same (excitation is before the space of observation) and then see the components A, B, C decay back into the ground state<br>
GS -> A-> B-> C -> GS

In [None]:
gauss = lambda x,mu,sigma=50: np.exp(-((x-mu)/sigma)**2)
x=np.arange(300,800,1)
spectra=pd.DataFrame({'A':gauss(x=x,mu=400),'B':gauss(x=x,mu=600),'C':gauss(x=x,mu=700),'GS':gauss(x=x,mu=500)},index=x)
spectra.index.name='wavelength in nm'
time=np.logspace(-2,4,int(1e4))
c=np.zeros((len(time),3))
rates=[1/0.1,1/10,1/1000]
c[0,0]=1
for i,t in enumerate(time[1:]):
    dc=c[i,:]
    timestep=t-time[i]
    dc[0]+=-rates[0]*timestep*dc[0]
    dc[1]+=rates[0]*timestep*dc[0]-rates[1]*timestep*dc[1]
    dc[2]+=rates[1]*timestep*dc[1]-rates[2]*timestep*dc[2]
    dc[dc<0]=0
    c[i+1,:]=dc
c=pd.DataFrame(c,index=time,columns=['A','B','C'])
c['GS']=-1*c.sum(axis=1)
c.index.name='time in ps'
data=0
for i,col in enumerate(c.columns):
    A,B=np.meshgrid(c.loc[:,col].values,spectra.loc[:,col].values)
    C=pd.DataFrame((A*B).T,index=c.index)
    if i==0: data=C
    else: data=data+C
data.columns=spectra.index.values

fig,ax=plt.subplots(1,3,figsize=(14,4))
spectra.plot(ax=ax[0])
c.plot(ax=ax[1])
ax[1].set_xscale('log')
X, Y = np.meshgrid(data.columns.values.astype(float), data.index.values.astype(float))
ax[2].pcolormesh(X,Y,data.values,cmap='jet')
ax[2].set_yscale('log')
ax[2].set_xlabel(spectra.index.name)
ax[2].set_ylabel(c.index.name)
fig.tight_layout()

## Task
Use SVD on the DataFrame "data" just generated, plot the plot the spectra, the kinetics and the strength of the first 5 singular vectors in this data. why are they different? Any idea how you would fix that?<br>Hint: Check that you have the right orientation of the matrics by comparing the dimensions to the DataFrame

In [None]:
#**code**

# Clustering

In this lab we will look into both K-mean clustering and hierachical clustering. <br>
please use the **df_iris** from above for these plots.

## K-Mean clustering

Use the below code to create a plot with 5 rows and 5 columns. Now use sklearn.cluster.KMeans to calculate a new "selector" that is separating the different flower types. for row 2-5 change the number of clusters you calculate from [1,2,3,4,5] and the numbers of initial random states you test [1,2,5,20]. Closely investigate the plots and formulate briefely where and when they differ.

Now use the same code but only give it two of the 4 columns (e.g. the first two) and observe the difference. Can you formulate it?
    


In [None]:
np.random.seed(12345)  # we start with the same random number to 
fig,ax=plt.subplots(5,5,figsize=(12,12))
#plot the selection as given in the data 
selector_given=iris['target']
for i in np.unique(selector_given):
    selector_name=iris['target_names'][i]
    for j in range(5):
        ax[0,j].scatter(df_iris.iloc[selector_given==i,2],df_iris.iloc[selector_given==i,3],label=selector_name,s=2)

#code here


# Choosing the right Clustering method

try to separate the domains in all the images in the subfolder "Smileys" using the clustering methods: 

Kmean and spectral clustering from sklearn.cluster


The domain image is from: 
Gorchy - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=4459327

Hint, for real images separating the clusters using an edge detection methoids can sometimes help. Here is a simple method for edge detection using the convolution with a simple Kernel.

In [None]:
import numpy as np
from scipy import misc,ndimage  # a library with useful stuff
import skimage,cv2

coins = skimage.data.coins()
edges = cv2.Canny(coins, 100, 200)
fig,ax=plt.subplots(1,2,figsize=(16,8))
ax[0].imshow(coins, cmap='gray')
ax[1].imshow(edges, cmap='gray')