# Foundation Data Sciences
## Week 08: Principal Component Analysis

**Learning outcomes:** 
In this lab you will implement PCA from scratch and compare your outcomes to the standard scikit-learn PCA. By the end of the lab you should be able to:
- explain the difference between standardization, normalization, and scaling,
- explain why it is important to standardize data prior to PCA,
- implement PCA from scratch,
- use the sklearn library to get the principal components from a data set.

In this lab we will first try to use PCA on a breast cancer patients data set in order to help predict whether a cancer is benign or malignant. As with the diabetes data set, this week, we will take only the first two steps (plotting our initial data and preparing our data). However, we will learn a new preprocessing step, which can be extremely beneficial to inspect our data and apply further steps more efficiently. In the next lab, we will run actual algorithms to predict the diagnosis of patients from their symptoms. 
In the last part of this lab, we will go back to the diabetes data set from week 05, in order to get one step closer to predicting whether patients have diabetes or not.

**Data set information:** The new data set is from [UCI](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)). It contains patients information taken from patients with benign and malignant breast cancer.

In [2]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Importing sklearn functions
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

**Discussion:** Try to remember with your lab partner what PCA is useful for.

Your answer:

We start as usual by loading data, inspecting data, and cleaning data.

**Exercise 01:**
- Load the `breast_cancer.csv` file from `datasets`, and store it as `breast_cancer`.
- Print out the first few lines of the data set.
- We have a few problems in the data set:
    - we don't need the patients' ids, and for privacy issues we want to remove them,
    - the diagnosis column has as values 'M' and 'B', which is in a format which makes it hard to apply mathematical operations to, and
    - the last column 'Unnamed: 32' seems completely useless.
- Fix the problems as follows:
     - Remove the `id` column, and the `Unnamed: 32` column.
     - Replace the 'M's and 'B's values with 1 and 0. (Hint: look up the documentation for `replace` in the pandas documentation).
     - Remove all rows that have NaN values.
- Plot the data set in a pair plot, where malignant and benign cancers are colored differently. (Hint: For speed up, and readability use only 8 columns of your choice.)
- Save the diagnosis column in a separate pandas series, called `diagnosis` and remove it from the `breast_cancer` dataframe.
-  Finally, print out the first few entries to make sure that all changes you made to the data set are as expected.

In [88]:
# Your code
breast_cancer = pd.read_csv('datasets/breast_cancer.csv')
del breast_cancer['id']
del breast_cancer['Unnamed: 32']

breast_cancer = breast_cancer.replace('M',1).replace('B',0)
breast_cancer.head(5)
#sns.pairplot(breast_cancer, hue='diagnosis', vars=['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'symmetry_mean'])

diagnosis = breast_cancer['diagnosis']
del breast_cancer['diagnosis']

At first glance, it already looks like there are some dimensions with a clear boundary between the two patient groups. Our goal will be to use PCA to make this distinction even clearer.

**Remark:** PCA can be especially helpful for data sets where the separation of classes is not that clear. But even in the case above, we might want to take several dimensions into consideration. Furthermore, visualising many more than the 8 variables above is challenging. PCA can help us with both problems.

The first step is to standardize the data, as described in the lecture notes. You may read about the similar operations of normalization or scaling when searching online - see this [short article](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02) (7 minutes) for descriptions of the terms.

**Discussion:** Discuss with your lab partner why standardization is important before applying PCA to a data set.

Your answer:


**Exercise 02:** 

Write a function `def standardize()` that takes as input a data set `df`, and returns a standardized data set. Try not to copy+paste code from the internet, but rather try to think of the math that needs to be applied to the data set. If you are stuck, look up the Week 5 lecture notes (Statistical Preliminaries topic).

Call the function on the data set and print it to the screen.

In [122]:
# Your code
import math
def standardize(df):
    df = df.copy()
    for col in list(df):
        col_values = df[col]
        mu = col_values.mean()
        sigma = col_values.std()
        df[col] = col_values.apply(lambda x: (x-mu)/sigma)
    return df

Now, we write the PCA function.

**Exercise 03:**

Write a function `def principal_component_analysis()` that takes as input a data set `df`. The function should first standardize the data with your `standardize()` function, then compute the covariance matrix of the standardized data set, and finally compute and return the eigenvalues and eigenvectors of the covariance matrix. 

You may use `np.cov()` and `np.linalg.eig()` to compute the covariance matrix and the eigenvalues and eigenvectors. However, be careful about the shapes of your matrices and vectors. 

Call the function on the dataset, with the `breast_cancer` dataset and print it to the screen.

In [123]:
# Your code
def principal_component_analysis(df):
    df = standardize(df)
    cov = np.cov(df)
    return np.linalg.eig(cov)

**Exercise 04:** 

If you have read the documentation of `np.linalg.eig()` carefully, you should have read that the eigenvalues are not sorted in any way. However, for PCA, we are interested in a descending order of the eigenvalues (and according eigenvectors).

- Write a function `def sort_eigenvalues()`, which takes as first input `eigenvalues`, an array with eigenvalues, and as second input `eigenvectors`, a 2-D array with the associated eigenvectors. The function should sort the eigenvalues and eigenvectors such that the first eigenvalue is the biggest. 

Careful, look up the documentation of `np.linalg.eig()` again, to be sure how the eigenvalues are connected to the eigenvectors! Hint: `argsort()` from the numpy library might be helpful.

In [124]:
# Your code
def sort_eigenvalues(eigenvalues, eigenvectors):
    e = zip(eigenvalues, eigenvectors)
    return [list(x) for x in zip(*sorted(e, key=lambda x: abs(x[0]), reverse=True))]

**Exercise 05:** 
- Rewrite your PCA function, such that it returns the eigenvalues and eigenvectors sorted largest to smallest.
- The function should also print out the percentage of the variance explained by each principal component, before returning the eigenvalues and eigenvectors.

In [127]:
# Your code
def principal_component_analysis(df):
    df = standardize(df)
    cov = np.cov(df)
    w,v = sort_eigenvalues(*np.linalg.eig(cov))
    s = np.sum(w)
    t = 0
    n = 4
    for x in w[:n]:
        t += (x/s).real
        print(f'{(x/s).real:.2f}')
    print(f'The first {n} principal components explain {t*100:.1f}% of the variance')
    return w,v
pca = principal_component_analysis(breast_cancer)
print(pca[0][:2])

0.34
0.17
0.12
0.10
The first 4 principal components explain 73.7% of the variance
[(122.41918421788047+0j), (62.10144854788569+0j)]


Let us compare our largest two eigenvalues and eigenvectors to the output of the PCA model from the sklearn library. The sklearn PCA does not standardize the data automatically, it only centers the data around 0. To get the same results as our implementation, we need to preprocess the data first.

In [126]:
standardized = standardize(breast_cancer)
pca = PCA(n_components=2).fit(standardized.values) # number of principil components that we are interested in
print(pca.explained_variance_) # Eigenvalues
print(pca.components_) # Eigenvector (Careful: sklearn returns the eigenvectors in columns in contrast to numpy)

[13.28160768  5.69135461]
[[ 0.21890244  0.10372458  0.22753729  0.22099499  0.14258969  0.23928535
   0.25840048  0.26085376  0.13816696  0.06436335  0.20597878  0.01742803
   0.21132592  0.20286964  0.01453145  0.17039345  0.15358979  0.1834174
   0.04249842  0.10256832  0.22799663  0.10446933  0.23663968  0.22487053
   0.12795256  0.21009588  0.22876753  0.25088597  0.12290456  0.13178394]
 [-0.23385713 -0.05970609 -0.21518136 -0.23107671  0.18611302  0.15189161
   0.06016536 -0.0347675   0.19034877  0.36657547 -0.10555215  0.08997968
  -0.08945723 -0.15229263  0.20443045  0.2327159   0.19720728  0.13032156
   0.183848    0.28009203 -0.21986638 -0.0454673  -0.19987843 -0.21935186
   0.17230435  0.14359317  0.09796411 -0.00825724  0.14188335  0.27533947]]


Finally, we want to visualize our PCA results.

**Exercise 06:**
- Call your `principal_component_analysis()` function on the data set and store the return values.
- Apply the first two PC vectors to the standardized(!) data set and store it in a new variable `result`. (Careful about the orientation of your arrays.)
- Plot a scatterplot from the data in `result`, where you use `hue = diagnosis` to color the different data points according to the type of cancer they represent.
- Can you spot a clear separation between benign and malignant cancer?

In [None]:
# Your code


Finally, let us apply the sklearn model we fitted above to the data. Again, we need to apply it to the standardized data. Because multiplying the data with the principal component vectors transforms the data, the method we need to call is `model.transform()`.

In [None]:
standardized = standardize(breast_cancer)
pca = PCA(n_components=30).fit(standardized.values)
pca_result = pca.transform(standardized.values)
sns.scatterplot(x=pca_result[:, 0], y=pca_result[:, 1], hue=diagnosis)
plt.xlabel('PC1')
plt.ylabel('PC2')

**Exercise 07:**

In the week 05 lab, we started to look at a diabetes data set. The last activity in this lab is to see if PCA will also help us to visualize that data set.

a)

- Load the diabetes data set.
- Plot the pair plot of that lab, with `hue='Outcome'`.
- Do you think that PCA will give a clear cut distinction that is as good as for the breast cancer data set? Hint: Compare the pair plot of the diabetes data set and the breast cancer data set.
- See if you can remember how the data should be cleaned to remove data points that seem erroneous.

In [None]:
# Your code

b) 
- Create a new Pandas Series from the `Outcome` column, store it in `diagnosis_diabetes`, and drop the column from your data set.
- Apply your PCA model, and plot the scatter plot of the first two principal components.

In [None]:
# Your code


**Discussion:**

Discuss with your lab partner whether you can already make out two distinct clusters.
Look at the output of your PCA algorithm, and see how many PCs you would need to take into account to get as much explanation as with the first two components of the breast cancer example.

We cannot plot scatter plots with more than three dimensions, however, we can at least plot three dimensions. Run the following script, and discuss whether it looks like we can distinguish diabetes from non-diabetes patients. Hint: Remember that you can set the range of variables with `ax.set_xlim([xmin,xmax])`, in order to look at a range of interest closer.

In [None]:
fig = plt.figure(figsize=(15,15))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(result_diabetes[0], result_diabetes[1], result_diabetes[2], c=diagnosis_diabetes)

ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')

plt.show()

In the next lab, we will use K-Means to try to cluster the patients with and without diabetes into two different groups, and then we will try to predict whether new patients have diabetes.

**We need your help:** This is a new course. In order for us to improve the labs for the next iteration, we need your feedback. Please fill out the following [form](https://forms.office.com/Pages/ResponsePage.aspx?id=sAafLmkWiUWHiRCgaTTcYZmGMCx4KxlMjSTITqjdcXpUNFlYTk1LNDBYODRKV0o5TlhCWVc4U0tLOC4u).

**Optional Exercise:**

Compare the output of your PCA algorithm to a PCA applied to the data set without having cleaning it before. Hint: You might have to reload the data set, if you haven't stored the cleaned data set in a separate variable.

In [None]:
# Your code
