In [None]:
import numpy as np                # We always need this
import pandas as pd               # Pandas is a library for working with tables of data 
                                  # called Data Frames
import matplotlib.pyplot as plt   # We also want to plot stuff...

# Diagnosing tumours.

The .csv file "wdbc.csv" contains a large data set (the "Breast Cancer Wisconsin (Diagnostic) Data Set (WDBC)"). It can be obtained from https://www.openml.org/d/1510 .

This is a real data set (i.e. not simulated but contains actual data). It contains various measurements on breast cancer cells, totally 30 different variables such as "radius", "texture", "perimeter", "area", "smoothness", "compactness", "concavity", etc.

In our data frame they are just called Var 1, Var 2, etc. :)

Each case also has a diagnosis: Malignant (M) or Benign (B).

There are totally 569 observations. The patient id's are of course anonymized.

Let's read the data frame and take a look at the content:

In [None]:
df = pd.read_csv("wdbc.csv",index_col=0)

In [None]:
df         # Let's look at the Data Frame

# Problem:

Would it be possible for us to use the 30 given variables to predict (diagnose) whether we are looking at a malignant or a benign case? Let's try!


### Extracting the numerical data:

Since you haven't been taught in detail how data frames work, I'll do most manipulations, just leaving the interpetation to you.

First we extract the columns that contain numerical data (the 30 variables).

In [None]:
df_data = df.loc[:,'Var 1':'Var 30'] 
df_data

As you can see, some variables contain much larger values than others (e.g. **Var 4** and **Var 5**). Because of this difference in scale, those variables with large numerical values will have larger variance and be overrepresented in our principal components. This introduces an undesireable bias that we would like to get rid of.

To get rid of that effect, we will work with normalized data instead, i.e. after centralizing (subtracting the mean from each column) we also normalize (divide each column with its standard deviation). This is easily done:

In [None]:
N=(df_data-df_data.mean())/df_data.std()
N

As you can see, the numbers are now more comparable in size.

The covariance matrix can also be easily obtained. Note that since we're working with normalized data, the covariance matrix is actually the correlation coefficient matrix $R$:

In [None]:
R = N.cov()
np.round(R,3)

# Visualization

Now let's do a bit of plotting. From our original dataframe `df`, let's say we want to plot the observations of **Var 2** against the observations of **Var 1**. Easy!

In [None]:
df.plot.scatter('Var 1','Var 2',marker='.')
plt.show()

Hmm, it would also be nice to see which data points represent malignant cases and which ones represent the benign ones.

Below is a small trick that assigns a red color to the malignant cases and a green color to the benign ones. You don't have to understand how it's done.

In [None]:
colors = {'M':'red','B':'green'}
colorlist = df["Diagnosis"].apply(lambda x : colors[x])

In [None]:
df.plot.scatter('Var 1','Var 2',c=colorlist,marker='.')
plt.show()

Or, we could plot the first two normalized variables from our normalized dataframe `N`. The picture is the same, but the numbers are different.

In [None]:
N.plot.scatter('Var 1','Var 2',c=colorlist,marker='.')
plt.show()

From a visual inspection, we see that in these two variables, there is no clear separation between the malignant and benign cases. In other words, we can't use only these two variables to make a meaningful diagnosis.

# Exercise 2(a)

Look at the correlation coefficient matrix that we calculated above and pick three pairs of variables:

- Pick one pair that is highly positively correlated ($r>0.95$).
- Pick one pair that is somewhat negatively correlated ($r< -0.3$).
- Pick one pair that seem uncorrelated ($r\in (-0.05,0.05)$).

For each pair, make a plot like the one above. Use the same trick as above to assign a red/green color to the malignant/benign cases.

Submit all three plots in LAMS.

In [None]:
# Go ahead

# Performing PCA

As you have seen now, out of the 30 given variables, many are highly correlated with each other. This means that the data set actully contains a lot of redundancy (many variables are determined by each other) and doing a PCA should enable us to reduce the data set to a smaller number of uncorrelated variables. Let's try it:

As mentioned above, we want to do our analysis on the normalized variables, so we should look at eigenvalues and eigenvectors of the correlation coefficient matrix $R$.

In [None]:
l, Q = np.linalg.eig(R)   # Create arrays l, and Q with eigenvalues and eigenvectors
# Then we sort in descending order.
idx = l.argsort()[::-1]
l = l[idx]
Q = Q[:,idx]

Let's take a look at the eigenvalues

In [None]:
print(np.round(l,1))   # Let's look at our sorted list of eigenvalues.

As explained previously, each eigenvalue denotes the variance of the corresponding principal component. 

Recall also that in the original but normalized data set, each variable has variance 1 (since the variables were normalized), but the principal components may have a larger or smaller variance (the eigenvalues above). The total variance of the 30 components is the same though:

In [None]:
sum(l)  # Calculate the sum of the eigenvalues 
        # (i.e. the sum of the variance of the 30 principal components)

# Exercise 2(b)

1. How much of the total variance is attributed by the first two principal components.
2. How many principal components do we need to attribute for at least 90% of the total variance?

Submit your answers in LAMS.
 
- Recall that for example `l[0:3]` will give you the first three entries `l[0]`, `l[1]` and `l[2]` of `l`. 
- To get the sum of the three first entries you can use `sum(l[0:3])`.
- The sum of all eigenvalues is simply `sum(l)`


In [None]:
# Go ahead

# Exercise 2(c)

1. Create a new dataframe `Np` containing the principal components of your nomalized data `N`.
2. Create a scatter plot of two first principal components against each other. If the columns of `Np` have no labels, you can access them by index instead. Just remember that the first column has index 0. Also, use the same trick as before, to assign a red color to malignant cases and green color to benign cases.
3. Does it look like we could use the first two principal components to guess whether an observation comes from a malignant or a benign case. Explain with a few words (just the idea, no detail needed).