# Sheet 03

## Preamble

Autors: Marten Ringwelski, Nico Ostermann, Simon Liessem

Note that this notebook MUST be executed in order to get everything to work.
The tasks can't be run individually. 

Also eCampus does not allow for uploading nested directory structures which makes it hard to properly organize the files. The files are expected to be in the `data` directory which itself is placed next to this notebook.

If you extract the zip file we handed in everything should work just fine.

Autoformatting if `jupyter-black` is installed.

In [None]:
try:
    import black
    import jupyter_black

    jupyter_black.load(
        lab=False,
        line_length=79,
        verbosity="DEBUG",
        target_version=black.TargetVersion.PY310,
    )
except ImportError:
    pass

Import all we weed and more.

Set seaborn default theme

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk
from sklearn.feature_selection import f_classif, SelectKBest
import math as m
import plotly.express as px

Set seaborn default theme

In [None]:
sns.set_theme()

If needed tweak parameters of matplotlib.
Here we increase the size and dpi to bet a bigger but still high-res image.

In [None]:
mpl.rcParams["figure.dpi"] = 200
mpl.rcParams["figure.figsize"] = (20, 15)
%matplotlib inline

## Exercise 1

### a)

Read the dataset with Pandas and store dataframe.
Then delete ever line that is not of one of the classes c-CS-s or t-CS-s.

In [None]:
df = pd.read_excel("data/Data_Cortex_Nuclear.xls")
df_subgroups = df[
    np.logical_or(df["class"] == "c-CS-s", df["class"] == "t-CS-s")
].copy()

First we print the amount of mice depending on the class.
This is 135 for the c-CS-s mice and  105 for the t-CS-s mice.

In [None]:
df_subgroups["class"].value_counts()

### b)

Get array of 0 and 1 to scale the color depending on the class.

In [None]:
colors = df_subgroups["class"].map(
    {
        "t-CS-s": 0,
        "c-CS-s": 1,
    },
)

Now we make a parallel plot. We plot the dataframe with the 5 proteins named in the task and use different colors for the 2 classes t-CS-s and c-CS-s.

In [None]:
fig = px.parallel_coordinates(
    df_subgroups,
    color=colors,
    dimensions=["pPKCG_N", "pP70S6_N", "pS6_N", "pGSK3B_N", "ARC_N"],
    color_continuous_scale=px.colors.diverging.Tealrose,
    color_continuous_midpoint=0.5,
)
fig.update_layout(
    coloraxis_colorbar=dict(
        title="Class",
        tickvals=[0, 1],
        ticktext=["t-CS-s", "c-CS-s"],
        lenmode="pixels",
        len=200,
    ),
)
fig.show()

### c)

By rearranging the axes we notice that the values of pS6_N and ARC_N are exatly equal.

## Exercise 3

See the end of the pdf.

## Exercise 3

### a)

Note how we cannot use the "code" clumn as index as int contains 53 duplicates.

In [None]:
df = pd.read_excel("data/breast-cancer-wisconsin.xlsx")

From the below we can see that we have 16 missing values in the column bareNuc (Bare Nuclei).

In [None]:
df.isna().aggregate(np.sum)

As these are less than 3% of all the data and about 7% of patients with a malignant tumor we decide to just leave the patiens with missing values out.

In [None]:
df = df.dropna()

## b)

In [None]:
data_columns = df.columns.difference(["class", "code"])

In [None]:
n_components = len(data_columns)

In [None]:
df_wo_meta = df[data_columns]

In [None]:
pca = sk.decomposition.PCA(n_components=n_components)

Next we did PCA with all 9 columns.
We first get a PCA instance, then fit it to our data and after that transform our data according to the PCA result.
To make a plot where we can see how high the variance is depending on the amount of components we make the sumcum over the variance each component yields.

In [None]:
pca.fit(df_wo_meta)
x_transformed = pca.transform(df_wo_meta)
plt.plot(
    np.cumsum(pca.explained_variance_ratio_),
)
plt.xlabel("Amount of components")
_ = plt.ylabel("Variance covered")

To find out how many components we need to at least 90% of the variance we computed PCA in such a way that we get a PCA transformation that yields 90% of the variance.
After this we print the shape of our transformed data and can see that we need 5 components to have at least 90% of the variance covered.

In [None]:
pca_most = sk.decomposition.PCA(n_components=0.9)
pca_most.fit(df_wo_meta)
transformed_most = pca_most.transform(df_wo_meta)

In [None]:
n_principal_components = transformed_most.shape[1]

In [None]:
n_principal_components

### c)

To make scatter plot matrix we now create a dataframe from the PCA result.
We then make a column with the class names and assign each sample a class wether its malignant or benign.
Then we use this dataframe to make a scatterplot with the class_names as hue.

In [None]:
df_most = pd.DataFrame(
    transformed_most,
    index=df.index,
    columns=[f"PC {i}" for i in range(1, n_principal_components + 1)],
)

In [None]:
df_most["class_name"] = df["class"].map({4: "malignant", 2: "benign"})

sns.pairplot(df_most, hue="class_name")

### d)

The first PCA mode shows the strongest difference in distributions.
That makes a lot of sense since the first pca mode covers the biggest fraction of variance.

We take the first row of the matrix and then look what the index of the maximum value is.
This is the index to the column of the original data.
In our case bareNuc has the biggest influence for the first component.

In [None]:
df_wo_meta.columns[np.argmax(pca_most.components_[0, :])]

Now the minimum.

In [None]:
df_wo_meta.columns[np.argmin(pca_most.components_[0, :])]

### e)

We use plotly and its functionality "Box select" to highlight the outlier in all plots to see that it is
in fact an outlier in all components.
Also we abuse `hover_data` to plot all attributes of the datapoint.
In the same way we abuse `hover_data` to show the datapoint index.

In [None]:
fig = px.scatter_matrix(
    df_most,
    dimensions=df_most.columns.difference(["class_name"]),
    color="class_name",
    hover_name=df_most.index,
    hover_data=df_most,
)
fig.show()

Since we got the index by reading point name from above we can just abuse.

In [None]:
df_most_wo_outlier = df_most.drop(
    6,
    axis=0,
)

In [None]:
sns.pairplot(df_most_wo_outlier, hue="class_name")

### f)

The huge difference in ranges would affect PCA, because axis the bigger range has a way higher variance.
We can use normalisation to compute a relative variance.
So the first principal component would explain a lot of the variance,
but when PCA is computed on the normalized dataset the first component would most likely explain a lot less variance.
If for example the dataset we analysed in this task was in this form
this could lead us to depict only 4 instead of 5 principle components for a coverage of 90% variance. 


So it would make sense to pre-process the data.
This can be done by the code below. 

In [None]:
scaler = sk.preprocessing.StandardScaler()
scaler.fit(df_wo_meta)
x_scaled = scaler.transform(df_wo_meta)