# Sheet 04

## Preamble

Autors: Marten Ringwelski, Nico Ostermann, Simon Liessem

Note that this notebook MUST be executed in order to get everything to work.
The tasks can't be run individually. 

Also eCampus does not allow for uploading nested directory structures which makes it hard to properly organize the files. The files are expected to be in the `data` directory which itself is placed next to this notebook.

If you extract the zip file we handed in everything should work just fine.

Autoformatting if `jupyter-black` is installed.

In [None]:
try:
    import black
    import jupyter_black

    jupyter_black.load(
        lab=False,
        line_length=79,
        verbosity="DEBUG",
        target_version=black.TargetVersion.PY310,
    )
except ImportError:
    pass

Import all we weed and more.

Set seaborn default theme

In [None]:
import seaborn as sns
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk
from sklearn.feature_selection import f_classif, SelectKBest
import math as m
import plotly.express as px
import sklearn.manifold
import scipy as sp
import scipy.sparse

Set seaborn default theme

In [None]:
sns.set_theme()

If needed tweak parameters of matplotlib.
Here we increase the size and dpi to bet a bigger but still high-res image.

In [None]:
mpl.rcParams["figure.dpi"] = 200
mpl.rcParams["figure.figsize"] = (20, 15)
%matplotlib inline

Disable future warnings as we get a lot of them and don't really care for this sheet.

In [None]:
import warnings

warnings.simplefilter(action="ignore", category=FutureWarning)

## Exercise 1

### a)

Read the dataframe and replace missing values by the respective mean of the column.

In [None]:
df = pd.read_excel("data/breast-cancer-wisconsin.xlsx")
df = df.fillna(df.mean())

df["class"] = df["class"].map({2: "benign", 4: "malignant"})

Now define the DataFrame for t-SNE and create a new one with the result.

In [None]:
data_columns = df.columns.difference(["class", "code"])

In [None]:
df_wo_meta = df[data_columns]

Do t-SNE with different perplexities as the task asked us to.


In [None]:
perplexities = [5, 10, 20, 30, 40, 50]
fig, axs = plt.subplots(
    nrows=2,
    ncols=m.ceil(len(perplexities) / 2),
)
for perplexity, ax in zip(perplexities, axs.flatten()):
    tsne = sk.manifold.TSNE(
        n_components=2,
        perplexity=perplexity,
        init="random",
        learning_rate="auto",
    )

    df_tsne = pd.DataFrame(
        tsne.fit_transform(df_wo_meta),
        columns=["x-tsne", "y-tsne"],
        index=df.index,
    )
    df_tsne[df.columns] = df

    ax.set_title(f"Perplexity: {perplexity}")
    ax.set_aspect("equal")
    sns.scatterplot(
        data=df_tsne,
        x="x-tsne",
        y="y-tsne",
        hue="class",
        ax=ax,
    )

Copy paste from above but init="pca"

In [None]:
perplexities = [5, 10, 20, 30, 40, 50]
fig, axs = plt.subplots(
    nrows=2,
    ncols=m.ceil(len(perplexities) / 2),
)
for perplexity, ax in zip(perplexities, axs.flatten()):
    tsne = sk.manifold.TSNE(
        n_components=2,
        perplexity=perplexity,
        init="pca",
        learning_rate="auto",
    )

    df_tsne = pd.DataFrame(
        tsne.fit_transform(df_wo_meta),
        columns=["x-tsne", "y-tsne"],
        index=df.index,
    )
    df_tsne[df.columns] = df

    ax.set_title(f"Perplexity: {perplexity}")
    ax.set_aspect("equal")
    sns.scatterplot(
        data=df_tsne,
        x="x-tsne",
        y="y-tsne",
        hue="class",
        ax=ax,
    )

TODO what do we no?

### c)

Read data and use mean for missing data.

In [None]:
df = pd.read_excel(
    "data/Data_Cortex_Nuclear.xls",
    index_col="MouseID",
)
df = df.fillna(df.mean())

In [None]:
meta_columns = ["Genotype", "Treatment", "Behavior", "class"]
df_wo_meta = df[df.columns.difference(meta_columns)]

df_scs = df_wo_meta[
    np.logical_or(
        df["class"] == "c-SC-s",
        df["class"] == "t-SC-s",
    )
].copy()

Now actuall do PCA and create a DataFrame with the result.
Also we use equal axis scale for the plot which makes sense since we care about the results from PCA.

In [None]:
pca = sk.decomposition.PCA(
    n_components=2,
)
# XXX There must be a better way to do this
df_pca = pd.DataFrame(
    pca.fit_transform(df_scs),
    columns=["x-pca", "y-pca"],
    index=df_scs.index,
)
df_pca[df.columns] = df

In [None]:
plt.gca().set_aspect("equal")

sns.scatterplot(
    df_pca,
    x="x-pca",
    y="y-pca",
    hue="class",
)

TODO
- isomap
- tsne with different settings

Now we do the same thing but for isomap.
As we see by the warnings using 2 or 5 is not good as the resulting graph has more than one connecting component.

In [None]:
# We don't care that the calculation is expensive
warnings.simplefilter(
    action="ignore",
    category=sp.sparse.SparseEfficiencyWarning,
)
n_neighbors_array = [2, 5, 7, 13, 17, 23, 29]

fig, axs = plt.subplots(
    nrows=2,
    ncols=m.ceil(len(n_neighbors_array) / 2),
)

for n_neighbors, ax in zip(n_neighbors_array, axs.flatten()):
    isomap = sk.manifold.Isomap(
        n_neighbors=n_neighbors,
    )

    df_isomap = pd.DataFrame(
        isomap.fit_transform(df_scs),
        columns=["x-isomap", "y-isomap"],
        index=df_scs.index,
    )

    df_isomap[df.columns] = df

    plt.gca().set_aspect("equal")

    sns.scatterplot(
        df_isomap,
        x="x-isomap",
        y="y-isomap",
        hue="class",
        ax=ax,
    )