# Week 4: An Introduction to PCA

## Goals
- Learn about Python functions
- See PCA in action

### Example from the lecture

Let's look at the example from earlier.

Instead of loading and plotting like we have done before, let's build a Python function to automate this.

#### Python functions (def)

In Python, we use `def` to indicate the start of a function. 

The subsequent lines in the function *must* be indented. The amount of space does not matter, but must be consistent. 

In [None]:
def MyFunction(x):      # first line 'def NAME(<input>):'
    return x + 1        # last line usually 'return <output>' 

Similar to an example from last week "`lambda x: x + 1`"

In [None]:
MyFunction(3)

In [None]:
MyFunction(3.1415)
# MyFunction("one")

We can make more complicated functions by including more lines.

In [None]:
def GetWord(s, i):                      # Two input variables: s and i
    s_clean = s.replace('.', '')        # Remove '.' from string s
    words = s_clean.split(' ')          # Cut string s_clean into substrings
    return words[i]                     # Output the ith word

In [None]:
GetWord("Nitwit. Blubber. Oddment. Tweak.", 1)

In [None]:
GetWord("Repetition legitimizes. Repetition legitimizes.", 2)

---

Back to the goal at hand. We want a function that accomplishes the following.

**INPUT:** a pandas data frame representing two columns (ind. then dep.) of data

**OUTPUT:** a `matplotlib` plot of the data.

We will do this in 2 stages.

In [None]:
import pandas as pd

def plot_dataframe(df):
    import matplotlib.pyplot as plt         # locally load matplotlib
    
    print(df)
    pass

plot_dataframe(pd.read_csv("data/pcadata.csv"))

Now let's get the column names with `df.columns.values`.

In [None]:
def plot_dataframe(df):
    import matplotlib.pyplot as plt

    cvals = df.columns.values               # get the column headers 
    print(cvals)
    pass

plot_dataframe(pd.read_csv("data/pcadata.csv"))

Now we can build the plot within the function.

In [None]:
def plot_dataframe(df):
    import matplotlib.pyplot as plt         # locally load matplotlib

    cvals = df.columns.values               # get the column headers 
    fig, ax = plt.subplots()
    ax.scatter(df[cvals[0]], df[cvals[1]], c="blue")
    ax.grid()
    ax.set_xlabel(cvals[0])                 
    ax.set_ylabel(cvals[1])
    return fig

Let's load the data for our initial PCA example.

In [None]:
fig = plot_dataframe(pd.read_csv("data/pcadata.csv"))

**Challenge:** In class, I displayed a graph where the plot's grids were $1\times 1$ squares. Can you achieve this? 

Now let's load the data, build the matrix $X$, and construct its covariance matrix $C_X$. 

Recall that $x_1, \dots, x_n \in \R^m$ are our data points (as column vectors), so
$$ 
\begin{aligned}
    X &= \begin{pmatrix}
        x_1 & x_2 & \cdots & x_n 
    \end{pmatrix}, \\
    C_X &= \dfrac{1}{n} XX^{\mathrm{t}}. 
\end{aligned}
$$

In [None]:
import pandas as pd 
import numpy as np

In [None]:
df = pd.read_csv("data/pcadata.csv")
X = np.array(df).T
print(X.shape)

In [None]:
C_X = X @ X.T / len(df)
print(C_X)

In [None]:
E = np.linalg.eig(C_X)
P = -E.eigenvectors @ np.array([[0,1],[1,0]])
print(P)
print(np.linalg.det(P))

In [None]:
Y = P @ X 
df2 = pd.DataFrame({
    "PC1" : Y[0], 
    "PC2" : Y[1]
})
fig = plot_dataframe(df2)

**Note:** If you think this makes the data look worse, consider 'standardising' the grid in `plot_dataframe`. 