# Week 4: An Introduction to PCA

## Goals
- Learn about Python functions
- See PCA in action

### Example from the lecture

Let's look at the example from earlier.

Instead of loading and plotting like we have done before, let's build a Python function to automate this.

#### Python functions (def)

In Python, we use `def` to indicate the start of a function. 

The subsequent lines in the function *must* be indented. The amount of space does not matter, but must be consistent. 

In [None]:
def MyFunction(x):      # first line 'def NAME(<input>):'
    return x + 1        # last line usually 'return <output>' 

Similar to an example from last week "`lambda x: x + 1`"

In [None]:
MyFunction(3)

In [None]:
MyFunction(3.1415)
# MyFunction("one")

We can make more complicated functions by including more lines.

In [None]:
def GetWord(s, i):                      # Two input variables: s and i
    s_clean = s.replace('.', '')        # Remove '.' from string s
    words = s_clean.split(' ')          # Cut string s_clean into substrings
    return words[i]                     # Output the ith word

In [None]:
GetWord("Nitwit. Blubber. Oddment. Tweak.", 1)

In [None]:
GetWord("Repetition legitimizes. Repetition legitimizes.", 2)

---

Back to the goal at hand. We want a function that accomplishes the following.

**INPUT:** a pandas data frame representing two columns (ind. then dep.) of data

**OUTPUT:** a `matplotlib` plot of the data including the mean point.

We will do this in 2 stages.

In [None]:
import pandas as pd

def plot_dataframe(df):
    import matplotlib.pyplot as plt         # locally load matplotlib
    
    print(df)
    pass

df = pd.read_csv("data/pcadata.csv")
plot_dataframe(df)

Now let's get the column names with `df.columns.values`.

In [None]:
def plot_dataframe(df):
    import matplotlib.pyplot as plt

    cvals = df.columns.values               # get the column headers 
    print(cvals)
    xbar = sum(df[cvals[0]]) / len(df)      # mean of x-values
    ybar = sum(df[cvals[1]]) / len(df)      # mean of y-values
    print("mean : ({0}, {1})".format(xbar, ybar))
    pass

df = pd.read_csv("data/pcadata.csv")
plot_dataframe(df)

Now we can build the plot within the function.

In [None]:
def plot_dataframe(df):
    import matplotlib.pyplot as plt         # locally load matplotlib

    cvals = df.columns.values               # get the column headers 
    xbar = sum(df[cvals[0]]) / len(df)
    ybar = sum(df[cvals[1]]) / len(df) 
    fig, ax = plt.subplots()
    ax.scatter(df[cvals[0]], df[cvals[1]], c="blue", zorder=2)
    ax.scatter([xbar], [ybar], marker='x', c="red", zorder=3)
    ax.grid()
    ax.set_xlabel(cvals[0])                 
    ax.set_ylabel(cvals[1])
    return fig

Let's load the data for our initial PCA example.

In [None]:
df = pd.read_csv("data/pcadata.csv")
fig = plot_dataframe(df)

**Challenge:** In class, I displayed a graph where the plot's grids were $1\times 1$ squares. Can you achieve this? 

Now let's load the data, build the matrix $X$, and construct its covariance matrix $C_X$. 

Recall that $x_1, \dots, x_n \in \R^m$ are our data points (as column vectors), so
$$ 
\begin{aligned}
    X &= \begin{pmatrix}
        x_1 & x_2 & \cdots & x_n 
    \end{pmatrix}, \\
    C_X &= \dfrac{1}{n} XX^{\mathrm{t}}. 
\end{aligned}
$$

In [None]:
import pandas as pd 
import numpy as np

In [None]:
df = pd.read_csv("data/pcadata.csv")
Z = np.array(df).T                      # Not normalized
print(Z.shape)

We need the mean to be $(0,0)\in\R^2$, so let's do that.

In [None]:
m1bar = sum(Z[0]) / len(Z[0])
m2bar = sum(Z[1]) / len(Z[1])
M = np.array([
    [m1bar] * len(df),
    [m2bar] * len(df)
])
X = Z - M 

In [None]:
C_X = X @ X.T / len(df)
print(C_X)

In [None]:
E = np.linalg.eig(C_X)
P = -E.eigenvectors @ np.array([[0,1],[1,0]])
print(P)
print(np.linalg.det(P))

In [None]:
Y = P @ X 
df2 = pd.DataFrame({
    "PC1" : Y[0], 
    "PC2" : Y[1]
})
fig = plot_dataframe(df2)

**Note:** If you think this makes the data look worse, consider 'standardising' the grid in `plot_dataframe`. 

### PCA on some 'real' data

Nothing really is chaning, but perhaps now the data has some context.

If you want some 'real' data sets, one can peruse [kaggle.com]()

We will take some [data concerning sustainability](https://www.kaggle.com/datasets/anshtanwar/global-data-on-sustainable-energy?resource=download) from kaggle. This is provided by [Ansh Tanwar](https://www.kaggle.com/anshtanwar) ([CC BY 4.0](https://creativecommons.org/licenses/by/4.0/))

I cannot provide "hard corded" data, so in order to run these examples, you will need to download something---either on Canvas, GitHub, or Kaggle itself (need an account for Kaggle).

Let's load the data set into Python

In [None]:
df = pd.read_csv("data/global-data-on-sustainable-energy.csv")
print(df.columns.values)
print(df)

Take all the rows with `Entity` equal to `Ireland`.

In [None]:
df_ie = df.loc[df["Entity"] == "Ireland"] 
df_ie

In [None]:
Z = np.array([
    df_ie["Electricity from fossil fuels (TWh)"],
    df_ie["Electricity from renewables (TWh)"],
    df_ie["Primary energy consumption per capita (kWh/person)"],
    df_ie["gdp_per_capita"]
]).T
print(Z)

Data are in vastly different scales: 10s, 1s, 10000s, and 10000s. We "rescale".

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(Z)
print(scaler.mean_)
X = scaler.transform(Z).T
print(X.T)

In [None]:
C_X = X @ X.T / X.shape[1]
print(C_X)

In [None]:
E = np.linalg.eig(C_X)
print(E.eigenvalues)

The first principal component is by far the most important (highest value).

First two principal components account for about $93\%$ of our variance.

In [None]:
sum(E.eigenvalues[:2])/4

And the first three PC accont for about $99\%$ of the variance.

In [None]:
sum(E.eigenvalues[:3])/4

Let's see what we can glean. Recall our variables are 
1. Electricity from fossil fuels (TWh)
2. Electricity from renewables (TWh)
3. Primary energy consumption per capita (kWh/person)
4. GDP per capita

Recall the PC1 was most significant, but the first two yield a lot of information.

In [None]:
P = E.eigenvectors 
# print(P.T @ C_X @ P)        # Sanity check 
print(P)

One might make the following conclusions from the first column alone:
- variables (1) and (3) are positively correlated, 
- variables (2) and (4) are positively correlated,
- variables (1) and (2) are negatively correlated.

These findings do not apply to second column for example.