# Week 5: Our descent into PCA

## Goals
- Hands on PCA 
- `filter` and `None`
- Plenty of plots plotted in a plot

## PCA and sustainability

Let's load up the same data set from last week and play around with it. 

First let's load the modules we'll use.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df_big = pd.read_csv("data/global-data-on-sustainable-energy.csv")
df_big.head()             # Prints the first 5 rows

Let's focus on three columns:
1. "Electricity from fossil fuels (TWh)"
2. "Electricity from nuclear (TWh)"
3. "Electricity from renewables (TWh)"

In [None]:
df = pd.DataFrame({
    "f" : df_big["Electricity from fossil fuels (TWh)"], 
    "n" : df_big["Electricity from nuclear (TWh)"],
    "r" : df_big["Electricity from renewables (TWh)"]
})
print(df)

Now we want to make sure that all of our data is complete. We don't want any missing entries. Let's check.

### The `filter` function

We will use the `filter` function in Python.

Similar to `map`, the `filter` function runs through an iterable object (e.g. a list) and applies a function `f` on each entry. If on that entry `f` returns `True`, then that entry is kept; otherwise the entry is discarded.

In [None]:
L = [-2, -1, 0, 1, 2]
is_pos = lambda x: x > 0
is_pos(-4)

In [None]:
list(filter(is_pos, L))     # Only the positive entries remain.

### `None` objects

In Python there is a special object `None`. 

In [None]:
x = None
print(x)
L = [1, 4, None, 3]
print(L[0])
print(L[2])

In [None]:
def IsNone(x):
    if x: 
        print("You gave me something.")
        return False
    else:
        print("You gave me `None`.")
        return True

In [None]:
print(IsNone(4))
print(IsNone(None))

Let's purposefully create a data frame without an entry.

In [None]:
miss = pd.DataFrame({
    "X" : [None, 6, None, 5], 
    "Y" : [7, 9, 11, 13]
})
print(miss)

In [None]:
print("The 0-entry in the 'X' column is: {0}".format(miss["X"][0]))
print(miss["X"][0] == None)
print(type(miss["X"][0]))

To check if a value is a `numpy.nan` we need to use a special function.

In [None]:
print(np.isnan(miss["X"][0]))

Back to filtering out the rows with missing entries.

Now let's use `filter` to find all rows that have a missing entry. 

In [None]:
has_nan = lambda row: any(map(lambda x: np.isnan(x), row))
[(pair[0], has_nan(pair[1])) for pair in miss.iterrows()]

In [None]:
baddies = list(filter(lambda pair: has_nan(pair[1]), df.iterrows()))
print(len(baddies))
baddies

Instead of working further on this, there is a [pandas method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) for doing exactly what we want. Let's do that.

In [None]:
df_clean = df.dropna()
list(filter(lambda pair: has_nan(pair[1]), df_clean.iterrows()))

## Problem 1

Build a Python function that does the following:

**Input:** Given three `pandas` data frames `(df1, df2, df3)` each with 2 columns,

**Output:** A `matplotlib` plot of all three of scatter plots in a single plot (2 x 2 grid of subplots).

Check out a [matplotlib example](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html#stacking-subplots-in-two-directions) on subplots stacking in both horizontal and vertical direction.

In [None]:
# Try it with a group first
# Also build yourself a simple test case.

## Problem 2

- Take the current data frame we have, `df_clean`, and construct the three principal components. 
- Project the data onto every pair of principal components, so onto (PC1, PC2), (PC1, PC3) and (PC2, PC3).
- For each of the three different projections, build a `pandas` data frame with two columns.
- Input these three data frames into your function.

In [None]:
# Try it with a group first

## (Bonus) Problem 3

Repeat Problem 2 but rescale the data by incorporating the following function.

In [None]:
def mat_to_rescaled_mat(Z):
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler.fit(Z)
    return scaler.transform(Z)