# Week 5: Our descent into PCA

## Goals
- Hands on PCA 
- `filter` and `None`
- Plenty of plots plotted in a plot

## PCA and sustainability

Let's load up the same data set from last week and play around with it. 

First let's load the modules we'll use.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
df_big = pd.read_csv("data/global-data-on-sustainable-energy.csv")
df_big.head()             # Prints the first 5 rows

Unnamed: 0,Entity,Year,Access to electricity (% of population),Access to clean fuels for cooking,Renewable-electricity-generating-capacity-per-capita,Financial flows to developing countries (US $),Renewable energy share in the total final energy consumption (%),Electricity from fossil fuels (TWh),Electricity from nuclear (TWh),Electricity from renewables (TWh),...,Primary energy consumption per capita (kWh/person),Energy intensity level of primary energy (MJ/$2017 PPP GDP),Value_co2_emissions_kt_by_country,Renewables (% equivalent primary energy),gdp_growth,gdp_per_capita,Density\n(P/Km2),Land Area(Km2),Latitude,Longitude
0,Afghanistan,2000,1.613591,6.2,9.22,20000.0,44.99,0.16,0.0,0.31,...,302.59482,1.64,760.0,,,,60,652230.0,33.93911,67.709953
1,Afghanistan,2001,4.074574,7.2,8.86,130000.0,45.6,0.09,0.0,0.5,...,236.89185,1.74,730.0,,,,60,652230.0,33.93911,67.709953
2,Afghanistan,2002,9.409158,8.2,8.47,3950000.0,37.83,0.13,0.0,0.56,...,210.86215,1.4,1029.999971,,,179.426579,60,652230.0,33.93911,67.709953
3,Afghanistan,2003,14.738506,9.5,8.09,25970000.0,36.66,0.31,0.0,0.63,...,229.96822,1.4,1220.000029,,8.832278,190.683814,60,652230.0,33.93911,67.709953
4,Afghanistan,2004,20.064968,10.9,7.75,,44.24,0.33,0.0,0.56,...,204.23125,1.2,1029.999971,,1.414118,211.382074,60,652230.0,33.93911,67.709953


Let's focus on three columns:
1. "Electricity from fossil fuels (TWh)"
2. "Electricity from nuclear (TWh)"
3. "Electricity from renewables (TWh)"

In [3]:
df = pd.DataFrame({
    "f" : df_big["Electricity from fossil fuels (TWh)"], 
    "n" : df_big["Electricity from nuclear (TWh)"],
    "r" : df_big["Electricity from renewables (TWh)"]
})
print(df)

         f    n     r
0     0.16  0.0  0.31
1     0.09  0.0  0.50
2     0.13  0.0  0.56
3     0.31  0.0  0.63
4     0.33  0.0  0.56
...    ...  ...   ...
3644  3.50  0.0  3.32
3645  3.05  0.0  4.30
3646  3.73  0.0  5.46
3647  3.66  0.0  4.58
3648  3.40  0.0  4.19

[3649 rows x 3 columns]


Now we want to make sure that all of our data is complete. We don't want any missing entries. Let's check.

### The `filter` function

We will use the `filter` function in Python.

Similar to `map`, the `filter` function runs through an iterable object (e.g. a list) and applies a function `f` on each entry. If on that entry `f` returns `True`, then that entry is kept; otherwise the entry is discarded.

In [5]:
L = [-2, -1, 0, 1, 2]
is_pos = lambda x: x > 0
is_pos(4)

True

In [6]:
list(filter(is_pos, L))     # Only the positive entries remain.

[1, 2]

### `None` objects

In Python there is a special object `None`. 

In [10]:
x = None
# print(x)
L = [1, 4, None, 3]
print(L[0])
print(L[2])

1
None


In [12]:
def IsNone(x):
    if x:           # equal to "if x != None"
        print("You gave me something.")
        return False
    else:
        print("You gave me `None`.")
        return True

In [14]:
# print(IsNone(4))
print(IsNone(None))

You gave me `None`.
True


In [17]:
x = 1
x is None		# This is the way to decide if a var is 'None'.

False

Let's purposefully create a data frame without an entry.

In [18]:
miss = pd.DataFrame({
    "X" : [None, 6, None, 5], 
    "Y" : [7, 9, 11, 13]
})
print(miss)

     X   Y
0  NaN   7
1  6.0   9
2  NaN  11
3  5.0  13


In [19]:
print("The 0-entry in the 'X' column is: {0}".format(miss["X"][0]))
print(miss["X"][0] == None)
print(type(miss["X"][0]))

The 0-entry in the 'X' column is: nan
False
<class 'numpy.float64'>


To check if a value is a `numpy.nan` we need to use a special function.

In [20]:
print(np.isnan(miss["X"][0]))

True


Back to filtering out the rows with missing entries.

Now let's use `filter` to find all rows that have a missing entry. 

In [25]:
has_nan = lambda row: any(map(lambda x: np.isnan(x), row))
[(pair[0], has_nan(pair[1])) for pair in miss.iterrows()]
# [pair for pair in miss.iterrows()]

[(0, True), (1, False), (2, True), (3, False)]

In [22]:
baddies = list(filter(lambda pair: has_nan(pair[1]), df.iterrows()))
print(len(baddies))
baddies

126


[(693,
  f    20.42
  n      NaN
  r    18.48
  Name: 693, dtype: float64),
 (694,
  f    19.59
  n      NaN
  r    21.03
  Name: 694, dtype: float64),
 (695,
  f    19.87
  n      NaN
  r    22.50
  Name: 695, dtype: float64),
 (696,
  f    23.30
  n      NaN
  r    21.84
  Name: 696, dtype: float64),
 (697,
  f    27.79
  n      NaN
  r    20.87
  Name: 697, dtype: float64),
 (698,
  f    25.20
  n      NaN
  r    25.42
  Name: 698, dtype: float64),
 (699,
  f    25.54
  n      NaN
  r    28.03
  Name: 699, dtype: float64),
 (700,
  f    33.69
  n      NaN
  r    22.29
  Name: 700, dtype: float64),
 (701,
  f    32.59
  n      NaN
  r    23.78
  Name: 701, dtype: float64),
 (702,
  f    32.12
  n      NaN
  r    24.57
  Name: 702, dtype: float64),
 (703,
  f    36.71
  n      NaN
  r    21.55
  Name: 703, dtype: float64),
 (704,
  f    41.00
  n      NaN
  r    21.04
  Name: 704, dtype: float64),
 (705,
  f    45.17
  n      NaN
  r    20.59
  Name: 705, dtype: float64),
 (706,
  f  

Instead of working further on this, there is a [pandas method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) for doing exactly what we want. Let's do that.

In [23]:
df_clean = df.dropna()
list(filter(lambda pair: has_nan(pair[1]), df_clean.iterrows()))

[]

## Problem 1

Build a Python function that does the following:

**Input:** Given three `pandas` data frames `(df1, df2, df3)` each with 2 columns,

**Output:** A `matplotlib` plot of all three of the scatter plots in a single plot (2 x 2 grid of subplots).

Check out a [matplotlib example](https://matplotlib.org/stable/gallery/subplots_axes_and_figures/subplots_demo.html#stacking-subplots-in-two-directions) on subplots stacking in both horizontal and vertical direction.

In [None]:
# Try it with a group first
# Also build yourself a simple test case.

## Problem 2

- Take the current data frame we have, `df_clean`, and construct the three principal components. 
- Project the data onto every pair of principal components, so onto (PC1, PC2), (PC1, PC3) and (PC2, PC3).
- For each of the three different projections, build a `pandas` data frame with two columns.
- Input these three data frames into your function.

In [None]:
# Try it with a group first

## (Bonus) Problem 3

Repeat Problem 2 but rescale the data by incorporating the following function.

In [None]:
def mat_to_rescaled_mat(Z):
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    scaler.fit(Z)
    return scaler.transform(Z)