In [1]:
%run Latex_macros.ipynb
%run beautify_plots.py

<IPython.core.display.Latex object>

In [2]:
# My standard magic !  You will see this in almost all my notebooks.

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Reload all modules imported with %aimport
%load_ext autoreload
%autoreload 1

%matplotlib inline

In [3]:
import numpy as np
import os

import matplotlib.pyplot as plt

import class_helper
%aimport class_helper

import unsupervised_helper
%aimport unsupervised_helper


In [4]:
# Create files containing charts
create = False

if create:
    pca_h = unsupervised_helper.PCA_Helper(visible=False)
    file_map = pca_h.corr_features_charts()
    print(file_map)

# Correlated features
Consider the following set of examples with $2$ features

<table>
    <tr>
        <th><center>Two features: perfect correlation</center></th>
    </tr>
    <tr>
        <td><img src="images/features_perf_corr.png"></td>
    </tr>
</table>


As you can see
- $\x_2$ is perfectly correlated with $\x_1$
$$
\x_2^\ip = 2 * \x_1^\ip
$$

**Linear algebra**

A way to conceptualize $\x^\ip$
- As a point in the space spanned by unit basis vectors parallel to the horizontal and vertical axes.
$$\begin{array}[lll]\\
\u_{(1)} = (1,0) \\
\u_{(2)} = (0,1) \\
\end{array}
$$
- With $\x^\ip$ having exposure 

$$
\begin{array}[lll]\\
\x^\ip_1 \text{ to } \u_{(1)} \\
\x^\ip_2 \text{ to } \u_{(2)} \\
\end{array}
$$


So example $\x^\ip$ is
$$
\x^\ip = \sum_{j'=1}^2 { \x^\ip_{j'} * \u_{(j')} }
$$


That is:
- Our feature space is defined by the basis vectors ("axes")
$$\begin{array}[lll]\\
\u_{(1)} = (1,0) \\
\u_{(2)} = (0,1) \\
\end{array}
$$
- $\x^\ip$ describes a point in the span of the basis vectors
    - $\x^\ip_1$ is the displacement of observation $\x^\ip$ along basis vector $\u_{(1)}$
    - $\x^\ip_2$ is the displacement of observation $\x^\ip$ along basis vector $\u_{(2)}$
- In general, for any length $n$ vector of features
$$
\x^\ip = \sum_{j'=1}^n { \tilde\x^\ip_{j'} * \u_{(j')} }
$$

One could easily imagine a *different* set of basis vectors to describe the feature space
- For example: a rotation of basis vectors $\u_{(1)}, \ldots,  \u_{(n)}$
- Let this alternate set of basis vectors be denoted by $\tilde{\v}_{(1)}, \ldots, \tilde{\v}_{(n)}$
- The basis vectors are mutually orthogonal
$$
\tilde{\v}_{(1)} \cdot \tilde{\v}_{(2)} = 0
$$
- The displacements $\x^\ip_j$ need to be adjusted relative to the alternate basis

$$
\x^\ip = \sum_{j'=1}^n { \tilde\x^\ip_{j'} * \tilde{\v}_{(j')} }
$$

PCA is a technique for finding particularly interesting alternate basis vectors.

The alternate basis is motivated by the fact that, for a given set of examples, there may be
pairwise correlation among features.

- If the correlation is *perfect* for some pair of features, they are redundant
    - May drop one feature


Consider the set of examples above.  Features 1 and 2 are perfectly correlated.
$$
\x_2^\ip = 2 * \x_1^\ip
$$

We can create an *alternate* basis vector (no longer parallel to the axes)
$$
\tilde{\v}_{(1)} = (1,2)
$$

such that example $\x^\ip$ is
$$
\x^\ip = \tilde\x^\ip_1 * \tilde{\v}_{(1)}
$$
where $\tilde\x^\ip_1 = \x^\ip_1$

That is, $\x^\ip$ has exposure $\tilde\x^\ip_1$ to the new, single basis vector.

So 
- Rather than representing $\x^\ip$ as a vector with 2 features (in the original basis)
- We can represent it as $\tilde\x^\ip$, a vector with 1 feature (in the new basis)

This is the essence of dimensionality reduction
- Changing bases to one with fewer basis vectors

It is rarely the case for features to be perfectly correlated

Let's modify the set of examples just a bit.

<table>
    <tr>
        <th><center>Two features: imperfect correlation</center></th>
    </tr>
    <tr>
        <td><img src="images/features_imperf_corr.png"></td>
    </tr>
</table>


We can still find an alternate basis of $2$ vectors to perfectly describe the set of examples.

$$
\x^\ip = \sum_{j'=1}^2 { \tilde\x^\ip_{j'} * \tilde{\v}_{(j')} }
$$

- The dark black line is the first alternate basis vector $\tilde{\v}_{(1)}$

<table>
    <tr>
        <th><center>Two features: imperfect correlation, alternate basis</center></th>
    </tr>
    <tr>
        <td><img src="images/features_basis.png"></td>
    </tr>
</table>


As you can see:
- The variation along $\tilde{\v}_{(1)}$ is much greater than that around $\tilde{\v}_{(2)}$
- Capturing the notion that the "main" relationship is along $\tilde{\u}_{(1)}$

In fact, if we dropped $\tilde{\v}_{(2)}$ such that $|| \tilde\x || = 1$
- The examples would be projected onto the line $\tilde{\v}_{(1)}$
- With little information being lost

PCA finds alternate basis vectors and *orders them* in order of decreasing variation.

# Subsets of correlated features

It may not be the case that a group of features is correlated across *all* examples

Consider the MNIST digits
- The subset of examples corresponding to the digit "1"
- Have a particular set of correlated features (forming a vertical column of pixels near the middle of the image)
- Which *may not* be correlated with the same features in examples corresponding to *other* digits

Thus, a synthetic feature encodes a "concept" that occurs in many but not all examples

We will present a method to *discover* "concepts"
- It may not necessarily be the pattern of features that corresponds to an entire digit
- It may be a partial pattern common to several digits
    - Vertical band (0, 1, 4, 7)
    - Horizontal band at top (5, 7, 9)

In [5]:
print("Done")

Done
