# Data Science Mathematics
# Principal Component Analysis
# In-Class Activity

###1) Show that the following is true (remember that variance is the square of the standard deviation):

> *cov(x,x)=var(x)*


###2) Explain the benefits of dimensionality reduction on large data sets. In what ways might dimensionality reduction be detrimental? Recall from our Clustering lecture the "Curse of Dimensionality." How might the utility of dimensionality reduction be explained in this context?

In large data sets, reducing dimensionality allows for easier and faster computation, especially when not all factors / features may be of equal interest or significance. 

In the Clustering lecture, we discussed the Curse of Dimensionality, which postulates that as the number of dimensions increases, the distance between an centroid and a given point in n-dimensional space begins to become equal (e.g. it becomes much more difficult to identify clusters based on meaningful factors and data can become "overfit", which is functionally useless).

In this case, reducing dimensionality not only makes computation faster, but also increases the likelihood of useful clusters, which is very useful as a means of analyzing data.

###3) Calculate (by hand) the eigenvalues and the associated eigenvectors of matrix A:

```
| 3     0     0 |
| -4    6     2 |
| 16   -15   -5 |
```

We know that the equation for determining the eigenvalues (**L** / lambda) in Matrix A is *A - **L** x I* where *I* is the identity matrix. This results in the following:

```
| 3-L   0     0 |
| -4   6-L    2 |
| 16  -15  -5-L |
```

we solve for the determinant in order to find the eigenvalues:

```(3-L)*((6-L)*(-5-L)-(-15*2))-0*(-4*(-5-L)-(16*2))+0*(-4*-15-16*(6-L)) = 0```

simplified, this comes out to:

`3*(6*-5 + -L*-5 + 6*-L + -L*-L + 30)-L(6*-5 + -L*-5 + 6*-L + -L*-L + 30) = 0`

`3*6*-5 + 3*-L*-5 + 3*6*-L + 3*-L*-L + 3*30  -  L*6*-5 + -L*-L*-5 + -L*6*-L + -L*-L*-L + -L*30 = 0`

`-90 + 15L + -18L + 3L^2 + 90 + 30L + -5L^2 + 6L^2 +-L^3 + -30L = 0`

`-L^3 + 3L^2 + -5L^2 + 6L^2 + 15L + -18L + 30L + -30L + -90 + 90 = 0`

`-L^3 + 4L^2 - 3L = 0`

`L^3 - 4L^2 + 3L = 0`

`L*(L^2 - 4L + 3) = 0`

`L*(L - 1)(L - 3) = 0`

Accordingly, **the eigenvalues are: 0, 1, 3**

Plugging these back in to the equation, our eigenvectors come out as:

Solving for the **eigenvector for L = 0:**

```
| 3-0   0     0 | |x|
| -4   6-0    2 |*|y|
| 16  -15  -5-0 | |z|
```

Simplifies to:

```
3x = 0
-4x + 6y + 2z = 0
16x - 15y - 5z = 0
```

We recognize from the first equation that **x=0** must be a solution, so we substitute that back in to the lower two equations, add them, and solve for **z**:

```
6y-15y + 2z-5z = 0
-9y -3z = 0
-3y - z = 0
-3y = z
```

So our additional solutions are **y=-1** and **z=3**, making the **L=0 eigenvector {0,-1,3}**


Solving for the **eigenvector for L = 1:**

```
| 3-1   0     0 | |x|
| -4   6-1    2 |*|y|
| 16  -15  -5-1 | |z|
```


```
| 3-3   0     0 | |x|
| -4   6-3    2 |*|y|
| 16  -15  -5-3 | |z|
```

###4) Compute (by hand) the determinant of matrix A

This is a 3x3 matrix:

```
| a1 b1 c1 |
| a2 b2 c2 |
| a3 b3 c3 |
```

The determinant of that 3x3 matrix can be expressed as:

```a1*[b2 c2 | b3 c3] - b1*[a2 c2 | a3 c3] + c1*[a2 b2 | a3 b3]```

This can then be simplified to:

```(a1*b2*c3 + b1*c2*a3 + c1*a2*b3) - (a3*b2*c1 + b3*c2*a1 + c3*a2*b1)```

This, for matrix A is:

```(3*6*-5 + 0*2*16 + 0*-4*-15)-(16*6*0 + -15*2*3 + -5*-4*0)```

In [13]:
x = ((3*6*-5) + (0*2*16) + (0*-4*-15))
y = ((16*6*0) + (-15*2*3) + (-5*-4*0))
print('Determinant:',x-y)

#The determinant describes the area of a transformation.
#In this case, it is 0.
#This means the transformation resulted in a line or a dot, both of which have area 0.

Determinant: 0


###5) You are a data scientist at a three-letter agency. You have been following a group of suspected ISIL members on social media, and have derived 4 features from various profiles. You are developing a supervised learning algorith for identifying ISIL members based on these features, and need to project your data onto **two dimensions** for clustering analysis.

###Do the following:
>a) Derive a covariance matrix from this data set

>b) Calculate the feature vector of eigenvalues from the covariance matrix

>c) Project the data set into the appropriate principle component space

>d) Assuming the class of each record is known, explain how this reduced data set could be used to derive a supervised learning algorith based on clustering

>e) BONUS: Graph the 2D principle components

Refer to your in-class handout for instructions.  You are going to do most of the coding yourself here.

Read about this library here:
https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

First, let's import our relevant libraries.

In [3]:
import numpy as np
from sklearn.decomposition import PCA

First, we need to instantiate our data set.

In [4]:
data = np.array([[5.1,3.5,1.4,0.2],
[4.9,3.0,1.4,0.2],
[4.7,3.2,1.3,0.2],
[4.6,3.1,1.5,0.2],
[5.0,3.6,1.4,0.2],
[5.4,3.9,1.7,0.4],
[4.6,3.4,1.4,0.3],
[5.0,3.4,1.5,0.2],
[4.4,2.9,1.4,0.2],
[4.9,3.1,1.5,0.1],
[5.4,3.7,1.5,0.2],
[4.8,3.4,1.6,0.2],
[4.8,3.0,1.4,0.1],
[4.3,3.0,1.1,0.1],
[5.8,4.0,1.2,0.2],
[5.7,4.4,1.5,0.4],
[5.4,3.9,1.3,0.4],
[5.1,3.5,1.4,0.3],
[5.7,3.8,1.7,0.3],
[5.1,3.8,1.5,0.3]])

###a) Derive a covariance matrix from this data set

Now, in the cell below, calculate your covariance matrix for the above data set:
> c = np.cov(x)

In [5]:
c = np.cov(data)

Print the covariance matrix.

In [6]:
print(c)

[[4.75       4.42166667 4.35333333 4.16       4.69666667 4.86
  4.215      4.595      3.965      4.49333333 5.03       4.38666667
  4.415      4.105      5.58       5.35       5.01333333 4.67166667
  5.16166667 4.72833333]
 [4.42166667 4.14916667 4.055      3.885      4.35833333 4.515
  3.9075     4.28416667 3.7075     4.21       4.68333333 4.08333333
  4.1375     3.81416667 5.18       4.93666667 4.645      4.34916667
  4.81916667 4.37916667]
 [4.35333333 4.055      3.99       3.81333333 4.30333333 4.45333333
  3.86166667 4.21166667 3.635      4.12       4.61       4.02
  4.04833333 3.76166667 5.11333333 4.9        4.59333333 4.28166667
  4.73166667 4.33166667]
 [4.16       3.885      3.81333333 3.65666667 4.11       4.25666667
  3.68833333 4.03166667 3.485      3.95333333 4.40666667 3.85333333
  3.88166667 3.59166667 4.86666667 4.66333333 4.37       4.08833333
  4.52833333 4.135     ]
 [4.69666667 4.35833333 4.30333333 4.11       4.65       4.81
  4.175      4.54166667 3.915      4.43

###b) Calculate the feature vector of eigenvalues from the covariance matrix

Now, in the cell below, calculate the eigenvectors and eigenvalues of the covariance matrix.

In [None]:
np.linalg.eig(c)

###c) Project the data set into the appropriate principle component space

Now calculate the principal components (reduce to 2 dimensions).  First, you need to instantiate your PCA object.

In [None]:
pca = PCA(n_components=2)

Now, in the cell below, train your model on your dataset:
> pca.fit(X)

In [None]:
pca.fit(data)

The following commands can be used to get your explained variance ratios (percentage of variance explained by each of the selected components) and your dimensionally-reduced components.

Print these values in the cells below.

In [None]:
print(pca.explained_variance_ratio_)

In [None]:
print(pca.components_)

###d) Assuming the class of each record is known, explain how this reduced data set could be used to derive a supervised learning algorith based on clustering

###e) BONUS: Graph the 2D principle components

Bonus: Figure out how to plot your principal components as a scatter plot:

https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.scatter.html

***Now save your output.  Go to File -> Print Preview and save your final output as a PDF.  Turn in to your Instructor, along with any additional sheets.