In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reding an external file with data

The .csv file "weight-height.csv" contains a biggish data set with the height and length of some population (propaply American, becaue of the units used).

We will import the data from this file into a "Pandas data frame":

In [None]:
df = pd.read_csv("weight-height.csv",index_col=None)

Pandas is a library of functions that help us handle with tabular data sets like this. Essentially, a Pandas data frame is like a matrix (2d Numpy array), but unlike Numpy arrays which only allow numerical data, a data frame can contain columns with different types of data. Let's look at the one we just imported:  

In [None]:
df

As you can see, the data frame has 10000 observations on three different variables ($n=10000$, $p=3$). The first variable is non-numerical.

A neat feature of data frames is that we can assign labels to the columns (in this example "Gender", "Height" and "Weight"), and use these labels to access individual columns, e.g.:

In [None]:
df['Height']

Let's add two columns that are easier for us to interpret :)

In [None]:
df['Height_cm']=df['Height']*2.54
df['Weight_kg']=df['Weight']/2.20462

In [None]:
df

Let's extract the numerical columns we're interested in. That would be the 4th and 5th column (index 3 and 4).

#### How about we plot the data points:

In [None]:
df.plot.scatter('Height_cm','Weight_kg')
plt.axis('Equal')
plt.show()

#### Let's also look at the correlation coefficients. 

For a Pandas dataframe the correlation coefficient matrix $R$ can be easily calculated. Let's calculate the correlation matrix for the `Height_cm` and `Weight_cm` columns:

In [None]:
X = df.loc[:,['Height_cm','Weight_kg']] # This extracts the given columns from the dataframe
                                        # and assigns them to a new datafram called X
S = X.cov()                             # Covariance matrix for X
R = X.corr()                            # Correlation matrix for X
print(S,'\n')
print(R)

As the plot above (and common sense) suggests, the height and weight of a person are highly correlated ($r\approx 0.92$).

We see that the Height variance is about 95 (standard deviation $\approx 9.78$) while the Weight variance is about 212 (standard deviation $\approx 14.56$).

## Calculating principal components

First, let's centralize!
The `X.mean()` calculates the mean for each column of the dataframe `X` so we can subtract that.

In [None]:
Y = X-X.mean()
print(Y)

The lecture tells us that we need the eigenvalues and eigenvectors of the covariance matrix `S`.

In [None]:
l, Q = np.linalg.eig(S)
# The following code sorts the eigenvalues by decreasing size and rearranges the columns of Q accordingly:
idx = l.argsort()[::-1]
l = l[idx]
Q = Q[:,idx]

Let's take a look:

In [None]:
print(l)
print(Q)

Let's check that `Q` is an orthogonal matrix:

In [None]:
Q.transpose() @ Q

The lecture tells us that the matrix
$$Y'=YQ$$
will contain our new variables (our principal components).

In [None]:
Yp = Y @ Q
Yp.columns=['Pc. 1','Pc. 2']  # We're also giving appropriate names to the columns.

In [None]:
Yp.plot.scatter('Pc. 1','Pc. 2')
plt.axis('Equal')
plt.show()

Note that the eigenvalues are the same as the variances of the principal components (diagonal entries of $S'$):

In [None]:
print(l)
print(Yp.cov())

Note that the total variance (sum of diagonal entries of covariance matrix) is the same for the original variables as for the principal components.

In [None]:
print(np.trace(Y.cov()))
print(np.trace(Yp.cov()))

But for the principal components, most of the variance is found in the first principal component.

In [None]:
l

In [None]:
l[0]/sum(l)

More than 96%