# Principal Component Analysis

The objective is to develop a principal component analysis to reduce the dimension of the data.

In this data, there are nine socio-economic and health factors including
1. Death of children under 5 years of age per 1000 live births
2. Exports of goods and services per capita. Given as %age of the GDP per capita
3. Total health spending per capita. Given as %age of GDP per capita
4. Imports of goods and services per capita. Given as %age of the GDP per capita
5. Net income per person
6. The measurement of the annual growth rate of the Total GDP
7. The average number of years a new born child would live if the current mortality patterns are to remain the same
8. The number of children that would be born to each woman if the current age-fertility rates remain the same.
9. The GDP per capita. Calculated as the Total GDP divided by the total population.

In [1]:
# used for manipulating directory paths
import os

# Scientific and vector computation for python
import numpy as np

from IPython.display import HTML, display, clear_output

# library written for this exam
import utilsPCA as utils

%load_ext autoreload
%autoreload 2

### First loading the dataset. 

In [2]:
# Load the dataset into the variable X 
data = np.loadtxt(os.path.join('Data', 'PCACountryData.txt'))
X = data

m = X.shape[0] # number of training examples

In [3]:
def pca(X):
   
    # Useful values
    m, n = X.shape

    # You need to return the following variables correctly.
    U = np.zeros(n)
    S = np.zeros(n)

    # ====================== YOUR CODE HERE ======================
    Sigma = (1/m) * np.dot(X.T,X)
    U,S,V = np.linalg.svd(Sigma)
    
    # ============================================================
    return U, S

In [4]:
X_norm, mu, sigma = utils.featureNormalize(X)
U, S = pca(X_norm)

In [5]:
print(X_norm[0,:])

[ 1.28765971 -1.13486665  0.2782514  -0.08220771 -0.80582187  0.15686445
 -1.61423717  1.89717646 -0.67714308]


In [6]:
def projectData(X, U, K):
   
    # You need to return the following variables correctly.
    Z = np.zeros((X.shape[0], K))

    # ====================== YOUR CODE HERE ======================
    Z = np.dot(X, U[:, :K])
    # =============================================================
    return Z

In [7]:
#  Project the data onto K = 1 dimension
K = 4
Z = projectData(X_norm, U, K)

#########Why in slide said it should be more >= 0.99
print(np.sum(S[:K])/np.sum(S))

print('Projection : {:.6f} {:.6f} {:.6f}{:.6f}'.format(Z[0, 0],Z[0, 1],Z[0, 2],Z[0, 3]))

0.8719078614023906
Projection : -2.904290 -0.095334 -0.7159651.002240


In [8]:
print(Z[0, 3])

1.0022403774544446


In [9]:
def recoverData(Z, U, K):
    """
    Recovers an approximation of the original data when using the 
    projected data.
    
    Parameters
    ----------
    Z : array_like
        The reduced data after applying PCA. This is a matrix
        of shape (m x K).
    
    U : array_like
        The eigenvectors (principal components) computed by PCA.
        This is a matrix of shape (n x n) where each column represents
        a single eigenvector.
    
    K : int
        The number of principal components retained
        (should be less than n).
    
    Returns
    -------
    X_rec : array_like
        The recovered data after transformation back to the original 
        dataset space. This is a matrix of shape (m x n), where m is 
        the number of examples and n is the dimensions (number of
        features) of original datatset.
    
    Instructions
    ------------
    Compute the approximation of the data by projecting back
    onto the original space using the top K eigenvectors in U.
    For the i-th example Z[i,:], the (approximate)
    recovered data for dimension j is given as follows:

        v = Z[i, :]
        recovered_j = np.dot(v, U[j, :K])

    Notice that U[j, :K] is a vector of size K.
    """
    # You need to return the following variables correctly.
    X_rec = np.zeros((Z.shape[0], U.shape[0]))

    # ====================== YOUR CODE HERE ======================
    X_rec = np.dot(Z,U[:, :K].T)
    # =============================================================
    return X_rec

    # rec is reconstruct

In [10]:
X_rec  = recoverData(Z, U, K)

In [11]:
print(X_rec[0,:])

[ 1.62943025 -0.86660956  0.42884917 -0.26227601 -0.97789998  0.25098613
 -1.54380803  1.55249949 -0.69960303]


In [13]:
K= 5
print(np.sum(S[:K])/np.sum(S))

0.945309975643951


In [14]:
K= 8
print(np.sum(S[:K])/np.sum(S))

0.9925694437691404
