# Synthetic Data

## Problem 1
Here is the Python code to create a matrix $A \in \mathbb{R}^{3 \times 2}$ 
whose individual entires are drawn from a Gaussian distribution with mean 0 and 
variance 1 in an iid fashion. 

In [113]:
import numpy as np
A = np.random.normal(0, 1, (3, 2))
print A

[[-0.17926302  0.31978681]
 [-1.57485409 -0.81205371]
 [ 0.80122514  0.73300686]]


In [114]:
print np.linalg.matrix_rank(A)

2


Matrices with iid Gaussian entries are always full rank. Here, the rank is 2.

# Generation of Dataset 1

## Problem 1
Here is the generation of 500 data samples $X = \{x_i = Av_i\}_{i = 1}^{500}$ using the mathematical model that each $v_i \in \mathbb{R}^2$ is a random vector whose entries are iid Gaussian with mean 0 and variance 1. Note that the dimensionality of $X$ confirms my expectations. Furthermore, we can verify as well that the matrix has rank 2.

In [116]:
N = 500
X = np.concatenate([np.matmul(A, np.random.normal(0, 1, (2, 1))) for i in range(N)], axis = 1)
print X.shape

(3, 500)


In [117]:
print np.linalg.matrix_rank(X)

2


# Singular Value and Eigenvalue Decomposition of Dataset 1

## Problem 1
Here is the computation of the SVD of $X$ and the EV decomposition of $XX^T$ and my verification that 

1. The left singular vectors of $X$ correspond to the eigenvectors of $XX^T$.
2. The eigenvalues of $XX^T$ are the square of the singular values of $X$.
3. The energy in $X$ is equal to the sum of the squares of the singular values of $X$.

In [118]:
# Calculation of the SVD of X and the EVD of XX^T
U, S, Vt = np.linalg.svd(X)
D, E = np.linalg.eig(np.matmul(X, X.transpose()))

In [119]:
# Verification of condition 1 -- note that each column of U has a corresponding column in E.
print "Left singular vectors of X = "
print U
print "Eigenvectors of XX^T = "
print E

Left singular vectors of X = 
[[ 0.0058102   0.83219762 -0.5544487 ]
 [-0.85658895  0.29022866  0.42664118]
 [ 0.51596668  0.47245576  0.71453757]]
Eigenvectors of XX^T = 
[[ 0.83219762  0.5544487   0.0058102 ]
 [ 0.29022866 -0.42664118 -0.85658895]
 [ 0.47245576 -0.71453757  0.51596668]]


In [126]:
# Verification of condition 2:
# Note that the -3.26849658e-13 and 2.45239578e-29 are likely due to the 
# propogation of error when doing numerical computation on a computer 
# (they are both basically 0), and one
# is negative and the other is positive to account for the 
# negative/positive difference in a pair of vectors
# above.
print "Eigenvalues of XX^T = "
print D
print "Singular values of X = "
print S
print "Square of the singular values of X = "
print np.square(S)

Eigenvalues of XX^T = 
[  9.40167420e+01  -3.26849658e-13   1.89506696e+03]
Singular values of X = 
[  4.35323668e+01   9.69622308e+00   4.95216698e-15]
Square of the singular values of X = 
[  1.89506696e+03   9.40167420e+01   2.45239578e-29]


In [122]:
# Verification of condition 3:
print "Energy of X = {}".format( np.square(np.linalg.norm(X)) )
print "Sum of the squares of the singular values of X = {}".format( sum([np.square(sigma) for sigma in S]) )

Energy of X = 1989.08370217
Sum of the squares of the singular values of X = 1989.08370217


## Problem 2
None of the singular values are exactly zero, but one of them is very close to zero (4.95e-15). None of my singular values are zero because of the propogation of error when doing numerical computations on a computer.



In [127]:
print A

[[-0.17926302  0.31978681]
 [-1.57485409 -0.81205371]
 [ 0.80122514  0.73300686]]


In [128]:
print S

[  4.35323668e+01   9.69622308e+00   4.95216698e-15]


In [129]:
print U

[[ 0.0058102   0.83219762 -0.5544487 ]
 [-0.85658895  0.29022866  0.42664118]
 [ 0.51596668  0.47245576  0.71453757]]


We know that the left singular values of $X$ corresponding to the two largest singular values are also eigenvalues of $XX^T$, and we know that 

$$ 
XX^T = A(\sum_{i = 1}^{500}v_iv_i^T)A^T
$$

So the relationship between singular value $\sigma_j$, its singular vector $U_j$ and $A$ is:

$$
[A(\sum_{i = 1}^{500}v_iv_i^T)A^T]U_j = \sigma_j^2 U_j
$$

# PCA of Dataset 1

## Problem 1
Since the original matrix was of rank 2, just having 2 principal components should suffice because the data originally had two dimensions of freedom, not three.

## Problem 2
Here is the proof that $E[x_k] = 0, k = 1, 2, 3$:
Each $x_k$ is an iid N(0, 1) random variable. Therefore, $E[x_k] = 0$ for $k = 1, 2, 3$.
Here is the computation of the mean vector. As we can see, the values are all close to 0.

In [135]:
m = np.mean(X, axis = 1)
print m

[-0.01115677 -0.0607125   0.0275935 ]
