In [62]:
"""
1. Top reason to lower the dimensions of a dataset
    To make the data correlations understandable,visualize them and fasten up the training process of our ML model.
    Further it is empirically proven that most higher dimensional data is very close to a certain manifold within
    this data and therefore suitable to be reduced to a lower dimension without losing too much information.

2. What is the curse of dimensionality?
    Multiple dimensions will vasten the distances between the datapoints and therefore lower the information density.
    To tackle this we have two options: a) increase the amount of datapoints, which would result in an unbelievably
    big dataset even when trying to make up for 100 dimensions and b) decrease the dimensions.

3. Once dimensionality of a dataset is reduced, can this be reversed?
    Yes and no. Since the compression always results in a loss of information the original dataset cannot be restored
    completely. But if we do it right, the ratio of compression is much greater than the ration of information loss
    and therefore, e.g. from a by 25% compressed dataset that holds 95% of variance of the original one, we can decompress
    a dataset that is just 5% worse than the original dataset.

4. Is PCA suitable to reduce the dimensions of a large non-linear dataset?
    In general PCA scales linear with the size of the dataset but to calculate non-linear datasets we must use a kernelPCA.
    Short answer: yes.

5. Apply a PCA to a 1000D dataset and fix the ratio of blur to 0.95. How many dimensions remain?
    This depends on the information density and variance of the datasets dimensions. The amount of remaining dimensions d
    solely depends on the dataset.

6. In which case is the following model ideal?
    pure PCA: large linear datasets where m roughly equals d
    incremental PCA: very large linear datasets where we need to apply the transformation on smaller batches, this is especially
    interesting for online applications
    randomized PCA: datasets where d << m because it scales better with large m
    Kernel PCA: non-linear datasets

7. How can one determine the performance of a dimension reduction algorithm to a dataset?
    We cannot directly estimate the performance but we can measure the performance of our estimator on the reduced
    dataset. Therefore we can choose which hyperparameters of our dimension reduction are the best considering our
    dataset and the ML model we want to use.

8. Does it make sense to apply two PCA algorithms in a row?
    This depends on our goal. Given a 0.95 variance threshold and suppose our first algorithms was executed propertly,
    we cannot further reduce the dataset by another PCA because we already reduced everything possible. If we want 
    to further increase computation speed and tradeoff some variance we can apply another algorithm. Another reason
    would be to visualize the pre-compressed dataset in an even lower dimension (1,2 or 3).

"""

'\n1. Top reason to lower the dimensions of a dataset\n    To make the data correlations understandable,visualize them and fasten up the training process of our ML model.\n    Further it is empirically proven that most higher dimensional data is very close to a certain manifold within\n    this data and therefore suitable to be reduced to a lower dimension without losing too much information.\n\n2. What is the curse of dimensionality?\n    Multiple dimensions will vasten the distances between the datapoints and therefore lower the information density.\n    To tackle this we have two options: a) increase the amount of datapoints, which would result in an unbelievably\n    big dataset even when trying to make up for 100 dimensions and b) decrease the dimensions.\n\n3. Once dimensionality of a dataset is reduced, can this be reversed?\n    Yes and no. Since the compression always results in a loss of information the original dataset cannot be restored\n    completely. But if we do it righ

In [89]:
from sklearn.datasets import make_swiss_roll
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.decomposition import KernelPCA

S, color = make_swiss_roll(n_samples=1000,noise=0.1)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

pca = KernelPCA(n_components=2,kernel="rbf",gamma=0.04)
s2d = pca.fit_transform(S)

ax.scatter(S[:, 0], S[:, 1], S[:, 2], c=color, cmap=plt.cm.Spectral)
ax.scatter(s2d[:, 0], s2d[:, 1], c=color, marker='s',cmap=plt.cm.Spectral)
# Set labels and title
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('Swiss Roll')

# Show the plot
plt.show()

In [64]:
np.random.seed(4)
m = 60
w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
X = np.empty((m, 3))
X[:, 0] = np.cos(angles) + np.sin(angles)/2 + noise * np.random.randn(m) / 2
X[:, 1] = np.sin(angles) * 0.7 + noise * np.random.randn(m) / 2
X[:, 2] = X[:, 0] * w1 + X[:, 1] * w2 + noise * np.random.randn(m)

"""
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c='b', marker='o')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')
ax.set_title('Data Visualization')

# Show the plot
%matplotlib inline
plt.show()"""

"\nfig = plt.figure()\nax = fig.add_subplot(111, projection='3d')\nax.scatter(X[:, 0], X[:, 1], X[:, 2], c='b', marker='o')\nax.set_xlabel('X')\nax.set_ylabel('Y')\nax.set_zlabel('Z')\nax.set_title('Data Visualization')\n\n# Show the plot\n%matplotlib inline\nplt.show()"

In [65]:

%matplotlib qt
pca = PCA(n_components=2)
x2d = pca.fit_transform(X)
x2d
X
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X[:, 0], X[:, 1], X[:,2], c='b', marker='o')
ax.scatter(X[:, 0], X[:, 1], X[:,2], c='b', marker='v')
ax.scatter(x2d[:, 0], x2d[:, 1], c='r', marker='s')

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('Z')

Text(0.5, 0, 'Z')

In [81]:
m = 69
n = 1000

w1, w2 = 0.1, 0.3
noise = 0.1

angles = np.random.rand(m) * 3 * np.pi / 2 - 0.5
d1000 = np.empty((m, n))



In [83]:
type(d1000)
d1000.shape[0]


69