# Principal Component Analysis

PCA, which stands for Principal Component Analysis, is a reduction technique used to prepare data for visualization or for use in analysis and modeling.

PCA aims to reduce the dimensionality of the data by projecting it onto a new feature space composed of the directions of maximum variance in the original data.

The goal of PCA is to find a new representation of the data that preserves as much information as possible while reducing the number of variables. This is achieved by identifying the principal components, which are linear combinations of the original variables that explain the largest amount of variance in the data.

We will work with the Iris flowers dataset, which has 4 dimensions, each of which has a relationship with the real world:

In [None]:
from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data
labels = iris.target

print(X.shape)
X[:10]


You can visualize them with the following function:

In [None]:
from utils import visualize_iris_pairplot

visualize_iris_pairplot(iris)


In Scikit-learn, to apply PCA to a dataset, an instance of the `**PCA**`** **class is created, which must be imported from the `sklearn.decomposition` module:

In [None]:
from sklearn.decomposition import PCA


One of the most important hyperparameters of PCA is the number of components we want, this number, given by the `n_components` argument, which I recommend specifying in most cases. Let's say we want to reduce our dataset to only two dimensions:

In [None]:
pca = PCA(n_components=2)


This means that from 4 dimensions, we are going to convert it to two – by calling the `fit` method and then `transform`:

In [None]:
pca.fit(X)
X_reduced = pca.transform(X)

print(X_reduced.shape)
X_reduced[:10]


And now we can visualize this new dataset, which is a low-dimensional version that captures the differences of the original data:

In [None]:
import matplotlib.pyplot as plt

# Graficar los datos transformados
plt.scatter(X_reduced[:, 0], X_reduced[:, 1], c=labels)
plt.xlabel('Componente Principal 1')
plt.ylabel('Componente Principal 2')
plt.show()


Something important to note is that after the transformations, these two new dimensions, these values have no relation to any physical property. They are just "components," here we cannot speak of centimeters or petals, none of that.

## How to measure how good PCA is?

It is difficult on its own to quantify how good our choice of PCA hyperparameters is. Sometimes the performance of PCA is measured in conjunction with how well it is able to help improve the performance of a machine learning model that is trained with the data coming out of PCA, or if the graphs we generate with it are good or not.

And there you have it, the PCA algorithm is useful when we need to reduce the dimension of our data, either to train a new model or simply visualize data.

And there you have it, PCA is an algorithm that perhaps on its own its usefulness is not so evident, but when you put it together with a graph or a machine learning model, it begins to gain more importance and its usefulness becomes evident.