### PCA ( Principal Components Analasys ) + Mathematical Notation

The goal of PCA is to find a set of orthogonal vectors (principal components) that capture the maximum variance in the data. Let $X$ be your data matrix with $n$ samples and $p$ features, where each row represents a sample and each column represents a feature.

1.  **Mean Centering:** The first step is to center the data by subtracting the mean of each feature from the corresponding feature values.

    $$X_{centered} = X - \bar{X}$$
    where $\bar{X}$ is the mean vector of the features.

2.  **Covariance Matrix:** Calculate the covariance matrix of the centered data.

    $$\Sigma = \frac{1}{n-1} X_{centered}^T X_{centered}$$
    The covariance matrix describes the relationships between the different features.


3.  **Eigenvalue Decomposition:** Find the eigenvalues and eigenvectors of the covariance matrix $\Sigma$.

    $$\Sigma v = \lambda v$$
    where $v$ is an eigenvector and $\lambda$ is the corresponding eigenvalue. The eigenvectors represent the directions of maximum variance (the principal components), and the eigenvalues represent the magnitude of the variance along those directions.


4.  **Sorting Eigenpairs:** Sort the eigenvectors in descending order based on their corresponding eigenvalues. The eigenvector with the largest eigenvalue is the first principal component, the eigenvector with the second largest eigenvalue is the second principal component, and so on.


5.  **Selecting Principal Components:** Choose the top $k$ eigenvectors (principal components) that correspond to the largest eigenvalues. The number of components $k$ is typically chosen based on the desired amount of variance to retain or by looking at a scree plot.


6.  **Projection:** Project the centered data onto the selected principal components to obtain the lower-dimensional representation of the data.
    $$X_{projected} = X_{centered} W$$
    where $W$ is the matrix formed by the top $k$ eigenvectors as columns.

### PCA in `sklearn.decomposition.PCA`

The `sklearn.decomposition.PCA` class in scikit-learn automates these steps.

-   `PCA(n_components)`: Initializes a PCA object. `n_components` specifies the number of principal components to keep. If `n_components` is an integer, it specifies the exact number of components. If it's a float between 0 and 1, it specifies the proportion of variance to explain.
-   `fit(X)`: Fits the PCA model to the data `X`. It calculates the mean, covariance matrix, eigenvalues, and eigenvectors.
-   `transform(X)`: Projects the data `X` onto the principal components learned during the `fit` step.
-   `fit_transform(X)`: Combines the `fit` and `transform` steps.

After fitting, you can access attributes like:

-   `components_`: The principal components (eigenvectors).
-   `explained_variance_`: The amount of variance explained by each selected component (eigenvalues).
-   `explained_variance_ratio_`: The percentage of variance explained by each selected component.

The code you have already executed using `sklearn.decomposition.PCA` demonstrates these concepts by fitting a PCA model to your data and printing the principal components.

In [1]:
#import Libs
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.decomposition import FactorAnalysis
from sklearn.decomposition import PCA

In [2]:
#Read the DataFrame
df = pd.read_csv('/content/diabetes_dataset.csv')
print(df)

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0              6      148             72             35        0  33.6   
1              1       85             66             29        0  26.6   
2              8      183             64              0        0  23.3   
3              1       89             66             23       94  28.1   
4              0      137             40             35      168  43.1   
..           ...      ...            ...            ...      ...   ...   
763           10      101             76             48      180  32.9   
764            2      122             70             27        0  36.8   
765            5      121             72             23      112  26.2   
766            1      126             60              0        0  30.1   
767            1       93             70             31        0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
0                       0.627   50        1  
1                  

In [3]:
#Check for Negative Values in numeric columns
numeric_cols = df.select_dtypes(include=np.number)
print(numeric_cols.lt(0).sum())

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64


In [4]:
#Check Null Values
df.isnull().sum()

Unnamed: 0,0
Pregnancies,0
Glucose,0
BloodPressure,0
SkinThickness,0
Insulin,0
BMI,0
DiabetesPedigreeFunction,0
Age,0
Outcome,0


### Factor Components for Dimensionality Reduction + Mathematical Notation

In Factor Analysis, the **factor components** (also known as factor loadings) represent the relationship between the original observed variables and the underlying latent factors. These latent factors are the new, reduced set of dimensions that the model identifies.

Let $X$ be the observed data matrix with $p$ variables and $n$ samples. Factor Analysis models the observed data as a linear combination of $k$ latent factors ($F$) and a unique error term ($\epsilon$):

$$X = \Lambda F + \epsilon$$

where:
- $X$ is the $p \times n$ data matrix.
- $\Lambda$ is the $p \times k$ matrix of **factor loadings** (the components we are discussing). Each element $\lambda_{ij}$ represents the loading of the $i$-th observed variable on the $j$-th latent factor.
- $F$ is the $k \times n$ matrix of latent factor scores.
- $\epsilon$ is the $p \times n$ matrix of unique error terms, representing the variance in $X$ not explained by the common factors.

The process of using factor components for dimensionality reduction involves the following:

1.  **Identifying Latent Factors:** Factor Analysis assumes that the correlations between observed variables can be explained by a smaller number of unobserved (latent) factors. The `fit()` method in `FactorAnalysis` estimates the factor loadings ($\Lambda$) and the variances of the unique error terms ($\Psi$, where $\Psi$ is a diagonal matrix with the variances of $\epsilon$ on the diagonal). The model aims to reproduce the observed covariance matrix ($\Sigma$) with the estimated parameters:

    $$\Sigma \approx \Lambda \Lambda^T + \Psi$$

2.  **Factor Loadings:** The `components_` attribute of the fitted `FactorAnalysis` object in scikit-learn corresponds to the matrix $\Lambda$ of factor loadings. These loadings indicate how strongly each original variable is associated with each latent factor. High loadings suggest that a variable is a good indicator of that factor.

3.  **Representing Data in Reduced Space:** Once the factors and their loadings are determined, you can transform the original data into the lower-dimensional space defined by these latent factors. This is typically done using the `transform()` method of the `FactorAnalysis` object, which estimates the factor scores ($F$) for each sample based on the observed data and the learned loadings.

Essentially, Factor Analysis uses the factor components ($\Lambda$) to understand the underlying structure of the data and then creates a new representation of the data (the factor scores $F$) based on these discovered latent factors, thereby achieving dimensionality reduction. Unlike PCA which focuses on maximizing variance, Factor Analysis aims to model the covariance structure of the observed variables using the latent factors.

In [5]:
#Get the Size of DataFrame
df.shape

(768, 9)

In [6]:
#Start the Factors And Inform the Desired Number of Components
factors = FactorAnalysis(n_components=5).fit(df)

In [7]:
# Apply Factor Analysis and transform the data
df_reduced_fa = factors.transform(df)

In [8]:
# Display the shape of the original and reduced DataFrames
print("Original DataFrame shape:", df.shape)
print("Reduced DataFrame shape (Factor Analysis):", df_reduced_fa.shape)

Original DataFrame shape: (768, 9)
Reduced DataFrame shape (Factor Analysis): (768, 5)


In [9]:
#Show the Reduced Dataframe
df_reduced_fa

array([[-0.66346898, -1.19776448, -0.15015341,  0.69874981,  1.82292907],
       [-0.71225242,  0.92917876, -0.15605682, -0.08901717,  0.83392617],
       [-0.64315095, -2.23589325,  0.81672604, -0.59968938, -0.95890383],
       ...,
       [ 0.27774039,  0.10284187, -0.11751162, -0.26989135, -0.11256013],
       [-0.68703172, -0.45155179,  0.43664345,  1.34299119, -0.53382978],
       [-0.70522896,  0.69143763, -0.23120526, -0.94407145,  0.67932666]])

In [10]:
#Now, a Example with PCA ( Initialize PCA with a number of components )
pca = PCA(n_components=5)

In [11]:
# Fit PCA on the DataFrame and transform the data
df_reduced_pca = pca.fit_transform(df)

In [12]:
# Display the shape of the original and reduced DataFrames
print("Original DataFrame shape:", df.shape)
print("Reduced DataFrame shape (PCA):", df_reduced_pca.shape)

Original DataFrame shape: (768, 9)
Reduced DataFrame shape (PCA): (768, 5)


In [13]:
#Show the Reduced Dataframe
df_reduced_pca

array([[-75.71424916,  35.95494354,   7.26068338,  15.6705266 ,
         16.50797757],
       [-82.35846646, -28.90955895,   5.49664901,   9.00443012,
          3.48038132],
       [-74.63022933,  67.90963328, -19.46175322,  -5.65311372,
        -10.29917609],
       ...,
       [ 32.11298721,  -3.37922193,   1.58797191,  -0.87945128,
         -2.98161526],
       [-80.21409513,  14.19059537, -12.35142227, -14.29252832,
          8.53699105],
       [-81.30834662, -21.6230423 ,   8.15277416,  13.82124771,
         -4.91458328]])