# Data Dimension reduction with PCA models

This project demonstrates dimensionality reduction using **Principal Component Analysis (PCA)** and **Locally Linear Embedding (LLE)** on two classic datasets:  
- `breast_cancer` (30 features)  
- `iris` (4 features)

## 1. Data import

In [1]:
from sklearn import datasets

data_breast_cancer = datasets.load_breast_cancer(as_frame=True)
print(data_breast_cancer['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

:Number of Instances: 569

:Number of Attributes: 30 numeric, predictive attributes and the class

:Attribute Information:
    - radius (mean of distances from center to points on the perimeter)
    - texture (standard deviation of gray-scale values)
    - perimeter
    - area
    - smoothness (local variation in radius lengths)
    - compactness (perimeter^2 / area - 1.0)
    - concavity (severity of concave portions of the contour)
    - concave points (number of concave portions of the contour)
    - symmetry
    - fractal dimension ("coastline approximation" - 1)

    The mean, standard error, and "worst" or largest (mean of the three
    worst/largest values) of these features were computed for each image,
    resulting in 30 features.  For instance, field 0 is Mean Radius, field
    10 is Radius SE, field 20 is Worst Radius.

    - 

In [2]:
df_data_breast_cancer = data_breast_cancer.frame
df_data_breast_cancer.head(10)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,0


In [3]:
data_iris = datasets.load_iris(as_frame=True)
print(data_iris['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

In [4]:
df_data_iris = data_iris.frame
df_data_iris.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


## 2. Data dimension reduction

**PCA is used to reduce the number of features while preserving at least 90% of the total variance.  
This helps simplify the data structure without losing essential information.**

**Variance** measures how spread out the values in a dataset are.  
In PCA, it represents how much of the data’s information is captured by each principal component.

Mathematically, variance is defined as:

$$
\text{Var}(X) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \bar{x})^2
$$

Where:
- $$ x_i $$ is a single data point,
- $$ \bar{x} $$ is the mean of the dataset,
- $$ n $$ is the number of samples.

Higher variance means more information is captured along that direction in the feature space.

**reduction of unscaled data**

In [5]:
from sklearn.model_selection import train_test_split
X_bc = df_data_breast_cancer.iloc[:,:-1]
y_bc = df_data_breast_cancer['target']
X_bc_train, X_bc_test, y_bc_train, y_bc_test = train_test_split(X_bc, y_bc, test_size=0.2, shuffle = True)

In [6]:
from sklearn.decomposition import PCA

# for breast cancer data

pca_bc = PCA(n_components = 0.90) # and automatically centers data
X_bc_reduced = pca_bc.fit_transform(X_bc)

print("***BREAST CANCER DATA***")
print(f"Original data size: {X_bc.shape}:")
print(f"Transformed data size: {X_bc_reduced.shape}:")
print("variance's:", pca_bc.explained_variance_ratio_)
print("Sum of variances:", sum(pca_bc.explained_variance_ratio_))

***BREAST CANCER DATA***
Original data size: (569, 30):
Transformed data size: (569, 1):
variance's: [0.98204467]
Sum of variances: 0.982044671510662


In [7]:
from sklearn.model_selection import train_test_split
X_ir = df_data_iris.iloc[:,:-1]
y_ir = df_data_iris['target']
# y_ir = (y_ir == 2).astype(int)
X_ir_train, X_ir_test, y_ir_train, y_ir_test = train_test_split(X_ir, y_ir, test_size=0.2, shuffle = True)

In [8]:
# for iris data

pca_ir = PCA(n_components = 0.9) # and automatically centers data
X_ir_reduced = pca_ir.fit_transform(X_ir)

print("***IRIS DATA***")
print(f"Original data size: {X_ir.shape}:")
print(f"Transformed data size: {X_ir_reduced.shape}:")
print("variance's:", pca_ir.explained_variance_ratio_)
print("Sum of variances:", sum(pca_ir.explained_variance_ratio_))

***IRIS DATA***
Original data size: (150, 4):
Transformed data size: (150, 1):
variance's: [0.92461872]
Sum of variances: 0.9246187232017327


**reduction of scaled data**

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_bc_scaled = scaler.fit_transform(X_bc)

pca_bc_scaled = PCA(n_components = 0.90) # and automatically centers data
X_bc_scaled_reduced = pca_bc_scaled.fit_transform(X_bc_scaled)

print("***BREAST CANCER DATA SCALED***")
print(f"Original data size: {X_bc_scaled.shape}:")
print(f"Transformed data size: {X_bc_scaled_reduced.shape}:")
print("variance's:", pca_bc_scaled.explained_variance_ratio_)
print("Sum of variances:", sum(pca_bc_scaled.explained_variance_ratio_))

***BREAST CANCER DATA SCALED***
Original data size: (569, 30):
Transformed data size: (569, 7):
variance's: [0.44272026 0.18971182 0.09393163 0.06602135 0.05495768 0.04024522
 0.02250734]
Sum of variances: 0.9100953006967311


In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_ir_scaled = scaler.fit_transform(X_ir)

pca_ir_scaled = PCA(n_components = 0.90) # and automatically centers data
X_ir_scaled_reduced = pca_ir_scaled.fit_transform(X_ir_scaled)

print("***IRIS DATA SCALED***")
print(f"Original data size: {X_ir_scaled.shape}:")
print(f"Transformed data size: {X_ir_scaled_reduced.shape}:")
print("variance's:", pca_ir_scaled.explained_variance_ratio_)
print("Sum of variances:", sum(pca_ir_scaled.explained_variance_ratio_))

***IRIS DATA SCALED***
Original data size: (150, 4):
Transformed data size: (150, 2):
variance's: [0.72962445 0.22850762]
Sum of variances: 0.9581320720000163


**We apply `StandardScaler` to normalize all features to the same scale (mean = 0, standard deviation = 1).  
Without scaling, PCA may overemphasize features with larger numeric ranges, which can distort the `explained_variance_ratio_`.  
As shown in this analysis, scaling is essential to ensure that dimensionality reduction is accurate and meaningful.**

In [11]:
import pickle
# Save explained variance ratios from scaled PCA

with open("pca_bc.pkl", "wb") as f:
    pickle.dump(list(pca_bc_scaled.explained_variance_ratio_), f)

with open("pca_ir.pkl", "wb") as f:
    pickle.dump(list(pca_ir_scaled.explained_variance_ratio_), f)

## 3. Finding the most decisive features

**Compute importance of original features and save index list**

In [12]:
import numpy as np

def get_sorted_feature_indices(pca):
    weights = np.abs(pca.components_ * pca.explained_variance_ratio_[:, np.newaxis])
    sorted_indices = np.argsort(np.max(weights, axis=0))[::-1]
    return sorted_indices

# calculating the indexes for breast cancer dataset
idx_bc = get_sorted_feature_indices(pca_bc_scaled)
with open("idx_bc.pkl", "wb") as f:
    pickle.dump(idx_bc, f)

# calculating the indexes for iris dataset
idx_ir = get_sorted_feature_indices(pca_ir_scaled)
with open("idx_ir.pkl", "wb") as f:
    pickle.dump(idx_ir, f)

**Indexes of the most decisive features shown:**

In [13]:
print(idx_bc)
print(idx_ir)

print("bc explained var:", sum(pca_bc_scaled.explained_variance_ratio_))
print("ir explained var:", sum(pca_ir_scaled.explained_variance_ratio_))

[ 7  6 27  5 22 26 20  2 23  3  0 12 25 10 13 17 15  9 16  4  8 29 24 28
 19 21  1 14 11 18]
[2 3 0 1]
bc explained var: 0.9100953006967311
ir explained var: 0.9581320720000163


**Scaling ensures that PCA treats all features equally.  
This improves component selection, makes variance ratios more meaningful, and leads to better dimensionality reduction results.**

**Dimensionality reduction is a powerful technique — it allows us to significantly compress the dataset while preserving most of its information.  
This not only speeds up training and inference in machine learning models, but also reduces memory usage and data storage requirements.**