# LAB: Dimensionality Reduction

**Load necessary packages and apply custom configurations**

In [59]:
import warnings; 
warnings.filterwarnings("ignore")
warnings.simplefilter(action="ignore",category=UserWarning)
warnings.simplefilter(action="ignore",category=FutureWarning)

# Suppress valuewarning when fitting ARIMA model.
from statsmodels.tools.sm_exceptions import ValueWarning
warnings.simplefilter('ignore', ValueWarning)


# Interactive plots embedded within the notebook
#%matplotlib notebook 
# Static images of plots embedded within the notebook
# %matplotlib inline   
%config InlineBackend.figure_formats = {'png', 'retina'}

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from scipy import stats

import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import statsmodels as sm
from platform import python_version

#pd.options.plotting.backend = "plotly" 
# Conflict with options in original matplotlib.

print('Python version', python_version())
print('Numpy version', np.__version__)
print('Scipy version', sp.__version__)
print('Pandas version', pd.__version__)
print('Matplotlib version', mpl.__version__)
print('Seaborn version', sns.__version__)
###############################################

#plt.style.use('ggplot')
#plt.style.use('seaborn-v0_8-muted')
plt.rcParams['figure.figsize'] = (6, 6)
plt.rcParams['grid.linestyle'] = ':'   
plt.rcParams['axes.grid'] = False

sns.set_style("whitegrid", {'axes.grid' : False})
#sns.color_palette("RdBu", n_colors=10)
#sns.color_palette("RdBu_r') # Good for heatmap

Python version 3.13.1
Numpy version 2.2.2
Scipy version 1.15.1
Pandas version 2.2.3
Matplotlib version 3.10.0
Seaborn version 0.13.2


# Part I: Linear PCA

**Create a dataset**   
rows = instances, columns = features/variables  

In [60]:
x1 = np.array([9,15,25,14,10,18,0,16,5,19,16,20])
x2 = np.array([39,56,93,61,50,75,32,85,42,70,66,80])

D = np.vstack((x1,x2)).T 
print(D)

[[ 9 39]
 [15 56]
 [25 93]
 [14 61]
 [10 50]
 [18 75]
 [ 0 32]
 [16 85]
 [ 5 42]
 [19 70]
 [16 66]
 [20 80]]


Make the dataset zero-mean by subtracting each column by its mean.

In [61]:
D = np.vstack((x1, x2)).T   #matrix(12,2)
mean_D = np.mean(D, axis=0) 
D_zero_mean = D - mean_D

print(D_zero_mean)

[[ -4.91666667 -23.41666667]
 [  1.08333333  -6.41666667]
 [ 11.08333333  30.58333333]
 [  0.08333333  -1.41666667]
 [ -3.91666667 -12.41666667]
 [  4.08333333  12.58333333]
 [-13.91666667 -30.41666667]
 [  2.08333333  22.58333333]
 [ -8.91666667 -20.41666667]
 [  5.08333333   7.58333333]
 [  2.08333333   3.58333333]
 [  6.08333333  17.58333333]]


Compute the sample covariance matrix $S$ from the zero-mean data using `np.cov`.  
    
   
Use the option `rowvar=False` to treat the variables column-wise.  
The sum is divided by  $N-1$ by default (option `bias=False` or `ddof=1`)

In [62]:
Sample_covar_matrix = np.cov(D_zero_mean, rowvar=False)
print(Sample_covar_matrix)

[[ 47.71969697 122.9469697 ]
 [122.9469697  370.08333333]]


Compute the eigenpairs of the covariance matrix $S$ to get the principal components.  
Show the eigenvectors sorted by the largest eigenvalues first, and the corresponding eigenvalues.

In [63]:
egval,egvec = np.linalg.eig(Sample_covar_matrix)

In [64]:
print(egval)
print(egvec)

[  6.18117609 411.62185422]
[[-0.94738969 -0.32008244]
 [ 0.32008244 -0.94738969]]


In [65]:
sorted_indices = np.argsort(egval)[::-1]
egval = egval[sorted_indices]
egvec = egvec[:, sorted_indices]
print(egval)
print(egvec)


[411.62185422   6.18117609]
[[-0.32008244 -0.94738969]
 [-0.94738969  0.32008244]]


Result: Eigenvalues = 441.6218 , 6.1811

Take $r=2$ principal components as the matrix `Pr`

In [66]:
# เลือก 2 principal components
r = 2
Pr = egvec[:, :r]  # เลือก 2 คอลัมน์แรกของ eigenvectors ที่เรียงลำดับแล้ว

# แสดงผล
print("Principal Components Matrix (Pr):")
print(Pr)


Principal Components Matrix (Pr):
[[-0.32008244 -0.94738969]
 [-0.94738969  0.32008244]]


Transform the data to obtain the reduced-dimension data $Z$ by multiplying  
    the zero-mean data $X$ to the matrix of principal components $Pr$.

In [67]:
# Transform ข้อมูลไปยัง reduced-dimension space
Z = D_zero_mean @ Pr

# แสดงผล
print("Reduced-Dimension Data (Z):")
print(Z)


Reduced-Dimension Data (Z):
[[ 23.75844731  -2.83726457]
 [  5.73232788  -3.08020118]
 [-32.52191518  -0.71104768]
 [  1.31546186  -0.53239927]
 [ 13.01707825  -0.26374738]
 [-13.22832361   0.15919617]
 [ 33.27091715   3.44866555]
 [-22.06205564   5.2548    ]
 [ 22.19660801   1.91254153]
 [ -8.81145759  -2.38860574]
 [ -4.06165149  -0.82676644]
 [-18.60543696  -0.13517099]]


Inverse-transform the data by multiplying the transformed data and the matrix of principal components. 

In [68]:
Pr_T = Pr.T  # Transpose ของ Pr
print(Pr_T)

[[-0.32008244 -0.94738969]
 [-0.94738969  0.32008244]]


<font color='blue'>Determine the variance explained by each principal component from the eigenvalues, 
the PVEs and the cumulative PVEs from the eigenvalues. 

In [69]:
total_variance = np.sum(egval)
pve = egval / total_variance

# คำนวณ Cumulative PVE
cumulative_pve = np.cumsum(pve)

# แสดงผลลัพธ์
print("Eigenvalues:", egval)
print("Proportion of Variance Explained (PVE):", pve)
print("Cumulative PVE:", cumulative_pve)

Eigenvalues: [411.62185422   6.18117609]
Proportion of Variance Explained (PVE): [0.98520553 0.01479447]
Cumulative PVE: [0.98520553 1.        ]


# Part II: Dimensionaltiy reduction with PCA

The dataset from sheet `MTCARS` in file `dimensionality-reduction.xlsx`. The dataset contains 11 columns:
- `mpg`: Miles per gallon (Fuel efficience)
- `cyl`: Number of cyclinders  
- `disp`: Displacement (Proxy to power generated by engine)  
- `hp`: Gross horse power  (Engine power output)
- `drat`: Rear axle ratio (# turns of the drive shaft for every one rotation of the wheel axle). 
          High ratio = More torque
- `wt`: Weight in 1000lbs
- `qsec`: 1/4 mile time (Fastest time to travel 1/4 mile in seconds)
- `vs`: Engine cylinder configuration (V-shape or Straight line)
- `am`: Transmission Type (Automatic or Manual)
- `gear`: Number of forward gears
- `carb`: Number of carburetors. Engines with higher displacement typically have higher barrel configuration

Load and explore the dataset

In [70]:
data = pd.read_excel('dimensionality-reduction.xlsx')
data.head()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
0,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
1,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
2,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
3,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
4,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2


In [71]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 2.9 KB


In [72]:
data.describe()

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
count,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0,32.0
mean,20.090625,6.1875,230.721875,146.6875,3.596563,3.21725,17.84875,0.4375,0.40625,3.6875,2.8125
std,6.026948,1.785922,123.938694,68.562868,0.534679,0.978457,1.786943,0.504016,0.498991,0.737804,1.6152
min,10.4,4.0,71.1,52.0,2.76,1.513,14.5,0.0,0.0,3.0,1.0
25%,15.425,4.0,120.825,96.5,3.08,2.58125,16.8925,0.0,0.0,3.0,2.0
50%,19.2,6.0,196.3,123.0,3.695,3.325,17.71,0.0,0.0,4.0,2.0
75%,22.8,8.0,326.0,180.0,3.92,3.61,18.9,1.0,1.0,4.0,4.0
max,33.9,8.0,472.0,335.0,4.93,5.424,22.9,1.0,1.0,5.0,8.0


Compute the covariance matrix of the dataset

In [73]:
cov_matrix = data.cov()

print("Covariance Matrix:")
cov_matrix

Covariance Matrix:


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
mpg,36.324103,-9.172379,-633.097208,-320.732056,2.195064,-5.116685,4.509149,2.017137,1.803931,2.135685,-5.363105
cyl,-9.172379,3.189516,199.660282,101.931452,-0.668367,1.367371,-1.886855,-0.729839,-0.465726,-0.649194,1.520161
disp,-633.097208,199.660282,15360.799829,6721.158669,-47.064019,107.684204,-96.051681,-44.377621,-36.564012,-50.802621,79.06875
hp,-320.732056,101.931452,6721.158669,4700.866935,-16.451109,44.192661,-86.770081,-24.987903,-8.320565,-6.358871,83.03629
drat,2.195064,-0.668367,-47.064019,-16.451109,0.285881,-0.372721,0.087141,0.118649,0.190151,0.275988,-0.078407
wt,-5.116685,1.367371,107.684204,44.192661,-0.372721,0.957379,-0.305482,-0.273661,-0.338105,-0.421081,0.67579
qsec,4.509149,-1.886855,-96.051681,-86.770081,0.087141,-0.305482,3.193166,0.670565,-0.20496,-0.280403,-1.894113
vs,2.017137,-0.729839,-44.377621,-24.987903,0.118649,-0.273661,0.670565,0.254032,0.042339,0.076613,-0.46371
am,1.803931,-0.465726,-36.564012,-8.320565,0.190151,-0.338105,-0.20496,0.042339,0.248992,0.292339,0.046371
gear,2.135685,-0.649194,-50.802621,-6.358871,0.275988,-0.421081,-0.280403,0.076613,0.292339,0.544355,0.326613


Compute the eigenvectors and the eigenvalues of the covariance matrix

In [74]:
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)

print(eigenvalues)


[1.86412732e+04 1.45527582e+03 9.43114274e+00 1.70733638e+00
 8.21717176e-01 4.40286790e-01 9.52210464e-02 8.17733529e-02
 3.93719949e-02 4.43742434e-02 6.28491258e-02]


In [75]:
print(eigenvectors)

[[ 3.81181985e-02 -9.18484655e-03  9.82070847e-01  4.76347838e-02
  -8.83284292e-02 -1.43790084e-01 -3.92391738e-02  2.27104005e-02
  -1.58569365e-02  3.06303615e-02  2.79013881e-03]
 [-1.20351498e-02  3.37248716e-03 -6.34839420e-02 -2.27991962e-01
   2.38725898e-01 -7.93818050e-01  4.25011021e-01 -1.89040332e-01
   1.45445363e-01  1.31718534e-01 -4.26772061e-02]
 [-8.99568146e-01 -4.35372320e-01  3.14426562e-02 -5.08682642e-03
  -1.07359688e-02  7.42413761e-03  5.82397980e-04 -5.84146399e-04
   9.42026215e-04 -5.39913212e-03 -3.53271286e-03]
 [-4.34784387e-01  8.99307303e-01  2.50930486e-02  3.57156383e-02
   1.65519386e-02  1.65368455e-03 -2.21253798e-03  4.74808678e-06
  -2.15261018e-03  1.86255377e-03  3.73408459e-03]
 [ 2.66007737e-03  3.90020536e-03  3.97249277e-02 -5.71293572e-02
  -1.33327645e-01  2.27229260e-01  3.48474105e-02 -9.38581717e-01
  -9.73818815e-02  1.84102094e-01  1.41311095e-02]
 [-6.23940543e-03 -4.86102295e-03 -8.49102579e-02  1.27962867e-01
  -2.43542958e-01 -

Computer the PVE of principal components

In [76]:

pve = eigenvalues / np.sum(eigenvalues)
cumulative_pve = np.cumsum(pve)

print("Proportion of Variance Explained (PVE):")
print(pve)
print("\nCumulative PVE:")
print(cumulative_pve)

Proportion of Variance Explained (PVE):
[9.26998858e-01 7.23683953e-02 4.68994713e-04 8.49029388e-05
 4.08625997e-05 2.18947144e-05 4.73518094e-06 4.06644997e-06
 1.95790244e-06 2.20665577e-06 3.12538030e-06]

Cumulative PVE:
[0.92699886 0.99936725 0.99983625 0.99992115 0.99996201 0.99998391
 0.99998864 0.99999271 0.99999467 0.99999687 1.        ]


**<font color='darkorange'>Question 2.1</font>**
- (a) How many PCs are sufficient to explain at least approximately 90% of the total variation in the data ?
- (b) Which of the original features get large weights (loadings) in PC1 ?

Standardize the data with `StandardScaler()` in scikit-learn.

Compute the covariance matrix of the standardized dataset

Compute the eigenvectors and the eigenvalues of the covariance matrix

Computer the PVE of principal components

**<font color='darkorange'>Question 2.2</font>**
- (a) How many PCs are sufficient to explain at least approximately 90% of the total variation in the data ?
- (b) Which of the original features get large weights (loadings) in PC1 ?
- (c) Do the PCA outputs differ significantly between standardizing and not standardizing the dataset before applying PCA? Explain the results. 