# Essential Linear Algebra for Finance

## 2.5     Principal Component Analysis  
  - 2.5.1  Definition of Principal Components  
  - 2.5.2  Principal Component Representation  
  - 2.5.3  Case Study: PCA of European Equity Indices 

In [None]:
from visualisations import *

Principal Component Analysis or PCA has a wide range of benefits and uses. It was often used as a tool to reduce training times, memory requirements, and overfitting in machine learning models while I was studying for my degree. Essentially it is a technique that can be used to reduce dimensionality and remove multicollinearity among features and is one of the most popular techniques used in financial risk management.  

Applications of PCA in financial risk management include:  
- Multi-factor option pricing models
- Predicting movements in implied volatility surfaces
- Compute large-dimensional covariance matrices
- Reduce dimensionality for stress testing and scenario analysis

It can be applied to any stationary time series but it works best for highly correlated systems such as zero coupons and commodity futures with different maturities as well as implied volatility surfaces.

### Case Study (European Equity Indices)

Before laying out the various components and structures of Principal Component Analysis

### 2.5.1 Definition of Principal Components

To analyse a set of time series variables we can create a $T \times n$ matrix $\mathbf{X}$ where  

$T$ represents the number of of observations  
$n$ represents the financial variable of interest (asset, portfolio, or risk factor returns)  

The columns of $\mathbf{X}$ are denoted as $\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_n$, where each $\mathbf{x}_i \in \mathbb{R}^m$ represent the time series of a correlated set of returns that have been standardised so each column has a mean of zero to prevent the scale of the variables from dominating the principle components.  

The rows of $\mathbf{X}$ representing the observations of each variable are measured with the same frequency and cover the same period for each column.  

The standardised correlation matrix of returns has the form  

$$\mathbf{V} = T^{-1} \mathbf{X}^{\top} \mathbf{X}$$

Sometimes we may have instances where $T < n$ such as a mutual fund with only a few years of data to extract a common trend in mutual fund returns. In this case $\mathbf{V}$ will be singular and have one or more eigenvalues of zero and therefore a full set of $n$ principal components can not be determined. This is ok though as we shall see that only the first few principal components are the 'most important' ones needed to sufficiently represent the system.  

Additionally we may have instances where $T \geq n$. In this case the eigenvalues of $\mathbf{V}$ will be positive since $\mathbf{V}$ will be positive definite and $\mathbf{V}$ will have $n$ positive eigenvalues $\lambda_{1}, \lambda_{2}, \dots , \lambda_{n}$. 


Eigen-decomposition of Covariance Matrix
$$
\mathbf{V} = \mathbf{W \Lambda W^\top}
$$

$\mathbf{V}$ is the covariance matrix of $\mathbf{X}$  

$\mathbf{W}$ is the orthogonal matrix of eigenvectors of $\mathbf{V}$

$\Lambda$ is the diagonal matrix of eigenvalues ($\lambda_1, \lambda_2, \dots, \lambda_{n}$) of $\mathbf{V}$ 

  
We order the eigenvalues on the diagonal of $\Lambda$ such that:
$$
\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n
$$
and align the columns of $\mathbf{W}$ accordingly.

**Principal Components Matrix**

The **principal components** time series matrix $\mathbf{P} \in \mathbb{R}^{T \times n}$ is defined as:
$$
\mathbf{P} = \mathbf{X W}
$$
- Each column of $\mathbf{P}$ is the time series of a principal component.
- The first column corresponds to the linear combination of variables associated with the largest eigenvalue, and so forth.
- Represent a linear combination of the columns of $\mathbf{X}$
- They are uncorrelated to each other
- The first principal components 'explains' the *most variation* or total variation in $\mathbf{X}$ 
- Each additional component 'explains' the greatest amount of remaining variation

Variance Explained by Each Principal Component

The variance of the $i$-th principal component is:
$$
\mathrm{Var}(P_i) = \lambda_i
$$

The proportion of total variance explained by the $i$-th principal component is:
$$
\frac{\lambda_i}{\sum_{j=1}^n \lambda_j}
$$



Summary Table

| Symbol | Meaning |
|--------|---------|
| $X$ | Centered data matrix ($T \times n$) |
| $V$ | Covariance matrix of $X$ ($n \times n$) |
| $W$ | Eigenvector matrix of $V$ ($n \times n$) |
| $\Lambda$ | Diagonal matrix of eigenvalues ($n \times n$) |
| $P$ | Principal components time series matrix ($T \times n$) |
| $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n$ | Ordered eigenvalues (largest to smallest) |

---

This provides the core PCA decomposition and how principal components relate to the original data.



In [None]:
# Import numpy library
import numpy as np

corr_matrix = np.array([[1,	0.936586026,	0.88445323,	0.864977312,	0.822904373,	0.854890669],
                        [0.936586026,	1,	0.911694445,	0.893083504,	0.847068766,	0.896280308],
                        [0.88445323, 0.911694445, 1, 0.836719949, 0.830307112, 0.862673077],
                        [0.864977312, 0.893083504, 0.836719949,	1, 0.789697754,	0.855688987],
                        [0.822904373, 0.847068766, 0.830307112, 0.789697754, 1,	0.833184836],
                        [0.854890669, 0.896280308, 0.862673077,	0.855688987, 0.833184836, 1]])

# Compute the eigenvalues and eigenvectors 
eigenvalues, eigenvectors = np.linalg.eig(corr_matrix)


idx = eigenvalues.argsort()[::-1]  # indices for sorting in descending order
sorted_eigenvalues = eigenvalues[idx]
sorted_eigenvectors = eigenvectors[:, idx]

# Display results
print("Eigenvalues (sorted):")
print(sorted_eigenvalues)

print("\nCorresponding Eigenvectors (columns):")
print(sorted_eigenvectors)

In [None]:
sorted_eigenvalues[1]/ sum(sorted_eigenvalues)

In [5]:
import pandas as pd

df = pd.read_excel('PCA Case Study.xlsx')
print(df.head())

        Date     AEX      CAC      DAX    FTSE     IBEX    MIB
0 2001-01-02  635.80  5758.02  6382.31  6198.1   9601.4  42992
1 2001-01-08  639.86  5834.34  6490.03  6165.5   9890.2  44304
2 2001-01-15  628.88  5845.73  6651.53  6209.3   9955.5  44469
3 2001-01-22  632.57  5925.62  6695.20  6294.3  10090.1  44901
4 2001-01-29  630.98  5826.37  6638.20  6256.4   9848.5  44049


In [7]:
for col in df.columns:
    if col != 'Date':  # skip the date column
        base_price = df[col].iloc[0]
        df[col + '_WI'] = 100 * df[col] / base_price

print(df.head())


        Date     AEX      CAC      DAX    FTSE     IBEX    MIB      AEX_WI  \
0 2001-01-02  635.80  5758.02  6382.31  6198.1   9601.4  42992  100.000000   
1 2001-01-08  639.86  5834.34  6490.03  6165.5   9890.2  44304  100.638566   
2 2001-01-15  628.88  5845.73  6651.53  6209.3   9955.5  44469   98.911607   
3 2001-01-22  632.57  5925.62  6695.20  6294.3  10090.1  44901   99.491979   
4 2001-01-29  630.98  5826.37  6638.20  6256.4   9848.5  44049   99.241900   

       CAC_WI      DAX_WI     FTSE_WI     IBEX_WI      MIB_WI  
0  100.000000  100.000000  100.000000  100.000000  100.000000  
1  101.325456  101.687790   99.474032  103.007895  103.051731  
2  101.523267  104.218222  100.180701  103.688004  103.435523  
3  102.910723  104.902457  101.552089  105.089883  104.440361  
4  101.187040  104.009363  100.940611  102.573583  102.458597  


In [8]:
# Loop through wealth index columns (assuming they end with '_WI')
wi_cols = [col for col in df.columns if col.endswith('_WI')]

for col in wi_cols:
    ret_col = col.replace('_WI', '_Return')
    # Calculate simple returns using pct_change()
    df[ret_col] = df[col].pct_change()

print(df[[*wi_cols, *[c.replace('_WI', '_Return') for c in wi_cols]]].head())


       AEX_WI      CAC_WI      DAX_WI     FTSE_WI     IBEX_WI      MIB_WI  \
0  100.000000  100.000000  100.000000  100.000000  100.000000  100.000000   
1  100.638566  101.325456  101.687790   99.474032  103.007895  103.051731   
2   98.911607  101.523267  104.218222  100.180701  103.688004  103.435523   
3   99.491979  102.910723  104.902457  101.552089  105.089883  104.440361   
4   99.241900  101.187040  104.009363  100.940611  102.573583  102.458597   

   AEX_Return  CAC_Return  DAX_Return  FTSE_Return  IBEX_Return  MIB_Return  
0         NaN         NaN         NaN          NaN          NaN         NaN  
1    0.006386    0.013255    0.016878    -0.005260     0.030079    0.030517  
2   -0.017160    0.001952    0.024884     0.007104     0.006602    0.003724  
3    0.005868    0.013666    0.006565     0.013689     0.013520    0.009715  
4   -0.002514   -0.016749   -0.008514    -0.006021    -0.023944   -0.018975  


In [9]:
df

Unnamed: 0,Date,AEX,CAC,DAX,FTSE,IBEX,MIB,AEX_WI,CAC_WI,DAX_WI,FTSE_WI,IBEX_WI,MIB_WI,AEX_Return,CAC_Return,DAX_Return,FTSE_Return,IBEX_Return,MIB_Return
0,2001-01-02,635.80,5758.02,6382.31,6198.1,9601.4,42992,100.000000,100.000000,100.000000,100.000000,100.000000,100.000000,,,,,,
1,2001-01-08,639.86,5834.34,6490.03,6165.5,9890.2,44304,100.638566,101.325456,101.687790,99.474032,103.007895,103.051731,0.006386,0.013255,0.016878,-0.005260,0.030079,0.030517
2,2001-01-15,628.88,5845.73,6651.53,6209.3,9955.5,44469,98.911607,101.523267,104.218222,100.180701,103.688004,103.435523,-0.017160,0.001952,0.024884,0.007104,0.006602,0.003724
3,2001-01-22,632.57,5925.62,6695.20,6294.3,10090.1,44901,99.491979,102.910723,104.902457,101.552089,105.089883,104.440361,0.005868,0.013666,0.006565,0.013689,0.013520,0.009715
4,2001-01-29,630.98,5826.37,6638.20,6256.4,9848.5,44049,99.241900,101.187040,104.009363,100.940611,102.573583,102.458597,-0.002514,-0.016749,-0.008514,-0.006021,-0.023944,-0.018975
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
285,2006-06-19,429.14,4817.63,5529.74,5692.1,11274.2,35562,67.496068,83.668171,86.641670,91.836208,117.422459,82.717715,0.020498,0.026143,0.028596,0.016919,0.028405,0.023897
286,2006-06-26,440.25,4965.96,5683.31,5833.4,11548.1,36345,69.243473,86.244230,89.047853,94.115939,120.275168,84.538984,0.025889,0.030789,0.027772,0.024824,0.024294,0.022018
287,2006-07-03,440.70,4953.71,5681.85,5888.9,11626.7,36422,69.314250,86.031483,89.024977,95.011374,121.093799,84.718087,0.001022,-0.002467,-0.000257,0.009514,0.006806,0.002119
288,2006-07-10,428.33,4780.79,5422.22,5707.6,11240.2,35425,67.368669,83.028367,84.957014,92.086285,117.068344,82.399051,-0.028069,-0.034907,-0.045695,-0.030787,-0.033242,-0.027374


In [10]:
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Select the return columns
return_cols = [col for col in df.columns if col.endswith('_Return')]

# Drop rows with NaN (usually the first row after pct_change)
returns_data = df[return_cols].dropna()

# Standardize returns (zero mean, unit variance)
scaler = StandardScaler()
returns_scaled = scaler.fit_transform(returns_data)

# Fit PCA - keep all components or specify n_components
pca = PCA(n_components=len(return_cols))
principal_components = pca.fit_transform(returns_scaled)

# Create a DataFrame of principal components
pc_df = pd.DataFrame(
    principal_components, 
    columns=[f'PC{i+1}' for i in range(len(return_cols))],
    index=returns_data.index
)

print(pc_df.head())

# Explained variance ratio (how much variance each PC explains)
print("Explained variance ratio:", pca.explained_variance_ratio_)


        PC1       PC2       PC3       PC4       PC5       PC6
1  1.300486  0.958780  0.055228  0.656085 -0.283453  0.086685
2  0.404385  0.130551  0.074687  0.276044  0.827720  0.128769
3  0.971577 -0.005254  0.330825 -0.146256  0.063873  0.172298
4 -1.142580 -0.584006 -0.331975 -0.145502  0.047431 -0.302757
5 -1.366834  0.730934 -0.232202 -0.807004  0.049657 -0.068227
Explained variance ratio: [0.88485215 0.03711643 0.02742965 0.02351967 0.01818113 0.00890098]


In [12]:
returns_data.corr()

Unnamed: 0,AEX_Return,CAC_Return,DAX_Return,FTSE_Return,IBEX_Return,MIB_Return
AEX_Return,1.0,0.936586,0.884453,0.864977,0.822904,0.854891
CAC_Return,0.936586,1.0,0.911694,0.893084,0.847069,0.89628
DAX_Return,0.884453,0.911694,1.0,0.83672,0.830307,0.862673
FTSE_Return,0.864977,0.893084,0.83672,1.0,0.789698,0.855689
IBEX_Return,0.822904,0.847069,0.830307,0.789698,1.0,0.833185
MIB_Return,0.854891,0.89628,0.862673,0.855689,0.833185,1.0


In [13]:
returns_data.cov()

Unnamed: 0,AEX_Return,CAC_Return,DAX_Return,FTSE_Return,IBEX_Return,MIB_Return
AEX_Return,0.001075,0.000864,0.000996,0.000604,0.00071,0.000752
CAC_Return,0.000864,0.000792,0.000881,0.000535,0.000627,0.000677
DAX_Return,0.000996,0.000881,0.00118,0.000612,0.000751,0.000796
FTSE_Return,0.000604,0.000535,0.000612,0.000454,0.000443,0.000489
IBEX_Return,0.00071,0.000627,0.000751,0.000443,0.000693,0.000589
MIB_Return,0.000752,0.000677,0.000796,0.000489,0.000589,0.00072


In [22]:
import numpy as np

eigvals_cov, eigvecs_cov = np.linalg.eigh(returns_data.cov())

idx = eigvals_cov.argsort()[::-1]

eigvals_cov_sorted = eigvals_cov[idx]
eigvecs_cov_sorted = eigvecs_cov[:, idx]

eigals_cov_matrix = np.diag(eigvals_cov_sorted)

print(f"The eigenvalues of the covariance matrix are:", eigvals_cov_sorted)
print()
print(eigals_cov_matrix)
print()
print(f"The eigenvectors of the covariance matrix are:", eigvecs_cov_sorted)


The eigenvalues of the covariance matrix are: [4.38184738e-03 1.59403922e-04 1.38777912e-04 1.17255851e-04
 7.26807935e-05 4.40486304e-05]

[[4.38184738e-03 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 1.59403922e-04 0.00000000e+00 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 1.38777912e-04 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 1.17255851e-04
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  7.26807935e-05 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00
  0.00000000e+00 4.40486304e-05]]

The eigenvectors of the covariance matrix are: [[ 0.47421644 -0.45356331  0.37371785 -0.46042398  0.31469713  0.3445425 ]
 [ 0.41371914 -0.17113779  0.14101733  0.01254856 -0.05440208 -0.88121813]
 [ 0.49581356 -0.14229564 -0.84429871  0.01950407 -0.06277944  0.12945609]
 [ 0.29404233 -0

In [27]:
import numpy as np

eigvals_corr, eigvecs_corr = np.linalg.eigh(returns_data.corr())

idx_corr = eigvals_corr.argsort()[::-1]

eigvals_corr_sorted = eigvals_corr[idx]
eigvecs_corr_sorted = eigvecs_corr[:, idx]

eigvals_corr_matrix = np.diag(eigvals_corr_sorted)

print(f"The eigenvalues of the correlation matrix are:", eigvals_corr_sorted)
print()
print(eigvals_corr_matrix)
print()
print(f"The eigenvectors of the correlation matrix are:", eigvecs_corr_sorted)

The eigenvalues of the correlation matrix are: [5.30911288 0.22269855 0.16457789 0.14111801 0.1090868  0.05340586]

[[5.30911288 0.         0.         0.         0.         0.        ]
 [0.         0.22269855 0.         0.         0.         0.        ]
 [0.         0.         0.16457789 0.         0.         0.        ]
 [0.         0.         0.         0.14111801 0.         0.        ]
 [0.         0.         0.         0.         0.1090868  0.        ]
 [0.         0.         0.         0.         0.         0.05340586]]

The eigenvectors of the correlation matrix are: [[ 4.12788178e-01 -2.20488289e-01  3.89216711e-01 -3.10855877e-01
  -5.60703462e-01 -4.67419981e-01]
 [ 4.22105563e-01 -1.64918360e-01  1.75044853e-01 -4.41163958e-02
  -1.88904006e-01  8.52265890e-01]
 [ 4.09744427e-01 -7.37371021e-04  5.47167886e-01  2.85859883e-01
   6.53225072e-01 -1.55876225e-01]
 [ 4.02996851e-01 -4.57885216e-01 -5.79694838e-01 -3.77150334e-01
   3.71975290e-01 -1.06210029e-01]
 [ 3.93466071e-0

In [28]:
import yfinance as yf
import pandas as pd

# Define the indices and their Yahoo Finance tickers
indices = {
    'AEX': '^AEX',
    'CAC 40': '^FCHI',
    'DAX': '^GDAXI',
    'FTSE 100': '^FTSE',
    'IBEX 35': '^IBEX',
    'FTSE MIB': 'FTSEMIB.MI'
}

# Set the date range
start_date = '2015-05-18'
end_date = '2025-05-18'

# Create an empty DataFrame to store the data
weekly_data = pd.DataFrame()

# Fetch weekly adjusted close prices for each index
for name, ticker in indices.items():
    data = yf.download(ticker, start=start_date, end=end_date, interval='1wk')
    weekly_data[name] = data['Adj Close']

# Save the data to a CSV file
weekly_data.to_csv('european_indices_weekly.csv')


Failed to get ticker '^AEX' reason: Expecting value: line 1 column 1 (char 0)
[*********************100%***********************]  1 of 1 completed

1 Failed download:
['^AEX']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
Failed to get ticker '^FCHI' reason: Expecting value: line 1 column 1 (char 0)
[*********************100%***********************]  1 of 1 completed

1 Failed download:
['^FCHI']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
Failed to get ticker '^GDAXI' reason: Expecting value: line 1 column 1 (char 0)
[*********************100%***********************]  1 of 1 completed

1 Failed download:
['^GDAXI']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
Failed to get ticker '^FTSE' reason: Expecting value: line 1 column 1 (char 0)
[*********************100%***********************]  1 of 1 completed

1 Failed download:
['^FTSE']: YFTzMissingError('$%ticker%: possibly delisted; no timezone found')
Failed to ge

DQSFYFYU5AKNE4KA - API KEY ALPHA VANTAGE

In [29]:
api_key = "DQSFYFYU5AKNE4KA"

In [30]:
import requests
import pandas as pd
import time



# Index symbols in Alpha Vantage (replace with correct tickers if needed)
indices = {
    'AEX': '^AEX',
    'CAC 40': '^FCHI',
    'DAX': '^GDAXI',
    'FTSE 100': '^FTSE',
    'IBEX 35': '^IBEX',
    'FTSE MIB': 'FTSEMIB.MI'
}

# Function to fetch data and save as CSV
def fetch_index_data(symbol, name):
    function = 'TIME_SERIES_WEEKLY_ADJUSTED'
    url = f'https://www.alphavantage.co/query?function={function}&symbol={symbol}&apikey={api_key}&datatype=csv'

    response = requests.get(url)
    if response.status_code == 200:
        with open(f'{name}_weekly.csv', 'w') as file:
            file.write(response.text)
        print(f'Data for {name} downloaded successfully.')
    else:
        print(f'Failed to download data for {name}.')

# Loop through each index
for name, symbol in indices.items():
    fetch_index_data(symbol, name)
    time.sleep(12)  # Respect free tier API limit (5 calls per minute)



Data for AEX downloaded successfully.
Data for CAC 40 downloaded successfully.
Data for DAX downloaded successfully.
Data for FTSE 100 downloaded successfully.
Data for IBEX 35 downloaded successfully.
Data for FTSE MIB downloaded successfully.


In [31]:
import pandas as pd
import glob

# List of index names (must match your CSV filenames like 'AEX_weekly.csv')
indices = ['AEX', 'CAC 40', 'DAX', 'FTSE 100', 'IBEX 35', 'FTSE MIB']

# Initialize an empty DataFrame
combined_df = pd.DataFrame()

for index in indices:
    filename = f'{index}_weekly.csv'
    df = pd.read_csv(filename, usecols=['timestamp', 'adjusted_close'])
    df.rename(columns={'adjusted_close': index}, inplace=True)
    df['timestamp'] = pd.to_datetime(df['timestamp'])
    
    if combined_df.empty:
        combined_df = df
    else:
        combined_df = pd.merge(combined_df, df, on='timestamp', how='outer')

# Sort by date
combined_df.sort_values('timestamp', inplace=True)
combined_df.reset_index(drop=True, inplace=True)

# Save to a consolidated CSV
combined_df.to_csv('combined_indices_weekly.csv', index=False)

print(combined_df.head())


ValueError: Usecols do not match columns, columns expected but not found: ['adjusted_close', 'timestamp']

In [34]:
import requests

# replace the "demo" apikey below with your own key from https://www.alphavantage.co/support/#api-key
url = 'https://www.alphavantage.co/query?function=TIME_SERIES_WEEKLY&symbol=^FTSE&apikey=DQSFYFYU5AKNE4KA'
r = requests.get(url)
data = r.json()

print(data)

{}
