<a href="https://colab.research.google.com/github/joshuadollison/MAT421/blob/main/Dollison_Project_Plan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Plan**
Joshua Dollison

MAT421-16133

3/23/2025

## 1. Introduction to the Problem
Managing portfolio risk is central to finance and investment strategy. As investors build diversified portfolios of dozens or hundreds of stocks, analyzing the joint behavior of asset returns becomes complex due to high correlations among the assets. Principal Component Analysis (PCA) offers a statistical technique to simplify this problem by transforming correlated asset returns into a set of uncorrelated components that explain the overall variance.

The goal of this project is to apply PCA to historical daily stock returns from a selected basket of equities (e.g., S&P 500 subset) to identify dominant patterns in market behavior and reduce dimensionality. This approach provides insights into latent market factors (e.g., overall market trend, sector performance) and helps in understanding systemic risk exposures.

## 2. Related Work

PCA has been extensively used in financial econometrics for factor modeling, risk attribution, and return forecasting. Applications include:

- Identifying the first principal component as a proxy for market return (similar to the CAPM beta).
- Using PCA to construct eigenportfolios that represent orthogonal risk exposures.
- Risk decomposition and volatility clustering in portfolio construction.

Academic references include works by Connor and Korajczyk (1986) on asset pricing using PCA, and more recently, machine learning applications integrating PCA with forecasting models. Our project builds on these foundations by providing a reproducible implementation using Python.

## 3. Proposed Methodology / Models

1. **Data Collection**:
   - Use `yfinance` to download 5 years of daily adjusted close prices for 30-100 S&P 500 companies.
   - Focus on diverse sectors to maximize variation.

2. **Preprocessing**:
   - Convert prices to daily log returns.
   - Handle missing values using forward fill or interpolation.
   - Standardize return series (mean = 0, std = 1).

3. **PCA Application**:
   - Apply PCA to the standardized return matrix.
   - Analyze explained variance ratio to select the number of components.

4. **Risk Attribution**:
   - Visualize loadings (i.e., stock contributions to each PC).
   - Interpret PCs as market-wide or sectoral trends.

5. **Reconstruction and Evaluation**:
   - Reconstruct the return matrix using top N PCs.
   - Compare variance captured and reconstruction error.

Optional extensions include comparing PCA results before and after major financial events (e.g., COVID crash) to see shifts in principal directions.

## 4. Experiment Setups

- **Tools**: Python 3.x, `yfinance`, `pandas`, `numpy`, `sklearn`, `matplotlib`, `seaborn`
- **Evaluation Metrics**:
  - Scree plot and cumulative explained variance
  - RMSE between original and reconstructed returns
  - Heatmaps of component loadings for interpretation
- **Baseline**:
  - Raw correlation matrix visualization
- **Validation**:
  - Compare results using different subsets of stocks (e.g., tech vs industrials)
  - Time slicing: 2018-2020 vs 2021-2023

The notebook will be structured with modular sections for loading, preprocessing, modeling, visualization, and analysis.

## 5. Expected Results

We expect that the first few principal components (3-5) will explain a significant proportion of the total variance (>70%). The first PC will likely correspond to general market movements, while others may represent sector-specific effects. Loadings should help identify which stocks contribute most to each risk factor.

By reconstructing the return series using a limited number of PCs, we anticipate that the reduced model will approximate the full correlation structure with lower complexity. These insights can be used for portfolio optimization, hedging strategies, or risk visualization.

The project will demonstrate that PCA is not only useful for compression but also for uncovering interpretable patterns in financial data, aiding in better decision-making.