# Smart ETF: Mimic SOXX with a Small Subset of Constituents

**Objective.** Build a feasible pipeline to approximate the performance of **SOXX (iShares Semiconductor ETF)** using a much smaller subset of its constituents (e.g., 3–10 names) with price-only data from free sources (e.g., `yfinance`).

**Constraints.**
- Data: Free OHLCV + corporate actions via `yfinance` (dividends/splits). Limited fundamentals may exist but are not guaranteed.
- Focus: U.S. markets, semiconductors (SOXX).

**Approach.**
1. Acquire SOXX + constituent prices (daily adjusted close).
2. Engineer market-based features (momentum, volatility, beta, liquidity proxies).
3. Calibrate a low-cardinality linear tracker (OLS/LASSO) → *Smart ETF weights*.
4. Evaluate tracking error, correlation, stability, and risk.
5. Communicate assumptions and caveats responsibly.

> **Delivery reminder:** This master notebook is structured in **12 stages**, each with clear markdown guidance and runnable code stubs.



## Table of Contents
1. [stage01 — problem-framing-and-scoping](#stage01)
2. [stage02 — tooling-setup & slides-outline](#stage02)
3. [stage03 — python-fundamentals](#stage03)
4. [stage04 — data-acquisition-and-ingestion](#stage04)
5. [stage05 — data-storage](#stage05)
6. [stage06 — data-preprocessing](#stage06)


# stage01_problem-framing-and-scoping
<a id='stage01'></a>

**Goal.** Approximate **SOXX** daily performance using a **small subset of its holdings** via price-based features and linear models.  
**Primary KPIs.**
- **Correlation** between ETF and Smart ETF daily returns (↑).
- **R²** of regression (↑).
- **Tracking error** (annualized std of return differences) (↓).
- **Cumulative performance gap** over the backtest (≈ 0).

**Scope & Feasibility.**
- **In-scope:** SOXX + (top N) semis; daily Adj Close; 2015–present (tunable).
- **Out-of-scope (for now):** Full accounting data, transaction costs, taxes, intraday microstructure.

**Assumptions (declare explicitly):**
- Rebalance schedule (e.g., monthly/quarterly) is fixed and rule-based.
- Survivorship bias risk acknowledged; holdings list frozen to current top names for the backtest unless historical holdings are sourced explicitly.
- Dividends approximated via adjusted close; explicit dividend modeling omitted.

**Success Criteria.**
- Using ≤ 5–10 names achieves corr ≥ 0.9 and annualized tracking error ≤ 2–3% (targets are adjustable).


# stage02_tooling-setup_slides-outline
<a id='stage02'></a>

**Environment & Libraries.**
- Core: `python`, `pandas`, `numpy`, `matplotlib`, `scikit-learn`, `statsmodels`, `yfinance`, `pyyaml`.


In [1]:
# install if running locally
!pip install yfinance pandas numpy matplotlib scikit-learn statsmodels pyyaml openpyxl



In [7]:
import os, sys, math, json, time, datetime as dt, warnings
from dataclasses import dataclass
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')

# paths
PROJECT_ROOT = os.getcwd()
DATA_DIR = os.path.join(PROJECT_ROOT, 'data')
FIG_DIR = os.path.join(PROJECT_ROOT, 'figures')
os.makedirs(DATA_DIR, exist_ok=True)
os.makedirs(FIG_DIR, exist_ok=True)

np.random.seed(69)

print('Setup complete. Folders:', DATA_DIR, FIG_DIR)

Setup complete. Folders: /Users/paramshah/Desktop/Bootcamp/project/notebooks/data /Users/paramshah/Desktop/Bootcamp/project/notebooks/figures


# stage03_python-fundamentals
<a id='stage03'></a>

In [3]:
# daily simple returns from prices
def simple_returns(prices: pd.Series) -> pd.Series:
    r = prices.pct_change()
    return r

def log_returns(prices: pd.Series) -> pd.Series:
    lr = np.log(prices).diff()
    return lr

# example stub
_demo = pd.Series([100, 101, 98, 99, 102], index=pd.date_range('2020-01-01', periods=5, freq='D'))
print('Prices\n', _demo)
print('Simple returns\n', simple_returns(_demo))
print('Log returns\n', log_returns(_demo))

Prices
 2020-01-01    100
2020-01-02    101
2020-01-03     98
2020-01-04     99
2020-01-05    102
Freq: D, dtype: int64
Simple returns
 2020-01-01         NaN
2020-01-02    0.010000
2020-01-03   -0.029703
2020-01-04    0.010204
2020-01-05    0.030303
Freq: D, dtype: float64
Log returns
 2020-01-01         NaN
2020-01-02    0.009950
2020-01-03   -0.030153
2020-01-04    0.010152
2020-01-05    0.029853
Freq: D, dtype: float64


# stage04_data-acquisition-and-ingestion
<a id='stage04'></a>

We’ll pull **SOXX** and a curated subset of **semiconductor constituents** from `yfinance`.  
> Note: `yfinance` provides ETF prices directly. Holdings are not always programmatically available for free historically; we **parameterize** a curated list (top names) and store it in YAML/JSON for reproducibility.

**Chosen baseline tickers (adjust as needed):**
- ETF: `SOXX`
- Candidates (subset of semis): `NVDA, AVGO, TSM, AMD, INTC, QCOM, TXN, ASML, MU, ADI`
- Chose YAML for better readability than json.
- Chose parquet instead of csv just for faster read/write functionality.

**Date range:** 2010-01-01 → today (configurable).


In [4]:
import yaml
import datetime as dt
import yfinance as yf

CONFIG = {
    'etf': 'SOXX',
    'candidates': ['AVGO','TSM','AMD','INTC','QCOM','TXN','ASML','MU','ADI'],
    #'NVDA' excluded nvidia because it's rise is a bit too much, if we include that then we dont need any other constituents
    'start': '2010-01-01',
    'end': None,
    'price_field': 'Adj Close'
}

# Save config for reproducibility
with open(os.path.join(DATA_DIR, 'config_soxx.yml'), 'w') as f:
    yaml.safe_dump(CONFIG, f)

def fetch_prices(tickers, start, end=None, price_field='Adj Close') -> pd.DataFrame:
    """Download price panel for tickers; returns a DataFrame of the chosen price field."""
    df = yf.download(tickers, start=start, end=end, auto_adjust=False, progress=False)
    # If single ticker, yfinance returns a different shape; standardize:
    if isinstance(tickers, str) or (hasattr(tickers, '__len__') and not isinstance(tickers, str) and len(tickers) == 1):
        df = df[[price_field]].rename(columns={price_field: tickers if isinstance(tickers, str) else tickers[0]})
    else:
        df = df[price_field]
    return df

etf_prices = fetch_prices(CONFIG['etf'], CONFIG['start'], CONFIG['end'], CONFIG['price_field'])
stock_prices = fetch_prices(CONFIG['candidates'], CONFIG['start'], CONFIG['end'], CONFIG['price_field'])

print('ETF price shape:', etf_prices.shape)
print('Stocks price shape:', stock_prices.shape)

etf_prices.to_parquet(os.path.join(DATA_DIR, 'soxx_prices_raw.parquet'))
stock_prices.to_parquet(os.path.join(DATA_DIR, 'semi_prices_raw.parquet'))

ETF price shape: (3934, 1)
Stocks price shape: (3934, 9)


# stage05_data-storage
<a id='stage05'></a>

We store **raw** and **processed** tables separately and maintain a **data dictionary** (YAML).  
- Raw: `data/*_raw.parquet`
- Processed: `data/*_proc.parquet`
- Metadata: `data/datadict.yml`


In [5]:
datadict = {
    'tables': {
        'soxx_prices_raw': {'path': 'data/soxx_prices_raw.parquet', 'keys': ['date'], 'description': 'Daily Adj Close for SOXX'},
        'semi_prices_raw': {'path': 'data/semi_prices_raw.parquet', 'keys': ['date','ticker'], 'description': 'Daily Adj Close for selected semiconductor stocks'}
    },
    'notes': 'All prices pulled via yfinance; adjusted close used for return calculations.'
}
with open(os.path.join(DATA_DIR, 'datadict.yml'), 'w') as f:
    yaml.safe_dump(datadict, f)

print('Wrote data dictionary to', os.path.join(DATA_DIR, 'datadict.yml'))

Wrote data dictionary to /Users/paramshah/Desktop/Bootcamp/project/notebooks/data/datadict.yml


# stage06_data-preprocessing
<a id='stage06'></a>

**Tasks.**
- align calendars (outer-join then forward-fill if necessary, or inner-join intersection).
- handle missing values (drop early NaNs, forward-fill sporadic gaps cautiously).
- compute daily returns from adjusted close.

**Decision:** for comparability, we **inner-join** on dates available for all series; report dropped periods.


In [6]:
etf_raw = pd.read_parquet(os.path.join(DATA_DIR, 'soxx_prices_raw.parquet'))

col = CONFIG['etf']
etf = etf_raw[col]

if etf.shape[1] == 1:  # if single column
    etf = etf.iloc[:, 0].rename("SOXX")


semi = pd.read_parquet(os.path.join(DATA_DIR, 'semi_prices_raw.parquet'))

# align on common dates
df_prices = pd.concat([etf, semi], axis=1)
initial_shape = df_prices.shape
df_prices = df_prices.dropna(how='any')

print('Initial shape:', initial_shape, ' -> after dropna:', df_prices.shape)

# compute daily returns
rets = df_prices.pct_change().dropna()
rets.columns.name = None

# split into etf returns (target) and stock returns (features)
r_etf = rets['SOXX'].rename('r_etf')
r_stk = rets.drop(columns=['SOXX'])

df_prices.to_parquet(os.path.join(DATA_DIR, 'prices_proc.parquet'))
rets.to_parquet(os.path.join(DATA_DIR, 'returns_proc.parquet'))
print('Saved processed price & returns tables.')

Initial shape: (3934, 10)  -> after dropna: (3934, 10)
Saved processed price & returns tables.
