CQF Final project

In [4]:
import yfinance as yf
import pandas as pd
from datetime import datetime, timedelta

# --- 1. Configuration: Set Dates and Define Asset Tickers ---

# Set the time period for data download (last 5 years)
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

# A dictionary to organize all the asset groups and their tickers
# Volatility pair updated to a more robust US Spot vs. Futures ETF pair.
asset_groups = {
    # Commodities
    "precious_metals_triple": ["GC=F", "SI=F", "PL=F"], # Gold, Silver, Platinum Futures
    "oil_pair": ["CL=F", "BZ=F"],                     # WTI, Brent Crude Futures
    "agri_pair": ["ZC=F", "ZS=F"],                    # Corn, Soybean Futures

    # Fixed Income & Currency
    "yield_pair": ["^TNX", "IGLT.L"],                 # US 10Y Yield, iShares UK Gilts ETF
    "currency_pair": ["AUDUSD=X", "CADUSD=X"],        # AUD/USD, CAD/USD

    # Volatility
    "volatility_pair": ["^VIX", "VIXY"],              # US VIX Index vs. Short-Term VIX Futures ETF*
    
    # Country Indices
    "eu_index_pair_1": ["^FCHI", "^GDAXI"],           # CAC 40, DAX
    "eu_index_pair_2": ["^IBEX", "FTSEMIB.MI"],      # IBEX 35, FTSE MIB

    # Equities
    "fr_banking_pair": ["BNP.PA", "GLE.PA"],          # BNP Paribas, Société Générale
    "fast_fashion_pair": ["ITX.MC", "HM-B.ST"],       # Inditex, H&M
    "german_auto_triple": ["VOW3.DE", "MBG.DE", "BMW.DE"], # VW, Mercedes, BMW
    "investor_ab_pair": ["INVE-A.ST", "INVE-B.ST"],    # Investor A, Investor B
    "vw_porsche_pair": ["VOW3.DE", "P911.DE"],        # VW, Porsche AG
    "semiconductor_pair": ["ASML.AS", "IFX.DE"],      # ASML, Infineon

    # ETFs
    "sector_etf_pair": ["XLRE", "XLU"]                # Real Estate ETF, Utilities ETF
}

# --- 2. Data Download ---

# Create an empty dictionary to store the downloaded dataframes
all_data = {}

print("Starting data download...")

for group_name, tickers in asset_groups.items():
    print(f"--> Downloading data for: {group_name}")
    try:
        # Download daily data for the specified tickers
        data = yf.download(tickers,
                           start=start_date.strftime('%Y-%m-%d'),
                           end=end_date.strftime('%Y-%m-%d'),
                           interval="1d",
                           auto_adjust=True, # Automatically adjust for splits/dividends
                           group_by='ticker')

        # When a single ticker in a group fails, yfinance might return a DataFrame
        # with only the successful tickers. We need to handle this.
        if isinstance(data.columns, pd.MultiIndex):
            # If multiple tickers are downloaded, stack them into a clean format
            df_processed = data.stack(level=0, future_stack=True).rename_axis(['Date', 'Ticker']).reset_index(level=1)
            # We are interested in the 'Close' price
            price_data = df_processed.pivot(columns='Ticker', values='Close')
        else:
            # If only one ticker was successful, it won't have a multi-index
            price_data = data[['Close']]
            # Rename column to the correct ticker if there's only one
            if len(tickers) == 1:
                price_data.columns = tickers

        # Forward-fill to handle non-trading days and then drop any remaining NaNs
        price_data = price_data.ffill().dropna()

        if not price_data.empty:
            all_data[group_name] = price_data
        else:
            print(f"    No data for {group_name} after processing.")

    except Exception as e:
        print(f"    An error occurred while downloading {group_name}: {e}")

print("\nData download complete.")

# --- 3. Verification ---

print("\n--- Verification ---")
print(f"Successfully downloaded data for {len(all_data)} groups.")
print("The following data groups are now available:")
for name in sorted(all_data.keys()):
    print(f"- {name}")

# Display the first few rows to verify a successful download
print("\nSample Data for 'volatility_pair':")
if 'volatility_pair' in all_data:
    print(all_data['volatility_pair'].head())
else:
    print("Could not retrieve 'volatility_pair' data.")

[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  2 of 2 completed

Starting data download...
--> Downloading data for: precious_metals_triple
--> Downloading data for: oil_pair



[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[                       0%                       ]

--> Downloading data for: agri_pair
--> Downloading data for: yield_pair
--> Downloading data for: currency_pair
--> Downloading data for: volatility_pair


[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed


--> Downloading data for: eu_index_pair_1
--> Downloading data for: eu_index_pair_2
--> Downloading data for: fr_banking_pair
--> Downloading data for: fast_fashion_pair
--> Downloading data for: german_auto_triple


[*********************100%***********************]  3 of 3 completed
[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[*********************100%***********************]  2 of 2 completed
[                       0%                       ]

--> Downloading data for: investor_ab_pair
--> Downloading data for: vw_porsche_pair
--> Downloading data for: semiconductor_pair
--> Downloading data for: sector_etf_pair


[*********************100%***********************]  2 of 2 completed


Data download complete.

--- Verification ---
Successfully downloaded data for 15 groups.
The following data groups are now available:
- agri_pair
- currency_pair
- eu_index_pair_1
- eu_index_pair_2
- fast_fashion_pair
- fr_banking_pair
- german_auto_triple
- investor_ab_pair
- oil_pair
- precious_metals_triple
- sector_etf_pair
- semiconductor_pair
- volatility_pair
- vw_porsche_pair
- yield_pair

Sample Data for 'volatility_pair':
Ticker             VIXY       ^VIX
Date                              
2020-07-13  2314.399902  32.189999
2020-07-14  2176.800049  29.520000
2020-07-15  2119.199951  27.760000
2020-07-16  2080.800049  28.000000
2020-07-17  2006.400024  25.680000





"We need to test if time-series are weakly stationary, or integrated of order zero (I(0)), if its statistical properties—specifically its mean, variance, and autocovariance—are invariant with respect to time. The majority of financial price series do not exhibit this property; they are typically non-stationary and contain a unit root, meaning they are integrated of order one (I(1)). A critical issue arises when standard regression techniques are applied to I(1) series. Regressing one I(1) series on another can lead to a "spurious regression," a situation where high R-squared values and statistically significant coefficients are observed even when no genuine economic relationship exists between the variables. This necessitates formal testing for stationarity."

The augmented Dickey–Fuller specification is:

$$
\Delta y_t = \alpha + \beta\,t + \gamma\,y_{t-1}
        + \sum_{i=1}^{p} \delta_i\,\Delta y_{t-i} + \varepsilon_t
$$

The hypotheses are:

- Null hypothesis: $H_0: \gamma = 0$
  (implying a unit root; the series is non-stationary)

- Alternative hypothesis: $H_1: \gamma < 0$
  (implying stationarity)


The Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test statistic for trend‐stationarity is given by

$$
\mathrm{KPSS} \;=\;
\frac{1}{T^2} \sum_{t=1}^T S_t^2 \;\bigg/\; \widehat{\sigma}^2
$$

where

$S_t = \sum_{i=1}^t \widehat{u}_i$

$\widehat{u}_i = y_i - \widehat{\beta}_0 - \widehat{\beta}_1\,i$
are the residuals from the OLS regression of \(y_t\) on an intercept and time trend.  
$(\widehat{\sigma}^2\)$ is a consistent estimate of the long‐run variance of $(\widehat{u}_t\)$, often computed via a Newey–West estimator:
  $$
  \widehat{\sigma}^2
  = \frac{1}{T}\sum_{t=1}^T \widehat{u}_t^2
    \;+\; 2 \sum_{\ell=1}^L w\bigl(\ell,L\bigr)\,
    \frac{1}{T}\sum_{t=\ell+1}^T \widehat{u}_t\,\widehat{u}_{t-\ell},
  $$
  with Bartlett weights \(w(\ell,L)=1-\ell/(L+1)\).

The hypotheses reverse those of the ADF:

- **Null hypothesis** (stationarity around a deterministic trend):  
  $$H_0:\; \{y_t\}\text{ is trend‐stationary}$$

- **Alternative hypothesis** (presence of a unit root):  
  $$H_1:\; \{y_t\}\text{ has a unit root (non‐stationary)}$$

**Interpretation:**  
- A large KPSS statistic leads to rejection of \(H_0\), suggesting non‐stationarity.  
- Used alongside the ADF:  
  - **Fail to reject ADF null** (evidence of unit root) **and** **reject KPSS null** (evidence against stationarity) ⇒ strong confirmation that \(y_t\) is \(I(1)\).
