In [None]:
# from google.colab import drive
# drive.flush_and_unmount()           # ignore errors if already unmounted

#If cannot remount, simply delete the mounted drive and then remount
# rm -rf /content/drive


In [1]:
# Colab cell
from google.colab import drive

drive.mount('/content/drive', force_remount=True)



Mounted at /content/drive


In [2]:
# Adjust these two for YOUR repo
REPO_OWNER = "ywanglab"
REPO_NAME  = "STAT4160"   # e.g., unified-stocks-team1
BASE_DIR   = "/content/drive/MyDrive/dspt25"
CLONE_DIR  = f"{BASE_DIR}/{REPO_NAME}"
REPO_URL   = f"https://github.com/{REPO_OWNER}/{REPO_NAME}.git"

# if on my office computer

# REPO_NAME  = "lectureNotes"   # e.g., on my office computer
# BASE_DIR = r"E:\OneDrive - Auburn University Montgomery\teaching\AUM\STAT 4160 Productivity Tools" # on my office computer
# CLONE_DIR  = f"{BASE_DIR}\{REPO_NAME}"

import os, pathlib
pathlib.Path(BASE_DIR).mkdir(parents=True, exist_ok=True)


In [3]:
import os, subprocess, shutil, pathlib

if not pathlib.Path(CLONE_DIR).exists():
    !git clone {REPO_URL} {CLONE_DIR}
else:
    # If the folder exists, just ensure it's a git repo and pull latest
    os.chdir(CLONE_DIR)
    # !git status
    # !git pull --rebase # !git pull --ff-only
os.chdir(CLONE_DIR)
print("Working dir:", os.getcwd())

Working dir: /content/drive/MyDrive/dspt25/STAT4160


## Session 15 — Framing & Metrics

### Learning goals

By the end of class, students can:

1.  Specify **forecast horizon** $H$, **step** (stride), and choose between **expanding** vs **sliding** rolling‑origin evaluation with an **embargo** gap.
2.  Implement a **date‑based splitter** that yields `(train_idx, val_idx)` for all tickers at once.
3.  Compute **MAE**, **sMAPE**, **MASE** (with a proper **training‑window scale**), and aggregate **per‑ticker** and **across tickers** (macro vs micro/weighted).
4.  Produce a tidy CSV of baseline results to serve as your course’s ground truth.

------------------------------------------------------------------------

## Agenda

-    forecasting setup — horizon $H$, step, rolling‑origin (expanding vs sliding), embargo
-    metrics — MAE, sMAPE, MASE; aggregation across tickers (macro vs micro/weighted)
-    **In‑class lab**: implement a date‑based splitter → compute naive & seasonal‑naive baselines → MAE/sMAPE/MASE per split/ticker → save reports
-    Wrap‑up & homework brief
-    Buffer

------------------------------------------------------------------------



### Framing the forecast

-   **Target:** next‑day log return $r_{t+1}$ (you built this as `r_1d`).

-   **Horizon** $H$: 1 business day.

-   **Step (stride):** how far the **origin** moves forward each split (e.g., 63 trading days ≈ a quarter).

-   **Rolling‑origin schemes**

    -   **Expanding:** train start fixed; **train grows** over time.
    -   **Sliding (rolling):** fixed‑length train window **slides** forward.

-   **Embargo:** small **gap** (e.g., 5 days) between train end and validation start to avoid adjacency leakage.

### Metrics (scalar, easy to compare)

-   **MAE:** $\frac{1}{n}\sum |y - \hat{y}|$ — robust & interpretable.

-   **sMAPE:** $\frac{2}{n}\sum \frac{|y - \hat{y}|}{(|y| + |\hat{y}| + \epsilon)}$ — scale‑free, safe for near‑zero returns with $\epsilon$.

-   **MASE:** $\text{MASE}=\frac{\text{MAE}_\text{model}}{\text{MAE}_\text{naive (train)}}$ — \<1 means better than naive.

    -   For seasonality $s$, the **naive comparator** predicts $y_{t+1} \approx y_{t+1-s}$ (we’ll use $s=5$ for day‑of‑week seasonality on business days).
    -   **Scale** is computed on the **training window only**, per ticker.

### Aggregation across tickers

-   **Per‑ticker metrics** first → then aggregate.
-   **Macro average:** mean of per‑ticker metrics (each ticker equal weight).
-   **Micro/weighted:** pool all rows (or weight tickers by sample count); for MAE, pooled MAE equals sample‑count weighted average of per‑ticker MAEs.

------------------------------------------------------------------------



## MAPE — *Mean Absolute Percentage Error*

### 📘 Definition

[
\text{MAPE} = \frac{100%}{n} \sum_{t=1}^{n}
\left| \frac{y_t - \hat{y}_t}{y_t} \right|
]

Where:

* ( y_t ) = true (actual) value at time ( t )
* ( \hat{y}_t ) = predicted (forecasted) value
* ( n ) = number of observations


* Lower is better.
* Example: MAPE = 5% → on average, predictions are off by 5%.

---

###  Example

| Time | Actual (y_t) | Forecast (\hat{y}_t) | Absolute % Error |
| ---- | ------------ | -------------------- | ---------------- |
| 1    | 100          | 105                  | 5%               |
| 2    | 200          | 180                  | 10%              |
| 3    | 150          | 155                  | 3.33%            |

[
\text{MAPE} = \frac{(5 + 10 + 3.33)}{3} = 6.11%
]

---

###  Limitations of MAPE

1. **Division by zero problem:**
   If any ( y_t = 0 ), MAPE is undefined (division by zero).
2. **Asymmetry:**
   Over-predictions and under-predictions are penalized **unequally** in percentage terms.
   (e.g., predicting 50 instead of 100 = 50% error, but predicting 200 instead of 100 = 100% error.)
3. **Biased for small actual values:**
   When actual values are near zero, MAPE can explode to very large values.

---

## 🧩 sMAPE — *Symmetric Mean Absolute Percentage Error*

To fix MAPE’s asymmetry, people use **sMAPE**, defined as:

[
\text{sMAPE} =
\frac{100%}{n} \sum_{t=1}^{n}
\frac{|\hat{y}_t - y_t|}{(|y_t| + |\hat{y}_t|)/2}
]

or equivalently:
[
\text{sMAPE} =
\frac{200%}{n} \sum_{t=1}^{n}
\frac{|\hat{y}_t - y_t|}{|y_t| + |\hat{y}_t|}
]



* The denominator uses the *average of actual and predicted* values —
  making the metric **symmetric** (over- and under-prediction penalized equally).
* Scale is still percentage-based.

 **Interpretation**

* 0% = perfect forecast
* 100% = prediction is completely off (in typical scaling)

---

### Example (same data)

| Time | Actual (y_t) | Forecast (\hat{y}_t) | sMAPE component |    |                       |
| ---- | ------------ | -------------------- | --------------- | -- | --------------------- |
| 1    | 100          | 105                  | ( 2×            | 5  | /(100+105) = 4.88% )  |
| 2    | 200          | 180                  | ( 2×            | 20 | /(200+180) = 10.53% ) |
| 3    | 150          | 155                  | ( 2×            | 5  | /(150+155) = 3.27% )  |

[
\text{sMAPE} = (4.88 + 10.53 + 3.27)/3 = 6.23%
]



---

## TL;DR

| Metric    | Meaning                                                                           | Key Idea                                             |
| --------- | --------------------------------------------------------------------------------- | ---------------------------------------------------- |
| **MAPE**  | Mean Absolute Percentage Error                                                    | Average % deviation from actual values               |
| **sMAPE** | Symmetric MAPE                                                                    | Same idea, but symmetric and avoids division by zero |
| **Tip**   | Use sMAPE for production forecasting metrics — it’s fairer and numerically safer. |                                                      |

---




## In‑class lab (35 min, Colab‑friendly)



```python
np.r_[np.nan, np.diff(np.log(adj))]
```

### In words:

> “Take the **logarithm** of the array `adj`, compute its **first differences**, and then **prepend a NaN** so that the output has the same length as the original series.”

Assume you have:

```python
import numpy as np
adj = np.array([100, 105, 102, 110], dtype=float)
```


```python
np.log(adj)
# → [4.60517, 4.65400, 4.62497, 4.70048]
```

This is common in finance — we often use **log prices** because:

* differences of logs approximate **returns**,
* and logs make multiplicative changes additive.

---

###  `np.diff(np.log(adj))`

Computes the **difference between consecutive elements**.

```python
np.diff(np.log(adj))
# → [0.04883, -0.02903, 0.07551]
```

Mathematically:
[
\text{diff}_t = \log(\text{adj}*t) - \log(\text{adj}*{t-1})
]
which equals:
[
\log\left(\frac{\text{adj}*t}{\text{adj}*{t-1}}\right)
]
→ this is the **log return** between days (t-1) and (t).

---

###  `np.r_[np.nan, ...]`

`np.r_[]` concatenates arrays row-wise.

Here, you’re prepending a single `np.nan` (Not-a-Number) value before the differences:

```python
np.r_[np.nan, np.diff(np.log(adj))]
# → [nan, 0.04883, -0.02903, 0.07551]
```

This aligns the array with your original data length.
Since there’s no previous day to compute a return for the **first** element, it’s set to `NaN`.



##  Equivalent longer version

```python
log_prices = np.log(adj)
diffs = np.diff(log_prices)
log_returns = np.insert(diffs, 0, np.nan)
```

`np.r_[]` is just a compact one-liner alternative.



In [4]:
import os, pathlib, numpy as np, pandas as pd
from pathlib import Path

# Load returns or create a tiny fallback
rpath = Path("data/processed/returns.parquet")
if rpath.exists():
    returns = pd.read_parquet(rpath)
else:
    # Fallback synthetic returns for 5 tickers, 320 business days
    rng = np.random.default_rng(0)
    dates = pd.bdate_range("2022-01-03", periods=320)
    frames=[]
    for tkr in ["AAPL","MSFT","GOOGL","AMZN","NVDA"]:
        eps = rng.normal(0, 0.012, size=len(dates)).astype("float32")
        adj = 100*np.exp(np.cumsum(eps))
        df = pd.DataFrame({
            "date": dates,
            "ticker": tkr,
            "adj_close": adj.astype("float32"),
            "log_return": np.r_[np.nan, np.diff(np.log(adj))].astype("float32")
        })
        df["r_1d"] = df["log_return"].shift(-1)
        df["weekday"] = df["date"].dt.weekday.astype("int8")
        df["month"]   = df["date"].dt.month.astype("int8")
        frames.append(df)
    returns = pd.concat(frames, ignore_index=True).dropna().reset_index(drop=True)
    returns["ticker"] = returns["ticker"].astype("category")
    returns.to_parquet(rpath, index=False)

# Standardize
returns["date"] = pd.to_datetime(returns["date"])
returns = returns.sort_values(["ticker","date"]).reset_index(drop=True)
returns["ticker"] = returns["ticker"].astype("category")
returns.head()


Unnamed: 0,date,ticker,log_return,r_1d,weekday,month
0,2020-01-01,AAPL,,0.002987,2,1
1,2020-01-02,AAPL,0.002987,-0.002741,3,1
2,2020-01-03,AAPL,-0.002741,-0.008906,4,1
3,2020-01-06,AAPL,-0.008906,-0.004547,0,1
4,2020-01-07,AAPL,-0.004547,-0.009916,1,1


In [17]:
len(returns["date"].unique() )

180

### 1) Rolling‑origin date splitter (expanding windows + embargo)

In [18]:
import numpy as np, pandas as pd

def make_rolling_origin_splits(dates: pd.Series,
                               train_min=252,   # ~1y of trading days
                               val_size=63,     # ~1 quarter
                               step=63,
                               embargo=5):
    """Return a list of (train_start, train_end, val_start, val_end) date tuples."""
    u = np.array(sorted(pd.to_datetime(dates.unique())))
    n = len(u)
    splits=[]
    i = train_min - 1
    while True:
        if i >= n: break
        tr_start, tr_end = u[0], u[i]
        vs_idx = i + embargo + 1
        ve_idx = vs_idx + val_size - 1
        if ve_idx >= n: break
        splits.append((tr_start, tr_end, u[vs_idx], u[ve_idx]))
        i += step
    return splits

def splits_to_indices(df, split):
    """Map a date split to index arrays for the full multi-ticker frame."""
    a,b,c,d = split
    tr_idx = df.index[(df["date"]>=a) & (df["date"]<=b)].to_numpy()
    va_idx = df.index[(df["date"]>=c) & (df["date"]<=d)].to_numpy()
    # sanity: embargo => last train date < first val date
    assert b < c
    return tr_idx, va_idx

splits = make_rolling_origin_splits(returns["date"], train_min= 100, val_size=63, step=63, embargo=5)
len(splits), splits[:2]

(1,
 [(Timestamp('2020-01-01 00:00:00'),
   Timestamp('2020-05-19 00:00:00'),
   Timestamp('2020-05-27 00:00:00'),
   Timestamp('2020-08-21 00:00:00'))])