# Validate features.npy / prices.npy (Alpha Vantage sentiment)

This notebook performs quick sanity checks:

- file existence
- shapes & dtypes
- NaN/Inf checks
- basic distribution stats (incl. sentiment columns)
- per-ticker row counts from `build_debug.csv`


In [1]:
import os
import numpy as np
import pandas as pd

DATA_DIR = "data"  # adjust if your artifacts are elsewhere
FEATURES_PATH = os.path.join(DATA_DIR, "features.npy")
PRICES_PATH   = os.path.join(DATA_DIR, "prices.npy")
DEBUG_PATH    = os.path.join(DATA_DIR, "build_debug.csv")

print("FEATURES_PATH:", FEATURES_PATH, "exists:", os.path.exists(FEATURES_PATH))
print("PRICES_PATH  :", PRICES_PATH,   "exists:", os.path.exists(PRICES_PATH))
print("DEBUG_PATH   :", DEBUG_PATH,    "exists:", os.path.exists(DEBUG_PATH))


FEATURES_PATH: data/features.npy exists: True
PRICES_PATH  : data/prices.npy exists: True
DEBUG_PATH   : data/build_debug.csv exists: True


In [2]:
features = np.load(FEATURES_PATH)
prices   = np.load(PRICES_PATH)
dbg      = pd.read_csv(DEBUG_PATH)

print("features shape:", features.shape, "dtype:", features.dtype)
print("prices   shape:", prices.shape,   "dtype:", prices.dtype)
print("\n=== build_debug.csv ===")
display(dbg)


features shape: (7450, 12) dtype: float32
prices   shape: (7450,) dtype: float32

=== build_debug.csv ===


Unnamed: 0,ticker,rows_raw,rows_features,feature_dim,first_date,last_date
0,AAPL,1509,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
1,MSFT,1509,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
2,NVDA,1509,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
3,AMZN,1509,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
4,GOOGL,1509,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00


In [3]:
# Hard sanity checks
assert features.ndim == 2, "features.npy must be 2D (N, D)"
assert prices.ndim == 1, "prices.npy must be 1D (N,)"
assert features.shape[0] == prices.shape[0], "Row mismatch between features and prices"
assert np.isfinite(features).all(), "features contains NaN/Inf"
assert np.isfinite(prices).all(), "prices contains NaN/Inf"
print("✅ Basic shape and finite checks passed.")


✅ Basic shape and finite checks passed.


In [4]:
# Feature-level stats
D = features.shape[1]
stats = pd.DataFrame({
    "col": list(range(D)),
    "min": features.min(axis=0),
    "p01": np.quantile(features, 0.01, axis=0),
    "p50": np.quantile(features, 0.50, axis=0),
    "p99": np.quantile(features, 0.99, axis=0),
    "max": features.max(axis=0),
    "mean": features.mean(axis=0),
    "std": features.std(axis=0),
})
display(stats)


Unnamed: 0,col,min,p01,p50,p99,max,mean,std
0,0,3.281395,3.840408,98.29604,327.8353,371.5483,108.3724,77.6812
1,1,3.430977,3.825697,97.84822,327.3255,369.0352,107.9406,77.3671
2,2,3.180342,18.31196,55.6171,88.72548,97.41521,54.95644,16.24138
3,3,-3.661608,-2.280956,0.01203543,2.240789,3.392967,0.003215497,0.7662531
4,4,-0.3388267,-0.1088992,0.6422905,1.159621,1.397194,0.5839841,0.3191408
5,5,0.1106408,0.1325248,3.208072,13.97635,17.00283,4.200731,3.55721
6,6,-0.3231209,-0.1692586,0.01401317,0.1912656,0.3853832,0.01154363,0.06767584
7,7,-1839840000.0,-813472200.0,2244076000.0,49895140000.0,53800510000.0,6770009000.0,11814800000.0
8,8,5.719123,19.28123,53.75063,84.46156,96.77747,53.19068,15.36385
9,9,-290.6548,-172.6507,-56.94516,-1.848563,-0.0,-61.63253,39.77143


## Sentiment columns

Assumption: you appended 2 Alpha Vantage features at the end:

- `sentiment` (weighted ticker sentiment)
- `sentiment_mass` (sum of relevance scores for that day)

So we interpret:
- `sentiment_col = D-2`
- `sentiment_mass_col = D-1`

If you inserted them elsewhere, just change indices below.


In [5]:
sentiment_col = D - 2
mass_col = D - 1

sent = features[:, sentiment_col]
mass = features[:, mass_col]

print("sentiment col index:", sentiment_col)
print("mass col index     :", mass_col)

print("\nSentiment stats:")
print(pd.Series(sent).describe())

print("\nMass stats:")
print(pd.Series(mass).describe())

# quick check: are they mostly zeros?
print("\n% zeros (sentiment):", (sent == 0).mean())
print("% zeros (mass)     :", (mass == 0).mean())


sentiment col index: 10
mass col index     : 11

Sentiment stats:
count    7450.000000
mean        0.064321
std         0.149141
min        -0.790291
25%         0.000000
50%         0.000000
75%         0.114589
max         0.936708
dtype: float64

Mass stats:
count    7450.000000
mean        0.419530
std         0.743199
min         0.000000
25%         0.000000
50%         0.000000
75%         0.645141
max         8.950099
dtype: float64

% zeros (sentiment): 0.6507382550335571
% zeros (mass)     : 0.6507382550335571


In [6]:
# Simple relationship checks (not a trading claim)
# Next-day return proxy from prices in the concatenated dataset:
# NOTE: since dataset is stacked across tickers, this return series crosses ticker boundaries.
# It's only a crude sanity check that values vary; don't use it for evaluation.

ret1 = np.empty_like(prices, dtype=np.float32)
ret1[:] = np.nan
ret1[1:] = (prices[1:] - prices[:-1]) / (prices[:-1] + 1e-9)

tmp = pd.DataFrame({
    "sentiment": sent,
    "mass": mass,
    "ret1": ret1,
}).dropna()

print("Corr(sentiment, ret1):", tmp["sentiment"].corr(tmp["ret1"]))
print("Corr(mass, ret1)     :", tmp["mass"].corr(tmp["ret1"]))


Corr(sentiment, ret1): 0.03609971665861143
Corr(mass, ret1)     : 0.003534417872037365


## Per-ticker row counts

We can infer expected per-ticker row counts from `build_debug.csv`.
This does not reconstruct ticker boundaries in `features.npy` (because it's concatenated),
but it gives you the totals you expect (e.g., 1490 rows per ticker × 5 = 7450).


In [7]:
dbg["rows_features"].sum(), features.shape[0]


(np.int64(7450), 7450)

In [8]:
print("Per-ticker rows:")
display(dbg[["ticker","rows_features","feature_dim","first_date","last_date"]])


Per-ticker rows:


Unnamed: 0,ticker,rows_features,feature_dim,first_date,last_date
0,AAPL,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
1,MSFT,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
2,NVDA,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
3,AMZN,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
4,GOOGL,1490,12,2018-01-30 00:00:00,2023-12-29 00:00:00
