# Price Feature Engineering & Target Construction

In this notebook, we engineer additional price-based features from the S&P 500 daily closing prices and define our prediction targets. These features help capture recent price trends, volatility, and momentum, while our target variable identifies actionable trading signals for classification.


## 1. Load and Align Data

In [47]:
import pandas as pd

# Load processed sentiment data (from previous notebook)
sentiment_df = pd.read_csv("../data/daily_sentiment.csv", parse_dates=['Date'])

# Load original price/headline data to get full price history
raw_df = pd.read_csv("../data/sp500_news.csv", parse_dates=['Date'])

# Only keep Date and CP (closing price), drop duplicates
price_df = raw_df[['Date', 'CP']].drop_duplicates().sort_values('Date').reset_index(drop=True)

# Merge sentiment with price
data = pd.merge(price_df, sentiment_df, on="Date", how="left")

## 2. Price-Based Features

We compute rolling statistics and percentage returns to capture trends and volatility, which are commonly used in quantitative finance.

In [48]:
# Rolling returns & volatility
data['return_1d'] = data['CP'].pct_change()
data['return_5d'] = data['CP'].pct_change(5)
data['volatility_5d'] = data['return_1d'].rolling(5).std()

# Rolling max/min (momentum proxies)
data['roll_max_5d'] = data['CP'].rolling(5).max()
data['roll_min_5d'] = data['CP'].rolling(5).min()

# Simple lag features
data['cp_lag1'] = data['CP'].shift(1)
data['cp_lag2'] = data['CP'].shift(2)

## 3. Target Variable Construction

The `target` variable indicates the *predicted price direction*:
- `1` = Predict upward movement (potential long trade)
- `0` = Neutral (no trade)
- `-1` = Predict downward movement (potential short trade)

In [49]:
# Parameters for signal (adjust as needed)
L_THRESH = 0.005   # +0.5% for long trade
S_THRESH = -0.005  # -0.5% for short trade
FWD_DAYS = 3       # Predict 2 days ahead

# Compute forward return
data['fwd_return'] = data['CP'].shift(-FWD_DAYS) / data['CP'] - 1

# Define multiclass signal
def signal(row):
    if row['fwd_return'] >= L_THRESH:
        return 1
    elif row['fwd_return'] <= S_THRESH:
        return -1
    else:
        return 0

data['target'] = data.apply(signal, axis=1)


## 4. Data Cleanup

We drop rows with missing values resulting from rolling windows and forward returns.

In [50]:
# Drop NAs from rolling features and targets
data = data.dropna().reset_index(drop=True)

# Save final modeling set
data.to_csv("../data/features.csv", index=False)
print(f"Saved modeling data: {data.shape}")

Saved modeling data: (1029, 22)


## 5. Quick Exploratory Checks

- Distribution of target classes
- Basic stats for new price features

> These checks help confirm balanced classes and reasonable feature ranges for modeling.

In [51]:
print(data['target'].value_counts(normalize=True))
print(data[['return_1d', 'volatility_5d', 'target']].describe())

target
 1    0.453839
-1    0.314869
 0    0.231293
Name: proportion, dtype: float64
         return_1d  volatility_5d       target
count  1029.000000    1029.000000  1029.000000
mean      0.000535       0.011259     0.138970
std       0.014362       0.009494     0.866097
min      -0.119841       0.000904    -1.000000
25%      -0.005797       0.006059    -1.000000
50%       0.000812       0.009051     0.000000
75%       0.007539       0.013530     1.000000
max       0.093828       0.094495     1.000000


## 6. Summary & Takeaways

- **Price features and target variables were engineered to enable supervised learning on S&P 500 data.**
    - Daily returns (`return_1d`) and recent volatility (`volatility_5d`) were computed as explanatory features.
    - The `target` variable was generated to represent price movement classes:  
        - `1`: Predict upward movement (long signal)  
        - `0`: Neutral (no trade)  
        - `-1`: Predict downward movement (short signal)
- **Class distribution** is moderately imbalanced, with roughly half of days labeled as neutral, and positive/negative classes making up the remainder.
- **Important:**  
    - These labels signal the *predicted* price direction for the next period.  
    - Actual trading outcomes (take profit, stop loss, etc.) depend on how trades are executed and exited according to subsequent price action—not on the label itself.

This notebook establishes a solid base for supervised learning by providing clean features and an interpretable target, ready for model training and evaluation in the next step.