# Stock Market Direction Prediction – S&P 500

## Project Overview
The goal of this project is to predict the next-day direction of the S&P 500 index (Up or Down) using machine learning.

Instead of predicting the exact price, the model predicts whether the market will go up (1) or down (0) based on historical closing prices.

This is a binary classification problem applied to financial time-series data.

In [None]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.impute import SimpleImputer

from xgboost import XGBClassifier

import warnings
warnings.filterwarnings("ignore")

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## Data Collection

The dataset used in this project is `SP500.csv`.

It contains:
- observation_date: The trading date
- SP500: The closing price of the S&P 500 index

The data is loaded using pandas and prepared for analysis.

In [None]:
from pathlib import Path

# Robust data path: works whether Jupyter is started from repo root or notebooks/
candidate_paths = [Path('data/SP500.csv'), Path('../data/SP500.csv')]
data_path = next((p for p in candidate_paths if p.exists()), None)
if data_path is None:
    raise FileNotFoundError('Could not find SP500.csv in data/ or ../data/')

df = pd.read_csv(data_path)
print('Using data file:', data_path.resolve())

date_col = df.columns[0]
df[date_col] = pd.to_datetime(df[date_col])
df = df.sort_values(date_col).reset_index(drop=True)
if date_col != 'observation_date':
    df = df.rename(columns={date_col: 'observation_date'})

PREDICTION_HORIZON_DAYS = 5
df['Target'] = (df['SP500'].shift(-PREDICTION_HORIZON_DAYS) > df['SP500']).astype(int)

df['ret_1'] = df['SP500'].pct_change(1, fill_method=None)
df['ret_2'] = df['SP500'].pct_change(2, fill_method=None)
df['ret_5'] = df['SP500'].pct_change(5, fill_method=None)
df['ma_5'] = df['SP500'].rolling(5).mean()
df['ma_10'] = df['SP500'].rolling(10).mean()
df['ma_20'] = df['SP500'].rolling(20).mean()
df['vol_5'] = df['ret_1'].rolling(5).std()
df['momentum_5'] = df['SP500'] / df['SP500'].shift(5) - 1
df['ma_ratio_5'] = df['SP500'] / df['ma_5'] - 1
df['ma_ratio_10'] = df['SP500'] / df['ma_10'] - 1
df['ma_ratio_20'] = df['SP500'] / df['ma_20'] - 1
df['day_of_week'] = df['observation_date'].dt.dayofweek

df = df.dropna().reset_index(drop=True)
feature_cols = ['ret_1','ret_2','ret_5','vol_5','momentum_5','ma_ratio_5','ma_ratio_10','ma_ratio_20','day_of_week']
X = df[feature_cols]
y = df['Target']

print('Rows after feature engineering:', len(df))
print('Date range:', df['observation_date'].min().date(), '->', df['observation_date'].max().date())
print('Target mean (Up ratio):', round(y.mean(), 4))
print('Features:', feature_cols)


## Data Preparation

1. The date column is converted to datetime format.
2. The dataset is sorted chronologically.
3. A new target column is created:
   - Target = 1 if price after 5 days is higher
   - Target = 0 if price after 5 days is lower
4. The last row is removed because it has no next-day value.

In [None]:
split = int(len(df) * 0.8)

X_train = X.iloc[:split]
X_val   = X.iloc[split:]

y_train = y.iloc[:split]
y_val   = y.iloc[split:]

## Feature Selection

Feature (X):
- Engineered technical indicators (returns, volatility, momentum, moving-average ratios, day of week)

Target (y):
- Binary direction (Up = 1, Down = 0)

Only numerical features are used for training.


In [None]:
model = XGBClassifier(
    n_estimators=300,
    max_depth=3,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    random_state=RANDOM_STATE,
    n_jobs=-1
)

## Train-Test Split

The dataset is split chronologically:

- 80% for training
- 20% for validation

We do not shuffle the data because this is time-series data.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_p = scaler.fit_transform(X_train)
X_val_p   = scaler.transform(X_val)

print(X_train_p.shape, X_val_p.shape)

## Data Scaling

StandardScaler is used to normalize the feature values.

Scaling improves model performance and ensures stable learning.

## Model Selection

The model used is XGBoost Classifier.

Main parameters:
- n_estimators = 300
- max_depth = 3
- learning_rate = 0.05
- subsample = 0.8
- colsample_bytree = 0.8

XGBoost is chosen because it is powerful for classification problems and handles structured data efficiently.

In [None]:
import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=3,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric="logloss"
)

model.fit(
    X_train_p,
    y_train,
    eval_set=[(X_val_p, y_val)],
    verbose=False
)

## Model Training

The model is trained using the training dataset.

Validation data is used to evaluate performance during training.

In [None]:
scaler = StandardScaler()

X_train_p = scaler.fit_transform(X_train)
X_val_p   = scaler.transform(X_val)

print(X_train_p.shape, X_val_p.shape)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

log_model = LogisticRegression(max_iter=2000, random_state=42)

# Train
log_model.fit(X_train_p, y_train)

# Predict
y_pred_log = log_model.predict(X_val_p)

# Accuracy
acc_log = accuracy_score(y_val, y_pred_log)
print(f"Logistic Regression Accuracy: {int(acc_log*100)}%")

# ROC-AUC
roc_log = roc_auc_score(y_val, log_model.predict_proba(X_val_p)[:,1])
print("Logistic Regression ROC-AUC:", roc_log)


In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_model = KNeighborsClassifier(n_neighbors=5)

# Train
knn_model.fit(X_train_p, y_train)

# Predict
y_pred_knn = knn_model.predict(X_val_p)

# Accuracy
acc_knn = accuracy_score(y_val, y_pred_knn)
print(f"KNN Accuracy: {int(acc_knn*100)}%")

# ROC-AUC
roc_knn = roc_auc_score(y_val, knn_model.predict_proba(X_val_p)[:,1])
print("KNN ROC-AUC:", roc_knn)

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_model = DecisionTreeClassifier(random_state=42)

# Train
dt_model.fit(X_train_p, y_train)

# Predict
y_pred_dt = dt_model.predict(X_val_p)

# Accuracy
acc_dt = accuracy_score(y_val, y_pred_dt)
print(f"Decision Tree Accuracy: {int(acc_dt*100)}%")

# ROC-AUC
roc_dt = roc_auc_score(y_val, dt_model.predict_proba(X_val_p)[:,1])
print("Decision Tree ROC-AUC:", roc_dt)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train
rf_model.fit(X_train_p, y_train)

# Predict
y_pred_rf = rf_model.predict(X_val_p)

# Accuracy
acc_rf = accuracy_score(y_val, y_pred_rf)
print(f"Random Forest Accuracy: {int(acc_rf*100)}%")

# ROC-AUC
roc_rf = roc_auc_score(y_val, rf_model.predict_proba(X_val_p)[:,1])
print("Random Forest ROC-AUC:", roc_rf)

In [None]:
from sklearn.svm import SVC

svm_model = SVC(probability=True, random_state=42)

svm_model.fit(X_train_p, y_train)

y_pred_svm = svm_model.predict(X_val_p)

acc_svm = accuracy_score(y_val, y_pred_svm)
print(f"SVM Accuracy: {int(acc_svm*100)}%")

roc_svm = roc_auc_score(y_val, svm_model.predict_proba(X_val_p)[:,1])
print("SVM ROC-AUC:", roc_svm)

In [None]:
from sklearn.naive_bayes import GaussianNB

nb_model = GaussianNB()

nb_model.fit(X_train_p, y_train)

y_pred_nb = nb_model.predict(X_val_p)

acc_nb = accuracy_score(y_val, y_pred_nb)
print(f"Naive Bayes Accuracy: {int(acc_nb*100)}%")

roc_nb = roc_auc_score(y_val, nb_model.predict_proba(X_val_p)[:,1])
print("Naive Bayes ROC-AUC:", roc_nb)

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

gb_model = GradientBoostingClassifier(random_state=42)

gb_model.fit(X_train_p, y_train)

y_pred_gb = gb_model.predict(X_val_p)

acc_gb = accuracy_score(y_val, y_pred_gb)
print(f"Gradient Boosting Accuracy: {int(acc_gb*100)}%")

roc_gb = roc_auc_score(y_val, gb_model.predict_proba(X_val_p)[:,1])
print("Gradient Boosting ROC-AUC:", roc_gb)

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

def plot_confusion(model, X_val, y_val, model_name):
    
    y_pred = model.predict(X_val)
    cm = confusion_matrix(y_val, y_pred)
    
    disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                                   display_labels=["Down", "Up"])
    
    disp.plot(cmap="Blues")
    plt.title(f"Confusion Matrix - {model_name}")
    plt.show()

In [None]:
plot_confusion(log_model, X_val_p, y_val, "Logistic Regression")
plot_confusion(knn_model, X_val_p, y_val, "KNN")
plot_confusion(dt_model, X_val_p, y_val, "Decision Tree")
plot_confusion(rf_model, X_val_p, y_val, "Random Forest")
plot_confusion(svm_model, X_val_p, y_val, "SVM")
plot_confusion(nb_model, X_val_p, y_val, "Naive Bayes")
plot_confusion(gb_model, X_val_p, y_val, "Gradient Boosting")

In [None]:
import pandas as pd

comparison = pd.DataFrame({
    "Model": [
        "Logistic Regression",
        "KNN",
        "Decision Tree",
        "Random Forest",
        "SVM",
        "Naive Bayes",
        "Gradient Boosting"
    ],
    "Accuracy": [
        acc_log,
        acc_knn,
        acc_dt,
        acc_rf,
        acc_svm,
        acc_nb,
        acc_gb
    ],
    "ROC-AUC": [
        roc_log,
        roc_knn,
        roc_dt,
        roc_rf,
        roc_svm,
        roc_nb,
        roc_gb
    ]
})

comparison["Accuracy"] = (comparison["Accuracy"] * 100).round(0).astype(int)
comparison.sort_values(by="Accuracy", ascending=False)

In [None]:
import matplotlib.pyplot as plt

plt.figure()
plt.bar(comparison["Model"], comparison["Accuracy"])
plt.xticks(rotation=45)
plt.title("Model Accuracy Comparison (%)")
plt.show()

In [None]:
comparison = comparison.sort_values(by="Accuracy", ascending=True)
comparison

In [None]:
import matplotlib.pyplot as plt

plt.figure()

plt.barh(comparison["Model"], comparison["Accuracy"])

plt.xlabel("Accuracy (%)")
plt.title("Model Comparison by Accuracy")

plt.xlim(0, 1)  # يخلي المقياس واضح من 0 إلى 1

plt.show()


## Final Comparison and Conclusion

In this project, seven different classification models were trained to predict the S&P 500 market direction. The models were evaluated using Accuracy, ROC-AUC score, and Confusion Matrix.

With engineered technical features and a time-based split, the best validation accuracy is around 61%. This is an improvement over the single-feature baseline.

Among the tested models, the best performance was achieved by the model with the highest Accuracy and ROC-AUC score.

These results highlight the high level of noise and uncertainty in financial markets. Future improvements could include using more advanced feature engineering, additional financial indicators, or deep learning models.
