# Predicting Short-Term Stock Price Movements Using Machine Learning (Milestone 1)

- Category: Application / Empirical Study
- Team Members: Jiho Hahn (jh2982), Ian Lau (icl8)

## Motivation

Stock markets are influenced by many interacting factors, from macroeconomic conditions to company-specific events, and short-term price movements are often considered unpredictable. Still, past research and industry practice suggest that certain technical indicators and statistical patterns can provide useful predictive signals. Our project aims to model daily stock returns as a binary classification problem (up vs. down) using machine learning techniques introduced in this course.


This problem is motivating because it is both practical and methodologically challenging, since financial data is noisy, non-stationary, and prone to overfitting. By tackling it, we can assess how well machine learning methods generalize to a real-world high-variance setting. Additionally, if one were able to consistently accurately predict stock markets, they would be able to profit by investing.


This project will be detailed benchmarking / analysis of multiple machine learning models on an existing financial dataset. We are focused on evaluating their predictive performance and robustness in short-term stock movement prediction.

## Method & Setup

#### 1. Model and output

As a baseline, Linear Regression model is trained to predict whether a stock price will go up or down in the following day, based on the metrics from the past.

#### 2. Dataset

Features and labels are extracted from the [US Stock Market Data & Technical Indicators dataset(USSMD) in Kaggle](https://www.kaggle.com/datasets/nikhilkohli/us-stock-market-data-60-extracted-features/data). It provides plenty of attributes(64) which our model can train on.

#### 3. Feature & label extraction

We chose 12 engineering features from our dataset of 64 raw attributes:

1. Close(t) – Captures the most representative daily price level, forming the basis for trends and returns.
2. Volume – Reflects trading activity and investor interest, often preceding large price movements.
3. MA10 – Short-term moving average that captures recent trend direction and short momentum shifts.
4. MA20 – Medium-term moving average that smooths noise and identifies broader trend alignment or crossovers.
5. RSI – Measures momentum and overbought/oversold conditions, signaling potential reversals.
6. MACD – Highlights changes in momentum by comparing short- and long-term EMAs, useful for trend shifts.
7. MACD_EMA – The smoothed version of MACD that indicates confirmed or weakening momentum signals.
8. Volatility_10d – Quantifies recent price variability, revealing risk regimes and market uncertainty.
9. Return_1d – Captures immediate momentum or mean-reversion tendencies from the prior day’s move.
10. Index_SP500_Return_1d – Represents overall market sentiment, accounting for macro-level correlations.
11. High_Low_Range – Measures intraday volatility, hinting at potential breakout or reversal conditions.
12. Return_5d – Encodes short-term momentum or drift over the past week, complementing single-day returns.
Then, we will apply a range of machine learning models covered throughout the semester, starting with baseline linear regression in this milestone.

Label is whether the following day's 1-day return is potivie or not.

#### 4. Train & Evaluation

- Train & test is carried out per ticker. (e.g. AAPL, MSFT, ...) Each ticker has its own dynamics and by training per ticker, and some stocks (e.g. AAPL) have stronger momentum or trend reversals than others (e.g. utilities). Logistic regression per ticker can tune weights to that stock’s specific technical indicator responses.
- Rows are split into training & test dataset in 80-20 ratio with respect to the date window per ticker. The ratio is a simple, standard rule of thumb.
- COVID sell-off started around Feb 19th 2020, so we ruled out the rows that date during Covid outbreak to avoid extreme market volatility that could distort model training and hinder evaluation under typical market conditions.
- Evaluation is analyzed with prediction accuracies.



## Preliminary Experiments

#### Experiment implementation (Code)

Full python code is given in the last section of the note.

#### Outcome & Analysis

Accuracies, f1-score, and confusion matrix are used per ticker as our evaluation metrics.

| **Ticker** | **Accuracy** | **F1-Score (Positive Class)** | **True Positives (TP)** | **False Positives (FP)** | **False Negatives (FN)** | **True Negatives (TN)** |
| ---------- | ------------ | ----------------------------- | ----------------------- | ------------------------ | ------------------------ | ----------------------- |
| **TSLA**   | 49.30%       | 0.562                         | 70                      | 64                       | 45                       | 36                      |
| **MSFT**   | 45.96%       | 0.236                         | 60                      | 39                       | 349                      | 270                     |
| **AMZN**   | 54.51%       | 0.699                         | 246                     | 205                      | 7                        | 8                       |
| **GOOGL**  | 55.03%       | 0.544                         | 165                     | 118                      | 159                      | 174                     |
| **GE**     | 48.35%       | 0.486                         | 237                     | 301                      | 200                      | 232                     |
| **GS**     | 49.72%       | 0.077                         | 15                      | 11                       | 350                      | 342                     |
| **IBM**    | 49.48%       | 0.389                         | 156                     | 132                      | 358                      | 324                     |
| **JPM**    | 49.69%       | 0.408                         | 168                     | 152                      | 336                      | 314                     |
| **FB**     | 52.09%       | 0.466                         | 45                      | 36                       | 67                       | 67                      |
| **AAPL**   | 45.26%       | 0.105                         | 23                      | 27                       | 366                      | 302


- AMZN and GOOGL show the highest accuracy and the strongest F1-Score, particularly AMZN at 0.699, which is quite high.
    
- MSFT and AAPL show very low F1-Scores for the positive class (predicting "Up"), suggesting the model is extremely conservative about issuing a Buy signal (low Recall) or, when it does, it's often wrong (low Precision). This is clearly visible in the high FN counts for these tickers (MSFT: 349, AAPL: 366).
    
- TSLA, GE, and GOOGL have F1-Scores close to the Accuracy, indicating a more balanced performance between Precision and Recall.

#### Key Findings

- Near-Random Performance: The majority of tickers (TSLA: 49.3%, GE: 48.4%, GS: 49.7%, IBM: 49.5%) exhibit accuracy scores very close to the 50% baseline. A random coin flip would yield 50% accuracy. This is a common initial outcome in stock market prediction, indicating that the features currently used (likely standard technical indicators) do not contain a clear, linear signal strong enough for reliable classification by the current model.

- Systematic Errors (MSFT): The result for MSFT (46.0%) is statistically worse than a random guess. This suggests that the model is not only failing to find a pattern but is systematically overfitting to noise in a way that causes it to be wrong more often than not. This highlights the need for better regularization or a more complex model.

- Marginal Outperformance (AMZN, GOOGL): The model shows its best performance on AMZN (54.5%) and GOOGL (55.0%). This suggests that the current feature set may capture some slight predictive signal specific to these high-growth/tech stocks, which often have higher momentum or clearer price trends than older industrials (like GE, IBM).

## Future Work

#### Feature Engineering and Data Augmentation

To move beyond the 50% barrier, we will incorporate features that capture multi-faceted market dynamics. Such include:

- Market sentiment data derived from financial news, social media mentions, or analyst reports. Sentiment can capture the psychological aspects of short-term trading which technical indicators miss.

- Global market indices like Euro Stoxx 50, Shanghai Composite Index (SSE Composite), Hang Seng Index(HSI).

#### Model Complexity and Selection

We will move from simple classification models to more powerful ones for non-linear and sequential data such as LSTM and Transformer.

## Code

In [100]:
from google.colab import files
files.upload()   # select kaggle.json

# Create token in: https://www.kaggle.com/settings

Saving kaggle.json to kaggle (2).json


{'kaggle (2).json': b'{"username":"jonathanhahn","key":"45d5d6732a6c770cb29fff03ee53914a"}'}

In [101]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [102]:
!kaggle datasets download -d nikhilkohli/us-stock-market-data-60-extracted-features

Dataset URL: https://www.kaggle.com/datasets/nikhilkohli/us-stock-market-data-60-extracted-features
License(s): CC0-1.0
us-stock-market-data-60-extracted-features.zip: Skipping, found more recently modified local copy (use --force to force download)


In [103]:
!unzip us-stock-market-data-60-extracted-features.zip -d data/

Archive:  us-stock-market-data-60-extracted-features.zip
replace data/AAPL.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [104]:
import pandas as pd
import glob

# Combine all CSVs into one DataFrame
files = glob.glob('data/*.csv')

dfs = {}
for file in files:
  ticker = file.split('/')[-1].replace('.csv', '')
  temp = pd.read_csv(file)
  dfs[ticker] = temp

# Parse dates and sort

for ticker, df in dfs.items():
  df['Date'] = pd.to_datetime(df['Date'])
  df = df.sort_values(['Date'])

print(dfs.keys())
print("AAPL shape & head: ", dfs['AAPL'].shape)
dfs['AAPL'].iloc[:, :8].head()

dict_keys(['TSLA', 'MSFT', 'AMZN', 'GOOGL', 'GE', 'GS', 'IBM', 'JPM', 'FB', 'AAPL'])
AAPL shape & head:  (3732, 64)


Unnamed: 0,Date,Open,High,Low,Close(t),Volume,SD20,Upper_Band
0,2005-10-17,6.66,6.69,6.5,6.6,154208600,0.169237,6.827473
1,2005-10-18,6.57,6.66,6.44,6.45,152397000,0.168339,6.819677
2,2005-10-19,6.43,6.78,6.32,6.78,252170800,0.180306,6.861112
3,2005-10-20,6.72,6.97,6.71,6.93,339440500,0.202674,6.931847
4,2005-10-21,7.02,7.03,6.83,6.87,199181500,0.21668,6.97486


In [105]:
import pandas as pd

# Add target col (next day up/down) & Engineer features

# Declare features
FEATURE_COLS = [
  'Close(t)',            # basic price
  'Volume',              # volume
  'MA10',                # short trend
  'MA20',                # medium trend
  'RSI',                 # momentum indicator
  'MACD',                # momentum/trend change indicator
  'MACD_EMA',            # signal line of MACD
  'Volatility_10d',      # e.g., rolling std of return over 10 days
  'Return_1d',           # previous day return
  'Index_SP500_Return_1d',   # market context
  'High_Low_Range',      # e.g., (High–Low)/Close yesterday
  'Return_5d'            # 5-day return (momentum)
]

# Declare label
LABEL_COL = 'label_up_next'

for ticker, df_current in dfs.items():
  # 1. 1-day returns per ticker
  df_current['Return_1d'] = df_current['Close(t)'].pct_change()
  # 2. 5-day returns per ticker
  df_current['Return_5d'] = df_current['Close(t)'].pct_change(5)
  # 3. 10-day volatility (std of daily returns)
  df_current['Volatility_10d'] = df_current['Return_1d'].rolling(10).std()
  # 4. S&P 500 1-day return (index context)
  df_current['Index_SP500_Return_1d'] = df_current['SnP_Close'].pct_change()
  # 5. Intraday high-low range, scaled by close
  df_current['High_Low_Range'] = (df_current['High'] - df_current['Low']) / df_current['Close(t)']

  # Drop data after covid
  df_current = df_current[df_current['Date'] < '2020-02-01']

  df_current = df_current[FEATURE_COLS + ['Date']]

  # Define label as next day up or down
  df_current[LABEL_COL] = (df_current['Return_1d'].shift(-1) > 0).astype(int)

  # `pct_change`, `rolling` incurs nan values so drop them.
  df_current = df_current.dropna();

  # Update the DataFrame in the dictionary
  dfs[ticker] = df_current




print(dfs['AAPL'][['Close(t)', 'Return_1d', 'Return_5d', 'Volatility_10d', 'label_up_next']].head(10))

    Close(t)  Return_1d  Return_5d  Volatility_10d  label_up_next
10      7.11   0.058036   0.014265        0.030474              0
11      7.10  -0.001406   0.024531        0.028794              1
12      7.40   0.042254   0.051136        0.027487              1
13      7.64   0.032432   0.116959        0.028213              0
14      7.55  -0.011780   0.123512        0.028460              0
15      7.44  -0.014570   0.046414        0.029160              0
16      7.39  -0.006720   0.040845        0.028879              1
17      7.42   0.004060   0.002703        0.028708              1
18      7.55   0.017520  -0.011780        0.026223              1
19      7.60   0.006623   0.006623        0.024431              0


In [106]:
# Split the data by date and remove 'Date' column for training

# Use validation set later
# TVT_TRAIN_RATIO = 0.7
# TVT_VAL_RATIO = 0.15
# TVT_TEST_RATIO = 0.15

# def split_by_time_tvt(df, train_end, val_end):
#   train = df[df['Date'] < train_end]
#   val = df[(df['Date'] >= train_end) & (df['Date'] < val_end)]
#   test = df[df['Date'] >= val_end]
#   return train, val, test

TT_TRAIN_RATIO = 0.8
TT_TEST_RATIO = 0.2

def split_by_time_tt(df, train_end):
  train = df[df['Date'] < train_end]
  test = df[df['Date'] >= train_end]
  return train, test

train_tests = {}

for ticker, df in dfs.items():
  date_diff = df['Date'].max() - df['Date'].min()
  train_end = df['Date'].min() + date_diff * TT_TRAIN_RATIO
  train, test = split_by_time_tt(df, train_end)

  train_tests[ticker] = {
    'train': train.drop('Date', axis=1),
    'test': test.drop('Date', axis=1)
  }

print(train_tests['AAPL']['train'].shape)
print(train_tests['AAPL']['test'].shape)

(2869, 13)
(718, 13)


In [107]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score

models = {}

# Train
for ticker, train_test in train_tests.items():
  X_train = train_test['train'][FEATURE_COLS]
  y_train = train_test['train'][LABEL_COL]

  pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LogisticRegression(max_iter=1000, class_weight="balanced", random_state=42))
  ])

  pipe.fit(X_train, y_train)

  models[ticker] = pipe

In [110]:
# Test
for ticker, pipe in models.items():
  X_test = train_tests[ticker]['test'][FEATURE_COLS]
  y_test = train_tests[ticker]['test'][LABEL_COL]

  y_pred = pipe.predict(X_test)
  y_prob = pipe.predict_proba(X_test)[:, 1] # Use only "Up" probability

  print(f"[{ticker}]\tAccuracy: {accuracy_score(y_test, y_pred)}")

  print("Confusion matrix:\n", confusion_matrix(y_test, y_pred))
  print("f-1 score:\n", f1_score(y_test, y_pred, average='binary'))

[TSLA]	Accuracy: 0.4930232558139535
Confusion matrix:
 [[36 64]
 [45 70]]
f-1 score:
 0.5622489959839357
[MSFT]	Accuracy: 0.4596100278551532
Confusion matrix:
 [[270  39]
 [349  60]]
f-1 score:
 0.23622047244094488
[AMZN]	Accuracy: 0.5450643776824035
Confusion matrix:
 [[  8 205]
 [  7 246]]
f-1 score:
 0.6988636363636364
[GOOGL]	Accuracy: 0.5503246753246753
Confusion matrix:
 [[174 118]
 [159 165]]
f-1 score:
 0.5436573311367381
[GE]	Accuracy: 0.48350515463917526
Confusion matrix:
 [[232 301]
 [200 237]]
f-1 score:
 0.48615384615384616
[GS]	Accuracy: 0.4972144846796657
Confusion matrix:
 [[342  11]
 [350  15]]
f-1 score:
 0.07672634271099744
[IBM]	Accuracy: 0.4948453608247423
Confusion matrix:
 [[324 132]
 [358 156]]
f-1 score:
 0.38902743142144636
[JPM]	Accuracy: 0.49690721649484537
Confusion matrix:
 [[314 152]
 [336 168]]
f-1 score:
 0.4077669902912621
[FB]	Accuracy: 0.5209302325581395
Confusion matrix:
 [[67 36]
 [67 45]]
f-1 score:
 0.46632124352331605
[AAPL]	Accuracy: 0.45264623