 Data Collection (Stock Prices + Sentiment Analysis)
 
  1. Stock Price Data ( alpha vantage API)
  We’ll use yfinance to collect historical stock prices (Open, High, Low, Close, Volume).

In [5]:
from alpha_vantage.timeseries import TimeSeries
import pandas as pd

API_KEY = "P3LQ4XJ9PACNYTHH"  

ts = TimeSeries(key=API_KEY, output_format="pandas")
stock_data, meta_data = ts.get_daily(symbol="JPM", outputsize="full")

# Save to CSV
stock_data.to_csv("jpm_stock_data.csv")
print("Stock data saved successfully!")




Stock data saved successfully!


 2. Preprocessing Data (Handling missing values, normalizing).

In [6]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

# Load stock data
df = pd.read_csv("jpm_stock_data.csv")
df = df.iloc[::-1]  # Reverse order so that oldest data is first

# Convert date column to datetime format
df.index = pd.to_datetime(df.index)

# Select only relevant columns (Open, High, Low, Close, Volume)
df = df[['1. open', '2. high', '3. low', '4. close', '5. volume']]
df.columns = ['Open', 'High', 'Low', 'Close', 'Volume']

# Check for missing values
print("Missing values:\n", df.isnull().sum())

# Normalize using MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, 1))
df_scaled = scaler.fit_transform(df)

# Convert back to DataFrame
df_scaled = pd.DataFrame(df_scaled, columns=df.columns, index=df.index)

print("\nPreprocessing Done! Data is now scaled.")
df_scaled.head()



Missing values:
 Open      0
High      0
Low       0
Close     0
Volume    0
dtype: int64

Preprocessing Done! Data is now scaled.


Unnamed: 0,Open,High,Low,Close,Volume
1970-01-01 00:00:00.000006375,0.270654,0.26817,0.260848,0.257505,0.017261
1970-01-01 00:00:00.000006374,0.262603,0.264873,0.26039,0.257996,0.012784
1970-01-01 00:00:00.000006373,0.259261,0.256347,0.256349,0.25327,0.008661
1970-01-01 00:00:00.000006372,0.260249,0.263698,0.261344,0.259622,0.014299
1970-01-01 00:00:00.000006371,0.268793,0.269155,0.269427,0.267675,0.012284


Feature Engineering ; 
Simple Moving Averages (SMA) – Averages of closing prices over different time windows 
Exponential Moving Averages (EMA) – Similar to SMA but gives more weight to recent prices.
Relative Strength Index (RSI) – Measures momentum to see if a stock is overbought/oversold.
MACD (Moving Average Convergence Divergence) – Identifies trend changes.

In [7]:
# Import required libraries
import pandas as pd

# Function to calculate Relative Strength Index (RSI)
def calculate_rsi(data, window=14):
    delta = data.diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=window).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=window).mean()
    rs = gain / loss
    return 100 - (100 / (1 + rs))

# Moving Averages
df_scaled['SMA_14'] = df_scaled['Close'].rolling(window=14).mean()
df_scaled['SMA_50'] = df_scaled['Close'].rolling(window=50).mean()
df_scaled['EMA_14'] = df_scaled['Close'].ewm(span=14, adjust=False).mean()

# RSI Indicator
df_scaled['RSI_14'] = calculate_rsi(df_scaled['Close'], window=14)

# MACD Calculation
ema_12 = df_scaled['Close'].ewm(span=12, adjust=False).mean()
ema_26 = df_scaled['Close'].ewm(span=26, adjust=False).mean()
df_scaled['MACD'] = ema_12 - ema_26

# Drop NaN values (from rolling calculations)
df_scaled.dropna(inplace=True)

print("Feature Engineering Done! New features added.")
df_scaled.head()


Feature Engineering Done! New features added.


Unnamed: 0,Open,High,Low,Close,Volume,SMA_14,SMA_50,EMA_14,RSI_14,MACD
1970-01-01 00:00:00.000006326,0.207538,0.204964,0.203195,0.20552,0.022308,0.221988,0.237996,0.218811,35.811919,-0.007702
1970-01-01 00:00:00.000006325,0.206361,0.206631,0.207962,0.207183,0.018817,0.220737,0.23699,0.21726,37.560451,-0.007955
1970-01-01 00:00:00.000006324,0.211829,0.21277,0.212003,0.211191,0.01773,0.219066,0.236054,0.216451,31.847507,-0.007742
1970-01-01 00:00:00.000006323,0.222501,0.225085,0.220583,0.220907,0.026391,0.218358,0.235406,0.217045,42.964554,-0.006712
1970-01-01 00:00:00.000006322,0.222501,0.220349,0.210821,0.210019,0.012735,0.216889,0.234414,0.216108,37.313433,-0.006697


Training ML Models ;
Defining the target (y) and features (X)
Splitting data into training & testing sets

Train models:
Linear Regression (Baseline model)
XGBoost (More advanced)


In [9]:
# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
import xgboost as xgb

# Define target variable (Close Price) and feature set
X = df_scaled.drop(columns=['Close'])  # All features except the target
y = df_scaled['Close']  # Target variable

# Split data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

# 1 Train Linear Regression Model
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)

# Make predictions
y_pred_lr = lr_model.predict(X_test)

# Evaluate Linear Regression
mae_lr = mean_absolute_error(y_test, y_pred_lr)
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))

print(f"Linear Regression - MAE: {mae_lr:.4f}, RMSE: {rmse_lr:.4f}")

# 2 Train XGBoost Model
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=100)
xgb_model.fit(X_train, y_train)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate XGBoost
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
rmse_xgb = np.sqrt(mean_squared_error(y_test, y_pred_xgb))

print(f"XGBoost - MAE: {mae_xgb:.4f}, RMSE: {rmse_xgb:.4f}")



Linear Regression - MAE: 0.0026, RMSE: 0.0034
XGBoost - MAE: 0.0889, RMSE: 0.1519
