# Notebook 3: ML Model Training
This notebook performs the following tasks:
1. Loads processed data from the SQLite database.
2. Prepares the data for machine learning by scaling and splitting.
3. Trains a linear regression model to predict stock prices.
4. Evaluates the model's performance.
5. Saves the trained model and scaler for future use.


## Importing Necessary Libraries
Libraries used:
- `pandas`: For data handling.
- `sqlite3`: To fetch processed data from the SQLite database.
- `scikit-learn`: For model training and evaluation.
- `joblib`: To save the trained model and scaler for reuse.


In [19]:
# Import necessary libraries
import pandas as pd
import sqlite3
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
import joblib

# Path to SQLite database
db_path = 'database/stocks_data.db'

## Loading Processed Data
Loads the preprocessed stock data from the `processed_stocks` table in the SQLite database (`stocks_data.db`). Displays the number of rows to verify successful loading.


In [20]:
# Step 1: Load processed data from SQLite
with sqlite3.connect(db_path) as conn:
    query = "SELECT * FROM processed_stocks"
    data = pd.read_sql(query, conn)

print(f"Loaded processed data: {data.shape[0]} rows")

Loaded processed data: 24090 rows


## Splitting Features and Target
- **Features (`X`)**: Includes engineered features such as moving averages, volatility, and lagged prices.
- **Target (`y`)**: Adjusted close price (`Adj Close`) to be predicted.


In [21]:
# Step 2: Define features (X) and target (y)
features = ['7-day MA', '14-day MA', 'Volatility', 'Lag_1', 'Lag_2']
target = 'Adj Close'

X = data[features]
y = data[target]

## Splitting Data into Training and Testing Sets
Divides the dataset into:
- Training set: 80% of the data, used to train the model.
- Testing set: 20% of the data, used to evaluate the model.
The split ensures fair evaluation of the model's predictive ability.

In [22]:
# Step 3: Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Data Normalization
Standardizes the features using `StandardScaler` from `scikit-learn`. This ensures that all features have a mean of 0 and a standard deviation of 1, improving model performance.

In [23]:
# Step 4: Normalize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit and transform on training data
X_test_scaled = scaler.transform(X_test)       # Transform testing data using the same scaler

## Training the Linear Regression Model
Uses the `LinearRegression` class from `scikit-learn` to train a regression model on the training data. This model predicts stock prices based on the input features.


In [24]:
# Step 5: Train a Linear Regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

## Evaluating Model Performance
Metrics calculated on the testing data:
1. **Mean Squared Error (MSE)**: Measures the average squared difference between actual and predicted prices.
2. **R-squared (R²)**: Indicates how well the model explains the variance in the target variable (closer to 1 is better).

Outputs these metrics to assess model accuracy.


In [25]:
# Step 6: Evaluate the model
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Model Evaluation:")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared (R²): {r2:.2f}")

Model Evaluation:
Mean Squared Error (MSE): 3.40
R-squared (R²): 1.00


## Saving the Model and Scaler
Saves the trained model (`stock_price_model.pkl`) and the scaler (`scaler.pkl`) using `joblib`. This ensures the model and preprocessing pipeline can be reused without retraining.

In [26]:
# Step 7: Save the trained model and scaler
joblib.dump(model, 'models/stock_price_model.pkl')
joblib.dump(scaler, 'models/scaler.pkl')
print("Trained model saved as 'stock_price_model.pkl'")
print("Scaler saved as 'scaler.pkl'")

Trained model saved as 'stock_price_model.pkl'
Scaler saved as 'scaler.pkl'


## Verification and Next Steps
- Confirms that the model and scaler have been saved successfully.
- Prepares for Notebook 4, where the model will be used to make predictions and generate visualizations.

In [27]:
# Step 9: Verify saved files and next steps
import os

if os.path.exists("models/stock_price_model.pkl") and os.path.exists("models/scaler.pkl"):
    print("Model and scaler saved successfully.")
    print("Proceed to Notebook 4 for predictions and visualizations.")
else:
    print("Error: Model or scaler file not found. Check save steps.")

Model and scaler saved successfully.
Proceed to Notebook 4 for predictions and visualizations.
