# **TradeCare: Model Training Notebook**

## Objectives
* Load engineered features dataset
* Filter to recent data (2020-2025) for better model performance
* Train regression model (BR1) to predict price movement
* Train classification model (BR2) to predict trade profitability
* Evaluate both models with appropriate metrics
* Save both models for dashboard deployment

## Inputs
* **Data Source:** `inputs/datasets/processed/bitcoin_features.csv`
* **Features:** 14 technical indicators
* **Targets:** 2 variables (continuous return + binary profitable)
* **Full Dataset:** ~92,000 rows (2014-2025)

## Outputs
* **BR1 Model:** Linear Regression saved as `outputs/models/regression_model.pkl`
* **BR2 Model:** Logistic Regression saved as `outputs/models/classification_model.pkl`
* **Evaluation metrics** for both models
* **Visualizations** showing model performance

## Business Requirements Addressed
* **BR1:** Price Movement Prediction (Regression)
  - Predict: % price change over 4 hours
  - Metrics: RMSE, MAE, R²
  - Use case: Setting stop-loss/take-profit levels

* **BR2:** Trade Profitability Assessment (Classification)
  - Predict: Profitable (1) or Not Profitable (0)
  - Metrics: Accuracy, Confusion Matrix, ROC-AUC
  - Use case: Binary decision support

## CRISP-DM Phase
Modeling → Model Training and Evaluation

---

## Setup

Import dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, confusion_matrix, classification_report, roc_auc_score, roc_curve
)
import joblib

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

Change to project root

In [2]:
current_dir = os.getcwd()
if 'jupyter_notebooks' in current_dir:
    os.chdir(os.path.dirname(current_dir))
    print(f"Changed directory to: {os.getcwd()}")
else:
    print(f"Already in project root: {current_dir}")

Changed directory to: /Users/ilianamarquez/Documents/vscode-projects/trade-care


----

## Load Engineered Features

In [3]:
df = pd.read_csv('inputs/datasets/processed/bitcoin_features.csv')
df['timestamp'] = pd.to_datetime(df['timestamp'])

print(f"✓ Data loaded: {len(df):,} rows")
print(f"  Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"  Features: {len(df.columns) - 4} (excluding timestamp, price, 2 targets)")
df.head()

✓ Data loaded: 92,127 rows
  Date range: 2014-11-17 09:00:00 to 2025-11-21 19:00:00
  Features: 14 (excluding timestamp, price, 2 targets)


Unnamed: 0,timestamp,CLOSE_PRICE,return_1h,return_4h,return_12h,return_24h,rsi,ma_10,ma_20,ma_50,dist_from_ma10,dist_from_ma20,volume_change,volume_ratio,volatility_24h,price_range,target_return_simple,target_profitable
0,2014-11-17 09:00:00,409.82,0.020341,0.038229,0.061572,0.057682,81.257533,398.621,390.283,385.2682,0.028094,0.050059,-0.293987,0.656453,0.010936,0.025328,-0.023059,0
1,2014-11-17 10:00:00,405.72,-0.010004,0.017046,0.052069,0.044647,71.564683,400.649,391.2355,385.4596,0.012657,0.037022,0.498773,0.94652,0.011225,0.013162,-0.019003,0
2,2014-11-17 11:00:00,406.66,0.002317,0.01953,0.054944,0.061997,71.708447,401.9,392.6385,385.6498,0.011844,0.035711,-0.1936,0.790797,0.0107,0.008631,-0.014951,0
3,2014-11-17 12:00:00,404.8,-0.004574,0.007843,0.050228,0.044942,69.582993,402.65,394.0305,385.7478,0.00534,0.027332,0.430754,1.180859,0.010612,0.016774,-0.009313,0
4,2014-11-17 13:00:00,400.37,-0.010944,-0.023059,0.015781,0.038897,63.996992,402.347,395.195,385.904,-0.004914,0.013095,0.603766,1.813744,0.010841,0.016085,-0.014861,0


---

## Filter to Recent Data (2020-2025)

**Why Filter to Recent Years:**
* Data cleaning showed 2020+ has 0 time gaps (continuous data)
* Recent market patterns more relevant to current trading
* Faster model training
* Better generalization to future predictions
* 2014-2019 had infrastructure issues (early Bitcoin exchanges)

**Still includes:**
* COVID crash (March 2020)
* 2021 bull run + ATH
* 2022 bear market
* 2024-2025 recovery

**Expected:** ~43,000 - 48,000 rows (5 years of hourly data)

In [4]:
rows_before = len(df)
df = df[df['timestamp'] >= '2020-01-01'].copy()
rows_after = len(df)

print(f"✓ Filtered to recent data (2020-2025)")
print(f"  Before: {rows_before:,} rows")
print(f"  After: {rows_after:,} rows")
print(f"  Dropped: {rows_before - rows_after:,} rows ({(rows_before-rows_after)/rows_before*100:.1f}%)")
print(f"\n  New date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
print(f"  Years covered: {(df['timestamp'].max() - df['timestamp'].min()).days / 365:.1f} years")

✓ Filtered to recent data (2020-2025)
  Before: 92,127 rows
  After: 51,643 rows
  Dropped: 40,484 rows (43.9%)

  New date range: 2020-01-01 00:00:00 to 2025-11-21 19:00:00
  Years covered: 5.9 years
