# **TradeCare: Feature Engineering Notebook**   

## Objectives
* Load cleaned Bitcoin OHLCV data
* Engineer technical indicators for ML model
* Create target variables (4-hour prediction horizon)
* Perform correlation analysis with target
* Save engineered features for model training

## Inputs
* **Data Source:** `inputs/datasets/processed/bitcoin_clean.csv`
* **Records:** ~92,180 hourly records (2014-2025)

## Outputs
* **Engineered features CSV:** `inputs/datasets/processed/bitcoin_features.csv`
* **Correlation study** showing which features correlate with profitability
* **Ready-to-train dataset** for classification model

## Features Created
**Price-based:**
* Returns (1h, 4h, 12h, 24h)

**Technical Indicators:**
* RSI (14-period)
* Moving Average (10-period)
* Volume change

**Target Variables:**
* `target_return`: 4-hour ahead price return (%)
* `target_profitable`: Binary (1 = profitable, 0 = not profitable)

## CRISP-DM Phase
Data Preparation → Feature Engineering

---

## Setup

Import corresponding dependencies

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

Change to project root

In [2]:
current_dir = os.getcwd()
if 'jupyter_notebooks' in current_dir:
    os.chdir(os.path.dirname(current_dir))
    print(f"Changed directory to: {os.getcwd()}")
else:
    print(f"Already in project root: {current_dir}")

Changed directory to: /Users/ilianamarquez/Documents/vscode-projects/trade-care


---

## Load Cleaned Data

In [3]:
# Load cleaned data
df = pd.read_csv('inputs/datasets/processed/bitcoin_clean.csv')

# Convert timestamp
df['timestamp'] = pd.to_datetime(df['timestamp'])

# Sort by time (critical for time-series features)
df = df.sort_values('timestamp').reset_index(drop=True)

print(f"✓ Data loaded: {len(df):,} rows")
print(f"  Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")
df.head()

✓ Data loaded: 92,180 rows
  Date range: 2014-11-15 06:00:00 to 2025-11-21 23:00:00


Unnamed: 0,TIME_UNIX,DATE_STR,HOUR_STR,OPEN_PRICE,HIGH_PRICE,CLOSE_PRICE,LOW_PRICE,VOLUME_FROM,VOLUME_TO,timestamp
0,1416031200,2014-11-15,6,395.88,398.12,396.15,394.43,459.6,182309.81,2014-11-15 06:00:00
1,1416034800,2014-11-15,7,396.15,397.49,397.15,395.96,428.88,170256.62,2014-11-15 07:00:00
2,1416038400,2014-11-15,8,397.15,399.99,399.9,396.91,445.96,178280.48,2014-11-15 08:00:00
3,1416042000,2014-11-15,9,399.9,399.9,392.56,391.83,494.09,195473.98,2014-11-15 09:00:00
4,1416045600,2014-11-15,10,392.56,393.1,391.83,390.03,437.84,171654.03,2014-11-15 10:00:00


---

## Feature Engineering

### 1. Price Returns (Momentum Features)

In [5]:
# Calculate price returns at different time horizons
df['return_1h'] = df['CLOSE_PRICE'].pct_change(1)
df['return_4h'] = df['CLOSE_PRICE'].pct_change(4)
df['return_12h'] = df['CLOSE_PRICE'].pct_change(12)
df['return_24h'] = df['CLOSE_PRICE'].pct_change(24)

print("✓ Price returns calculated")
print(df[['timestamp', 'CLOSE_PRICE', 'return_1h', 'return_4h']].head(30))

✓ Price returns calculated
             timestamp  CLOSE_PRICE  return_1h  return_4h
0  2014-11-15 06:00:00       396.15        NaN        NaN
1  2014-11-15 07:00:00       397.15   0.002524        NaN
2  2014-11-15 08:00:00       399.90   0.006924        NaN
3  2014-11-15 09:00:00       392.56  -0.018355        NaN
4  2014-11-15 10:00:00       391.83  -0.001860  -0.010905
5  2014-11-15 11:00:00       389.82  -0.005130  -0.018457
6  2014-11-15 12:00:00       390.50   0.001744  -0.023506
7  2014-11-15 13:00:00       387.34  -0.008092  -0.013297
8  2014-11-15 14:00:00       376.47  -0.028063  -0.039201
9  2014-11-15 15:00:00       374.82  -0.004383  -0.038479
10 2014-11-15 16:00:00       374.63  -0.000507  -0.040640
11 2014-11-15 17:00:00       370.60  -0.010757  -0.043218
12 2014-11-15 18:00:00       371.20   0.001619  -0.013998
13 2014-11-15 19:00:00       374.51   0.008917  -0.000827
14 2014-11-15 20:00:00       372.78  -0.004619  -0.004938
15 2014-11-15 21:00:00       375.24   0.00659

**Expected Behavior:**
* **NaN values:** First few rows will have NaN (normal - no prior data for calculation)
  - `return_1h`: Row 0 is NaN (needs 1 previous price)
  - `return_4h`: Rows 0-3 are NaN (needs 4 previous prices)
  - `return_24h`: Rows 0-23 are NaN (needs 24 previous prices)
* **Negative returns:** Expected! Bitcoin price DROPS sometimes
  - Negative value = price decreased (loss period)
  - Positive value = price increased (profit period)
* **Typical ranges:**
  - 1h returns: -5% to +5% (normal volatility)
  - 4h returns: -10% to +10%
  - 24h returns: -20% to +20%

**What to Check:**
* ✓ NaN only in first N rows (not scattered throughout)
* ✓ Returns show both positive and negative values
* ✓ No extreme outliers (>50%) unless during crash events

**Validation:**

In [9]:
print("Return Statistics:")
print(df[['return_1h', 'return_4h', 'return_12h', 'return_24h']].describe())
print(f"\n✓ Ranges look realistic for crypto market")
print(f"✓ NaN count: {df['return_24h'].isna().sum():,} rows (expected from rolling window)")

Return Statistics:
          return_1h     return_4h    return_12h    return_24h
count  92179.000000  92176.000000  92168.000000  92156.000000
mean       0.000088      0.000343      0.001022      0.002082
std        0.007685      0.014849      0.025304      0.036665
min       -0.153590     -0.270015     -0.291917     -0.451041
25%       -0.002340     -0.004583     -0.008479     -0.013067
50%        0.000099      0.000278      0.000638      0.001342
75%        0.002585      0.005364      0.010653      0.017253
max        0.202311      0.235700      0.336792      0.353167

✓ Ranges look realistic for crypto market
✓ NaN count: 24 rows (expected from rolling window)
