# Gold Price Forecasting - Exploration and Analysis

This notebook provides comprehensive exploration and analysis for gold price forecasting using various machine learning techniques.

## Table of Contents
1. [Setup and Data Loading](#setup)
2. [Exploratory Data Analysis](#eda)
3. [Feature Engineering](#features)
4. [Model Training and Comparison](#models)
5. [Visualization and Results](#visualization)
6. [Performance Evaluation](#evaluation)
7. [Risk Analysis](#risk)
8. [Backtesting](#backtest)

## 1. Setup and Data Loading {#setup}

In [None]:
# Import necessary libraries
import sys
import os
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime, timedelta

# Import our custom modules
from src.data_collection import GoldDataCollector
from src.feature_engineering import FeatureEngineer
from src.models import create_model, ModelEnsemble
from src.visualization import GoldPriceVisualizer
from src.risk_management import RiskManager
from src.backtesting import Backtester, BacktestConfig, MovingAverageCrossoverStrategy
from config.settings import get_settings

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

print("✅ Setup complete!")

In [None]:
# Load configuration
settings = get_settings()
print(f"Configuration loaded - Default model: {settings.model.default_model}")
print(f"Data source: {settings.data_source.default_source}")

In [None]:
# Collect gold price data
collector = GoldDataCollector()

# Define date range
end_date = datetime.now().strftime("%Y-%m-%d")
start_date = (datetime.now() - timedelta(days=1095)).strftime("%Y-%m-%d")  # 3 years

print(f"Collecting data from {start_date} to {end_date}")

# Fetch data
raw_data = collector.get_gold_prices(start_date, end_date)
print(f"✅ Collected {len(raw_data)} records")
print(f"Date range: {raw_data.index.min()} to {raw_data.index.max()}")

# Display basic info
print("\nData shape:", raw_data.shape)
print("\nColumns:", raw_data.columns.tolist())
raw_data.head()

## 2. Exploratory Data Analysis {#eda}

In [None]:
# Basic statistics
print("=== BASIC STATISTICS ===")
print(raw_data.describe())

print("\n=== DATA INFO ===")
print(raw_data.info())

print("\n=== MISSING VALUES ===")
print(raw_data.isnull().sum())

In [None]:
# Price trend visualization
visualizer = GoldPriceVisualizer()
visualizer.plot_price_trend(raw_data, title="Gold Price Trend (3 Years)")

In [None]:
# Price distribution
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Closing price histogram
axes[0, 0].hist(raw_data['Close'], bins=50, alpha=0.7, edgecolor='black')
axes[0, 0].set_title('Distribution of Closing Prices')
axes[0, 0].set_xlabel('Price ($)')
axes[0, 0].set_ylabel('Frequency')

# Daily returns
returns = raw_data['Close'].pct_change().dropna()
axes[0, 1].hist(returns, bins=50, alpha=0.7, edgecolor='black')
axes[0, 1].set_title('Distribution of Daily Returns')
axes[0, 1].set_xlabel('Daily Return')
axes[0, 1].set_ylabel('Frequency')

# Volume distribution
if 'Volume' in raw_data.columns:
    axes[1, 0].hist(raw_data['Volume'], bins=50, alpha=0.7, edgecolor='black')
    axes[1, 0].set_title('Distribution of Volume')
    axes[1, 0].set_xlabel('Volume')
    axes[1, 0].set_ylabel('Frequency')

# Price volatility (30-day rolling std)
volatility = returns.rolling(30).std()
axes[1, 1].plot(volatility.index, volatility.values)
axes[1, 1].set_title('30-Day Rolling Volatility')
axes[1, 1].set_xlabel('Date')
axes[1, 1].set_ylabel('Volatility')

plt.tight_layout()
plt.show()

print(f"Average daily return: {returns.mean():.4f}")
print(f"Daily volatility: {returns.std():.4f}")
print(f"Annualized volatility: {returns.std() * np.sqrt(252):.4f}")

## 3. Feature Engineering {#features}

In [None]:
# Create feature engineer
fe = FeatureEngineer()

# Apply feature engineering
print("Creating features...")
featured_data = fe.create_all_features(raw_data)

print(f"✅ Feature engineering complete!")
print(f"Original features: {raw_data.shape[1]}")
print(f"Engineered features: {featured_data.shape[1]}")
print(f"Added {featured_data.shape[1] - raw_data.shape[1]} new features")

# Display feature names
print("\nFeature categories:")
feature_names = featured_data.columns.tolist()
technical_features = [f for f in feature_names if any(x in f for x in ['sma', 'ema', 'rsi', 'macd', 'bb_'])]
time_features = [f for f in feature_names if any(x in f for x in ['year', 'month', 'day', 'quarter', '_sin', '_cos'])]
lag_features = [f for f in feature_names if 'lag' in f]
rolling_features = [f for f in feature_names if 'rolling' in f]

print(f"Technical indicators: {len(technical_features)}")
print(f"Time features: {len(time_features)}")
print(f"Lag features: {len(lag_features)}")
print(f"Rolling features: {len(rolling_features)}")

In [None]:
# Visualize technical indicators
visualizer.plot_technical_indicators(featured_data)

In [None]:
# Correlation analysis
# Select numeric features for correlation
numeric_features = featured_data.select_dtypes(include=[np.number]).columns.tolist()
# Limit to avoid too large correlation matrix
sample_features = numeric_features[:30] if len(numeric_features) > 30 else numeric_features

visualizer.plot_correlation_matrix(featured_data, sample_features)

## 4. Model Training and Comparison {#models}

In [None]:
# Prepare data for modeling
# Remove rows with NaN values
clean_data = featured_data.dropna()

# Features and target
X = clean_data.select_dtypes(include=[np.number]).drop('Close', axis=1, errors='ignore')
y = clean_data['Close']

print(f"Training data shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature names: {X.columns.tolist()[:10]}...")  # Show first 10 features

In [None]:
# Split data (time series split)
from sklearn.model_selection import train_test_split

# Use time-based split to avoid data leakage
split_idx = int(0.8 * len(X))
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
print(f"Training period: {X_train.index[0]} to {X_train.index[-1]}")
print(f"Test period: {X_test.index[0]} to {X_test.index[-1]}")

In [None]:
# Train multiple models
models = {
    'Linear Regression': create_model('linear'),
    'Ridge Regression': create_model('ridge', alpha=1.0),
    'Random Forest': create_model('random_forest', n_estimators=100),
    # Add more models as needed
}

# Train models and collect results
results = {}
predictions = {}

for name, model in models.items():
    print(f"\nTraining {name}...")
    
    try:
        # Train model
        model.fit(X_train, y_train)
        
        # Make predictions
        pred = model.predict(X_test)
        predictions[name] = pred
        
        # Evaluate
        metrics = model.evaluate(X_test, y_test)
        results[name] = metrics
        
        print(f"✅ {name} - RMSE: {metrics['rmse']:.2f}, R²: {metrics['r2']:.3f}")
        
    except Exception as e:
        print(f"❌ Error training {name}: {e}")

print(f"\n✅ Model training complete! Trained {len(results)} models.")

## 5. Visualization and Results {#visualization}

In [None]:
# Compare model performance
if results:
    visualizer.plot_model_comparison(results)

In [None]:
# Plot predictions vs actual
if predictions:
    visualizer.plot_predictions_vs_actual(y_test, predictions, 
                                         title="Model Predictions vs Actual Gold Prices")

In [None]:
# Feature importance (for tree-based models)
for name, model in models.items():
    if hasattr(model, 'get_feature_importance'):
        try:
            importance = model.get_feature_importance()
            visualizer.plot_feature_importance(importance, top_n=15, 
                                             title=f"Feature Importance - {name}")
        except:
            print(f"Could not get feature importance for {name}")

## 6. Performance Evaluation {#evaluation}

In [None]:
# Detailed performance analysis
if results:
    print("=== MODEL PERFORMANCE SUMMARY ===")
    performance_df = pd.DataFrame(results).T
    print(performance_df.round(4))
    
    # Find best model
    best_model = performance_df['r2'].idxmax()
    print(f"\n🏆 Best model: {best_model} (R² = {performance_df.loc[best_model, 'r2']:.4f})")

In [None]:
# Residual analysis for best model
if predictions and results:
    best_model_name = performance_df['r2'].idxmax()
    best_predictions = predictions[best_model_name]
    
    visualizer.plot_residuals(y_test, best_predictions, best_model_name)

## 7. Risk Analysis {#risk}

In [None]:
# Risk analysis
risk_manager = RiskManager()

# Generate risk report
risk_report = risk_manager.generate_risk_report(featured_data['Close'])

print("=== RISK ANALYSIS REPORT ===")
print("\nReturn Metrics:")
for key, value in risk_report['return_metrics'].items():
    print(f"  {key}: {value:.4f}")

print("\nRisk Metrics:")
for key, value in risk_report['risk_metrics'].items():
    print(f"  {key}: {value:.4f}")

print("\nPosition Sizing:")
for key, value in risk_report['position_sizing'].items():
    print(f"  {key}: {value:.4f}")

In [None]:
# Drawdown analysis
drawdown_info = risk_manager.calculate_maximum_drawdown(featured_data['Close'])

fig, ax = plt.subplots(figsize=(12, 6))
ax.fill_between(drawdown_info['drawdown_series'].index, 
                drawdown_info['drawdown_series'].values, 0, 
                alpha=0.7, color='red', label='Drawdown')
ax.set_title('Portfolio Drawdown Over Time')
ax.set_xlabel('Date')
ax.set_ylabel('Drawdown (%)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"Maximum Drawdown: {drawdown_info['max_drawdown']:.2f}%")
print(f"Max Drawdown Date: {drawdown_info['max_drawdown_date']}")

## 8. Backtesting {#backtest}

In [None]:
# Backtesting with simple moving average strategy
config = BacktestConfig(
    initial_capital=100000,
    commission=0.001,
    position_fraction=0.1
)

# Create strategy
strategy = MovingAverageCrossoverStrategy(short_window=10, long_window=20)

# Run backtest
backtester = Backtester(config)
backtest_results = backtester.run_backtest(featured_data, strategy)

print("=== BACKTESTING RESULTS ===")
print(f"Strategy: {backtest_results['strategy_name']}")
print(f"Total Return: {backtest_results['total_return']:.2f}%")
print(f"Annualized Return: {backtest_results['annualized_return']:.2f}%")
print(f"Volatility: {backtest_results['volatility']:.2f}%")
print(f"Sharpe Ratio: {backtest_results['sharpe_ratio']:.2f}")
print(f"Max Drawdown: {backtest_results['max_drawdown']['max_drawdown']:.2f}%")
print(f"Total Trades: {backtest_results['total_trades']}")
print(f"Win Rate: {backtest_results['win_rate']:.2f}%")
print(f"Final Portfolio Value: ${backtest_results['final_portfolio_value']:.2f}")

In [None]:
# Plot portfolio performance
portfolio_values = backtest_results['portfolio_values']
buy_hold_values = (featured_data['Close'] / featured_data['Close'].iloc[0]) * config.initial_capital

fig, ax = plt.subplots(figsize=(12, 8))

ax.plot(portfolio_values.index, portfolio_values.values, 
        label=f'{strategy.name} Strategy', linewidth=2)
ax.plot(buy_hold_values.index, buy_hold_values.values, 
        label='Buy & Hold', linewidth=2, alpha=0.7)

ax.set_title('Strategy Performance vs Buy & Hold')
ax.set_xlabel('Date')
ax.set_ylabel('Portfolio Value ($)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate buy & hold performance
buy_hold_return = (buy_hold_values.iloc[-1] / buy_hold_values.iloc[0] - 1) * 100
print(f"\nBuy & Hold Return: {buy_hold_return:.2f}%")
print(f"Strategy Outperformance: {backtest_results['total_return'] - buy_hold_return:.2f}%")

## Summary and Conclusions

This notebook has provided a comprehensive analysis of gold price forecasting including:

1. **Data Collection**: Historical gold price data spanning 3 years
2. **Feature Engineering**: Creation of technical indicators, time features, and statistical measures
3. **Model Training**: Comparison of multiple machine learning models
4. **Performance Evaluation**: Detailed analysis of model accuracy and reliability
5. **Risk Analysis**: Comprehensive risk metrics and drawdown analysis
6. **Backtesting**: Strategy performance evaluation with transaction costs

### Key Findings:
- [To be filled based on actual results]
- [Model performance insights]
- [Risk characteristics]
- [Strategy effectiveness]

### Next Steps:
1. Experiment with additional features (economic indicators, sentiment data)
2. Implement more sophisticated models (neural networks, ensemble methods)
3. Develop more complex trading strategies
4. Integrate real-time data feeds
5. Deploy models in production environment