# 1. RuneScape Item Price Exploration

This notebook is for exploring the raw price data collected from the OSRS Wiki API. The goal is to visualize the data, understand its properties, and perform the necessary preprocessing and feature engineering before applying machine learning models.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (18, 8)

## 1. Load and Clean Data

First, we load the raw JSON data, parse it into a pandas DataFrame, and perform initial cleaning steps like converting the timestamp and setting it as the index.

In [None]:
RAW_DATA_PATH = Path('../data/raw')
ITEM_FILE = RAW_DATA_PATH / 'twisted_bow_1h.json'

if not ITEM_FILE.exists():
    print(f"File not found: {ITEM_FILE}")
    print("Please run `python -m src.main collect` from the root directory first.")
else:
    # The API data is nested under a 'data' key
    df_raw = pd.read_json(ITEM_FILE)
    df = pd.json_normalize(df_raw['data'])
    
    # Convert unix timestamp to datetime and set as index
    df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
    df = df.set_index('timestamp')
    
    # Create a single average price column for simplicity
    df['avgPrice'] = df[['avgHighPrice', 'avgLowPrice']].mean(axis=1).fillna(0)
    
    # Check for missing values
    print("Missing values check:")
    print(df.isnull().sum())
    
    print("\nData Info:")
    print(df.info())
    
    print("\nData Head:")
    display(df.head())

## 2. Initial Visualization

A simple line plot is the best way to get a first look at the time series.

In [None]:
if 'df' in locals():
    df['avgPrice'].plot(title='Twisted Bow Price Over Time', lw=2)
    plt.ylabel('Price (GP)')
    plt.xlabel('Date')
    plt.show()

## 3. Feature Engineering

Now we create features that our model can use to make predictions. Raw price alone is not enough; we need to provide context.

### 3.1 Define the Target Variable

The most important step is defining what we want to predict. A common goal in time series forecasting is to predict the next value. We'll create a `target` column that is the `avgPrice` of the *next* time step.

In [None]:
if 'df' in locals():
    # Shift the price from the next period to the current row
    df['target'] = df['avgPrice'].shift(-1)
    
    # The last row will have a NaN target, which is expected
    display(df[['avgPrice', 'target']].head())
    display(df[['avgPrice', 'target']].tail())

### 3.2 Rolling Features (Trend and Volatility)

- **Rolling Mean (Moving Average):** This smooths out short-term fluctuations and helps identify longer-term trends. Crossovers between short-term and long-term moving averages are classic trading signals.
- **Rolling Standard Deviation:** This measures volatility. High volatility means high risk and high potential reward.

In [None]:
if 'df' in locals():
    # Using 1-day and 7-day windows for our 1-hour data
    df['rolling_mean_24h'] = df['avgPrice'].rolling(window=24).mean()
    df['rolling_mean_168h'] = df['avgPrice'].rolling(window=168).mean() # 168 hours = 7 days
    
    df['rolling_std_24h'] = df['avgPrice'].rolling(window=24).std()

    # Plotting the rolling means
    fig, ax = plt.subplots()
    ax.plot(df.index, df['avgPrice'], label='Average Price', color='lightblue', lw=1)
    ax.plot(df.index, df['rolling_mean_24h'], label='24h Rolling Mean', color='orange', lw=2)
    ax.plot(df.index, df['rolling_mean_168h'], label='7d Rolling Mean', color='red', lw=2)
    ax.set_title('Price vs. Rolling Means')
    ax.set_ylabel('Price (GP)')
    ax.legend()
    plt.show()

### 3.3 Time-Based Features

Player activity often follows weekly and daily cycles. For example, more players are online during evenings and weekends, which can affect prices. We can capture this by extracting features from the timestamp.

In [None]:
if 'df' in locals():
    df['hour'] = df.index.hour
    df['day_of_week'] = df.index.dayofweek # Monday=0, Sunday=6
    df['day_of_year'] = df.index.dayofyear
    df['is_weekend'] = (df.index.dayofweek >= 5).astype(int)
    
    display(df[['hour', 'day_of_week', 'is_weekend']].head())

Let's visualize if there's a weekly pattern. A box plot is great for this.

In [None]:
if 'df' in locals():
    fig, ax = plt.subplots(figsize=(12, 6))
    sns.boxplot(x='day_of_week', y='avgPrice', data=df, ax=ax)
    ax.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
    ax.set_title('Average Price by Day of the Week')
    plt.show()

### 3.4 Lag Features

The price at time `t` is often highly correlated with the price at `t-1`, `t-2`, etc. We can provide these past values directly to the model as features.

In [None]:
if 'df' in locals():
    # We'll create lags for 1 hour, 24 hours (1 day), and 168 hours (1 week) ago
    for lag in [1, 24, 168]:
        df[f'lag_{lag}h'] = df['avgPrice'].shift(lag)
        
    display(df[['avgPrice', 'lag_1h', 'lag_24h']].tail())

## 4. Final Processed DataFrame

After creating all these features, our DataFrame has many `NaN` (Not a Number) values at the beginning (from rolling features and lags) and one at the end (for the target). We must drop these rows before training a model.

In [None]:
if 'df' in locals():
    print(f"Shape before dropping NaNs: {df.shape}")
    df_processed = df.dropna()
    print(f"Shape after dropping NaNs:  {df_processed.shape}")
    
    # Select only the features we will use for the model
    # We exclude the original high/low prices and volumes
    features = [
        'avgPrice',
        'rolling_mean_24h',
        'rolling_mean_168h',
        'rolling_std_24h',
        'hour',
        'day_of_week',
        'day_of_year',
        'is_weekend',
        'lag_1h',
        'lag_24h',
        'lag_168h',
        'target' # Keep the target column
    ]
    
    df_final = df_processed[features]
    
    print("\nFinal DataFrame Head:")
    display(df_final.head())

## 5. Next Steps

This `df_final` DataFrame is now ready for machine learning!

1.  **Formalize this logic:** Move the feature engineering steps from this notebook into a dedicated `src/feature_engineering.py` script.
2.  **Split the data:** Separate the data into training, validation, and testing sets. **Crucially, for time series, this must be a chronological split, not a random one.**
3.  **Train a model:** Start with a simple baseline like `LinearRegression` or a more powerful model like `XGBoost` to predict the `target` column using all other columns as features.