# Data Loading and Preprocessing

This notebook demonstrates how to load, validate, and preprocess time series data in TimeSmith.

## What You'll Learn

- Loading and validating time series data
- Handling missing dates and values
- Resampling time series
- Outlier detection and removal
- Data quality checks

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from timesmith import (
    MissingDateFiller,
    MissingValueFiller,
    Resampler,
    OutlierRemover,
    assert_series,
    is_series,
)

np.random.seed(42)
print("Data loading and preprocessing tools loaded!")

## 1. Loading and Validating Data

TimeSmith provides validators to ensure data quality.

In [None]:
# Create sample data
dates = pd.date_range('2020-01-01', periods=100, freq='D')
values = np.random.randn(100).cumsum() + 100
y = pd.Series(values, index=dates, name='temperature')

print(f"Is SeriesLike: {is_series(y)}")
print(f"Data shape: {y.shape}")
print(f"\nFirst few values:")
print(y.head())

# Validate
assert_series(y, name='y')
print("\nData validation passed!")

## 2. Handling Missing Dates

Fill gaps in the time index.

In [None]:
# Introduce missing dates
y_missing_dates = y.drop(y.index[20:25])

print(f"Before: {len(y)} points")
print(f"After removing dates: {len(y_missing_dates)} points")

# Fill missing dates
filler = MissingDateFiller(frequency='D', method='forward')
y_filled = filler.fit_transform(y_missing_dates)

print(f"After filling: {len(y_filled)} points")
print(f"\nFilled values:")
print(y_filled.iloc[20:26])

## 3. Handling Missing Values

Fill NaN values in the data.

In [None]:
# Add missing values
y_filled.iloc[10:15] = np.nan

print(f"Missing values before: {y_filled.isna().sum()}")

# Fill missing values
value_filler = MissingValueFiller(method='linear')
y_complete = value_filler.fit_transform(y_filled)

print(f"Missing values after: {y_complete.isna().sum()}")
print(f"\nFilled values:")
print(y_complete.iloc[10:16])

## 4. Resampling

Change the frequency of your time series.

In [None]:
# Resample to weekly
resampler = Resampler(target_frequency='W', method='mean')
y_weekly = resampler.fit_transform(y_complete)

print(f"Daily: {len(y_complete)} points")
print(f"Weekly: {len(y_weekly)} points")

# Visualize
fig, axes = plt.subplots(2, 1, figsize=(12, 8))
axes[0].plot(y_complete.index, y_complete.values, linewidth=1.5, label='Daily')
axes[0].set_title('Daily Data', fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(y_weekly.index, y_weekly.values, linewidth=2, marker='o', label='Weekly')
axes[1].set_title('Weekly Resampled Data', fontweight='bold')
axes[1].set_xlabel('Date')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5. Outlier Removal

Detect and remove outliers using IQR method.

In [None]:
# Add outliers
y_with_outliers = y_complete.copy()
y_with_outliers.iloc[30] = y_with_outliers.mean() + 5 * y_with_outliers.std()
y_with_outliers.iloc[60] = y_with_outliers.mean() - 4 * y_with_outliers.std()

print(f"Outliers added at indices: 30, 60")

# Remove outliers
outlier_remover = OutlierRemover(method='iqr', factor=1.5)
y_clean = outlier_remover.fit_transform(y_with_outliers)

print(f"\nBefore: {len(y_with_outliers)} points")
print(f"After: {len(y_clean)} points")
print(f"Outliers removed: {len(y_with_outliers) - len(y_clean)}")

# Visualize
plt.figure(figsize=(12, 6))
plt.plot(y_with_outliers.index, y_with_outliers.values, 
         linewidth=1.5, alpha=0.5, label='With Outliers', color='red')
plt.plot(y_clean.index, y_clean.values, 
         linewidth=2, label='Cleaned', color='steelblue')
plt.title('Outlier Removal', fontsize=14, fontweight='bold')
plt.xlabel('Date')
plt.ylabel('Value')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Summary

You've learned:
- How to validate time series data
- How to handle missing dates and values
- How to resample time series
- How to detect and remove outliers

**Best Practices:**
- Always validate data before processing
- Handle missing data appropriately for your use case
- Choose resampling method based on your analysis needs
- Be careful when removing outliers - they might be important signals