# Part 2: Pandas & Practical Data Manipulation

## Time Series Analysis in Python

---

## Setup and Imports

Before we begin, install the required packages:

```bash
pip install pandas numpy matplotlib seaborn
```

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

---

# 1. Time Series Data Structures in Pandas

## Overview

Pandas provides several specialized data structures for working with time series:

1. **Timestamp**: A single point in time (like `datetime`)
2. **DatetimeIndex**: An array of Timestamps (index for time series)
3. **Period**: A fixed-frequency interval (e.g., January 2023)
4. **PeriodIndex**: An array of Periods
5. **Timedelta**: Duration of time
6. **TimedeltaIndex**: An array of Timedeltas

## Why Use DatetimeIndex?

- Efficient time-based indexing and slicing
- Automatic date arithmetic
- Built-in resampling and frequency conversion
- Integration with matplotlib for time-based plotting

## 1.1 Creating Timestamps and DatetimeIndex

In [None]:
# Single timestamp
timestamp = pd.Timestamp('2023-01-15')
print("Single Timestamp:")
print(timestamp)
print(f"Type: {type(timestamp)}")

# Creating DatetimeIndex from list
dates_list = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
dt_index = pd.DatetimeIndex(dates_list)
print("\nDatetimeIndex from list:")
print(dt_index)

# Using pd.to_datetime
dates_str = pd.Series(['2023/01/01', '2023/01/02', '2023/01/03'])
dates_converted = pd.to_datetime(dates_str)
print("\nConverted with to_datetime:")
print(dates_converted)

## 1.2 Using pd.date_range()

The most common way to create a DatetimeIndex:

In [None]:
# Daily frequency
daily = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
print("Daily dates:")
print(daily)

# Using periods instead of end date
weekly = pd.date_range(start='2023-01-01', periods=8, freq='W')
print("\nWeekly dates (8 periods):")
print(weekly)

# Business days only
business_days = pd.date_range(start='2023-01-01', periods=10, freq='B')
print("\nBusiness days:")
print(business_days)

# Hourly frequency
hourly = pd.date_range(start='2023-01-01', periods=24, freq='H')
print("\nHourly (first 5):")
print(hourly[:5])

## Common Frequency Strings

| Code | Description | Code | Description |
|------|-------------|------|-------------|
| D | Calendar day | H | Hourly |
| B | Business day | T or min | Minute |
| W | Weekly | S | Second |
| M | Month end | MS | Month start |
| Q | Quarter end | QS | Quarter start |
| Y | Year end | YS | Year start |
| BH | Business hour | L | Millisecond |

## 1.3 Period and PeriodIndex

In [None]:
# Single period
period = pd.Period('2023-01', freq='M')
print("Single Period (month):")
print(period)
print(f"Start: {period.start_time}")
print(f"End: {period.end_time}")

# PeriodIndex - useful for quarterly/monthly data
periods = pd.period_range(start='2023-01', end='2023-12', freq='M')
print("\nMonthly periods for 2023:")
print(periods)

# Quarterly periods
quarters = pd.period_range(start='2023Q1', end='2024Q4', freq='Q')
print("\nQuarterly periods:")
print(quarters)

## 1.4 Time Zones

Pandas supports timezone-aware datetime objects:

In [None]:
# Create timezone-naive dates
dates_naive = pd.date_range('2023-01-01', periods=5, freq='D')
print("Timezone-naive:")
print(dates_naive)
print(f"Timezone: {dates_naive.tz}")

# Localize to a timezone
dates_utc = dates_naive.tz_localize('UTC')
print("\nLocalized to UTC:")
print(dates_utc)

# Convert to different timezone
dates_ny = dates_utc.tz_convert('America/New_York')
print("\nConverted to New York:")
print(dates_ny)

---

# 2. Data Loading & Preparation

## 2.1 Creating Time Series DataFrames

In [None]:
# Generate sample data
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=365, freq='D')
data = {
    'sales': np.random.randint(100, 500, size=365),
    'customers': np.random.randint(10, 100, size=365),
    'temperature': 15 + 10 * np.sin(2 * np.pi * np.arange(365) / 365) + np.random.normal(0, 2, 365)
}

df = pd.DataFrame(data, index=dates)
df['revenue'] = df['sales'] * np.random.uniform(10, 20, size=365)

print("Time Series DataFrame:")
print(df.head(10))
print(f"\nShape: {df.shape}")
print(f"Index type: {type(df.index)}")
print(f"\nIndex info:")
print(df.index)

## 2.2 Loading Time Series from CSV

Common patterns for reading time series data:

In [None]:
# Save sample data to CSV first
df.to_csv('sample_timeseries.csv')

# Method 1: Parse dates during read
df_loaded = pd.read_csv('sample_timeseries.csv', 
                        index_col=0,
                        parse_dates=True)
print("Loaded from CSV:")
print(df_loaded.head())
print(f"Index type: {type(df_loaded.index)}")

# Method 2: Set index after loading
df_alt = pd.read_csv('sample_timeseries.csv')
df_alt['date'] = pd.to_datetime(df_alt.iloc[:, 0])
df_alt = df_alt.set_index('date')
df_alt = df_alt.drop(df_alt.columns[0], axis=1)

print("\nAlternative method:")
print(df_alt.head())

## 2.3 Handling Missing Data

Time series often have missing values that need special treatment:

In [None]:
# Create data with missing values
df_missing = df.copy()
df_missing.loc['2023-01-05':'2023-01-07', 'sales'] = np.nan
df_missing.loc['2023-01-15', 'customers'] = np.nan

print("Data with missing values:")
print(df_missing.loc['2023-01-01':'2023-01-10'])

# Check for missing values
print("\nMissing values count:")
print(df_missing.isnull().sum())

# Forward fill - use last valid observation
df_ffill = df_missing.fillna(method='ffill')
print("\nForward fill:")
print(df_ffill.loc['2023-01-01':'2023-01-10', 'sales'])

# Backward fill
df_bfill = df_missing.fillna(method='bfill')
print("\nBackward fill:")
print(df_bfill.loc['2023-01-01':'2023-01-10', 'sales'])

# Interpolation - linear by default
df_interp = df_missing.interpolate(method='linear')
print("\nLinear interpolation:")
print(df_interp.loc['2023-01-01':'2023-01-10', 'sales'])

### Interpolation Methods

| Method | Description | Best For |
|--------|-------------|----------|
| linear | Linear interpolation | Default, general purpose |
| time | Time-weighted interpolation | Irregularly spaced time series |
| polynomial | Polynomial interpolation | Smooth curves |
| spline | Spline interpolation | Smooth curves with control |
| nearest | Nearest neighbor | Categorical-like data |

In [None]:
# Visualize different interpolation methods
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

methods = ['linear', 'polynomial', 'spline', 'nearest']
for ax, method in zip(axes.flatten(), methods):
    if method == 'polynomial':
        interpolated = df_missing['sales'].interpolate(method=method, order=2)
    elif method == 'spline':
        interpolated = df_missing['sales'].interpolate(method=method, order=2)
    else:
        interpolated = df_missing['sales'].interpolate(method=method)
    
    df_missing['sales'].plot(ax=ax, style='o', label='Original (with NaN)', alpha=0.5)
    interpolated.plot(ax=ax, label=f'{method.capitalize()} interpolation', linewidth=2)
    ax.set_title(f'{method.capitalize()} Interpolation', fontsize=12, fontweight='bold')
    ax.legend()
    ax.set_xlim('2023-01-01', '2023-01-31')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

# 3. Time-Based Indexing and Slicing

## 3.1 Basic Slicing

In [None]:
# Single date
print("Data for 2023-01-15:")
print(df.loc['2023-01-15'])

# Date range
print("\nData from Jan 1 to Jan 5:")
print(df.loc['2023-01-01':'2023-01-05'])

# Partial string indexing - very powerful!
print("\nAll data for January 2023:")
print(df.loc['2023-01'].head())
print(f"Shape: {df.loc['2023-01'].shape}")

# All data for Q1
print("\nFirst quarter (Jan-Mar):")
print(df.loc['2023-01':'2023-03'].shape)

## 3.2 Advanced Selection Methods

In [None]:
# Using .between_time() for intraday data
# First create hourly data
hourly_dates = pd.date_range('2023-01-01', periods=24*7, freq='H')
hourly_data = pd.DataFrame({
    'value': np.random.randn(24*7)
}, index=hourly_dates)

# Select data between specific times
business_hours = hourly_data.between_time('09:00', '17:00')
print("Business hours (9am-5pm):")
print(business_hours.head(10))

# Using .at_time() for specific time
noon_data = hourly_data.at_time('12:00')
print("\nData at noon each day:")
print(noon_data)

## 3.3 DatetimeIndex Attributes

Extract useful information from the index:

In [None]:
# Add temporal features
df['year'] = df.index.year
df['month'] = df.index.month
df['day'] = df.index.day
df['day_of_week'] = df.index.dayofweek  # Monday=0, Sunday=6
df['day_name'] = df.index.day_name()
df['week_of_year'] = df.index.isocalendar().week
df['quarter'] = df.index.quarter
df['is_weekend'] = df.index.dayofweek >= 5

print("DataFrame with temporal features:")
print(df.head(10))

# Useful for grouping and analysis
print("\nAverage sales by day of week:")
print(df.groupby('day_name')['sales'].mean().sort_values(ascending=False))

---

# 4. Resampling and Frequency Conversion

## Theory

**Resampling** is the process of converting time series from one frequency to another:

- **Downsampling**: Higher to lower frequency (daily → monthly)
  - Requires aggregation (mean, sum, etc.)
  
- **Upsampling**: Lower to higher frequency (monthly → daily)
  - Requires filling missing values

## 4.1 Downsampling

In [None]:
# Create clean dataframe for resampling
df_clean = df[['sales', 'customers', 'revenue']].copy()

# Resample to weekly (sum)
weekly_sum = df_clean.resample('W').sum()
print("Weekly totals:")
print(weekly_sum.head())

# Resample to monthly (mean)
monthly_mean = df_clean.resample('M').mean()
print("\nMonthly averages:")
print(monthly_mean.head())

# Resample to monthly (multiple aggregations)
monthly_agg = df_clean.resample('M').agg({
    'sales': ['sum', 'mean', 'std'],
    'customers': ['sum', 'mean'],
    'revenue': 'sum'
})
print("\nMonthly aggregations:")
print(monthly_agg.head())

## 4.2 Common Aggregation Functions

| Function | Description | Use Case |
|----------|-------------|----------|
| sum() | Sum of values | Sales, revenue |
| mean() | Average | Temperature, prices |
| median() | Middle value | Robust average |
| std() | Standard deviation | Volatility |
| min() / max() | Extremes | Temperature ranges |
| first() / last() | First/last value | Opening/closing prices |
| count() | Number of observations | Data availability |

In [None]:
# Visualize daily vs weekly data
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Daily data
df_clean['sales'].plot(ax=axes[0], alpha=0.7, label='Daily')
axes[0].set_title('Daily Sales', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Sales', fontsize=11)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Weekly data
weekly_sum['sales'].plot(ax=axes[1], marker='o', linewidth=2, label='Weekly (sum)', color='orange')
axes[1].set_title('Weekly Sales (Resampled)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Sales', fontsize=11)
axes[1].set_xlabel('Date', fontsize=11)
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4.3 Upsampling

In [None]:
# Start with monthly data
monthly_data = df_clean.resample('M').sum()
print("Monthly data:")
print(monthly_data.head())

# Upsample to daily and forward fill
daily_ffill = monthly_data.resample('D').ffill()
print("\nUpsampled to daily (forward fill):")
print(daily_ffill.head(35))  # Show Jan + Feb

# Upsample and interpolate
daily_interp = monthly_data.resample('D').interpolate(method='linear')
print("\nUpsampled to daily (interpolated):")
print(daily_interp.head(35))

## 4.4 Using .asfreq() for Frequency Conversion

In [None]:
# asfreq() vs resample()
# asfreq: changes frequency but doesn't aggregate
# resample: changes frequency AND aggregates

# Select every 7th day (weekly sampling)
weekly_asfreq = df_clean.asfreq('W')
print("Using asfreq('W'):")
print(weekly_asfreq.head())

# Compare with resample
weekly_resample = df_clean.resample('W').mean()
print("\nUsing resample('W').mean():")
print(weekly_resample.head())

---

# 5. Rolling Window Operations

## Theory

**Rolling windows** compute statistics over a moving window of data. Essential for:
- Smoothing noisy data
- Identifying trends
- Creating features for ML models
- Technical analysis in finance

## 5.1 Basic Rolling Statistics

In [None]:
# 7-day rolling mean
df_clean['sales_ma7'] = df_clean['sales'].rolling(window=7).mean()

# 30-day rolling mean
df_clean['sales_ma30'] = df_clean['sales'].rolling(window=30).mean()

# Rolling standard deviation (volatility)
df_clean['sales_std7'] = df_clean['sales'].rolling(window=7).std()

# Rolling min and max
df_clean['sales_min7'] = df_clean['sales'].rolling(window=7).min()
df_clean['sales_max7'] = df_clean['sales'].rolling(window=7).max()

print("DataFrame with rolling features:")
print(df_clean.head(10))

### Visualizing Rolling Windows

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(14, 8), sharex=True)

# Original vs smoothed
df_clean['sales'].plot(ax=axes[0], alpha=0.3, label='Original', color='gray')
df_clean['sales_ma7'].plot(ax=axes[0], label='7-day MA', linewidth=2)
df_clean['sales_ma30'].plot(ax=axes[0], label='30-day MA', linewidth=2)
axes[0].set_title('Sales with Moving Averages', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Sales', fontsize=11)
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Rolling standard deviation (volatility)
df_clean['sales_std7'].plot(ax=axes[1], color='red', linewidth=2)
axes[1].set_title('7-day Rolling Standard Deviation (Volatility)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Std Dev', fontsize=11)
axes[1].set_xlabel('Date', fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 5.2 Custom Rolling Functions

In [None]:
# Custom function: range (max - min)
def rolling_range(x):
    return x.max() - x.min()

df_clean['sales_range7'] = df_clean['sales'].rolling(window=7).apply(rolling_range)

# Custom function: coefficient of variation
def coef_variation(x):
    return x.std() / x.mean() if x.mean() != 0 else 0

df_clean['sales_cv7'] = df_clean['sales'].rolling(window=7).apply(coef_variation)

print("Custom rolling statistics:")
print(df_clean[['sales', 'sales_range7', 'sales_cv7']].dropna().head(10))

## 5.3 Expanding Windows

Unlike rolling windows (fixed size), expanding windows include all data from the start:

In [None]:
# Expanding mean - cumulative average
df_clean['sales_expanding_mean'] = df_clean['sales'].expanding().mean()

# Expanding sum - cumulative sum
df_clean['sales_cumsum'] = df_clean['sales'].expanding().sum()

# Or use cumsum() directly
df_clean['sales_cumsum2'] = df_clean['sales'].cumsum()

print("Expanding window statistics:")
print(df_clean[['sales', 'sales_expanding_mean', 'sales_cumsum']].head(10))

In [None]:
# Visualize expanding mean vs rolling mean
fig, ax = plt.subplots(figsize=(14, 6))

df_clean['sales'].plot(ax=ax, alpha=0.3, label='Original', color='gray')
df_clean['sales_ma30'].plot(ax=ax, label='30-day Rolling Mean', linewidth=2)
df_clean['sales_expanding_mean'].plot(ax=ax, label='Expanding Mean', linewidth=2)

ax.set_title('Comparison: Rolling vs Expanding Mean', fontsize=12, fontweight='bold')
ax.set_ylabel('Sales', fontsize=11)
ax.set_xlabel('Date', fontsize=11)
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 5.4 Exponentially Weighted Functions

Give more weight to recent observations:

In [None]:
# Exponentially weighted moving average (EWMA)
df_clean['sales_ewm_short'] = df_clean['sales'].ewm(span=7, adjust=False).mean()
df_clean['sales_ewm_long'] = df_clean['sales'].ewm(span=30, adjust=False).mean()

# Exponentially weighted standard deviation
df_clean['sales_ewm_std'] = df_clean['sales'].ewm(span=7, adjust=False).std()

print("Exponentially weighted statistics:")
print(df_clean[['sales', 'sales_ewm_short', 'sales_ewm_long']].head(15))

In [None]:
# Compare simple MA vs EWMA
fig, ax = plt.subplots(figsize=(14, 6))

df_clean['sales'].plot(ax=ax, alpha=0.3, label='Original', color='gray', linewidth=1)
df_clean['sales_ma7'].plot(ax=ax, label='7-day Simple MA', linewidth=2)
df_clean['sales_ewm_short'].plot(ax=ax, label='7-day EWMA', linewidth=2, linestyle='--')

ax.set_title('Simple Moving Average vs Exponential Moving Average', fontsize=12, fontweight='bold')
ax.set_ylabel('Sales', fontsize=11)
ax.set_xlabel('Date', fontsize=11)
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_xlim('2023-01-01', '2023-03-31')  # Focus on Q1
plt.tight_layout()
plt.show()

---

# 6. Lag and Shift Operations

## Theory

**Lagging** creates features from previous time steps - crucial for:
- Creating autoregressive features
- Computing changes and growth rates
- Building ML models

## 6.1 Basic Shift Operations

In [None]:
# Create clean dataframe for lag operations
df_lag = df[['sales', 'customers']].copy()

# Lag 1 (previous day)
df_lag['sales_lag1'] = df_lag['sales'].shift(1)

# Lag 7 (same day last week)
df_lag['sales_lag7'] = df_lag['sales'].shift(7)

# Lead 1 (next day) - negative shift
df_lag['sales_lead1'] = df_lag['sales'].shift(-1)

print("Lag and lead features:")
print(df_lag.head(10))

## 6.2 Computing Changes and Growth Rates

In [None]:
# Difference from previous period
df_lag['sales_diff1'] = df_lag['sales'].diff(1)

# Percentage change
df_lag['sales_pct_change'] = df_lag['sales'].pct_change(1) * 100

# Week-over-week change
df_lag['sales_diff7'] = df_lag['sales'].diff(7)
df_lag['sales_pct_change7'] = df_lag['sales'].pct_change(7) * 100

print("Changes and growth rates:")
print(df_lag[['sales', 'sales_diff1', 'sales_pct_change', 'sales_pct_change7']].head(15))

In [None]:
# Visualize original vs changes
fig, axes = plt.subplots(3, 1, figsize=(14, 10), sharex=True)

# Original series
df_lag['sales'].plot(ax=axes[0], linewidth=1.5, color='steelblue')
axes[0].set_title('Original Sales', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Sales', fontsize=11)
axes[0].grid(True, alpha=0.3)

# First difference
df_lag['sales_diff1'].plot(ax=axes[1], linewidth=1, color='orange')
axes[1].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[1].set_title('Day-over-Day Change', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Change', fontsize=11)
axes[1].grid(True, alpha=0.3)

# Percentage change
df_lag['sales_pct_change'].plot(ax=axes[2], linewidth=1, color='green')
axes[2].axhline(y=0, color='red', linestyle='--', linewidth=1)
axes[2].set_title('Percentage Change', fontsize=12, fontweight='bold')
axes[2].set_ylabel('% Change', fontsize=11)
axes[2].set_xlabel('Date', fontsize=11)
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6.3 Creating Multiple Lags for ML

In [None]:
# Create multiple lag features efficiently
def create_lags(df, column, lags):
    """
    Create multiple lag features for a column
    """
    for lag in lags:
        df[f'{column}_lag{lag}'] = df[column].shift(lag)
    return df

# Create lags 1-7 and 14, 21, 28 (weekly patterns)
df_ml = df[['sales']].copy()
lags = list(range(1, 8)) + [14, 21, 28]
df_ml = create_lags(df_ml, 'sales', lags)

print("ML-ready dataframe with lags:")
print(df_ml.head(30))
print(f"\nShape: {df_ml.shape}")
print(f"Columns: {df_ml.columns.tolist()}")

---

# 7. Simple Forecasting Methods

## Theory

Simple methods serve as baselines before applying complex models:

1. **Naive**: Last observed value
2. **Seasonal Naive**: Same period from last cycle
3. **Moving Average**: Average of recent observations
4. **Exponential Smoothing**: Weighted average giving more weight to recent data

## 7.1 Naive Methods

In [None]:
# Prepare train-test split
df_forecast = df[['sales']].copy()
train_size = int(len(df_forecast) * 0.8)
train = df_forecast[:train_size]
test = df_forecast[train_size:]

print(f"Train size: {len(train)}")
print(f"Test size: {len(test)}")

# Naive forecast: use last value
naive_forecast = pd.Series(
    train['sales'].iloc[-1], 
    index=test.index,
    name='naive'
)

# Seasonal naive: use same day from last week
seasonal_naive_forecast = pd.Series(
    train['sales'].iloc[-7:].values,
    index=test.index[:7],
    name='seasonal_naive'
)
# Repeat pattern for rest of test period
for i in range(7, len(test), 7):
    end_idx = min(i+7, len(test))
    seasonal_naive_forecast = pd.concat([
        seasonal_naive_forecast,
        pd.Series(train['sales'].iloc[-7:-(7-end_idx+i)].values if end_idx < i+7 else train['sales'].iloc[-7:].values,
                 index=test.index[i:end_idx])
    ])

print("\nNaive forecast (first 10 days):")
print(naive_forecast.head(10))
print("\nSeasonal naive forecast (first 10 days):")
print(seasonal_naive_forecast.head(10))

## 7.2 Moving Average Forecast

In [None]:
# Simple moving average forecast
window = 7
ma_forecast = pd.Series(
    train['sales'].iloc[-window:].mean(),
    index=test.index,
    name='moving_average'
)

print("Moving average forecast (first 10 days):")
print(ma_forecast.head(10))

## 7.3 Exponential Smoothing

### Single Exponential Smoothing (SES)

**Formula**: ŷₜ = α·yₜ₋₁ + (1-α)·ŷₜ₋₁

Where α (alpha) is the smoothing parameter (0 < α < 1)

In [None]:
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing

# Single Exponential Smoothing
ses_model = SimpleExpSmoothing(train['sales'])
ses_fit = ses_model.fit(smoothing_level=0.2, optimized=False)
ses_forecast = ses_fit.forecast(steps=len(test))

print("Single Exponential Smoothing forecast:")
print(ses_forecast.head(10))
print(f"\nSmoothing level (alpha): {ses_fit.params['smoothing_level']:.4f}")

### Holt's Linear Trend (Double Exponential Smoothing)

Extends SES to capture trend

In [None]:
# Holt's method (with trend)
holt_model = ExponentialSmoothing(
    train['sales'], 
    trend='add',
    seasonal=None
)
holt_fit = holt_model.fit()
holt_forecast = holt_fit.forecast(steps=len(test))

print("Holt's Linear Trend forecast:")
print(holt_forecast.head(10))

### Holt-Winters (Triple Exponential Smoothing)

Captures level, trend, and seasonality

In [None]:
# Holt-Winters method (with trend and seasonality)
hw_model = ExponentialSmoothing(
    train['sales'],
    trend='add',
    seasonal='add',
    seasonal_periods=7  # weekly seasonality
)
hw_fit = hw_model.fit()
hw_forecast = hw_fit.forecast(steps=len(test))

print("Holt-Winters forecast:")
print(hw_forecast.head(10))
print(f"\nModel parameters:")
print(f"Smoothing level (alpha): {hw_fit.params['smoothing_level']:.4f}")
print(f"Smoothing trend (beta): {hw_fit.params['smoothing_trend']:.4f}")
print(f"Smoothing seasonal (gamma): {hw_fit.params['smoothing_seasonal']:.4f}")

## 7.4 Comparing Forecast Methods

In [None]:
# Visualize all forecasts
fig, ax = plt.subplots(figsize=(14, 6))

# Plot training data
train['sales'].plot(ax=ax, label='Train', linewidth=1.5, color='black')

# Plot test data
test['sales'].plot(ax=ax, label='Test (Actual)', linewidth=2, color='blue')

# Plot forecasts
naive_forecast.plot(ax=ax, label='Naive', linestyle='--', linewidth=1.5, alpha=0.7)
ma_forecast.plot(ax=ax, label='Moving Avg', linestyle='--', linewidth=1.5, alpha=0.7)
ses_forecast.plot(ax=ax, label='SES', linestyle='--', linewidth=1.5, alpha=0.7)
holt_forecast.plot(ax=ax, label='Holt', linestyle='--', linewidth=1.5, alpha=0.7)
hw_forecast.plot(ax=ax, label='Holt-Winters', linestyle='--', linewidth=2, alpha=0.9)

ax.set_title('Comparison of Simple Forecasting Methods', fontsize=14, fontweight='bold')
ax.set_ylabel('Sales', fontsize=12)
ax.set_xlabel('Date', fontsize=12)
ax.legend(loc='best', fontsize=10)
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7.5 Evaluating Forecasts

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error

def evaluate_forecast(actual, forecast, method_name):
    """
    Calculate MAE, RMSE, MAPE for a forecast
    """
    # Align indices
    common_idx = actual.index.intersection(forecast.index)
    actual_aligned = actual.loc[common_idx]
    forecast_aligned = forecast.loc[common_idx]
    
    mae = mean_absolute_error(actual_aligned, forecast_aligned)
    rmse = np.sqrt(mean_squared_error(actual_aligned, forecast_aligned))
    mape = np.mean(np.abs((actual_aligned - forecast_aligned) / actual_aligned)) * 100
    
    return {
        'Method': method_name,
        'MAE': mae,
        'RMSE': rmse,
        'MAPE': mape
    }

# Evaluate all methods
results = []
results.append(evaluate_forecast(test['sales'], naive_forecast, 'Naive'))
results.append(evaluate_forecast(test['sales'], ma_forecast, 'Moving Average'))
results.append(evaluate_forecast(test['sales'], ses_forecast, 'SES'))
results.append(evaluate_forecast(test['sales'], holt_forecast, 'Holt'))
results.append(evaluate_forecast(test['sales'], hw_forecast, 'Holt-Winters'))

results_df = pd.DataFrame(results)
results_df = results_df.sort_values('RMSE')

print("\nForecast Evaluation Results:")
print("=" * 70)
print(results_df.to_string(index=False))
print("\n" + "=" * 70)
print(f"Best method: {results_df.iloc[0]['Method']}")

---

# Summary and Key Takeaways

## Key Pandas Time Series Operations

### 1. Data Structures
- Use `DatetimeIndex` for time series data
- Create with `pd.date_range()` or `pd.to_datetime()`
- Consider `Period` for interval data (months, quarters)

### 2. Data Preparation
- Handle missing values with `ffill()`, `bfill()`, or `interpolate()`
- Always set datetime as index for time series operations
- Extract temporal features: day, month, day_of_week, etc.

### 3. Resampling
- **Downsampling**: Use `.resample('W').sum()` or `.mean()`
- **Upsampling**: Use `.resample('D').ffill()` or `.interpolate()`
- Choose appropriate aggregation function

### 4. Rolling Windows
- **Fixed windows**: `.rolling(window=7).mean()`
- **Expanding windows**: `.expanding().mean()`
- **Exponential**: `.ewm(span=7).mean()`
- Use for smoothing and creating features

### 5. Lag Operations
- Create lags: `.shift(1)` for previous period
- Calculate changes: `.diff()` and `.pct_change()`
- Essential for ML feature engineering

### 6. Simple Forecasting
- Start with naive methods as baseline
- Moving averages for simple smoothing
- Exponential smoothing (SES, Holt, Holt-Winters) for trend/seasonality

## Best Practices

1. **Always visualize** your data and transformations
2. **Check for missing values** before analysis
3. **Choose appropriate frequency** for your data
4. **Consider seasonality** when selecting window sizes
5. **Compare multiple methods** before choosing one
6. **Use proper train-test splits** (chronological)

## Common Pandas Time Series Functions

```python
# Date ranges
pd.date_range(start, end, freq)
pd.period_range(start, end, freq)

# Resampling
df.resample('W').sum()
df.asfreq('D')

# Rolling
df.rolling(window=7).mean()
df.expanding().mean()
df.ewm(span=7).mean()

# Shifting
df.shift(1)
df.diff()
df.pct_change()

# Time-based selection
df.loc['2023-01']
df.between_time('09:00', '17:00')
df.at_time('12:00')
```

---

## Next Steps

In **Part 3**, we'll cover:
- Feature engineering for machine learning models
- Traditional ML approaches (Random Forest, XGBoost, LightGBM)
- Deep learning for time series (LSTM, GRU, 1D CNN)
- Advanced tools (Prophet, NeuralProphet)
- Model evaluation and selection strategies
- Production deployment considerations

---

*End of Part 2*