# Phase 2, Step 5: Create Temporal Features

This notebook creates **temporal features** from the sample collection dates. Water quality varies with time due to:

- **Seasonality**: Rainfall, temperature, and runoff patterns differ by season
- **Hydrological cycles**: Wet vs. dry periods affect pollutant dilution and concentration
- **Year effects**: Long-term trends (e.g., land use change, climate)

## Objectives
1. Parse sample dates from the Landsat feature datasets
2. Create calendar-based features (year, month, quarter, season)
3. Add cyclical encodings for month and day-of-year (preserves periodicity)
4. Add region-specific features (wet/dry season for Southern Africa)
5. Save temporal feature datasets for model integration

## Data Format
Sample dates are in **DD-MM-YYYY** format (e.g., 02-01-2011 = 2 January 2011).

## Step 1: Load Dependencies and Data

We use `pandas` for data handling and `numpy` for cyclical encoding. No external APIs are required—temporal features are derived purely from the date column.

In [1]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd

import os

In [2]:
# Load Landsat training and validation data (source of Sample Date)
print("Loading sample locations and dates from Landsat feature files...")
landsat_train = pd.read_csv("landsat_features_training.csv")
landsat_val = pd.read_csv("landsat_features_validation.csv")

# Keep only merge keys and extract dates
train_df = landsat_train[["Latitude", "Longitude", "Sample Date"]].copy()
val_df = landsat_val[["Latitude", "Longitude", "Sample Date"]].copy()

print(f"Training samples: {len(train_df)}")
print(f"Validation samples: {len(val_df)}")
landsat_train.head()

Loading sample locations and dates from Landsat feature files...
Training samples: 9319
Validation samples: 200


Unnamed: 0,Latitude,Longitude,Sample Date,nir,green,swir16,swir22,NDMI,MNDWI
0,-28.760833,17.730278,02-01-2011,11190.0,11426.0,7687.5,7645.0,0.185538,0.195595
1,-26.861111,28.884722,03-01-2011,17658.5,9550.0,13746.5,10574.0,0.124566,-0.180134
2,-26.45,28.085833,03-01-2011,15210.0,10720.0,17974.0,14201.0,-0.083293,-0.252805
3,-27.671111,27.236944,03-01-2011,14887.0,10943.0,13522.0,11403.0,0.048048,-0.105416
4,-27.356667,27.286389,03-01-2011,16828.5,9502.5,12665.5,9643.0,0.141147,-0.142683


## Step 2: Parse Dates and Create Temporal Features

We parse the `Sample Date` column using `dayfirst=True` (DD-MM-YYYY) and create:

1. **Calendar features**: year, month, quarter, day_of_year
2. **Season**: Meteorological seasons for Southern Hemisphere (Dec–Feb=summer, Mar–May=autumn, Jun–Aug=winter, Sep–Nov=spring)
3. **Cyclical encodings**: `month_sin`/`month_cos` and `day_of_year_sin`/`day_of_year_cos` so that December and January are treated as adjacent
4. **Time trend**: `months_since_start` for a linear trend across the study period (2011–2015)
5. **Wet season flag**: For South Africa, the wet season is typically October–April (summer rainfall region); dry season is May–September

In [3]:
def create_temporal_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Add temporal features derived from Sample Date.
    Expects Sample Date in DD-MM-YYYY format.
    Keeps original Sample Date string for merging.
    """
    out = df.copy()
    
    # Parse dates (dayfirst=True for DD-MM-YYYY); keep original for merge key
    date_parsed = pd.to_datetime(out["Sample Date"], dayfirst=True, errors="coerce")
    dt = date_parsed.dt
    
    # Calendar features
    out["year"] = dt.year
    out["month"] = dt.month
    out["quarter"] = dt.quarter
    out["day_of_year"] = dt.dayofyear
    out["week_of_year"] = dt.isocalendar().week.astype(int)
    
    # Meteorological seasons (Southern Hemisphere)
    # DJF = summer, MAM = autumn, JJA = winter, SON = spring
    month_to_season = {
        12: "summer", 1: "summer", 2: "summer",
        3: "autumn", 4: "autumn", 5: "autumn",
        6: "winter", 7: "winter", 8: "winter",
        9: "spring", 10: "spring", 11: "spring",
    }
    out["season"] = dt.month.map(month_to_season)
    
    # South Africa wet season: Oct–Apr (1 = wet, 0 = dry)
    wet_months = {10, 11, 12, 1, 2, 3, 4}
    out["is_wet_season"] = (dt.month.isin(wet_months)).astype(int)
    
    # Months since start of study (2011-01-01)
    start = pd.Timestamp("2011-01-01")
    out["months_since_start"] = (
        (out["year"] - start.year) * 12 + (dt.month - start.month)
    ).clip(lower=0)
    
    # Cyclical encoding: month (1–12) -> sin/cos so Dec and Jan are adjacent
    out["month_sin"] = np.sin(2 * np.pi * out["month"] / 12)
    out["month_cos"] = np.cos(2 * np.pi * out["month"] / 12)
    
    # Cyclical encoding: day_of_year (1–365) for within-year seasonality
    max_doy = 366  # leap years
    out["day_of_year_sin"] = np.sin(2 * np.pi * out["day_of_year"] / max_doy)
    out["day_of_year_cos"] = np.cos(2 * np.pi * out["day_of_year"] / max_doy)
    
    return out

In [4]:
# Create temporal features for training and validation
train_temporal = create_temporal_features(train_df)
val_temporal = create_temporal_features(val_df)

print("Temporal features created:")
print([c for c in train_temporal.columns if c not in ["Latitude", "Longitude", "Sample Date"]])
print("\nFirst rows (training):")
train_temporal.head(10)

Temporal features created:
['year', 'month', 'quarter', 'day_of_year', 'week_of_year', 'season', 'is_wet_season', 'months_since_start', 'month_sin', 'month_cos', 'day_of_year_sin', 'day_of_year_cos']

First rows (training):


Unnamed: 0,Latitude,Longitude,Sample Date,year,month,quarter,day_of_year,week_of_year,season,is_wet_season,months_since_start,month_sin,month_cos,day_of_year_sin,day_of_year_cos
0,-28.760833,17.730278,02-01-2011,2011,1,1,2,52,summer,1,0,0.5,0.866025,0.034328,0.999411
1,-26.861111,28.884722,03-01-2011,2011,1,1,3,1,summer,1,0,0.5,0.866025,0.051479,0.998674
2,-26.45,28.085833,03-01-2011,2011,1,1,3,1,summer,1,0,0.5,0.866025,0.051479,0.998674
3,-27.671111,27.236944,03-01-2011,2011,1,1,3,1,summer,1,0,0.5,0.866025,0.051479,0.998674
4,-27.356667,27.286389,03-01-2011,2011,1,1,3,1,summer,1,0,0.5,0.866025,0.051479,0.998674
5,-27.010111,26.698083,04-01-2011,2011,1,1,4,1,summer,1,0,0.5,0.866025,0.068615,0.997643
6,-25.127778,27.628889,04-01-2011,2011,1,1,4,1,summer,1,0,0.5,0.866025,0.068615,0.997643
7,-25.20639,27.558,04-01-2011,2011,1,1,4,1,summer,1,0,0.5,0.866025,0.068615,0.997643
8,-24.69514,27.40906,04-01-2011,2011,1,1,4,1,summer,1,0,0.5,0.866025,0.068615,0.997643
9,-26.984722,26.632278,04-01-2011,2011,1,1,4,1,summer,1,0,0.5,0.866025,0.068615,0.997643


## Step 3: Inspect Temporal Feature Distributions

We verify the features look correct: season and wet/dry balance, and cyclical encodings form a circle.

In [5]:
print("Season counts (training):")
print(train_temporal["season"].value_counts().sort_index())
print("\nWet vs dry season (training):")
print(train_temporal["is_wet_season"].value_counts())
print("\nYear distribution (training):")
print(train_temporal["year"].value_counts().sort_index())

Season counts (training):
season
autumn    2406
spring    2474
summer    2001
winter    2438
Name: count, dtype: int64

Wet vs dry season (training):
is_wet_season
1    5225
0    4094
Name: count, dtype: int64

Year distribution (training):
year
2011    1602
2012    1756
2013    1951
2014    2084
2015    1926
Name: count, dtype: int64


## Step 4: Save Temporal Feature Datasets

We save CSVs with `Latitude`, `Longitude`, `Sample Date`, and all temporal features. These can be merged with Landsat, TerraClimate, and spatial features in Phase 2, Step 6.

**Note**: For modeling, you may prefer to drop redundant columns (e.g., use only `month_sin`/`month_cos` and drop raw `month` if using tree-based models that handle raw numbers). We save all features so you can choose during the combine step.

In [6]:
out_train = "temporal_features_training.csv"
out_val = "temporal_features_validation.csv"

# Sample Date is already DD-MM-YYYY (kept from input for merge consistency)
train_save = train_temporal.copy()
val_save = val_temporal.copy()

train_save.to_csv(out_train, index=False)
val_save.to_csv(out_val, index=False)

print(f"Saved {out_train} ({len(train_save)} rows)")
print(f"Saved {out_val} ({len(val_save)} rows)")
print("\nColumns:", list(train_save.columns))

Saved temporal_features_training.csv (9319 rows)
Saved temporal_features_validation.csv (200 rows)

Columns: ['Latitude', 'Longitude', 'Sample Date', 'year', 'month', 'quarter', 'day_of_year', 'week_of_year', 'season', 'is_wet_season', 'months_since_start', 'month_sin', 'month_cos', 'day_of_year_sin', 'day_of_year_cos']


## Summary of Temporal Features

| Feature | Description | Use case |
|---------|-------------|----------|
| `year` | Calendar year (2011–2015) | Long-term trends |
| `month` | 1–12 | Raw month |
| `quarter` | 1–4 | Coarser seasonality |
| `day_of_year` | 1–365 | Within-year position |
| `week_of_year` | 1–52 | Weekly granularity |
| `season` | summer/autumn/winter/spring | Categorical season |
| `is_wet_season` | 1=wet, 0=dry (Oct–Apr) | Hydrological regime |
| `months_since_start` | Months since 2011-01 | Linear time trend |
| `month_sin`, `month_cos` | Cyclical month encoding | Preserves Dec–Jan continuity |
| `day_of_year_sin`, `day_of_year_cos` | Cyclical day encoding | Fine-grained seasonality |

## Next Steps

- Phase 2, Step 6: Combine all features (Landsat, TerraClimate, spatial, temporal)