# WiDS Datathon 2026 - Data Profiling Analysis

This notebook performs automated data profiling using ydata-profiling on the training and test datasets.

## Dataset Overview
- **Training data**: 221 wildfire events (69 hits, 152 censored)
- **Test data**: 95 wildfire events
- **Features**: 34 features across 6 categories
- **Target**: Predict probability of fire reaching evacuation zone at 12h, 24h, 48h, 72h

## Setup and Imports

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


## Load Data

In [2]:
# Load training data
train_df = pd.read_csv('Data/train.csv')
print(f"Training data shape: {train_df.shape}")
print(f"Training data columns: {train_df.shape[1]}")
print(f"Training data rows: {train_df.shape[0]}")

# Load test data
test_df = pd.read_csv('Data/test.csv')
print(f"\nTest data shape: {test_df.shape}")
print(f"Test data columns: {test_df.shape[1]}")
print(f"Test data rows: {test_df.shape[0]}")

# Load metadata
metadata_df = pd.read_csv('Data/metaData.csv')
print(f"\nMetadata loaded: {metadata_df.shape[0]} feature descriptions")

Training data shape: (221, 37)
Training data columns: 37
Training data rows: 221

Test data shape: (95, 35)
Test data columns: 35
Test data rows: 95

Metadata loaded: 37 feature descriptions


## Quick Data Overview

In [3]:
# Display first few rows of training data
print("Training Data - First 5 rows:")
display(train_df.head())

# Display basic info
print("\nTraining Data Info:")
train_df.info()

Training Data - First 5 rows:


Unnamed: 0,event_id,num_perimeters_0_5h,dt_first_last_0_5h,low_temporal_resolution_0_5h,area_first_ha,area_growth_abs_0_5h,area_growth_rel_0_5h,area_growth_rate_ha_per_h,log1p_area_first,log1p_growth,...,dist_fit_r2_0_5h,alignment_cos,alignment_abs,cross_track_component,along_track_speed,event_start_hour,event_start_dayofweek,event_start_month,time_to_hit_hours,event
0,10892457,3,4.265188,0,79.696304,2.875935,0.036086,0.674281,4.390693,1.354787,...,0.886373,-0.054649,0.054649,-1.937219,-0.106026,19,4,5,18.892512,0
1,11757157,2,1.169918,0,8.946749,0.0,0.0,0.0,2.297246,0.0,...,0.0,-0.568898,0.568898,-0.0,-0.0,4,4,6,22.048108,1
2,11945086,4,4.777526,0,106.482638,0.0,0.0,0.0,4.677329,0.0,...,0.0,0.882385,0.882385,0.0,0.0,22,4,8,0.888895,1
3,12044083,1,0.0,1,67.631125,0.0,0.0,0.0,4.228746,0.0,...,0.0,0.0,0.0,0.0,0.0,20,5,8,60.953021,0
4,12052347,2,4.975273,0,35.632874,0.0,0.0,0.0,3.600946,0.0,...,0.0,0.934634,0.934634,-0.0,0.0,21,5,7,44.990274,0



Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 221 entries, 0 to 220
Data columns (total 37 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   event_id                      221 non-null    int64  
 1   num_perimeters_0_5h           221 non-null    int64  
 2   dt_first_last_0_5h            221 non-null    float64
 3   low_temporal_resolution_0_5h  221 non-null    int64  
 4   area_first_ha                 221 non-null    float64
 5   area_growth_abs_0_5h          221 non-null    float64
 6   area_growth_rel_0_5h          221 non-null    float64
 7   area_growth_rate_ha_per_h     221 non-null    float64
 8   log1p_area_first              221 non-null    float64
 9   log1p_growth                  221 non-null    float64
 10  log_area_ratio_0_5h           221 non-null    float64
 11  relative_growth_0_5h          221 non-null    float64
 12  radial_growth_m               221 non-null 

In [4]:
# Check target variable distribution
print("Target Variable Distribution:")
print(f"\nEvent (Hit within 72h):")
print(train_df['event'].value_counts())
print(f"\nHit rate: {train_df['event'].mean():.2%}")
print(f"Censored rate: {(1 - train_df['event'].mean()):.2%}")

print(f"\nTime to Hit (hours) - Summary Statistics:")
print(train_df['time_to_hit_hours'].describe())

Target Variable Distribution:

Event (Hit within 72h):
event
0    152
1     69
Name: count, dtype: int64

Hit rate: 31.22%
Censored rate: 68.78%

Time to Hit (hours) - Summary Statistics:
count    221.000000
mean      37.567626
std       25.902361
min        0.001220
25%       12.242322
50%       43.109830
75%       63.938706
max       66.994474
Name: time_to_hit_hours, dtype: float64


## Generate Profiling Report for Training Data

This will generate a comprehensive profiling report including:
- Overview statistics
- Variable distributions
- Correlations
- Missing values
- Duplicate rows
- And much more!

**Note**: This may take a few minutes to complete.

In [5]:
# Generate profile report for training data
print("Generating profiling report for training data...")
print("This may take 2-3 minutes...")

train_profile = ProfileReport(
    train_df,
    title="WiDS Datathon 2026 - Training Data Profile",
    explorative=True,
    minimal=False,
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    },
)

print("Training data profile generated!")

Generating profiling report for training data...
This may take 2-3 minutes...
Training data profile generated!


In [6]:
# Save training data profile to HTML
train_profile.to_file("train_data_profile.html")
print("Training data profile saved to: train_data_profile.html")
print("Open this file in a web browser to view the interactive report.")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Training data profile saved to: train_data_profile.html
Open this file in a web browser to view the interactive report.


## Generate Profiling Report for Test Data

In [7]:
# Generate profile report for test data
print("Generating profiling report for test data...")
print("This may take 1-2 minutes...")

test_profile = ProfileReport(
    test_df,
    title="WiDS Datathon 2026 - Test Data Profile",
    explorative=True,
    minimal=False,
    correlations={
        "pearson": {"calculate": True},
        "spearman": {"calculate": True},
        "kendall": {"calculate": False},
        "phi_k": {"calculate": False},
        "cramers": {"calculate": False},
    },
)

print("Test data profile generated!")

Generating profiling report for test data...
This may take 1-2 minutes...
Test data profile generated!


In [8]:
# Save test data profile to HTML
test_profile.to_file("test_data_profile.html")
print("Test data profile saved to: test_data_profile.html")
print("Open this file in a web browser to view the interactive report.")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Test data profile saved to: test_data_profile.html
Open this file in a web browser to view the interactive report.


## Compare Train and Test Distributions

Generate a comparison report to identify any distribution shifts between training and test data.

In [9]:
# Generate comparison report
print("Generating comparison report between train and test data...")
print("This may take 2-3 minutes...")

comparison_report = train_profile.compare(test_profile)

print("Comparison report generated!")

Generating comparison report between train and test data...
This may take 2-3 minutes...
Comparison report generated!


In [10]:
# Save comparison report to HTML
comparison_report.to_file("train_test_comparison.html")
print("Comparison report saved to: train_test_comparison.html")
print("Open this file in a web browser to view the interactive comparison.")

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Comparison report saved to: train_test_comparison.html
Open this file in a web browser to view the interactive comparison.


## Summary

Three HTML reports have been generated:

1. **train_data_profile.html** - Comprehensive analysis of training data
2. **test_data_profile.html** - Comprehensive analysis of test data
3. **train_test_comparison.html** - Side-by-side comparison of train vs test distributions

### Key Things to Look For:

**In Training Data Profile:**
- Missing values (should be none)
- Feature distributions (many zeros for single-perimeter fires)
- Correlations between features
- Target variable distribution (31% hits, 69% censored)

**In Test Data Profile:**
- Similar distributions to training data
- No target variables (as expected)

**In Comparison Report:**
- Distribution shifts between train and test
- Any concerning differences that might affect model performance
- Feature stability across datasets

## Quick Feature Category Analysis

In [11]:
# Categorize features
temporal_coverage = ['num_perimeters_0_5h', 'dt_first_last_0_5h', 'low_temporal_resolution_0_5h']
growth_features = ['area_first_ha', 'area_growth_abs_0_5h', 'area_growth_rel_0_5h', 
                   'area_growth_rate_ha_per_h', 'log1p_area_first', 'log1p_growth',
                   'log_area_ratio_0_5h', 'relative_growth_0_5h', 'radial_growth_m', 
                   'radial_growth_rate_m_per_h']
kinematics = ['centroid_displacement_m', 'centroid_speed_m_per_h', 'spread_bearing_deg',
              'spread_bearing_sin', 'spread_bearing_cos']
distance = ['dist_min_ci_0_5h', 'dist_std_ci_0_5h', 'dist_change_ci_0_5h', 'dist_slope_ci_0_5h',
            'closing_speed_m_per_h', 'closing_speed_abs_m_per_h', 'projected_advance_m',
            'dist_accel_m_per_h2', 'dist_fit_r2_0_5h']
directionality = ['alignment_cos', 'alignment_abs', 'cross_track_component', 'along_track_speed']
temporal_metadata = ['event_start_hour', 'event_start_dayofweek', 'event_start_month']

print("Feature Categories:")
print(f"\nTemporal Coverage: {len(temporal_coverage)} features")
print(f"Growth: {len(growth_features)} features")
print(f"Kinematics: {len(kinematics)} features")
print(f"Distance: {len(distance)} features")
print(f"Directionality: {len(directionality)} features")
print(f"Temporal Metadata: {len(temporal_metadata)} features")
print(f"\nTotal: {len(temporal_coverage) + len(growth_features) + len(kinematics) + len(distance) + len(directionality) + len(temporal_metadata)} features")

Feature Categories:

Temporal Coverage: 3 features
Growth: 10 features
Kinematics: 5 features
Distance: 9 features
Directionality: 4 features
Temporal Metadata: 3 features

Total: 34 features


In [12]:
# Check for features with many zeros (single perimeter fires)
print("Features with high proportion of zeros (indicating single-perimeter fires):")
print("\n" + "="*70)

zero_proportions = (train_df == 0).sum() / len(train_df)
high_zero_features = zero_proportions[zero_proportions > 0.5].sort_values(ascending=False)

for feature, proportion in high_zero_features.items():
    if feature not in ['event_id', 'event', 'time_to_hit_hours']:
        print(f"{feature:40s}: {proportion:.1%} zeros")

Features with high proportion of zeros (indicating single-perimeter fires):

projected_advance_m                     : 91.9% zeros
closing_speed_abs_m_per_h               : 91.9% zeros
closing_speed_m_per_h                   : 91.9% zeros
dist_change_ci_0_5h                     : 91.9% zeros
dist_std_ci_0_5h                        : 91.4% zeros
dist_fit_r2_0_5h                        : 91.4% zeros
log1p_growth                            : 89.1% zeros
cross_track_component                   : 88.7% zeros
area_growth_abs_0_5h                    : 88.7% zeros
spread_bearing_sin                      : 88.7% zeros
spread_bearing_deg                      : 88.7% zeros
centroid_speed_m_per_h                  : 88.7% zeros
centroid_displacement_m                 : 88.7% zeros
radial_growth_rate_m_per_h              : 88.7% zeros
radial_growth_m                         : 88.7% zeros
relative_growth_0_5h                    : 88.7% zeros
log_area_ratio_0_5h                     : 88.7% zeros
area_

## Next Steps

After reviewing the profiling reports, consider:

1. **Feature Engineering**:
   - Handle features with many zeros (single-perimeter fires)
   - Create interaction terms
   - Circular encoding for hour and month

2. **Feature Selection**:
   - Remove redundant features
   - Focus on most predictive features

3. **Modeling Strategy**:
   - Survival analysis models (Cox, Random Survival Forest)
   - Handle censored data properly
   - Cross-validation strategy

4. **Validation**:
   - Stratified splits (by event status)
   - Monitor C-index and Brier scores
   - Ensure monotonicity in predictions