# üìä Exploratory Data Analysis (EDA)

> **PM Accelerator Mission**: "By making industry-leading tools and education available to individuals from all backgrounds, we level the playing field for future PM leaders."

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/moazmo/weather-trend-forecasting/blob/main/presentation/02_EDA_Analysis.ipynb)
[![nbviewer](https://img.shields.io/badge/render-nbviewer-orange.svg)](https://nbviewer.org/github/moazmo/weather-trend-forecasting/blob/main/presentation/02_EDA_Analysis.ipynb)

This notebook covers:
1. Data Loading & Cleaning
2. Statistical Analysis
3. Temporal Patterns
4. Geographic Analysis
5. Anomaly Detection

In [7]:
import plotly.io as pio
pio.renderers.default = "notebook_connected"

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go

# Load data
df = pd.read_csv('../data/processed/weather_cleaned.csv', parse_dates=['date'])
print(f"üìä Dataset Shape: {df.shape}")
print(f"üìÖ Date Range: {df['date'].min()} to {df['date'].max()}")
print(f"üåç Countries: {df['country'].nunique()}")

üìä Dataset Shape: (102652, 34)
üìÖ Date Range: 2024-06-15 00:00:00 to 2025-12-24 00:00:00
üåç Countries: 186


## 1. Data Overview

### Data Cleaning Steps Performed
1. **Missing Values**: Forward-fill interpolation within each country
2. **Country Names**: Fixed 30+ typos (e.g., "Untied States" ‚Üí "United States")
3. **Outliers**: Removed temperatures outside physical bounds (-90¬∞C to 60¬∞C)
4. **Date Parsing**: Converted to datetime with proper timezone handling

In [8]:
# Basic statistics
print("üìä Temperature Statistics:")
print(df['temperature_celsius'].describe())

print("\nüìä Missing Values:")
print(df.isnull().sum())

üìä Temperature Statistics:
count    102652.000000
mean         21.989384
std           9.078192
min         -24.900000
25%          17.000000
50%          24.200000
75%          28.100000
max          49.200000
Name: temperature_celsius, dtype: float64

üìä Missing Values:
country                 0
date                    0
temperature_celsius     0
humidity                0
pressure_mb             0
wind_kph                0
precip_mm               0
cloud                   0
uv_index                0
latitude                0
longitude               0
month                   0
day_of_month            0
day_of_week             0
day_of_year             0
week_of_year            0
quarter                 0
is_weekend              0
month_sin               0
month_cos               0
day_sin                 0
day_cos                 0
day_of_year_sin         0
day_of_year_cos         0
temp_lag_1              0
temp_lag_2              0
temp_lag_3              0
temp_lag_7           

## 2. Temperature Distribution Analysis

![Temperature Distribution](images/temp_distribution.png)

In [9]:
# Global temperature distribution (interactive version)
fig = px.histogram(
    df, x='temperature_celsius', 
    nbins=100,
    title='üå°Ô∏è Global Temperature Distribution',
    labels={'temperature_celsius': 'Temperature (¬∞C)'},
    color_discrete_sequence=['#4facfe']
)
fig.update_layout(template='plotly_dark', showlegend=False)
fig.show()

print(f"\nüìà Mean Temperature: {df['temperature_celsius'].mean():.1f}¬∞C")
print(f"üìà Median Temperature: {df['temperature_celsius'].median():.1f}¬∞C")
print(f"üìà Std Deviation: {df['temperature_celsius'].std():.1f}¬∞C")


üìà Mean Temperature: 22.0¬∞C
üìà Median Temperature: 24.2¬∞C
üìà Std Deviation: 9.1¬∞C


### Key Insight
The temperature distribution is **bimodal** with peaks around:
- **15-20¬∞C**: Temperate regions
- **25-30¬∞C**: Tropical regions

This confirms the need for **climate zone encoding** in our model.

## 3. Monthly Temperature Patterns

![Monthly Pattern](images/monthly_pattern.png)

In [10]:
# Monthly temperature patterns (interactive version)
monthly_avg = df.groupby('month')['temperature_celsius'].mean().reset_index()

fig = px.bar(
    monthly_avg, x='month', y='temperature_celsius',
    title='üìÖ Average Temperature by Month (Global)',
    labels={'temperature_celsius': 'Avg Temp (¬∞C)', 'month': 'Month'},
    color='temperature_celsius',
    color_continuous_scale='RdYlBu_r'
)
fig.update_layout(template='plotly_dark')
fig.show()

## 4. Hemisphere Seasonality

![Hemisphere Seasonality](images/hemisphere_seasonality.png)

In [11]:
# Seasonality by Hemisphere (interactive version)
df['hemisphere'] = df['latitude'].apply(lambda x: 'Northern' if x >= 0 else 'Southern')
monthly_hemi = df.groupby(['month', 'hemisphere'])['temperature_celsius'].mean().reset_index()

fig = px.line(
    monthly_hemi, x='month', y='temperature_celsius', color='hemisphere',
    title='üåç Seasonality: Northern vs Southern Hemisphere',
    labels={'temperature_celsius': 'Avg Temp (¬∞C)', 'month': 'Month'},
    markers=True
)
fig.update_layout(template='plotly_dark')
fig.show()

### Key Insight
**Opposite seasonality** is clearly visible:
- Northern Hemisphere peaks in **July-August** (summer)
- Southern Hemisphere peaks in **January-February** (summer)

This validates the importance of **hemisphere encoding** in our features.

## 5. Anomaly Detection

We applied multiple anomaly detection methods:

| Method | Description | Use Case |
|--------|-------------|----------|
| **Z-Score** | Statistical deviation from mean | Simple outliers |
| **IQR** | Interquartile range | Robust to skewness |
| **Isolation Forest** | Tree-based isolation | Multivariate anomalies |
| **LOF** | Local Outlier Factor | Density-based detection |

In [12]:
from sklearn.ensemble import IsolationForest

# Prepare features for anomaly detection
features = ['temperature_celsius', 'humidity', 'pressure_mb']
X = df[features].dropna()

# Fit Isolation Forest
clf = IsolationForest(contamination=0.01, random_state=42)
X['anomaly'] = clf.fit_predict(X)

anomalies = X[X['anomaly'] == -1]
print(f"üîç Detected {len(anomalies)} anomalies ({len(anomalies)/len(X)*100:.2f}%)")
print(f"\nüìä Anomaly Temperature Range:")
print(f"   Min: {anomalies['temperature_celsius'].min():.1f}¬∞C")
print(f"   Max: {anomalies['temperature_celsius'].max():.1f}¬∞C")

üîç Detected 1027 anomalies (1.00%)

üìä Anomaly Temperature Range:
   Min: -24.9¬∞C
   Max: 49.2¬∞C


---

*Continue to Notebook 03 for Model Evaluation ‚Üí*