# Outlier Detection (–í—ã–±—Ä–æ—Å—ã) - Complete Guide
## 5 Methods for Robust Outlier Detection in Weather/Emergency Data

This notebook demonstrates:
1. ‚úÖ **Z-Score** - Standard statistical method
2. ‚úÖ **IQR (Tukey's Fences)** - Quartile-based detection
3. ‚úÖ **Isolation Forest** - ML-based multivariate detection
4. ‚úÖ **Elliptic Envelope** - Gaussian assumption method
5. ‚úÖ **MAD** - Median Absolute Deviation (robust)
6. ‚úÖ **Ensemble** - Combine multiple methods

**Use Cases:**
- Weather data anomalies (extreme temperatures, unusual precipitation)
- Emergency case outliers (rare events)
- Data quality assurance
- Sensor malfunction detection

In [None]:
# Install required packages
!pip install pandas numpy scikit-learn scipy matplotlib seaborn

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from src.utils.outlier_detection import OutlierDetector, quick_outlier_detection

# Set plot style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("‚úÖ All libraries imported successfully")

## Generate Sample Data with Outliers

In [None]:
# Generate weather data with artificial outliers
np.random.seed(42)
n_samples = 1000

# Normal weather data
temperature = np.random.normal(15, 8, n_samples)  # Mean 15¬∞C, std 8¬∞C
precipitation = np.random.gamma(2, 5, n_samples)  # Gamma distribution
humidity = np.random.uniform(30, 80, n_samples)   # Uniform 30-80%
wind_speed = np.abs(np.random.normal(10, 5, n_samples))  # Mean 10 km/h

# Add artificial outliers (5% of data)
n_outliers = int(n_samples * 0.05)
outlier_indices = np.random.choice(n_samples, n_outliers, replace=False)

# Extreme temperature outliers
temperature[outlier_indices[:n_outliers//4]] = np.random.uniform(45, 55, n_outliers//4)
temperature[outlier_indices[n_outliers//4:n_outliers//2]] = np.random.uniform(-35, -25, n_outliers//4)

# Extreme precipitation outliers
precipitation[outlier_indices[n_outliers//2:3*n_outliers//4]] = np.random.uniform(100, 150, n_outliers//4)

# Extreme wind speed outliers
wind_speed[outlier_indices[3*n_outliers//4:]] = np.random.uniform(80, 120, n_outliers//4)

# Create DataFrame
df = pd.DataFrame({
    'date': pd.date_range('2020-01-01', periods=n_samples, freq='D'),
    'temperature': temperature,
    'precipitation': precipitation,
    'humidity': humidity,
    'wind_speed': wind_speed
})

print(f"Generated {n_samples} samples with ~{n_outliers} artificial outliers")
print(f"\nData Summary:")
print(df.describe())

## Visualize Data Distribution

In [None]:
# Plot distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
features = ['temperature', 'precipitation', 'humidity', 'wind_speed']

for i, feature in enumerate(features):
    ax = axes[i//2, i%2]
    
    # Box plot
    ax.boxplot(df[feature], vert=False)
    ax.set_xlabel(feature.capitalize())
    ax.set_title(f'{feature.capitalize()} Distribution (Box Plot)')
    ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("üìä Box plots show potential outliers as points beyond whiskers")

## Method 1: Z-Score Detection

In [None]:
detector = OutlierDetector(contamination=0.05)

# Detect outliers using Z-score
df_zscore = detector.detect_zscore(
    df, 
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed'],
    threshold=3.0
)

print(f"\nüìä Z-Score Results:")
print(df_zscore[['temperature', 'precipitation', 'outlier_zscore']].head(10))

## Method 2: IQR (Interquartile Range)

In [None]:
# Detect outliers using IQR
df_iqr = detector.detect_iqr(
    df,
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed'],
    k=1.5  # Standard Tukey's fence
)

print(f"\nüìä IQR Results:")
print(df_iqr[['temperature', 'precipitation', 'outlier_iqr']].head(10))

## Method 3: Isolation Forest (ML-based)

In [None]:
# Detect outliers using Isolation Forest
df_iforest = detector.detect_isolation_forest(
    df,
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed']
)

print(f"\nüìä Isolation Forest Results:")
print(df_iforest[['temperature', 'precipitation', 'outlier_iforest', 'anomaly_score_iforest']].head(10))

# Plot anomaly scores
plt.figure(figsize=(14, 4))
plt.plot(df_iforest['anomaly_score_iforest'], linewidth=0.5)
plt.axhline(y=df_iforest['anomaly_score_iforest'].quantile(0.05), 
            color='r', linestyle='--', label='5% threshold')
plt.xlabel('Sample Index')
plt.ylabel('Anomaly Score')
plt.title('Isolation Forest Anomaly Scores')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Method 4: Elliptic Envelope (Gaussian)

In [None]:
# Detect outliers using Elliptic Envelope
df_elliptic = detector.detect_elliptic_envelope(
    df,
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed']
)

print(f"\nüìä Elliptic Envelope Results:")
print(df_elliptic[['temperature', 'precipitation', 'outlier_elliptic', 'mahalanobis_distance']].head(10))

## Method 5: MAD (Median Absolute Deviation)

In [None]:
# Detect outliers using MAD
df_mad = detector.detect_mad(
    df,
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed'],
    threshold=3.5
)

print(f"\nüìä MAD Results:")
print(df_mad[['temperature', 'precipitation', 'outlier_mad']].head(10))

## Method 6: Ensemble (Combines All Methods)

In [None]:
# Create new detector for ensemble
detector_ensemble = OutlierDetector(contamination=0.05)

# Ensemble detection
df_ensemble = detector_ensemble.ensemble_detection(
    df,
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed'],
    methods=['zscore', 'iqr', 'isolation_forest', 'mad'],
    voting='majority'  # At least 2 methods agree
)

# Generate comprehensive report
report = detector_ensemble.generate_outlier_report(df_ensemble)

print(f"\nüìä Ensemble Results:")
print(df_ensemble[['temperature', 'precipitation', 'outlier_ensemble']].head(10))

## Compare All Methods

In [None]:
# Visualize comparison
outlier_cols = [col for col in df_ensemble.columns if col.startswith('outlier_')]

comparison = pd.DataFrame({
    'Method': [col.replace('outlier_', '').upper() for col in outlier_cols],
    'Outliers Detected': [df_ensemble[col].sum() for col in outlier_cols],
    'Percentage': [df_ensemble[col].sum() / len(df_ensemble) * 100 for col in outlier_cols]
})

print("\nüìä Method Comparison:")
print(comparison.to_string(index=False))

# Bar plot
plt.figure(figsize=(12, 6))
plt.bar(comparison['Method'], comparison['Outliers Detected'], color='steelblue')
plt.axhline(y=n_outliers, color='r', linestyle='--', label=f'True outliers: {n_outliers}')
plt.xlabel('Detection Method')
plt.ylabel('Number of Outliers Detected')
plt.title('Outlier Detection Method Comparison')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## Handle Outliers: Different Strategies

In [None]:
# Strategy 1: Remove outliers
print("\nüõ†Ô∏è STRATEGY 1: REMOVE OUTLIERS")
df_removed = detector_ensemble.handle_outliers(
    df_ensemble, 
    method='remove',
    outlier_col='outlier_ensemble'
)

# Strategy 2: Cap outliers (Winsorization)
print("\nüõ†Ô∏è STRATEGY 2: CAP OUTLIERS (WINSORIZATION)")
df_capped = detector_ensemble.handle_outliers(
    df_ensemble, 
    method='cap',
    outlier_col='outlier_ensemble'
)

# Strategy 3: Interpolate outliers
print("\nüõ†Ô∏è STRATEGY 3: INTERPOLATE OUTLIERS")
df_interpolated = detector_ensemble.handle_outliers(
    df_ensemble, 
    method='interpolate',
    outlier_col='outlier_ensemble'
)

# Strategy 4: Flag only (no modification)
print("\nüõ†Ô∏è STRATEGY 4: FLAG ONLY (NO MODIFICATION)")
df_flagged = detector_ensemble.handle_outliers(
    df_ensemble, 
    method='flag',
    outlier_col='outlier_ensemble'
)

## Visualize Before/After Outlier Handling

In [None]:
# Compare temperature distributions
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Original
axes[0].hist(df['temperature'], bins=50, color='steelblue', alpha=0.7, edgecolor='black')
axes[0].set_title('Original Data')
axes[0].set_xlabel('Temperature (¬∞C)')
axes[0].set_ylabel('Frequency')
axes[0].axvline(df['temperature'].mean(), color='r', linestyle='--', label='Mean')
axes[0].legend()

# After removing outliers
axes[1].hist(df_removed['temperature'], bins=50, color='green', alpha=0.7, edgecolor='black')
axes[1].set_title('After Removing Outliers')
axes[1].set_xlabel('Temperature (¬∞C)')
axes[1].axvline(df_removed['temperature'].mean(), color='r', linestyle='--', label='Mean')
axes[1].legend()

# After capping outliers
axes[2].hist(df_capped['temperature'], bins=50, color='orange', alpha=0.7, edgecolor='black')
axes[2].set_title('After Capping Outliers')
axes[2].set_xlabel('Temperature (¬∞C)')
axes[2].axvline(df_capped['temperature'].mean(), color='r', linestyle='--', label='Mean')
axes[2].legend()

plt.tight_layout()
plt.show()

print("\nüìä Statistics Comparison:")
print(f"{'Metric':<20} {'Original':<15} {'Removed':<15} {'Capped':<15}")
print("-" * 65)
print(f"{'Sample Count':<20} {len(df):<15} {len(df_removed):<15} {len(df_capped):<15}")
print(f"{'Mean Temp':<20} {df['temperature'].mean():<15.2f} {df_removed['temperature'].mean():<15.2f} {df_capped['temperature'].mean():<15.2f}")
print(f"{'Std Temp':<20} {df['temperature'].std():<15.2f} {df_removed['temperature'].std():<15.2f} {df_capped['temperature'].std():<15.2f}")
print(f"{'Min Temp':<20} {df['temperature'].min():<15.2f} {df_removed['temperature'].min():<15.2f} {df_capped['temperature'].min():<15.2f}")
print(f"{'Max Temp':<20} {df['temperature'].max():<15.2f} {df_removed['temperature'].max():<15.2f} {df_capped['temperature'].max():<15.2f}")

## Quick Detection Function

In [None]:
# Quick one-liner for outlier detection
df_quick = quick_outlier_detection(
    df,
    columns=['temperature', 'precipitation', 'humidity', 'wind_speed'],
    method='ensemble'
)

print("\n‚úÖ Quick detection complete!")
print(f"Detected {df_quick['outlier_ensemble'].sum()} outliers")

## Export Results

In [None]:
# Save results
df_ensemble.to_csv('../data/processed/outlier_detection_results.csv', index=False)
df_removed.to_csv('../data/processed/data_outliers_removed.csv', index=False)
df_capped.to_csv('../data/processed/data_outliers_capped.csv', index=False)

print("‚úÖ Results saved:")
print("   ‚Ä¢ outlier_detection_results.csv (with all detection methods)")
print("   ‚Ä¢ data_outliers_removed.csv (outliers removed)")
print("   ‚Ä¢ data_outliers_capped.csv (outliers capped)")

## Summary & Recommendations

### When to Use Each Method:

1. **Z-Score**: 
   - ‚úÖ Fast, simple
   - ‚ö†Ô∏è Assumes normal distribution
   - Use for: Quick univariate checks

2. **IQR (Tukey's Fences)**:
   - ‚úÖ Robust, no distribution assumption
   - ‚úÖ Industry standard
   - Use for: General outlier detection

3. **Isolation Forest**:
   - ‚úÖ Multivariate, handles complex patterns
   - ‚úÖ No distribution assumption
   - Use for: High-dimensional data, complex outliers

4. **Elliptic Envelope**:
   - ‚úÖ Good for Gaussian data
   - ‚ö†Ô∏è Assumes normal distribution
   - Use for: Weather data (often Gaussian)

5. **MAD**:
   - ‚úÖ Very robust to extreme outliers
   - ‚úÖ Better than Z-score for skewed data
   - Use for: Data with many outliers

6. **Ensemble**:
   - ‚úÖ Most reliable (consensus of methods)
   - ‚ö†Ô∏è Slower
   - Use for: Production systems, critical decisions

### Handling Strategy Recommendations:

- **Remove**: Data quality issues, sensor errors
- **Cap (Winsorization)**: Legitimate extreme values, preserve information
- **Interpolate**: Time series with occasional bad readings
- **Flag**: Need to keep all data, analyze separately

### For This Project:
**Recommended: Ensemble with Capping**
- Use ensemble detection (majority voting)
- Cap outliers to 1-99 percentile range
- Preserve extreme weather events (they're legitimate data!)
- Only remove obvious sensor errors