# 🔍 Advanced Exploratory Data Analysis

> **PM Accelerator Mission**: "By making industry-leading tools and education available to individuals from all backgrounds, we level the playing field for future PM leaders. This is the PM Accelerator motto, as we grant aspiring and experienced PMs what they need most – Access. We introduce you to industry leaders, surround you with the right PM ecosystem, and discover the new world of AI product management skills."

---

## Objectives
1. **Anomaly Detection**: Identify and analyze outliers using multiple methods
2. **Data Quality Assessment**: Understand data patterns and distributions
3. **Visualization**: Create comprehensive visualizations of findings

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('dark_background')
print("✅ Libraries loaded successfully!")

✅ Libraries loaded successfully!


## 1. Data Loading & Overview

In [2]:
# Load the CLEANED dataset
DATA_PATH = "../data/weather_cleaned.csv"

# Read with optimized dtypes
df = pd.read_csv(DATA_PATH, parse_dates=['last_updated'])

print(f"📊 Dataset Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"📅 Date Range: {df['last_updated'].min()} to {df['last_updated'].max()}")
print(f"🌍 Countries: {df['country'].nunique()}")
df.head()

📊 Dataset Shape: 114,203 rows × 42 columns
📅 Date Range: 2024-05-16 01:45:00 to 2025-12-24 20:00:00
🌍 Countries: 204


Unnamed: 0,country,location_name,latitude,longitude,timezone,last_updated_epoch,last_updated,temperature_celsius,temperature_fahrenheit,condition_text,...,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,sunrise,sunset,moonrise,moonset,moon_phase,moon_illumination,date
0,Afghanistan,Kabul,34.52,69.18,Asia/Kabul,1970-01-01 00:00:01.715849100,2024-05-16 13:15:00,26.6,79.8,Partly Cloudy,...,26.6,1,1,04:50 AM,06:50 PM,12:12 PM,01:11 AM,Waxing Gibbous,55,2024-05-16
1,Albania,Tirana,41.33,19.82,Europe/Tirane,1970-01-01 00:00:01.715849100,2024-05-16 10:45:00,19.0,66.2,Partly Cloudy,...,2.0,1,1,05:21 AM,07:54 PM,12:58 PM,02:14 AM,Waxing Gibbous,55,2024-05-16
2,Algeria,Algiers,36.76,3.05,Africa/Algiers,1970-01-01 00:00:01.715849100,2024-05-16 09:45:00,23.0,73.4,Sunny,...,18.4,1,1,05:40 AM,07:50 PM,01:15 PM,02:14 AM,Waxing Gibbous,55,2024-05-16
3,Andorra,Andorra La Vella,42.5,1.52,Europe/Andorra,1970-01-01 00:00:01.715849100,2024-05-16 10:45:00,6.3,43.3,Light Drizzle,...,0.9,1,1,06:31 AM,09:11 PM,02:12 PM,03:31 AM,Waxing Gibbous,55,2024-05-16
4,Angola,Luanda,-8.84,13.23,Africa/Luanda,1970-01-01 00:00:01.715849100,2024-05-16 09:45:00,26.0,78.8,Partly Cloudy,...,262.3,5,10,06:12 AM,05:55 PM,01:17 PM,12:38 AM,Waxing Gibbous,55,2024-05-16


In [3]:
# Data types and missing values
info_df = pd.DataFrame({
    'Data Type': df.dtypes,
    'Non-Null Count': df.count(),
    'Null Count': df.isnull().sum(),
    'Null %': (df.isnull().sum() / len(df) * 100).round(2),
    'Unique Values': df.nunique()
})
info_df

Unnamed: 0,Data Type,Non-Null Count,Null Count,Null %,Unique Values
country,object,114203,0,0.0,204
location_name,object,114203,0,0.0,255
latitude,float64,114203,0,0.0,395
longitude,float64,114203,0,0.0,400
timezone,object,114203,0,0.0,198
last_updated_epoch,object,114203,0,0.0,1112
last_updated,datetime64[ns],114203,0,0.0,19523
temperature_celsius,float64,114203,0,0.0,662
temperature_fahrenheit,float64,114203,0,0.0,1090
condition_text,object,114203,0,0.0,46


In [4]:
# Statistical summary for numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"📈 Numeric columns ({len(numeric_cols)}): {numeric_cols[:10]}...")
df[numeric_cols].describe()

📈 Numeric columns (29): ['latitude', 'longitude', 'temperature_celsius', 'temperature_fahrenheit', 'wind_mph', 'wind_kph', 'wind_degree', 'pressure_mb', 'pressure_in', 'precip_mm']...


Unnamed: 0,latitude,longitude,temperature_celsius,temperature_fahrenheit,wind_mph,wind_kph,wind_degree,pressure_mb,pressure_in,precip_mm,...,gust_kph,air_quality_Carbon_Monoxide,air_quality_Ozone,air_quality_Nitrogen_dioxide,air_quality_Sulphur_dioxide,air_quality_PM2.5,air_quality_PM10,air_quality_us-epa-index,air_quality_gb-defra-index,moon_illumination
count,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,...,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0,114203.0
mean,19.181513,22.009927,22.20536,71.97142,8.100401,13.039905,169.897437,1014.091241,29.945458,0.139141,...,18.294976,490.457107,60.13991,15.478621,10.865304,25.000947,50.228428,1.722949,2.665692,49.363134
std,24.43653,65.795019,9.073116,16.331409,7.527848,12.112072,102.986526,10.82243,0.319544,0.583829,...,14.178624,805.348434,31.590983,24.92088,38.220053,38.774609,154.86605,0.959204,2.497338,35.147339
min,-41.3,-175.2,-24.9,-12.8,2.2,3.6,1.0,947.0,27.96,0.0,...,3.6,-9999.0,0.0,0.0,-9999.0,0.168,-1848.15,1.0,1.0,0.0
25%,3.75,-6.8361,17.2,63.0,4.0,6.5,82.0,1010.0,29.83,0.0,...,10.4,220.15,40.0,1.48,0.95,7.215,10.25,1.0,1.0,14.0
50%,17.25,23.3167,24.3,75.7,6.9,11.2,163.0,1014.0,29.93,0.0,...,15.6,310.8,57.0,5.3,2.405,14.5,20.75,1.0,2.0,49.0
75%,40.4,50.58,28.2,82.8,11.2,18.0,255.0,1018.0,30.06,0.03,...,24.2,485.85,76.0,17.945,8.75,28.49,43.105,2.0,3.0,84.0
max,64.15,179.22,49.2,120.6,1841.2,2963.2,360.0,3006.0,88.77,42.24,...,2970.4,38879.398,480.7,427.7,521.33,1614.1,6037.29,6.0,10.0,100.0


## 2. Anomaly Detection Methods

We'll implement multiple anomaly detection techniques:
1. **Z-Score Method**: Statistical approach using standard deviations
2. **IQR Method**: Interquartile range based detection
3. **Isolation Forest**: Machine learning ensemble method
4. **Local Outlier Factor (LOF)**: Density-based detection

In [5]:
# Focus on key numeric features for anomaly detection
anomaly_features = ['temperature_celsius', 'humidity', 'pressure_mb', 
                    'wind_kph', 'precip_mm', 'cloud', 'uv_index']

# Filter to existing columns
anomaly_features = [col for col in anomaly_features if col in df.columns]
print(f"🔍 Analyzing features: {anomaly_features}")

🔍 Analyzing features: ['temperature_celsius', 'humidity', 'pressure_mb', 'wind_kph', 'precip_mm', 'cloud', 'uv_index']


### 2.1 Z-Score Method

In [6]:
def detect_zscore_outliers(data, column, threshold=3):
    """Detect outliers using Z-score method."""
    z_scores = np.abs(stats.zscore(data[column].dropna()))
    outliers = z_scores > threshold
    return outliers, z_scores

# Apply Z-score to each feature
zscore_results = {}
for col in anomaly_features:
    if col in df.columns:
        valid_idx = df[col].dropna().index
        outliers, z_scores = detect_zscore_outliers(df, col)
        zscore_results[col] = {
            'outliers': outliers.sum(),
            'percentage': (outliers.sum() / len(outliers) * 100)
        }

zscore_df = pd.DataFrame(zscore_results).T
zscore_df.columns = ['Outlier Count', 'Percentage (%)']
print("📊 Z-Score Anomaly Detection Results (threshold=3 std):")
zscore_df

📊 Z-Score Anomaly Detection Results (threshold=3 std):


Unnamed: 0,Outlier Count,Percentage (%)
temperature_celsius,525.0,0.459708
humidity,0.0,0.0
pressure_mb,79.0,0.069175
wind_kph,154.0,0.134848
precip_mm,1551.0,1.358108
cloud,0.0,0.0
uv_index,430.0,0.376523


### 2.2 IQR (Interquartile Range) Method

In [7]:
def detect_iqr_outliers(data, column, multiplier=1.5):
    """Detect outliers using IQR method."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    outliers = (data[column] < lower_bound) | (data[column] > upper_bound)
    return outliers, lower_bound, upper_bound

# Apply IQR to each feature
iqr_results = {}
for col in anomaly_features:
    if col in df.columns:
        outliers, lower, upper = detect_iqr_outliers(df, col)
        iqr_results[col] = {
            'outliers': outliers.sum(),
            'percentage': (outliers.sum() / len(df) * 100),
            'lower_bound': lower,
            'upper_bound': upper
        }

iqr_df = pd.DataFrame(iqr_results).T
print("📊 IQR Anomaly Detection Results (1.5×IQR):")
iqr_df

📊 IQR Anomaly Detection Results (1.5×IQR):


Unnamed: 0,outliers,percentage,lower_bound,upper_bound
temperature_celsius,2602.0,2.278399,0.7,44.7
humidity,0.0,0.0,-3.5,136.5
pressure_mb,3370.0,2.950886,998.0,1030.0
wind_kph,1786.0,1.563882,-10.75,35.25
precip_mm,21498.0,18.824374,-0.045,0.075
cloud,0.0,0.0,-112.5,187.5
uv_index,256.0,0.224162,-8.5,14.7


### 2.3 Isolation Forest

In [8]:
# Prepare data for Isolation Forest
df_numeric = df[anomaly_features].dropna()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_numeric)

# Train Isolation Forest
iso_forest = IsolationForest(contamination=0.05, random_state=42, n_jobs=-1)
iso_predictions = iso_forest.fit_predict(X_scaled)

# -1 indicates outlier, 1 indicates inlier
iso_outliers = iso_predictions == -1

print(f"🌲 Isolation Forest Results:")
print(f"   - Total samples analyzed: {len(df_numeric):,}")
print(f"   - Outliers detected: {iso_outliers.sum():,} ({iso_outliers.sum()/len(iso_outliers)*100:.2f}%)")
print(f"   - Inliers: {(~iso_outliers).sum():,} ({(~iso_outliers).sum()/len(iso_outliers)*100:.2f}%)")

🌲 Isolation Forest Results:
   - Total samples analyzed: 114,203
   - Outliers detected: 5,711 (5.00%)
   - Inliers: 108,492 (95.00%)


### 2.4 Local Outlier Factor (LOF)

In [9]:
# Train LOF (using a sample for computational efficiency)
sample_size = min(50000, len(df_numeric))
sample_idx = np.random.choice(len(df_numeric), sample_size, replace=False)
X_sample = X_scaled[sample_idx]

lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05, n_jobs=-1)
lof_predictions = lof.fit_predict(X_sample)

lof_outliers = lof_predictions == -1

print(f"🔬 Local Outlier Factor (LOF) Results:")
print(f"   - Samples analyzed: {sample_size:,}")
print(f"   - Outliers detected: {lof_outliers.sum():,} ({lof_outliers.sum()/len(lof_outliers)*100:.2f}%)")

🔬 Local Outlier Factor (LOF) Results:
   - Samples analyzed: 50,000
   - Outliers detected: 2,500 (5.00%)


## 3. Anomaly Visualization

In [10]:
# Create anomaly detection summary visualization
methods = ['Z-Score', 'IQR', 'Isolation Forest', 'LOF']
outlier_counts = [
    zscore_df['Outlier Count'].sum() if not zscore_df.empty else 0,
    iqr_df['outliers'].sum() if not iqr_df.empty else 0,
    iso_outliers.sum(),
    lof_outliers.sum()
]

fig = go.Figure(data=[
    go.Bar(
        x=methods,
        y=outlier_counts,
        marker_color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'],
        text=outlier_counts,
        textposition='outside'
    )
])

fig.update_layout(
    title='🔍 Anomaly Detection Methods Comparison',
    xaxis_title='Detection Method',
    yaxis_title='Number of Outliers Detected',
    template='plotly_dark',
    height=500
)
fig.show()

In [11]:
# Visualize temperature distribution with outliers highlighted
if 'temperature_celsius' in df.columns:
    temp_data = df['temperature_celsius'].dropna()
    z_scores = np.abs(stats.zscore(temp_data))
    outlier_mask = z_scores > 3
    
    fig = go.Figure()
    
    # Normal points
    fig.add_trace(go.Histogram(
        x=temp_data[~outlier_mask],
        name='Normal',
        marker_color='#4ECDC4',
        opacity=0.7
    ))
    
    # Outliers
    fig.add_trace(go.Histogram(
        x=temp_data[outlier_mask],
        name='Outliers',
        marker_color='#FF6B6B',
        opacity=0.7
    ))
    
    fig.update_layout(
        title='🌡️ Temperature Distribution with Outliers Highlighted',
        xaxis_title='Temperature (°C)',
        yaxis_title='Frequency',
        template='plotly_dark',
        barmode='overlay',
        height=500
    )
    fig.show()

In [12]:
# Box plots for all anomaly features
fig = make_subplots(rows=2, cols=4, subplot_titles=anomaly_features[:8])

colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4', '#FFEAA7', '#DDA0DD', '#98D8C8', '#F7DC6F']

for i, col in enumerate(anomaly_features[:8]):
    if col in df.columns:
        row = i // 4 + 1
        col_idx = i % 4 + 1
        fig.add_trace(
            go.Box(y=df[col].dropna(), name=col, marker_color=colors[i % len(colors)]),
            row=row, col=col_idx
        )

fig.update_layout(
    title='📦 Box Plots: Identifying Outliers in Weather Features',
    template='plotly_dark',
    height=600,
    showlegend=False
)
fig.show()

In [13]:
# Anomaly counts by country (top 20)
if 'country' in df.columns and 'temperature_celsius' in df.columns:
    # Calculate outliers per country
    df_temp = df[['country', 'temperature_celsius']].dropna()
    
    def count_outliers(group):
        if len(group) < 10:
            return 0
        z_scores = np.abs(stats.zscore(group))
        return (z_scores > 3).sum()
    
    country_outliers = df_temp.groupby('country')['temperature_celsius'].apply(count_outliers)
    top_outlier_countries = country_outliers.sort_values(ascending=False).head(20)
    
    fig = go.Figure(data=[
        go.Bar(
            x=top_outlier_countries.values,
            y=top_outlier_countries.index,
            orientation='h',
            marker_color='#FF6B6B'
        )
    ])
    
    fig.update_layout(
        title='🌍 Top 20 Countries with Most Temperature Anomalies',
        xaxis_title='Number of Anomalies',
        yaxis_title='Country',
        template='plotly_dark',
        height=600
    )
    fig.show()

## 4. Insights Summary

In [14]:
# Generate summary statistics
print("="*60)
print("📊 ADVANCED EDA - SUMMARY INSIGHTS")
print("="*60)

print(f"\n📁 Dataset Overview:")
print(f"   • Total Records: {len(df):,}")
print(f"   • Total Features: {len(df.columns)}")
print(f"   • Countries Covered: {df['country'].nunique() if 'country' in df.columns else 'N/A'}")

print(f"\n🔍 Anomaly Detection Summary:")
print(f"   • Z-Score (3σ): ~{zscore_df['Percentage (%)'].mean():.2f}% average outliers per feature")
print(f"   • IQR Method: ~{iqr_df['percentage'].mean():.2f}% average outliers per feature")
print(f"   • Isolation Forest: {iso_outliers.sum()/len(iso_outliers)*100:.2f}% outliers")
print(f"   • LOF: {lof_outliers.sum()/len(lof_outliers)*100:.2f}% outliers")

print(f"\n🌡️ Temperature Statistics:")
if 'temperature_celsius' in df.columns:
    temp = df['temperature_celsius']
    print(f"   • Mean: {temp.mean():.2f}°C")
    print(f"   • Std Dev: {temp.std():.2f}°C")
    print(f"   • Range: {temp.min():.2f}°C to {temp.max():.2f}°C")

print("\n" + "="*60)

📊 ADVANCED EDA - SUMMARY INSIGHTS

📁 Dataset Overview:
   • Total Records: 114,203
   • Total Features: 42
   • Countries Covered: 204

🔍 Anomaly Detection Summary:
   • Z-Score (3σ): ~0.34% average outliers per feature
   • IQR Method: ~3.69% average outliers per feature
   • Isolation Forest: 5.00% outliers
   • LOF: 5.00% outliers

🌡️ Temperature Statistics:
   • Mean: 22.21°C
   • Std Dev: 9.07°C
   • Range: -24.90°C to 49.20°C



In [15]:
# Save anomaly data for use in other notebooks
anomaly_summary = {
    'zscore_results': zscore_df.to_dict() if not zscore_df.empty else {},
    'iqr_results': iqr_df.to_dict() if not iqr_df.empty else {},
    'isolation_forest_outliers': int(iso_outliers.sum()),
    'lof_outliers': int(lof_outliers.sum())
}

import json
with open('../outputs/anomaly_summary.json', 'w') as f:
    json.dump(anomaly_summary, f, indent=2)

print("✅ Anomaly summary saved to outputs/anomaly_summary.json")

✅ Anomaly summary saved to outputs/anomaly_summary.json
