# Exploratory Data Analysis (EDA)

This notebook explores the EcoNET Smart City IoT dataset, specifically focusing on:
- Station coverage and missing data
- Sensor readings (e.g., soil moisture, temperature)
- Anomaly label distribution


In [None]:
# Explore the raw and processed dataset

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load data 
df = pd.read_csv("../data/raw/mock_sensor_data.csv", parse_dates=['timestamp'])

# --- Basic Metadata ---
print("Data shape:", df.shape)
print("Columns:", df.columns.tolist())
print("Missing values:")
print(df.isna().sum())

# --- Time series plot ---
df.set_index('timestamp')[['sensor_A', 'sensor_B', 'sensor_C']].plot(figsize=(12, 4))
plt.title("Sensor Readings Over Time")
plt.xlabel("Time")
plt.ylabel("Value")
plt.tight_layout()
plt.show()

# --- Anomaly Distribution ---
sns.countplot(x='anomaly', data=df)
plt.title("Anomaly Class Balance")
plt.show()

# --- Rolling statistics for a sensor ---
df['sensor_A_mean'] = df['sensor_A'].rolling(window=12).mean()
df['sensor_A_std'] = df['sensor_A'].rolling(window=12).std()

plt.plot(df['timestamp'], df['sensor_A'], label='sensor_A')
plt.plot(df['timestamp'], df['sensor_A_mean'], label='Rolling Mean')
plt.fill_between(df['timestamp'],
                 df['sensor_A_mean'] - df['sensor_A_std'],
                 df['sensor_A_mean'] + df['sensor_A_std'],
                 alpha=0.2, label='Rolling Std Dev')
plt.legend()
plt.title("Rolling Mean and Std for Sensor A")
plt.tight_layout()
plt.show()

## Notes

- Dataset used: `smart_city_iot.csv`, transformed from long → wide format.
- Exploratory plots helped identify sparsity and temporal irregularities.
- Only stations with sufficient temporal coverage were used for downstream models.
