---
title: "Industrial Sensor Data Analysis"
format: html
---

# ðŸ“¡ Sensor Data Analysis & Exploration
**Portfolio Project 1 â€” Industrial Sensor Data**

---

## Objective
Explore, clean, and visualize multi-channel sensor data from a real industrial dataset.
Learn patterns, correlations, and temporal behaviour across sensors.

## Dataset
**UCI Machine Learning Repository â€” Gas Sensor Array Dataset**
- 16 chemical sensors exposed to gas mixtures
- 65,537 readings with timestamps and sensor responses
- Download: https://archive.ics.uci.edu/dataset/291

---

In [None]:
# 1. Imports & Configuration
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from matplotlib.gridspec import GridSpec
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-whitegrid')
FIG_W, FIG_H = 14, 6
print('All imports successful.')

## 1. Data Loading & Synthetic Fallback
We attempt to load the UCI gas sensor CSV. If unavailable locally, we generate a realistic synthetic replica with the same statistical fingerprint.

In [None]:
# 2. Load or synthesise data
DATA_PATH = 'gas_sensor_array.csv'


def generate_synthetic_sensor_data(n=65537, n_sensors=16, seed=42):
    """Synthetic gas-sensor array with realistic drift, noise, and cross-correlation."""
    rng = np.random.default_rng(seed)
    t = np.linspace(0, 100, n)

    # Base drift per sensor (slow sinusoidal)
    base = np.column_stack([
        np.sin(t / (20 + i*3) + i*0.5) * (0.3 + 0.05*i)
        for i in range(n_sensors)
    ])

    # Correlated noise (shared + independent)
    shared_noise = rng.normal(0, 0.05, (n, 1))
    indep_noise = rng.normal(0, 0.02, (n, n_sensors))
    noise = shared_noise + indep_noise

    # Occasional transient spikes (gas pulses)
    spikes = np.zeros((n, n_sensors))
    spike_idx = rng.choice(n, size=200, replace=False)
    affected = rng.integers(0, n_sensors, size=200)
    spikes[spike_idx, affected] = rng.normal(1.5, 0.4, 200)

    data = base + noise + spikes
    data = np.clip(data, -2, 3)

    cols = [f'Sensor_{i+1:02d}' for i in range(n_sensors)]
    df = pd.DataFrame(data, columns=cols)
    df['Timestamp'] = pd.date_range('2023-01-01', periods=n, freq='5min')
    df['Gas_Label'] = rng.choice(['CH4', 'CO', 'NO2', 'H2S', 'NH3'], size=n, p=[
                                 0.3, 0.25, 0.2, 0.15, 0.1])
    return df


try:
    df = pd.read_csv(DATA_PATH)
    print(f'Loaded from file: {df.shape}')
except FileNotFoundError:
    df = generate_synthetic_sensor_data()
    print(f'Synthetic data generated: {df.shape}')

df.head()

## 2. Basic Exploratory Data Analysis

In [None]:
# 3. Shape, types, nulls
print('Shape:', df.shape)
print('\nDtypes:\n', df.dtypes)
print('\nNull counts:\n', df.isnull().sum())
print('\nDescriptive statistics:')
df.drop(columns=['Timestamp', 'Gas_Label'],
        errors='ignore').describe().round(3)

In [None]:
# 4. Time-series overview â€” first 5 sensors
fig, axes = plt.subplots(5, 1, figsize=(FIG_W, 10), sharex=True)
sensor_cols = [c for c in df.columns if c.startswith('Sensor')]

for ax, col in zip(axes, sensor_cols[:5]):
    ax.plot(df['Timestamp'], df[col], linewidth=0.6, color='steelblue')
    ax.set_ylabel(col, fontsize=9)
    ax.set_xlim(df['Timestamp'].iloc[0],
                df['Timestamp'].iloc[min(8640, len(df)-1)])

axes[-1].set_xlabel('Time')
fig.suptitle('Sensor Time Series â€” First 7 Days', fontsize=14, y=1.02)
plt.tight_layout()
plt.show()

## 3. Distribution & Correlation Analysis

In [None]:
# 5. Violin plots â€” distribution per sensor (sample for readability)
fig, ax = plt.subplots(figsize=(FIG_W, 5))
sample_cols = sensor_cols[:8]
data_to_plot = df[sample_cols].sample(5000, random_state=0)
sns.violinplot(data=data_to_plot, ax=ax, palette='coolwarm', linewidth=0.8)
ax.set_title('Sensor Response Distributions', fontsize=13)
ax.set_ylabel('Response Magnitude')
plt.xticks(rotation=30)
plt.tight_layout()
plt.show()

In [None]:
# 6. Correlation heatmap
corr = df[sensor_cols].corr()
fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm',
            center=0, vmin=-1, vmax=1, linewidths=0.5, ax=ax)
ax.set_title('Sensor Cross-Correlation Matrix', fontsize=13)
plt.tight_layout()
plt.show()

## 4. Rolling Statistics & Drift Detection

In [None]:
# 7. Rolling mean + std for one sensor (detect drift)
WINDOW = 288  # 1 day at 5-min intervals
pick = 'Sensor_01'

df['roll_mean'] = df[pick].rolling(WINDOW, center=True).mean()
df['roll_std'] = df[pick].rolling(WINDOW, center=True).std()

fig, axes = plt.subplots(2, 1, figsize=(FIG_W, 7), sharex=True)

axes[0].plot(df['Timestamp'], df[pick], lw=0.5,
             color='steelblue', alpha=0.6, label='Raw')
axes[0].plot(df['Timestamp'], df['roll_mean'],
             color='crimson', lw=1.5, label='24h Rolling Mean')
axes[0].legend(loc='upper right')
axes[0].set_title('Sensor 01 â€” Raw vs Rolling Mean')
axes[0].set_ylabel('Response')

axes[1].fill_between(df['Timestamp'], df['roll_std'],
                     color='orange', alpha=0.6)
axes[1].set_title('Rolling Standard Deviation (drift indicator)')
axes[1].set_ylabel('Std Dev')
axes[1].set_xlabel('Time')

plt.tight_layout()
plt.show()

## 5. Gas Label Breakdown

In [None]:
# 8. Mean sensor response by gas label
mean_by_gas = df.groupby('Gas_Label')[sensor_cols].mean()

fig, ax = plt.subplots(figsize=(FIG_W, 5))
mean_by_gas.T.plot(kind='bar', ax=ax, width=0.8, colormap='tab10')
ax.set_title('Mean Sensor Response by Gas Label')
ax.set_ylabel('Mean Response')
ax.set_xlabel('Sensor')
ax.legend(title='Gas', bbox_to_anchor=(1.02, 1))
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## Summary
- Explored 16 sensors across 65k+ readings
- Identified cross-sensor correlations and shared-noise structure
- Quantified sensor drift via rolling statistics
- Segmented response patterns by gas label

These features and patterns feed directly into the **Anomaly Detection** and **ML Classification** notebooks.