# 03 Extensive EDA – Single-Station Lightning Waveforms ⚡

This notebook demonstrates an in-depth exploratory data analysis (EDA) for a single station. It mirrors the style of `01_eda.ipynb` and `02_eda.ipynb` but dives much deeper with many statistics and signal-processing visualisations.  It assumes you have generated synthetic lightning data using `scripts/sim_make.py` or obtained real recordings.

We will load the waveform for one station, inspect noise characteristics, visualise several events, and experiment with classic time-series techniques to pinpoint lightning occurrences.  Later we will attempt a simple denoising approach and evaluate detection accuracy.

## 0. Imports and file paths
```python
import numpy as np
import json, pathlib
import matplotlib.pyplot as plt
import scipy.signal as sig
```

Adjust the paths below to point at your waveform (`.npy`) and its accompanying meta file.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 0. File paths and loading helpers                       ║
# ╚══════════════════════════════════════════════════════════╝
import numpy as np, json, pathlib, scipy.signal as sig, matplotlib.pyplot as plt
root = pathlib.Path('../data/synthetic')
wave_f = root / 'storm1_LON.npy'  # single station example
meta_f = root / 'storm1_meta.json'
wave = np.load(wave_f, mmap_mode='r')
meta = json.load(open(meta_f))
fs = meta['fs']
events = meta['events']
print('Samples:', wave.shape[0], 'Fs=', fs)
print('Total events:', len(events))
print('First event:', events[0])

## 1. First-look visualisation
Plot the first few seconds of the waveform to see noise and a sample lightning burst.  We also compute a short-time Fourier transform (spectrogram) to examine frequency content.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 1. First-second waveform & spectrogram                  ║
# ╚══════════════════════════════════════════════════════════╝
sec = wave[:fs]
t = np.arange(sec.size)/fs
fig, ax = plt.subplots(2,1,figsize=(12,4),sharex=True)
ax[0].plot(t, sec)
ax[0].set_ylabel('E-field (arb.)'); ax[0].set_title('First second')
ax[1].specgram(sec, Fs=fs, NFFT=2048, noverlap=1024, cmap='viridis')
ax[1].set_xlabel('Time (s)'); ax[1].set_ylabel('Frequency (Hz)')
plt.show()

## 2. Noise statistics
We quantify baseline noise to help choose detection thresholds.  Basic statistics include mean, standard deviation, power spectral density and the amplitude envelope.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 2. Noise characteristics                                 ║
# ╚══════════════════════════════════════════════════════════╝
noise_seg = wave[:fs*5]
mean = float(noise_seg.mean())
std = float(noise_seg.std())
f, Pxx = sig.welch(noise_seg, fs, nperseg=4096)
print('Mean', mean, 'Std', std)
plt.semilogy(f, Pxx); plt.xlabel('Hz'); plt.ylabel('PSD'); plt.show()
rect = np.abs(noise_seg)
b, a = sig.butter(2, 500/(fs/2))
env = sig.filtfilt(b, a, rect)
plt.plot(np.arange(env.size)/fs, env); plt.title('Noise envelope'); plt.show()

## 3. Explore lightning events
Loop over a few labelled events and display their waveform snippets with annotations such as predicted start time, duration and approximate peak frequency.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 3. Inspect several labelled events                       ║
# ╚══════════════════════════════════════════════════════════╝
for ev in events[:3]:
    i0 = int(ev['t'] * fs)
    sl = slice(i0 - int(0.01*fs), i0 + int(0.05*fs))
    snip = wave[sl]
    t_snip = (np.arange(snip.size)/fs) - 0.01
    plt.figure(figsize=(10,3))
    plt.plot(t_snip*1000, snip)
    plt.axvline(0, color='r', ls='--', label='event t')
    plt.xlabel('ms'); plt.ylabel('Amplitude');
    plt.title(f"Event at t={ev['t']:.3f}s, amp={ev['amp']:.2f}")
    plt.legend(); plt.show()

## 4. Rolling statistics & envelope
A moving window view helps us detect bursts. We compute rolling mean, rolling std, and a smooth envelope via Hilbert transform.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 4. Rolling stats and envelope                            ║
# ╚══════════════════════════════════════════════════════════╝
window_ms = 1
win = int(window_ms*1e-3*fs)
roll_mean = np.convolve(wave, np.ones(win)/win, mode='same')
roll_std = np.sqrt(np.convolve((wave-roll_mean)**2, np.ones(win)/win, mode='same'))
envelope = np.abs(sig.hilbert(wave))
plt.plot(envelope[:fs*2]); plt.title('Envelope – first two seconds'); plt.show()

## 5. Simple band-pass filter (denoising)
We design a Butterworth band-pass filter to suppress low-frequency drift and high-frequency noise.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 5. Band-pass filter                                       ║
# ╚══════════════════════════════════════════════════════════╝
low, high = 3000, 9000
b, a = sig.butter(4, [low/(fs/2), high/(fs/2)], btype='band')
wave_filt = sig.filtfilt(b, a, wave)
plt.plot(wave[:fs], label='orig'); plt.plot(wave_filt[:fs], label='filt', alpha=0.7);
plt.legend(); plt.title('Band-pass filtered'); plt.show()

## 6. Burst detection via envelope threshold
Using the filtered waveform we compute the envelope and apply a threshold based on noise statistics. Detections are compared against ground truth event times for a basic precision/recall estimate.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 6. Envelope threshold detector                           ║
# ╚══════════════════════════════════════════════════════════╝
env_f = np.abs(sig.hilbert(wave_filt))
noise_level = np.median(env_f) + 6*np.median(np.abs(env_f-np.median(env_f)))
detections = env_f > noise_level
det_times = np.where(np.diff(detections.astype(int))==1)[0] / fs
print('Detected events:', len(det_times))
true_times = np.array([ev['t'] for ev in events])
tol = 0.005
tp = sum(np.any(np.abs(det_times[:,None]-true_times)<=tol, axis=1))
fn = len(true_times) - tp
fp = len(det_times) - tp
print(f'TP={tp}, FP={fp}, FN={fn}')

## 7. Extract clean waveforms
Extract short snippets around each detected event for further analysis or model training.

In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 7. Snippet extraction                                    ║
# ╚══════════════════════════════════════════════════════════╝
snippets = []
for t0 in det_times:
    i0 = int((t0 - 0.005)*fs)
    sl = slice(max(i0,0), i0 + int(0.04*fs))
    snippets.append(wave_filt[sl])
plt.plot(snippets[0]); plt.title('Example snippet'); plt.show()

## 8. Matched filter detection
A simple matched filter slides an idealised waveform template over the recording and correlates it with the signal. Peaks in the correlation indicate likely lightning events.


In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 8. Matched filter detection                               ║
# ╚══════════════════════════════════════════════════════════╝
template = snippets[0]  # crude template using first snippet
template = (template - template.mean()) / template.std()
xcorr = sig.fftconvolve(wave_filt, template[::-1], mode='same')
plt.figure(figsize=(12,3)); plt.plot(xcorr); plt.title('Matched filter response'); plt.show()
peak_idx = np.argsort(xcorr)[-len(events):]
peak_times = np.sort(peak_idx)/fs
print('Top correlation peak times:', peak_times[:5])

## 9. Event amplitude distribution
We examine the distribution of true event amplitudes and compare with the amplitudes measured after filtering.


In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 9. Amplitude histograms                                   ║
# ╚══════════════════════════════════════════════════════════╝
true_amps = np.array([ev['amp'] for ev in events])
measured_amps = [snip.max() - snip.min() for snip in snippets]
plt.hist(true_amps, bins=20, alpha=0.6, label='true');
plt.hist(measured_amps, bins=20, alpha=0.6, label='measured');
plt.legend(); plt.title('Event amplitude distribution'); plt.show()

## 9.1 STFT around events
We view spectrograms of short snippets around each labelled event to analyse frequency content evolution.


In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 9.1 STFT per event                                        ║
# ╚══════════════════════════════════════════════════════════╝
win = sig.windows.hann(256)
for ev in events[:3]:
    i0 = int(ev['t']*fs)
    seg = wave[i0-int(0.01*fs):i0+int(0.02*fs)]
    f,t,S = sig.spectrogram(seg, fs=fs, window=win, nperseg=256, noverlap=128)
    plt.pcolormesh(t-0.01, f, 20*np.log10(S+1e-6), shading='auto')
    plt.title(f'STFT around event at {ev["t"]:.3f}s')
    plt.ylabel('Hz'); plt.xlabel('Time (s)');
    plt.show()

## 10. Wavelet time-frequency view
Wavelet transforms provide a clearer time-frequency picture for transient bursts.


In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 10. Wavelet transform                                     ║
# ╚══════════════════════════════════════════════════════════╝
import pywt
scales = np.arange(1, 128)
coef, freqs = pywt.cwt(wave[:fs*2], scales, 'mexh', sampling_period=1/fs)
plt.imshow(np.abs(coef), extent=[0,2,freqs[-1],freqs[0]], aspect='auto', cmap='turbo')
plt.ylabel('Frequency (Hz)'); plt.xlabel('Time (s)'); plt.title('CWT of first 2s'); plt.show()

## 10.1 Cross-correlation check
Cross-correlating snippets verifies alignment between predicted and actual events. High peaks indicate consistent timing across events.


In [None]:
# ╔══════════════════════════════════════════════════════════╗
# ║ 10.1 Cross-correlation                                     ║
# ╚══════════════════════════════════════════════════════════╝
ref = snippets[0]
for i,s in enumerate(snippets[1:4], start=1):
    corr = sig.correlate(s, ref, mode='full')
    lags = np.arange(-len(s)+1, len(ref)) / fs
    plt.figure(figsize=(6,2))
    plt.plot(lags*1000, corr)
    plt.title(f'Corr snippet {i} vs 0'); plt.xlabel('lag ms');
    plt.tight_layout(); plt.show()

## 11. Further work
These analyses illustrate many avenues for lightning signal exploration. Extending to cross-station coherence or machine learning will build upon this foundation.


## 8. Conclusions
We showcased a wide array of EDA techniques on a single-station recording: noise analysis, spectral content, rolling statistics, simple denoising and envelope-based detection.  These provide insight into lightning signal properties and form a baseline for future modelling.