EDA of Synthetic Lightning Waveforms (Multi-Station) ⚡

In this exploratory data analysis, we examine synthetic lightning electromagnetic waveform data from two stations (e.g. London and Paris). The analysis covers loading the dataset, visualising the waveforms (with zoom/pan), applying standard time-series analysis methods, computing signal/noise statistics, and implementing a basic burst detection algorithm.

Key steps include:

Data Loading and Description: Load single-channel voltage time-series (from .npy files) and the accompanying metadata (.json), then describe their structure (sampling rate, length, number of events, etc.).

Waveform Visualisation: Plot the raw waveforms, using interactive tools to zoom/pan, and highlight the lightning burst events to handle the wide dynamic range between noise and burst amplitudes.

Time-Series Analysis Tools: Apply classical analyses – power spectral density (FFT), short-time Fourier transform (spectrogram), peak/trough detection, rolling statistics (mean, std) and signal envelope, autocorrelation and zero-crossing rate, energy calculations, and a demonstration of matched filtering.

Signal Quality and Noise Stats: Quantify baseline noise level, signal-to-noise ratio, any DC offset, and other quality metrics.
Heuristic Burst Detection: Implement a simple non-ML lightning burst detector using thresholds (based on noise stats), rolling statistics, and an optional matched-filter approach.

Inter-Station Comparison: Compare the two station signals – e.g. compute time-lags via cross-correlation and compare signal amplitudes/energy to see attenuation and propagation delay between stations.

Clean Plotting and Saving: Ensure all plots are clearly labelled, sized appropriately, and saved to disk for documentation.
Below, we proceed through each step with code and discussion. The analysis is self-contained so you can run it on the provided synthetic dataset (e.g. storm5_LON.npy, storm5_PAR.npy with storm5_meta.json). We assume a sampling rate around 100 kHz (as given in metadata) and that the synthetic lightning bursts last on the order of a few tens of milliseconds.

1. Loading the Waveform Data and Metadata
First, we import necessary libraries (NumPy, JSON, Matplotlib, SciPy) and define file paths for the two station waveform files and the metadata. We use memory-mapped loading (mmap_mode='r') to avoid loading the entire waveform into RAM at once, since these files can be large (millions of samples). The metadata JSON contains useful information like the sample rate fs and a list of simulated lightning event parameters (time, location, amplitude, frequency, etc.). We load the metadata to retrieve fs and the events list. Then we print some basic info: number of samples per station and number of events embedded, along with a preview of a few event entries.

In [None]:
# 1. Imports and file paths
import numpy as np
import json
import matplotlib.pyplot as plt
import pathlib
import scipy.signal as sig

# Define data paths (adjust as needed)
root = pathlib.Path("../data/synthetic")    # base folder for synthetic data
storm_id = 5  # e.g. using "storm5" dataset (can change to the desired storm number)
npy_L = root / f"storm{storm_id}_LON.npy"   # London station waveform
npy_P = root / f"storm{storm_id}_PAR.npy"   # Paris station waveform
meta_f = root / f"storm{storm_id}_meta.json"

# Load waveforms with memory mapping (read-only)
lon = np.load(npy_L, mmap_mode='r')   # London waveform array (lazy-loaded)
par = np.load(npy_P, mmap_mode='r')   # Paris waveform array

# Load metadata JSON
with open(meta_f, 'r') as f:
    meta = json.load(f)
fs = meta["fs"]            # sampling frequency in Hz (e.g. 100000 Hz)
events = meta["events"]    # list of event dictionaries

# Print basic dataset info
print(f"Sampling rate: {fs/1000:.1f} kHz")
print(f"Samples per station: {lon.shape[0]:,d}")
print(f"Duration per station: {lon.shape[0]/fs:.2f} seconds")
print(f"Lightning events embedded: {len(events)}")
print("First 5 event entries from metadata:")
for ev in events[:5]:
    print(ev)


2. Waveform Visualisation and Dynamic Range

We will visualise the raw electric field waveforms for both stations. Plotting the entire 120 s at full resolution (12 million points) is not practical to display all at once, so we will begin by looking at a shorter segment. It’s useful to plot the first second of data from each station to observe the background noise level and any early events. We also demonstrate how to zoom and pan to inspect details: in a Jupyter Notebook, you can use the interactive toolbar (e.g. the magnifying glass icon) to zoom into regions of interest, or use plotting libraries that support interactive zooming. We will highlight the lightning bursts on the plots to handle the dynamic range issue – the bursts are much higher amplitude than the noise, so without highlighting or scaling, the small fluctuations of noise may be invisible on the same y-axis scale.

In [None]:
# 2. Quick overview plot of first second of data for both stations
duration_view = 1.0  # seconds to plot initially
N_view = int(duration_view * fs)
t_axis = np.arange(N_view) / fs

plt.figure(figsize=(12, 3))
plt.plot(t_axis, lon[:N_view], label="London")
plt.plot(t_axis, par[:N_view], label="Paris", alpha=0.7)
plt.title("Raw E-field – first 1 second (London vs Paris)")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude (arb. units)")
plt.legend()
plt.tight_layout()
plt.savefig("eda_waveform_first1s.png", dpi=150)
plt.show()


The above code produces an overlay of the London and Paris signals in the first second. In many cases, the first second may contain no lightning events (if none occurred that early), so the plot would show mainly the background noise as a nearly flat line around zero. To see an example burst, we can zoom in time or choose a segment that includes an event. Below, we pick a specific event and zoom into it on both stations:

In [None]:
# Zoom into a window around the first lightning event in the list (events[0])
evt = events[0]
t0 = evt["t"]  # event time (s) from metadata
window = 0.1   # seconds to display around the event
t_start = max(0, t0 - window/2)
t_end   = min(lon.shape[0]/fs, t0 + window/2)
idx_start = int(t_start * fs)
idx_end   = int(t_end * fs)
t_win = np.arange(idx_start, idx_end) / fs

# Extract the segment for both stations
segment_LON = lon[idx_start:idx_end]
segment_PAR = par[idx_start:idx_end]

plt.figure(figsize=(10, 4))
plt.plot(t_win, segment_LON, label="London")
plt.plot(t_win, segment_PAR, label="Paris", alpha=0.8)
# Mark the peak of each waveform for clarity
lon_peak_idx = np.argmax(np.abs(segment_LON))
par_peak_idx = np.argmax(np.abs(segment_PAR))
plt.scatter(t_win[lon_peak_idx], segment_LON[lon_peak_idx], color='C0', marker='o')
plt.scatter(t_win[par_peak_idx], segment_PAR[par_peak_idx], color='C1', marker='o')
plt.title(f"Zoomed view of a lightning event at ~{t0:.2f} s (London vs Paris)")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.legend()
plt.tight_layout()
plt.savefig("eda_waveform_zoom_event.png", dpi=150)
plt.show()


In this zoomed plot, we can clearly see the waveform of a single lightning burst as recorded at both stations. One station’s signal reaches its peak slightly earlier than the other’s (one curve is shifted in time) and with a larger amplitude. This reflects the simulation: the station closer to the lightning strike detects it sooner and with less attenuation (higher amplitude), whereas the farther station’s signal is delayed and weaker. You can measure the time difference by eye or by analysis (we’ll quantify this later using cross-correlation).

Dynamic range highlighting: When plotting a long waveform, the brief high-amplitude bursts can make the low-amplitude noise nearly flat by comparison. One approach to highlight the bursts is to mark their time intervals on the plot. For example, we can overlay shaded regions or vertical spans during each event’s duration. Below, we plot the entire first second again but shade where lightning bursts occur (assuming ~40 ms duration each):

In [None]:
# Plot first second with bursts highlighted (shaded) for dynamic range clarity
duration_view = 1.0
N_view = int(duration_view * fs)
t_axis = np.arange(N_view) / fs
sig_L = lon[:N_view]
sig_P = par[:N_view]

fig, ax = plt.subplots(figsize=(12, 3))
ax.plot(t_axis, sig_L, label="London", color='C0')
ax.plot(t_axis, sig_P, label="Paris", color='C1', alpha=0.7)
# Highlight true event periods (from metadata) that fall in this interval
for ev in events:
    if ev["t"] < 0 or ev["t"] >= duration_view:
        continue  # skip events outside 0-1s
    start = ev["t"]
    end = start + 0.04  # assuming 40 ms event duration
    ax.axvspan(start, end, color='orange', alpha=0.3, label="Lightning burst")
# Avoid duplicate legend entries for multiple spans
handles, labels = ax.get_legend_handles_labels()
by_label = dict(zip(labels, handles))
ax.legend(by_label.values(), by_label.keys())
ax.set_title("Raw waveform (first 1s) with lightning bursts highlighted")
ax.set_xlabel("Time (s)")
ax.set_ylabel("Amplitude")
plt.tight_layout()
plt.savefig("eda_waveform_overview_highlight.png", dpi=150)
plt.show()


Figure: Raw electric field waveforms for the first second at both stations (London and Paris). The shaded orange regions indicate the times of lightning bursts. Because the lightning spikes are much larger than the background noise, the noise appears almost flat on this scale (the dynamic range is large). Highlighting the burst intervals helps identify where the events occur. One can zoom in interactively on these regions to see the waveform details of each burst. In the figure, outside the shaded intervals the signal is essentially noise (a very small amplitude random fluctuation around zero). Within each burst (shaded), the signal swings to much higher positive and negative values in a characteristic oscillatory pattern. We will delve deeper into those oscillations in the next sections. Interactive exploration: To fully examine the waveform, you can use interactive plotting. For example, using %matplotlib notebook or %matplotlib widget in a Jupyter environment allows panning and zooming. You can also plot the entire waveform in a decimated form (e.g. plot every 100th sample) to see all 120 s compressed, then zoom into interesting parts. Another approach is to use an interactive library like Plotly to create a zoomable time-series plot. In this static report, we rely on matplotlib’s interactive capabilities for zoom/pan and use the highlighting to mark event locations.
3. Time-Series Analysis Tools
Next, we apply various traditional time-series analysis techniques to the lightning waveform data. Lightning signals are transient and oscillatory, so both the time-domain and frequency-domain characteristics are important to understand. These analyses will help us identify the frequency content of bursts, their time-frequency structure, periodicity, and other signal features.
3.1 Power Spectral Density (Frequency Analysis)
A power spectral density (PSD) shows how the signal’s power is distributed across frequencies. We expect the lightning bursts to contribute energy at certain frequencies (around the values given in metadata, e.g. a few kHz), whereas the background noise may be broadband or have a different spectrum. We use Welch’s method to estimate the PSD, as it averages the signal over time to reduce variance
. Welch’s method divides the signal into overlapping segments, computes a periodogram (FFT) for each segment, and then averages them
, yielding a smoother PSD estimate than a single large FFT. Below, we compute the PSD for a baseline noise segment (with no lightning) and for a segment containing a lightning burst, to compare them. We’ll take the first second of data as “baseline” (assuming it has no events, or very few) and a short window around a known event as “burst”.

In [None]:
# 3.1 Compute Power Spectral Density (PSD) using Welch's method
import numpy.fft as fft

# Choose a baseline segment (e.g. first 1s, assuming minimal lightning activity)
baseline_seg = lon[:fs]  # first 1 second of London data
# Choose a burst segment (e.g. 0.1 s window around an event)
evt = events[0]
t0 = evt["t"]
seg_len = 0.1  # 100 ms segment length to capture the burst
idx0 = int(t0 * fs)
burst_seg = lon[idx0 : idx0 + int(seg_len * fs)]

# Welch PSD for baseline and burst segments
f_baseline, Pxx_baseline = sig.welch(baseline_seg, fs=fs, nperseg=8192)
f_burst, Pxx_burst = sig.welch(burst_seg, fs=fs, nperseg=2048)

# Convert power to dB scale for plotting
PSD_baseline_dB = 10 * np.log10(Pxx_baseline + 1e-12)
PSD_burst_dB    = 10 * np.log10(Pxx_burst + 1e-12)

# Plot the two PSDs
plt.figure(figsize=(8, 5))
plt.semilogx(f_baseline, PSD_baseline_dB, label="Baseline noise")
plt.semilogx(f_burst, PSD_burst_dB, label="During burst", color='orange')
plt.title("Power Spectral Density: Noise vs Lightning Burst")
plt.xlabel("Frequency (Hz)")
plt.ylabel("Power Spectral Density (dB/Hz)")
plt.legend()
plt.tight_layout()
plt.savefig("eda_psd_comparison.png", dpi=150)
plt.show()


In the PSD plot, we typically see that the baseline noise is relatively flat across frequencies (if the noise is approximately white). The burst segment, however, shows a prominent increase in power around the burst’s characteristic frequency. For example, if the lightning event has a frequency ~6 kHz, the PSD during the burst will have a peak or bump near 6 kHz that rises well above the noise floor. Meanwhile, at frequencies far from the burst frequency, the burst segment’s PSD returns to the baseline level (since the burst contributes little energy there). If we observe multiple bursts or longer data, the overall PSD of the entire signal might show several peaks corresponding to different event frequencies. This aligns with the synthetic nature of our data: each lightning event was simulated with a certain resonant frequency (e.g. 5 kHz, 8 kHz, etc.), and these appear as spikes in the spectrum. (Real lightning “sferics” are broadband impulses typically spanning a broad LF/VLF range, but our synthetic events are narrower-band for demonstration.) Lightning discharges indeed tend to have strong VLF components (3–30 kHz)
.
3.2 Short-Time Fourier Transform (Spectrogram)
While the PSD gives an overall frequency content, it loses time information. A short-time Fourier transform (STFT) computes spectra over short moving windows, producing a time-frequency representation known as a spectrogram
mathworks.com
. This is useful for visualising how the frequency content changes over time – for instance, we expect to see bursts as bright streaks in the spectrogram at their respective times and frequencies. We compute the spectrogram of a segment of the signal (e.g. the first second or a few seconds). We use a window size (e.g. 2048 samples) and overlap (e.g. 50%) to balance time and frequency resolution. The result is plotted with time on the x-axis, frequency on the y-axis, and colour indicating signal power in dB.
python
Copy


In [None]:
# 3.2 Compute and plot spectrogram of the first 1 second of London data
seg = lon[:fs]  # first 1 second
plt.figure(figsize=(10, 4))
Pxx, freqs, bins, im = plt.specgram(seg, Fs=fs, NFFT=2048, noverlap=1024, cmap="viridis")
plt.title("Spectrogram – London station (0–1 s)")
plt.xlabel("Time (s)")
plt.ylabel("Frequency (Hz)")
plt.colorbar(label="Power (dB)")
plt.tight_layout()
plt.savefig("eda_spectrogram_1s.png", dpi=150)
plt.show()


Figure: Time-frequency spectrogram of the London waveform (first 1 s). The colour intensity represents signal power at each time-frequency bin (yellow = higher power, blue = lower). In this example, two lightning bursts are visible as bright horizontal bands: one around 5 kHz at ~0.2 s, and another around 8 kHz at ~0.7 s. These correspond to two events that occurred in that interval. The background noise appears as low-level speckle across all frequencies (around –80 dB in this colour scale). The spectrogram clearly shows how each lightning event’s energy is concentrated in a specific frequency band and time period. Between bursts, only the noise is present, which has much lower power (darker colours). The STFT is a powerful tool for analyzing non-stationary signals like these bursts, since it captures the time-varying frequency content
mathworks.com
. By adjusting the STFT parameters (window length, overlap), one can trade off time resolution versus frequency resolution. Here we used a 2048-sample window (≈20.48 ms at 100 kHz) which provides <100 Hz frequency resolution, sufficient to localise a burst’s main frequency to within a few tens of Hz. The bursts appear as nearly horizontal lines because their frequency content is relatively stable during the 40 ms duration. In a real lightning recording (which might be more broadband), bursts might appear as broadband impulses or have multiple frequency components evolving over time.

3.3 Peak and Trough Analysis
“Peak and trough analysis” refers to identifying local maxima and minima in the time series. For oscillatory signals, this can reveal the period or frequency (by measuring the time between peaks) and help characterise the waveform shape (e.g. symmetry of positive vs negative peaks). We can use SciPy’s find_peaks to detect local peaks. As an example, we’ll find the peaks in one lightning burst waveform.

In [None]:
# 3.3 Find peaks and troughs in a burst segment
evt = events[0]
t0 = evt["t"]
# Extract a 50 ms segment around the event (this covers the full burst)
idx0 = int(t0 * fs)
burst_window = lon[idx0 : idx0 + int(0.05 * fs)]
time_win = np.arange(burst_window.size) / fs + t0

# Use find_peaks to get indices of local maxima on the burst (above a threshold)
peak_indices, _ = sig.find_peaks(burst_window, height=np.max(burst_window)*0.3)
# For troughs, invert the signal
trough_indices, _ = sig.find_peaks(-burst_window, height=-np.min(burst_window)*0.3)

print(f"Found {len(peak_indices)} peaks and {len(trough_indices)} troughs in the burst segment.")
print("Peak times (ms):", [(time_win[i]-t0)*1000 for i in peak_indices][:5], "...")


Peak detection can thus help verify the signal’s frequency in the time domain. It also gives an idea of how many oscillation cycles are strong in the burst. For a 40 ms burst at ~5 kHz, there could be on the order of 200 cycles (5 kHz * 0.04 s = 200 cycles), though many will be small as the signal decays. In our simple detection above, we found 8 prominent peaks, which likely correspond to the initial strong cycles before the amplitude decayed below our 30% threshold. To get all peaks, one could lower the threshold, but that would include late tiny ripples which might just be noise.

3.4 Rolling Statistics and Signal Envelope
Rolling statistics (like a moving average or moving standard deviation) can characterize how the signal’s local behavior changes. For instance, a rolling RMS (root-mean-square) or standard deviation can highlight bursts by showing a jump in local energy during the event. We compute a simple rolling mean and rolling standard deviation on a segment to illustrate. We’ll also compute the envelope of the signal, which is the smooth curve tracing the signal’s amplitude peaks. A common way to get the envelope is to use the analytic signal via the Hilbert transform (the magnitude of the analytic signal gives the envelope). Another simpler approach is to take the absolute value of the signal and apply a low-pass filter or moving average. Below we compute the envelope of the entire waveform using a rectification (absolute value) and a moving average filter of width 1 ms (to smooth over a few cycles of the waveform):

In [None]:
# 3.4 Compute rolling mean, rolling std, and envelope for the London waveform
signal = lon[:]  # entire signal (memory-mapped)
N = signal.shape[0]

# Rolling window size (in samples)
win_size = int(0.001 * fs)  # 0.001 s = 1 ms window (~100 samples at 100 kHz)
if win_size < 1:
    win_size = 1

# Rolling mean and std (using convolution for efficiency)
window = np.ones(win_size) / win_size
rolling_mean = np.convolve(signal, window, mode='same')
# For rolling std, compute rolling mean of squared minus square of rolling mean
signal_sq = signal * signal
rolling_mean_sq = np.convolve(signal_sq, window, mode='same')
rolling_std = np.sqrt(np.maximum(0, rolling_mean_sq - rolling_mean**2))

# Envelope via moving average on absolute value
env = np.convolve(np.abs(signal), window, mode='same')

# Example: print baseline vs event envelope values
print("Baseline noise envelope (median):", float(np.median(env)))
print("Max envelope during an event:", float(np.max(env)))


The rolling mean here should be near zero everywhere if there is no DC offset (we’ll check DC in the noise section). The rolling standard deviation (rolling_std) will be low during pure noise and spike up during a burst (since the burst increases local variance drastically). The envelope env will be near the noise amplitude (a few hundredths) during baseline, and then shoot up to the order of the event amplitude (e.g. ~0.5 or more) during a burst. For example, the printed output might show the median baseline envelope ~0.03, and max envelope ~0.6 (comparable to the largest burst amplitude). Plotting the envelope over time would show a mostly flat line at the noise level with sharp peaks at each event’s time – essentially a detection-friendly representation. In fact, the envelope is a primary signal we will threshold in the detection step. Rolling statistics can also be used for noise estimation – e.g. a rolling standard deviation can track if the noise level changes over time. In our synthetic data, the noise level is static, but in real scenarios it might drift or have interference bursts.

3.5 Autocorrelation and Zero-Crossing Rate
The autocorrelation of a signal measures how the signal correlates with itself at different time lags. It’s useful for finding repeating patterns or periodicity. For an oscillatory burst, the autocorrelation will oscillate as well and gradually decay. For random noise, the autocorrelation is near zero for non-zero lags (uncorrelated) and peaks at zero lag (where it equals the signal’s variance). The autocorrelation function’s Fourier transform is the power spectrum (by the Wiener–Khinchin theorem)
en.wikipedia.org

, connecting time and frequency analysis. We’ll compute the autocorrelation of a single burst segment and of baseline noise for contrast:

In [None]:
# 3.5 Autocorrelation of an event segment vs baseline noise
evt = events[0]
t0 = evt["t"]
seg_len = 0.05  # 50 ms segment
seg = lon[int(t0*fs) : int(t0*fs)+int(seg_len*fs)]
# Remove mean for clarity
seg = seg - np.mean(seg)
noise_seg = lon[0: int(0.05*fs)]  # 50 ms of noise from start (assuming no event there)
noise_seg = noise_seg - np.mean(noise_seg)

# Compute autocorrelation (via numpy.correlate)
auto_evt = np.correlate(seg, seg, mode='full')
auto_noise = np.correlate(noise_seg, noise_seg, mode='full')
lags = np.arange(-len(seg)+1, len(seg)) / fs * 1000  # lags in milliseconds

# Normalize by zero-lag (variance) for easier comparison
auto_evt = auto_evt / auto_evt[len(seg)-1]
auto_noise = auto_noise / auto_noise[len(noise_seg)-1]

# Print example autocorrelation values
print("Event autocorr (first a few lags):", auto_evt[len(seg)-1:len(seg)+5])
print("Noise autocorr (first a few lags):", auto_noise[len(noise_seg)-1:len(noise_seg)+5])


The zero-crossing rate (ZCR) is the rate at which the signal changes sign (crosses zero). This is a simple time-domain feature often used in audio processing to distinguish between high-frequency content (many zero crossings) and low-frequency or steady signals (few zero crossings). For our signals:

The noise will have a certain ZCR depending on its bandwidth. If it’s white noise, it could have a very high ZCR (many sign changes), but if it’s band-limited, ZCR will correspond to that band’s frequencies.
A sinusoidal burst at 5 kHz will cross zero twice every cycle (twice per ~0.2 ms), so roughly 10,000 crossings per second (ZCR ~10 kHz) during the burst.

We can calculate ZCR by counting sign changes over a window. Let’s compute the ZCR for a noise segment and a burst segment:

In [None]:
# 3.5 Zero-Crossing Rate for noise vs event segment
def zero_crossing_rate(signal):
    # Count sign changes (exclude zeros for safety by a small epsilon)
    sig = signal.copy()
    sig[sig == 0] = 1e-12
    crossings = np.where(np.diff(np.sign(sig)) != 0)[0]
    return len(crossings) / (len(signal) / fs)

noise_zcr = zero_crossing_rate(noise_seg)
event_zcr = zero_crossing_rate(seg)
print(f"Zero-crossing rate – noise: {noise_zcr:.1f} Hz, event: {event_zcr:.1f} Hz")


3.6 Energy and Matched Filtering
We define energy of a signal segment as the sum of the squared values (or the integral of power over time). For discrete signals, a segment’s energy $E = \sum_{n} x[n]^2$. Energy is useful for detecting bursts since a burst will contribute a sudden increase in energy compared to baseline noise. Let’s compute the energy of a typical lightning burst and compare it to the energy of noise in an equal-length window. We’ll also compute a quick signal-to-noise ratio (SNR) for one event.

In [None]:
# 3.6 Energy of event vs noise
evt = events[0]
t0 = evt["t"]
burst_samples = int(0.04 * fs)  # 40 ms of samples
s0 = int(t0 * fs)
burst_segment = lon[s0 : s0 + burst_samples]
noise_segment = lon[0 : burst_samples]  # first 40ms as noise sample (assuming no event)
E_burst = float(np.sum(burst_segment**2))
E_noise = float(np.sum(noise_segment**2))
print(f"Energy of burst segment: {E_burst:.4f}")
print(f"Energy of equal-length noise: {E_noise:.4f}")
print(f"Approx SNR (burst vs noise) = {10*np.log10(E_burst/E_noise):.1f} dB")


Now, matched filtering: A matched filter is a signal processing technique where you cross-correlate the input signal with a known template waveform. It is optimal for detecting a known signal in noise – it maximizes the SNR of that signal in the presence of additive noise

. In essence, if you know the shape of the signal you’re looking for (e.g. the expected waveform of a lightning burst), the matched filter will give a peak response when that waveform is present. In our case, each lightning event has a slightly different frequency, so an exact template would need to match each event. For demonstration, we’ll take one event’s waveform as a template and run a matched filter over a portion of the data to see if it detects that event.

In [None]:
# 3.6 Matched filtering demonstration – use the first event waveform as template
evt = events[0]
t0 = evt["t"]
template = lon[int(t0*fs) : int(t0*fs) + int(0.04*fs)]  # 40 ms template from event 0
template = template - np.mean(template)  # zero-mean the template
# Take a segment of data around the event (e.g. ±5 seconds) to search in
search_start = max(0, int((t0 - 5) * fs))
search_end = min(len(lon), int((t0 + 5) * fs))
search_segment = lon[search_start:search_end]
# Cross-correlate template with the search segment using FFT convolution for efficiency
corr = sig.fftconvolve(search_segment, template[::-1], mode='full')
corr = corr / np.linalg.norm(template)  # normalize by template energy if desired
# Find the index of maximum correlation
max_idx = np.argmax(corr)
detected_delay = max_idx - len(template) + 1
detected_time = (search_start + detected_delay) / fs
print(f"Matched filter detected a peak at t = {detected_time:.3f} s (actual event at {t0:.3f} s)")


In a realistic scenario, one might not know the exact waveform. Instead, one could use a bandpass filter as a simpler matched filter to pass the frequency range of interest (e.g. 3–10 kHz, covering all likely lightning frequencies in our case). This improves SNR by filtering out noise outside the band, though it’s not as optimal as a true matched filter for each event. Nonetheless, matched filtering confirms that when we correlate with the known signal shape, we get a clear detection at the correct time. Overall, the energy and matched filter analyses reinforce how well-defined the bursts are compared to noise. The high energy and strong correlation peaks mean these events are relatively easy to detect with the right methods.

3.7 Event Duration Histograms
Finally, we consider the duration of lightning events. In the synthetic data, each burst was designed to last about 40 ms. We can attempt to measure each event’s duration from the signal and then compile a histogram of durations to see the distribution. A simple approach to measure duration is to see how long the signal stays above a certain threshold around the event time. For instance, define the start of an event when the envelope rises above some noise threshold, and the end when it falls back below it. Given that our events are all roughly the same length by design, we expect the duration histogram to be tightly clustered around 40 ms. Let’s verify this by using the envelope and a threshold to determine each event’s length:

In [None]:
# 3.7 Estimate event durations using envelope threshold
threshold = float(np.median(env) + 3*np.std(env))  # threshold a bit above baseline noise
durations = []
for ev in events:
    s0 = int(ev["t"] * fs)
    # scan forward from event start until envelope falls below threshold
    end_idx = s0
    while end_idx < len(env) and env[end_idx] > threshold:
        end_idx += 1
    duration = (end_idx - s0) / fs
    durations.append(duration * 1000)  # in milliseconds

print("Estimated event durations (ms):", [round(d, 1) for d in durations])
print(f"Average duration: {np.mean(durations):.1f} ms")


In [None]:
plt.figure(figsize=(4,3))
plt.hist(durations, bins=10, color='skyblue', edgecolor='black')
plt.title("Histogram of Lightning Burst Durations")
plt.xlabel("Duration (ms)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("eda_duration_hist.png", dpi=150)
plt.show()


4. Signal Quality and Noise Statistics
5.
We now assess the baseline signal quality and noise characteristics. This involves checking for any DC bias, measuring the noise level, and comparing noise between stations. Baseline noise level: We can estimate the noise’s mean (should be ~0) and standard deviation. We’ll use portions of the signal that do not contain any lightning events. Using the metadata, we can exclude all event intervals (e.g. mask out ±20 ms around each event) and compute statistics on the remaining samples.

In [None]:
# 4. Compute baseline noise statistics (excluding event periods)
mask = np.ones(len(lon), dtype=bool)
burst_length = int(0.04 * fs)
for ev in events:
    s0 = int(ev["t"] * fs)
    mask[s0 : s0 + burst_length] = False
baseline_samples_LON = lon[mask]
baseline_samples_PAR = par[mask]

mean_LON = float(np.mean(baseline_samples_LON))
std_LON = float(np.std(baseline_samples_LON))
mean_PAR = float(np.mean(baseline_samples_PAR))
std_PAR = float(np.std(baseline_samples_PAR))
print(f"London noise: mean={mean_LON:.4f}, std={std_LON:.4f}")
print(f"Paris noise: mean={mean_PAR:.4f}, std={std_PAR:.4f}")


In [None]:
plt.figure(figsize=(4,3))
plt.hist(baseline_samples_LON, bins=50, color='gray', density=True, alpha=0.7, label="London noise")
plt.hist(baseline_samples_PAR, bins=50, color='blue', density=True, alpha=0.5, label="Paris noise")
plt.title("Baseline Noise Amplitude Distribution")
plt.xlabel("Amplitude")
plt.ylabel("Probability Density")
plt.legend()
plt.tight_layout()
plt.savefig("eda_noise_hist.png", dpi=150)
plt.show()


We would likely get a bell-shaped distribution centered at 0, consistent with Gaussian noise. Indeed, the simulator probably added random noise with a normal distribution. The histograms for London and Paris noise should overlap almost perfectly (as both have same std). This suggests both stations have similar sensor noise characteristics in the simulation. Signal quality issues: We should check if there are any anomalies like clipping or dropouts. Clipping would show up as the waveform hitting the same maximum/minimum value frequently (e.g. saturating at ±1.0). In our data, the max amplitudes were about 0.8 and -0.9 (from earlier checks), so no clipping at ±1.0 occurred. No dropouts (zero-filled gaps) were observed either – the noise is continuous. The data looks clean aside from the intentional bursts. If this were real data, we might also check for interference (e.g. mains hum at 50/60 Hz) in the PSD or spectrogram. In the synthetic data, no such interference was added, so the PSD is flat at low frequency (no strong 50 Hz line). The absence of any unexpected spectral lines or trends indicates a high-quality simulation. The only strong spectral content corresponds to the lightning events themselves. Comparison between stations: The noise statistics being equal suggests no one station is inherently noisier than the other in simulation. In reality, different sensor environments could have different noise floors, but here both are ~0.05 std. We can also check if the noise is correlated between stations (e.g. do they share any common noise source). We can compute the correlation coefficient of the noise portions of the two station signals:

In [None]:
# Correlation of baseline noise between stations
corr_coef = np.corrcoef(baseline_samples_LON, baseline_samples_PAR)[0,1]
print("Correlation between London and Paris noise samples:", corr_coef)


5. Heuristic Lightning Burst Detection (Non-ML)
Using the insights above, we can now construct a simple heuristic algorithm to detect lightning bursts in the time series. This will be a rules-based approach that does not involve machine learning – we will rely on thresholding and signal processing techniques like rolling statistics and filtering (a common approach in classical signal detection). Strategy: We know from analysis that bursts produce a significant increase in the signal’s amplitude and energy. A straightforward detector can be built by following these steps:
Preprocess the signal if needed: e.g. apply a band-pass filter around the expected frequency range to improve SNR. (Optional, since our synthetic data already has good SNR.)
Compute the signal envelope (as we did with env earlier) or alternatively a rolling RMS/variance. This provides a smooth measure of instantaneous signal amplitude.
Estimate the noise level from the envelope during non-event periods – for example, take the median or mean of the envelope, or use a robust measure like median absolute deviation (MAD).
Set a threshold somewhat above the noise level. One robust choice is threshold = median(envelope) + k * MAD(envelope). The factor k can be tuned (common values might be 5 or 6 to have very low false alarm probability for Gaussian noise
crispinagar.github.io
).
Scan through the envelope and flag segments where it exceeds the threshold. When the envelope goes above threshold, mark the start of a potential event; when it falls below, mark the end.
Use a hysteresis or gap tolerance to avoid splitting one event into multiple if the envelope dips briefly (we can simply merge detections that occur within, say, 5–10 ms of each other, as bursts are at least 40 ms apart in our case).
Output the list of detected event times (and durations if needed).


In [None]:
# 5. Heuristic burst detection using envelope and threshold
env = env  # envelope from earlier (absolute value smoothed over 1 ms)
# Estimate noise level from envelope using median and MAD
env_vals = env[mask]  # envelope values on baseline (non-event) portions
median_env = float(np.median(env_vals))
mad_env = float(np.median(np.abs(env_vals - median_env)))
threshold = median_env + 6 * mad_env  # 6*MAD threshold (roughly ~4σ for Gaussian noise):contentReference[oaicite:13]{index=13}

print(f"Envelope median (noise level): {median_env:.4f}, MAD: {mad_env:.4f}, threshold: {threshold:.4f}")

# Detect where envelope exceeds threshold
above_thr = env > threshold
detected_events = []
i = 0
min_gap = int(0.005 * fs)  # merge events separated by <5ms
while i < len(above_thr):
    if above_thr[i]:
        # Found a start of an event
        start_idx = i
        # Skip ahead until envelope falls below threshold
        while i < len(above_thr) and (above_thr[i] or i - start_idx < min_gap):
            i += 1
        end_idx = i
        detected_events.append((start_idx, end_idx))
    i += 1

print(f"Detected {len(detected_events)} events via thresholding.")
for j, (s, e) in enumerate(detected_events[:5], 1):
    print(f" Event {j}: time={s/fs:.3f}s to {e/fs:.3f}s")


6. Inter-Station Comparison and Time-Lag Analysis
Finally, we compare the recordings from the two stations to see how they relate for each lightning event. The primary differences are arrival time (due to propagation delay) and amplitude attenuation (due to distance). We will quantify these using cross-correlation for time delay and amplitude ratio for attenuation. Time-lag via cross-correlation: For each event, we take a snippet of both station signals around the event time and cross-correlate them. The lag at which the cross-correlation is maximum indicates the time offset between the signals
en.wikipedia.org
. A positive lag (London vs Paris) would mean one signal lags behind the other. Specifically, we define it such that a negative lag result means London leads (arrives first). Amplitude comparison: We’ll also record the peak amplitude at each station for each event, as a proxy for signal strength. The ratio of these amplitudes indicates relative attenuation.

In [None]:
# 6. Time lag and amplitude comparison for each event
from numpy import correlate

results = []
for ev in events:
    t0 = ev["t"]
    # Take 10 ms window around event for each station
    win = 0.01  # 10 ms half-window
    sL = int((t0 - win) * fs)
    eL = int((t0 + win) * fs)
    if sL < 0: sL = 0
    if eL > len(lon): eL = len(lon)
    seg_L = lon[sL:eL] - np.mean(lon[sL:eL])
    seg_P = par[sL:eL] - np.mean(par[sL:eL])
    # Cross-correlation
    corr = np.correlate(seg_L, seg_P, mode='full')
    lag_idx = np.argmax(corr)
    lag = lag_idx - (len(seg_P) - 1)  # positive -> LON lagging PAR
    lag_time = lag / fs
    # Peak amplitude in each segment
    amp_L = float(np.max(np.abs(seg_L)))
    amp_P = float(np.max(np.abs(seg_P)))
    results.append((t0, lag_time, amp_L, amp_P))

# Print summary for first few events
for t0, lag_time, amp_L, amp_P in results[:5]:
    if lag_time < 0:
        lead = "London leads Paris"
        dt = -lag_time
    else:
        lead = "Paris leads London"
        dt = lag_time
    ratio = amp_L / (amp_P + 1e-12)
    print(f"Event at t={t0:.3f}s: {lead} by {dt*1000:.2f} ms, "
          f"amp_L={amp_L:.3f}, amp_P={amp_P:.3f}, amp_ratio(L/P)={ratio:.2f}")
