# BangleBeat

The following notebook walks through the process of making a machine learning model to improve BangleJS2 heart rate accuracy based on prior data. The ML model runs directly on the BangleJS2 watch and should serve as an easy to use pipeline for folks to make thier own Bangle more accurate for them. Further work may attempt to generalize this approach to yield an open source watch which actively learns how to improve its own heart rate measurements. 



## Data Capture

First, we should assess how good or bad the BangleJS2 performs when compared to other wearable devices. This includes the following devices:

- BangleJS2 Smart Watch
- Garmin Instinct 2X Smart Watch
- Polar H10 ECG Chest Strap

These are all supported to a sufficient degree by [Gadgetbridge](https://gadgetbridge.org/). However, in the initial phases of this project I found that the BangleJS2 does not sample heart rate data nearly as fast as the Garmin does. To give this model a fighting chance I wrote [loglog](https://github.com/lucspec/BangleApps/tree/master/apps/loglog) -- a BangleJS2 app to prioritize data collection over battery life. Initial testing yields a high enough data density to be more comprable with the other two sensors in this work.

## Analysis

With our sensors and collection mechanisms in place, we can start looking at some data.

In [None]:
import sys
import sqlite3
from datetime import datetime, timedelta
import os
from typing import Optional, List
from pathlib import Path

try:
    import pandas as pd
    import matplotlib.pyplot as plt
    import numpy as np
    from scipy import stats
    import seaborn as sns
except ImportError as e:
    print(f"Error: {e}")
    print("Run: poetry install --with analysis")
    sys.exit(1)

from lib.hrDataLoaders import getGadgetbridgeData, getLoglogCsv

In [None]:
def combineHrData(*dataframes: pd.DataFrame) -> pd.DataFrame:
    """
    Combine multiple heart rate dataframes
    
    Parameters:
    -----------
    *dataframes : pd.DataFrame
        Variable number of dataframes to combine
    
    Returns:
    --------
    pd.DataFrame
        Combined dataframe with standard columns
    """
    # Filter out empty dataframes
    valid_dfs = [df for df in dataframes if not df.empty]
    
    if not valid_dfs:
        return pd.DataFrame()
    
    # Ensure all dataframes have the required columns
    required_cols = ['device_name', 'TIMESTAMP', 'HEART_RATE', 'datetime']
    
    standardized_dfs = []
    for df in valid_dfs:
        if all(col in df.columns for col in required_cols):
            # Keep only required columns plus any extras
            standardized_dfs.append(df)
        else:
            print(f"Warning: Dataframe missing required columns, skipping")
    
    if standardized_dfs:
        return pd.concat(standardized_dfs, ignore_index=True)
    else:
        return pd.DataFrame()

With some boilerplate out of the way, let's load up our data and see what we have.

In [None]:
# Configuration
DB_PATH = "./data/Gadgetbridge.db"
LOGLOG_CSV = "./data/Gadgetbridge.db"

# Load Gadgetbridge data
gb_data = getGadgetbridgeData(DB_PATH)
print(f"Loaded {len(gb_data)} Gadgetbridge measurements")

# Load custom app data if path provided
if LOGLOG_CSV:
    #loglog_data = getLoglogCsv(LOGLOG_CSV, device_name='loglog')
    loglog_data = getLoglogData(LOGLOG_CSV, device_name='loglog')
    print(f"Loaded {len(loglog_data)} custom app measurements")
else:
    print("\nNo loglog CSV provided")
    loglog_data = pd.DataFrame()

# Combine all data
print("\nCombining all data sources...")
hr_data = combineHrData(gb_data, loglog_data)

if hr_data.empty:
    print("No heart rate data found!")

print(f"\nTotal HR measurements: {len(hr_data)}")
print(f"Date range: {hr_data['datetime'].min()} to {hr_data['datetime'].max()}")
print(f"\nDevices found:")
for device in sorted(hr_data['device_name'].unique()):
    count = len(hr_data[hr_data['device_name'] == device])
    date_range = hr_data[hr_data['device_name'] == device]['datetime']
    print(f"  - {device}: {count:,} measurements ({date_range.min()} to {date_range.max()})")


In [None]:
# Calculate statistics
print("\n" + "="*80)
print("DEVICE STATISTICS")
print("="*80)
stats_df = calculate_device_statistics(hr_data)
print(stats_df.to_string(index=False, float_format=lambda x: f'{x:.2f}'))

# Compare with Polar H10
print("\n" + "="*80)
print("ACCURACY COMPARISON (vs Polar H10 Chest Strap)")
print("="*80)
polar_comparison = compare_with_polar(hr_data, tolerance_seconds=60)
if polar_comparison is not None and not polar_comparison.empty:
    print(polar_comparison.to_string(index=False, float_format=lambda x: f'{x:.2f}'))
    print("\nInterpretation:")
    print("  - MAE/RMSE: Lower is better (how many bpm off on average)")
    print("  - Correlation: Closer to 1.0 is better (how well it tracks changes)")
    print("  - Mean Diff: Positive = reads higher than Polar, Negative = reads lower")
else:
    print("Not enough simultaneous measurements to compare devices")
    print("(Devices need to be worn at the same time within 60 seconds)")

# Generate visualizations
print("\n" + "="*80)
print("GENERATING VISUALIZATIONS")
print("="*80)

print("\n1. Heart rate distributions (raw BPM)...")
plot_hr_distributions(hr_data, normalized=False)

print("2. Heart rate distributions (normalized 0-1)...")
plot_hr_distributions(hr_data, normalized=True)

print("3. Heart rate timeline (raw BPM)...")
plot_hr_timeline(hr_data, max_hours=48, normalized=False)

print("4. Heart rate timeline (normalized 0-1)...")
plot_hr_timeline(hr_data, max_hours=48, normalized=True)

if 'Polar H10' in hr_data['device_name'].values:
    print("5. Bland-Altman comparison (medical-grade accuracy plot)...")
    plot_bland_altman(hr_data, tolerance_seconds=60)

    print("6. Scatter plot comparison...")
    plot_scatter_comparison(hr_data, tolerance_seconds=60)

print("\n" + "="*80)
print("Analysis complete!")
print("="*80)