# LDL Cholesterol Equation Comparison

This notebook compares four methods for calculating LDL cholesterol:

1. **Friedewald (1972)**: The traditional standard, assumes fixed 5:1 TG:VLDL ratio
2. **Martin-Hopkins**: Uses 180-cell lookup table for adjustable TG:VLDL factor
3. **Extended Martin-Hopkins**: Extended lookup table for TG 400-800 mg/dL
4. **Sampson (NIH Equation 2)**: Developed with beta-quantification, includes quadratic TG term

We'll compare these equations across a synthetic grid of TC, HDL, and TG values to understand when each method excels.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use('Agg')  # Non-interactive backend for notebook execution
import seaborn as sns
import sys
import warnings
sys.path.insert(0, '..')

from ldlC.models import (
    calc_ldl_friedewald,
    calc_ldl_martin_hopkins,
    calc_ldl_martin_hopkins_extended,
    calc_ldl_sampson
)

# Suppress Friedewald warnings for high TG
warnings.filterwarnings('ignore', category=UserWarning)

print('All imports successful!')

## 1. Create Synthetic Data Grid

We'll create a grid of clinically relevant values:
- **TC (Total Cholesterol)**: 150-300 mg/dL in 10 mg/dL steps
- **HDL**: 40-80 mg/dL in 10 mg/dL steps  
- **TG (Triglycerides)**: 50-800 mg/dL in various steps (denser at clinical thresholds)

In [None]:
# Define value ranges
tc_values = np.arange(150, 310, 20)  # 150, 170, 190, ..., 290
hdl_values = np.array([40, 50, 60, 70])  # Common HDL levels
tg_values = np.array([50, 100, 150, 200, 300, 400, 500, 600, 700, 800])  # Key TG thresholds

print(f"TC values: {tc_values}")
print(f"HDL values: {hdl_values}")
print(f"TG values: {tg_values}")
print(f"\nTotal combinations: {len(tc_values) * len(hdl_values) * len(tg_values)}")

In [None]:
# Generate all combinations
results = []

for tc in tc_values:
    for hdl in hdl_values:
        for tg in tg_values:
            # Calculate LDL with each equation
            ldl_friedewald = calc_ldl_friedewald(tc, hdl, tg)
            ldl_martin_hopkins = calc_ldl_martin_hopkins(tc, hdl, tg)
            ldl_extended_mh = calc_ldl_martin_hopkins_extended(tc, hdl, tg)
            ldl_sampson = calc_ldl_sampson(tc, hdl, tg)
            
            results.append({
                'tc_mgdl': tc,
                'hdl_mgdl': hdl,
                'tg_mgdl': tg,
                'non_hdl_mgdl': tc - hdl,
                'ldl_friedewald': ldl_friedewald,
                'ldl_martin_hopkins': ldl_martin_hopkins,
                'ldl_extended_mh': ldl_extended_mh,
                'ldl_sampson': ldl_sampson
            })

df = pd.DataFrame(results)
print(f"Generated {len(df)} combinations")
df.head(10)

## 2. Calculate Differences Between Equations

We'll calculate the differences between each equation and use Martin-Hopkins as the reference (since it's widely considered more accurate than Friedewald).

In [None]:
# Calculate differences from Friedewald (traditional baseline)
df['diff_mh_vs_friedewald'] = df['ldl_martin_hopkins'] - df['ldl_friedewald']
df['diff_emh_vs_friedewald'] = df['ldl_extended_mh'] - df['ldl_friedewald']
df['diff_sampson_vs_friedewald'] = df['ldl_sampson'] - df['ldl_friedewald']

# Calculate differences between modern equations
df['diff_sampson_vs_mh'] = df['ldl_sampson'] - df['ldl_martin_hopkins']
df['diff_emh_vs_mh'] = df['ldl_extended_mh'] - df['ldl_martin_hopkins']

# Filter out NaN values (from Friedewald at high TG)
df_valid_friedewald = df[df['tg_mgdl'] <= 400].copy()
df_high_tg = df[df['tg_mgdl'] > 400].copy()

print(f"Valid Friedewald records (TG ‚â§ 400): {len(df_valid_friedewald)}")
print(f"High TG records (TG > 400): {len(df_high_tg)}")

## 3. Heatmaps: Equation Differences Across TG and TC Ranges

These heatmaps show how the equations diverge at different TG and TC levels.

In [None]:
# Create pivot table for Martin-Hopkins vs Friedewald difference
# Average across HDL values
pivot_mh_fried = df_valid_friedewald.groupby(['tg_mgdl', 'tc_mgdl'])['diff_mh_vs_friedewald'].mean().unstack()

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Heatmap 1: Martin-Hopkins vs Friedewald
sns.heatmap(pivot_mh_fried, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            ax=axes[0], cbar_kws={'label': 'Difference (mg/dL)'})
axes[0].set_title('Martin-Hopkins minus Friedewald\n(TG ‚â§ 400 mg/dL)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('TC (mg/dL)')
axes[0].set_ylabel('TG (mg/dL)')

# Heatmap 2: Sampson vs Friedewald
pivot_samp_fried = df_valid_friedewald.groupby(['tg_mgdl', 'tc_mgdl'])['diff_sampson_vs_friedewald'].mean().unstack()
sns.heatmap(pivot_samp_fried, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            ax=axes[1], cbar_kws={'label': 'Difference (mg/dL)'})
axes[1].set_title('Sampson minus Friedewald\n(TG ‚â§ 400 mg/dL)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('TC (mg/dL)')
axes[1].set_ylabel('TG (mg/dL)')

# Heatmap 3: Sampson vs Martin-Hopkins
pivot_samp_mh = df_valid_friedewald.groupby(['tg_mgdl', 'tc_mgdl'])['diff_sampson_vs_mh'].mean().unstack()
sns.heatmap(pivot_samp_mh, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            ax=axes[2], cbar_kws={'label': 'Difference (mg/dL)'})
axes[2].set_title('Sampson minus Martin-Hopkins\n(TG ‚â§ 400 mg/dL)', fontsize=12, fontweight='bold')
axes[2].set_xlabel('TC (mg/dL)')
axes[2].set_ylabel('TG (mg/dL)')

plt.tight_layout()
plt.savefig('equation_comparison_heatmaps.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved: equation_comparison_heatmaps.png')

## 4. High TG Analysis (400-800 mg/dL)

For TG > 400 mg/dL, Friedewald is unreliable (returns NaN). Let's compare the modern equations in this range.

In [None]:
# Compare extended Martin-Hopkins vs Standard Martin-Hopkins at high TG
pivot_emh_mh = df_high_tg.groupby(['tg_mgdl', 'tc_mgdl'])['diff_emh_vs_mh'].mean().unstack()

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Heatmap: Extended M-H vs M-H
sns.heatmap(pivot_emh_mh, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            ax=axes[0], cbar_kws={'label': 'Difference (mg/dL)'})
axes[0].set_title('Extended M-H minus Standard M-H\n(TG > 400 mg/dL)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('TC (mg/dL)')
axes[0].set_ylabel('TG (mg/dL)')

# Heatmap: Sampson vs M-H at high TG
pivot_samp_mh_high = df_high_tg.groupby(['tg_mgdl', 'tc_mgdl'])['diff_sampson_vs_mh'].mean().unstack()
sns.heatmap(pivot_samp_mh_high, annot=True, fmt='.1f', cmap='RdBu_r', center=0,
            ax=axes[1], cbar_kws={'label': 'Difference (mg/dL)'})
axes[1].set_title('Sampson minus Martin-Hopkins\n(TG > 400 mg/dL)', fontsize=12, fontweight='bold')
axes[1].set_xlabel('TC (mg/dL)')
axes[1].set_ylabel('TG (mg/dL)')

plt.tight_layout()
plt.savefig('high_tg_comparison_heatmaps.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved: high_tg_comparison_heatmaps.png')

## 5. Line Plots: LDL Estimates Across TG Range

Visualize how each equation behaves as TG increases, for a fixed TC and HDL.

In [None]:
# Example patient profile: TC=200, HDL=50
df_example = df[(df['tc_mgdl'] == 190) & (df['hdl_mgdl'] == 50)].sort_values('tg_mgdl')

fig, ax = plt.subplots(figsize=(12, 7))

# Plot each equation
ax.plot(df_example['tg_mgdl'], df_example['ldl_friedewald'], 
        'o-', label='Friedewald', linewidth=2, markersize=8, color='#e74c3c')
ax.plot(df_example['tg_mgdl'], df_example['ldl_martin_hopkins'], 
        's-', label='Martin-Hopkins', linewidth=2, markersize=8, color='#3498db')
ax.plot(df_example['tg_mgdl'], df_example['ldl_extended_mh'], 
        '^-', label='Extended M-H', linewidth=2, markersize=8, color='#2ecc71')
ax.plot(df_example['tg_mgdl'], df_example['ldl_sampson'], 
        'D-', label='Sampson', linewidth=2, markersize=8, color='#9b59b6')

# Add vertical line at Friedewald threshold
ax.axvline(x=400, color='red', linestyle='--', alpha=0.7, label='Friedewald limit (400 mg/dL)')

ax.set_xlabel('Triglycerides (mg/dL)', fontsize=12)
ax.set_ylabel('Estimated LDL-C (mg/dL)', fontsize=12)
ax.set_title('LDL-C Estimates by Equation\n(TC=190, HDL=50 mg/dL)', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_xlim(0, 850)

plt.tight_layout()
plt.savefig('ldl_by_tg_lineplot.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved: ldl_by_tg_lineplot.png')

## 6. Summary Statistics by TG Stratum

Calculate mean differences between equations within clinically relevant TG strata.

In [None]:
# Define TG strata
def tg_stratum(tg):
    if tg < 150:
        return '< 150 (Normal)'
    elif tg < 200:
        return '150-199 (Borderline)'
    elif tg < 400:
        return '200-399 (High)'
    else:
        return '400-800 (Very High)'

df['tg_stratum'] = df['tg_mgdl'].apply(tg_stratum)

# Calculate summary statistics
summary_cols = ['ldl_friedewald', 'ldl_martin_hopkins', 'ldl_extended_mh', 'ldl_sampson']
summary = df.groupby('tg_stratum')[summary_cols].agg(['mean', 'std']).round(1)

# Flatten column names
summary.columns = ['_'.join(col).strip() for col in summary.columns.values]
print('Mean LDL-C (mg/dL) by TG Stratum:')
summary

In [None]:
# Mean absolute differences from Martin-Hopkins (reference)
df['abs_diff_fried_mh'] = np.abs(df['ldl_friedewald'] - df['ldl_martin_hopkins'])
df['abs_diff_samp_mh'] = np.abs(df['ldl_sampson'] - df['ldl_martin_hopkins'])
df['abs_diff_emh_mh'] = np.abs(df['ldl_extended_mh'] - df['ldl_martin_hopkins'])

diff_summary = df.groupby('tg_stratum').agg({
    'abs_diff_fried_mh': 'mean',
    'abs_diff_samp_mh': 'mean',
    'abs_diff_emh_mh': 'mean'
}).round(2)
diff_summary.columns = ['Friedewald vs M-H', 'Sampson vs M-H', 'Extended M-H vs M-H']

print('\nMean Absolute Difference from Martin-Hopkins (mg/dL):')
diff_summary

## 7. When Does Each Equation Excel?

### Friedewald (1972)
- ‚úÖ **Best for**: Normal TG levels (< 150 mg/dL)
- ‚úÖ **Advantages**: Simple, widely understood, validated in most clinical labs
- ‚ùå **Limitations**: Assumes fixed 5:1 TG:VLDL ratio; unreliable for TG > 400 mg/dL
- ‚ö†Ô∏è **Not recommended**: Diabetes, metabolic syndrome, very low LDL-C

### Martin-Hopkins
- ‚úÖ **Best for**: Patients with elevated TG (150-400 mg/dL) or very low LDL-C
- ‚úÖ **Advantages**: Uses individualized TG:VLDL factor based on patient's lipid profile
- ‚ùå **Limitations**: Requires lookup table; less validated at TG > 400
- ‚≠ê **Widely adopted**: Johns Hopkins, American Heart Association endorsed

### Extended Martin-Hopkins
- ‚úÖ **Best for**: Very high TG (400-800 mg/dL)
- ‚úÖ **Advantages**: Finer granularity at high TG levels
- ‚ùå **Limitations**: Less widely validated than standard M-H
- üî¨ **Use case**: Research settings with hypertriglyceridemia

### Sampson (NIH Equation 2)
- ‚úÖ **Best for**: High TG (200-800 mg/dL), especially for research validation
- ‚úÖ **Advantages**: Developed with beta-quantification (gold standard); accounts for quadratic TG effects
- ‚ùå **Limitations**: More complex formula; less clinical adoption currently
- üî¨ **NASA-endorsed**: Used in spaceflight medicine research

In [None]:
# Create a visual summary of equation recommendations
fig, ax = plt.subplots(figsize=(14, 6))

# Define TG ranges and colors
tg_ranges = ['< 150', '150-200', '200-400', '400-800']
equations = ['Friedewald', 'Martin-Hopkins', 'Extended M-H', 'Sampson']
colors = ['#e74c3c', '#3498db', '#2ecc71', '#9b59b6']

# Recommendation matrix (3=best, 2=good, 1=acceptable, 0=not recommended)
# Friedewald:     [3, 2, 1, 0]
# Martin-Hopkins: [3, 3, 3, 2]
# Extended M-H:   [2, 2, 2, 3]
# Sampson:        [2, 3, 3, 3]
recommendations = np.array([
    [3, 2, 1, 0],  # Friedewald
    [3, 3, 3, 2],  # Martin-Hopkins
    [2, 2, 2, 3],  # Extended M-H
    [2, 3, 3, 3],  # Sampson
])

# Create heatmap
cmap = plt.cm.get_cmap('RdYlGn', 4)  # Red to Green with 4 levels
im = ax.imshow(recommendations, cmap=cmap, vmin=0, vmax=3, aspect='auto')

# Labels
ax.set_xticks(np.arange(len(tg_ranges)))
ax.set_yticks(np.arange(len(equations)))
ax.set_xticklabels(tg_ranges, fontsize=12)
ax.set_yticklabels(equations, fontsize=12)
ax.set_xlabel('Triglyceride Range (mg/dL)', fontsize=14)
ax.set_ylabel('Equation', fontsize=14)
ax.set_title('Equation Recommendation by TG Level', fontsize=16, fontweight='bold')

# Add text annotations
labels = ['Not Rec.', 'Acceptable', 'Good', 'Best']
for i in range(len(equations)):
    for j in range(len(tg_ranges)):
        text = labels[recommendations[i, j]]
        color = 'white' if recommendations[i, j] in [0, 3] else 'black'
        ax.text(j, i, text, ha='center', va='center', fontsize=10, fontweight='bold', color=color)

plt.tight_layout()
plt.savefig('equation_recommendations.png', dpi=150, bbox_inches='tight')
plt.show()
print('Saved: equation_recommendations.png')

## 8. Clinical Implications

### Key Takeaways

1. **For routine clinical use (TG < 400)**: Martin-Hopkins is preferred over Friedewald, especially when LDL-C is low or borderline.

2. **For hypertriglyceridemia (TG 400-800)**: Use Sampson or Extended Martin-Hopkins. Friedewald should not be used.

3. **Maximum differences**: The largest discrepancies between equations occur at:
   - High TG (> 300 mg/dL)
   - High TC (> 250 mg/dL)
   - Low HDL combined with high TG

4. **Hybrid approach**: Our ML model will use all equation outputs as features, learning the optimal combination for each patient profile.

In [None]:
print('Notebook completed successfully!')
print('\nGenerated files:')
print('  - equation_comparison_heatmaps.png')
print('  - high_tg_comparison_heatmaps.png')
print('  - ldl_by_tg_lineplot.png')
print('  - equation_recommendations.png')