# Analyze Data Discrepancies Between Sources

**Purpose**: Identify and document conflicts between 2018 paper and Fitness Browser data

**Why This Matters**:
- Understanding discrepancies validates our tiered data strategy
- Documents cases where sources disagree
- Helps explain why we can't assume "absence of data = no growth"
- Both sources from same lab (Morgan Price team) - equally trustworthy
- Difference: Paper reports Growth AND No Growth; Database reports Growth only

**Types of Discrepancies**:
1. **FB_NoData_Supp_Growth**: Fitness Browser has no data, but Supplementary Table shows growth
2. **FB_Growth_Supp_NoGrowth**: Fitness Browser shows growth, but Supplementary Table shows no growth

**Output File**:
- `results/data_source_discrepancies.csv` - All discrepancies with metadata

**Last updated**: 2025-10-14

## Setup

In [1]:
import sqlite3
import pandas as pd
import numpy as np
from pathlib import Path
import re

print("Imports successful")

Imports successful


## Configuration

In [2]:
# Paths
DB_PATH = Path("../data/source/feba.db")
RESULTS_DIR = Path("results")

# Input files
SUPP_TABLE_FILE = RESULTS_DIR / "supplementary_table_s2_clean.csv"
ORGANISM_METADATA_FILE = RESULTS_DIR / "organism_metadata.csv"

# Output file
OUTPUT_FILE = RESULTS_DIR / "data_source_discrepancies.csv"

print(f"Configuration set")
print(f"  Database: {DB_PATH}")
print(f"  Supplementary table: {SUPP_TABLE_FILE}")
print(f"  Output: {OUTPUT_FILE}")

Configuration set
  Database: ../data/source/feba.db
  Supplementary table: results/supplementary_table_s2_clean.csv
  Output: results/data_source_discrepancies.csv


## Load Data Sources

In [3]:
# Load supplementary table
print("Loading supplementary table...")
supp_data = pd.read_csv(SUPP_TABLE_FILE, index_col=0)
print(f"  Loaded: {supp_data.shape[0]} carbon sources × {supp_data.shape[1]} organisms")

# Load organism metadata
print("\nLoading organism metadata...")
metadata = pd.read_csv(ORGANISM_METADATA_FILE)
orgid_to_species = dict(zip(metadata['orgId'], metadata['Species_Name']))
species_to_orgid = dict(zip(metadata['Species_Name'], metadata['orgId']))
print(f"  Loaded: {len(metadata)} organisms")

# Load Fitness Browser data
print("\nLoading Fitness Browser data...")
conn = sqlite3.connect(str(DB_PATH))
query = """
SELECT DISTINCT
    e.orgId,
    e.condition_1 as carbon_source
FROM Experiment e
WHERE e.expGroup = 'carbon source'
  AND e.num > 0
"""
fb_data = pd.read_sql_query(query, conn)
conn.close()
print(f"  Loaded: {len(fb_data)} organism-carbon pairs")

Loading supplementary table...
  Loaded: 94 carbon sources × 28 organisms

Loading organism metadata...
  Loaded: 57 organisms

Loading Fitness Browser data...
  Loaded: 1089 organism-carbon pairs


## Create Lookup Sets for Fitness Browser

In [4]:
# Create set of (orgId, carbon_source) tuples for fast lookup
fb_pairs = set(zip(fb_data['orgId'], fb_data['carbon_source']))

print(f"Created Fitness Browser lookup set with {len(fb_pairs):,} pairs")

Created Fitness Browser lookup set with 1,089 pairs


## Find Discrepancies

In [5]:
print("\nAnalyzing discrepancies between data sources...\n")

discrepancies = []

# Iterate through supplementary table
for carbon_supp in supp_data.index:
    for organism_supp in supp_data.columns:
        supp_call = supp_data.loc[carbon_supp, organism_supp]
        
        # Skip if no data in supplementary table
        if pd.isna(supp_call) or supp_call == '':
            continue
        
        # Get orgId for this organism
        org_id = species_to_orgid.get(organism_supp)
        if not org_id:
            continue
        
        # Check if this pair exists in Fitness Browser
        fb_has_data = (org_id, carbon_supp) in fb_pairs
        
        # Type 1: FB has no data, but Supp shows growth
        if not fb_has_data and supp_call == 'Growth':
            discrepancies.append({
                'Organism_ID': org_id,
                'Organism_Name': organism_supp,
                'Carbon_Source_Supp': carbon_supp,
                'Carbon_Source_Our': carbon_supp,
                'Fitness_Browser_Call': 'No Data',
                'Supplementary_Call': 'Growth',
                'Type': 'FB_NoData_Supp_Growth'
            })
        
        # Type 2: FB shows growth, but Supp shows no growth
        elif fb_has_data and supp_call == 'No Growth':
            discrepancies.append({
                'Organism_ID': org_id,
                'Organism_Name': organism_supp,
                'Carbon_Source_Supp': carbon_supp,
                'Carbon_Source_Our': carbon_supp,
                'Fitness_Browser_Call': 'Growth',
                'Supplementary_Call': 'No Growth',
                'Type': 'FB_Growth_Supp_NoGrowth'
            })

# Convert to DataFrame
discrepancy_df = pd.DataFrame(discrepancies)

print(f"Found {len(discrepancy_df)} total discrepancies")


Analyzing discrepancies between data sources...

Found 173 total discrepancies


## Analyze Discrepancy Types

In [6]:
print("\nDiscrepancy breakdown by type:")
print("="*70)

for disc_type in discrepancy_df['Type'].unique():
    count = (discrepancy_df['Type'] == disc_type).sum()
    print(f"\n{disc_type}: {count} cases")
    
    if disc_type == 'FB_NoData_Supp_Growth':
        print("  Meaning: Fitness Browser has NO data, but Supplementary Table shows GROWTH")
        print("  Implication: Experiment was done (organism grew), but not in feba.db")
        print("  Reason: Pre-publication data, different setup, or filtered from database")
    
    elif disc_type == 'FB_Growth_Supp_NoGrowth':
        print("  Meaning: Fitness Browser shows GROWTH, but Supplementary Table shows NO GROWTH")
        print("  Implication: Direct contradiction between sources")
        print("  Resolution: Prioritize Tier 1 data (Supplementary Table)")
    
    # Show first few examples
    examples = discrepancy_df[discrepancy_df['Type'] == disc_type].head(5)
    print(f"\n  First {len(examples)} examples:")
    for idx, row in examples.iterrows():
        print(f"    {row['Organism_Name']} + {row['Carbon_Source_Supp']}")

print("\n" + "="*70)


Discrepancy breakdown by type:

FB_NoData_Supp_Growth: 170 cases
  Meaning: Fitness Browser has NO data, but Supplementary Table shows GROWTH
  Implication: Experiment was done (organism grew), but not in feba.db
  Reason: Pre-publication data, different setup, or filtered from database

  First 5 examples:
    Acidovorax sp. GW101-3H11 + Sucrose
    Burkholderia phytofirmans PsJN + Sucrose
    Caulobacter crescentus NA1000 + Sucrose
    Escherichia coli BW25113 + Sucrose
    Herbaspirillum seropedicae SmR1 + Sucrose

FB_Growth_Supp_NoGrowth: 3 cases
  Meaning: Fitness Browser shows GROWTH, but Supplementary Table shows NO GROWTH
  Implication: Direct contradiction between sources
  Resolution: Prioritize Tier 1 data (Supplementary Table)

  First 3 examples:
    Marinobacter adhaerens HP15 + Sodium Fumarate dibasic
    Burkholderia phytofirmans PsJN + 2-Deoxy-D-Ribose
    Burkholderia phytofirmans PsJN + Sodium propionate



## Calculate Statistics

In [7]:
# Total comparisons possible
total_comparisons = supp_data.notna().sum().sum()
n_discrepancies = len(discrepancy_df)
n_agreements = total_comparisons - n_discrepancies

print("\nOverall statistics:")
print("="*70)
print(f"  Total comparisons: {total_comparisons:,}")
print(f"  Agreements: {n_agreements:,} ({100*n_agreements/total_comparisons:.1f}%)")
print(f"  Discrepancies: {n_discrepancies:,} ({100*n_discrepancies/total_comparisons:.1f}%)")
print("="*70)


Overall statistics:
  Total comparisons: 2,632
  Agreements: 2,459 (93.4%)
  Discrepancies: 173 (6.6%)


## Save Discrepancies

In [8]:
print(f"\nSaving discrepancies to: {OUTPUT_FILE}")
discrepancy_df.to_csv(OUTPUT_FILE, index=False)
print(f"Saved {len(discrepancy_df)} discrepancies")


Saving discrepancies to: results/data_source_discrepancies.csv
Saved 173 discrepancies


## Summary

In [9]:
print("="*70)
print("DATA DISCREPANCY ANALYSIS COMPLETE")
print("="*70)

print(f"\nKey Findings:")

n_type1 = (discrepancy_df['Type'] == 'FB_NoData_Supp_Growth').sum()
n_type2 = (discrepancy_df['Type'] == 'FB_Growth_Supp_NoGrowth').sum()

print(f"\n1. Missing Fitness Browser Data ({n_type1} cases):")
print(f"   - Supplementary Table shows GROWTH")
print(f"   - Fitness Browser has NO DATA")
print(f"   - Proves: Cannot assume 'absence of data = no growth'")

print(f"\n2. Direct Contradictions ({n_type2} cases):")
print(f"   - Fitness Browser shows GROWTH")
print(f"   - Supplementary Table shows NO GROWTH")
print(f"   - Resolution: Prioritize Tier 1 data (Supplementary Table)")

print(f"\nImplications for Data Strategy:")
print(f"  - Use Supplementary Table S2 for BOTH growth and no-growth (Tier 1)")
print(f"  - Use Fitness Browser for GROWTH calls only (Tier 2)")
print(f"  - Do NOT assume absence of Fitness Browser data = no growth")
print(f"  - {n_type1} experiments exist in supp table but not in feba.db")

print(f"\nFiles created:")
print(f"  {OUTPUT_FILE}")

print(f"\nConsistency: {100*n_agreements/total_comparisons:.1f}%")

print("\n" + "="*70)

DATA DISCREPANCY ANALYSIS COMPLETE

Key Findings:

1. Missing Fitness Browser Data (170 cases):
   - Supplementary Table shows GROWTH
   - Fitness Browser has NO DATA
   - Proves: Cannot assume 'absence of data = no growth'

2. Direct Contradictions (3 cases):
   - Fitness Browser shows GROWTH
   - Supplementary Table shows NO GROWTH
   - Resolution: Prioritize Tier 1 data (Supplementary Table)

Implications for Data Strategy:
  - Use Supplementary Table S2 for BOTH growth and no-growth (Tier 1)
  - Use Fitness Browser for GROWTH calls only (Tier 2)
  - Do NOT assume absence of Fitness Browser data = no growth
  - 170 experiments exist in supp table but not in feba.db

Files created:
  results/data_source_discrepancies.csv

Consistency: 93.4%

