# Dark Matter Genes: Exploring the Unknown 37%

**Research Question**: What are the gene clusters with COG category "S" (Function unknown) actually doing?

The COG analysis revealed that novel genes are enriched in S-category (+1.64% enrichment, 69% consistency across species). These are genuinely unknown proteins - the "dark matter" of bacterial genomes.

## Objectives

1. Quantify the scale of unknown function genes across the pangenome
2. Analyze S-category distribution by core/auxiliary/singleton status
3. Look for composite COG categories involving S (e.g., "SV", "LS", "ST")
4. Examine phylum-specific patterns in dark matter
5. Identify the most common families of unknown proteins

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (14, 8)

# COG category descriptions
COG_DESCRIPTIONS = {
    'J': 'Translation, ribosomal structure',
    'K': 'Transcription',
    'L': 'Replication, recombination, repair (mobile elements)',
    'D': 'Cell cycle control, division',
    'V': 'Defense mechanisms',
    'T': 'Signal transduction',
    'M': 'Cell wall/membrane biogenesis',
    'N': 'Cell motility',
    'U': 'Intracellular trafficking',
    'O': 'Posttranslational modification, chaperones',
    'C': 'Energy production and conversion',
    'G': 'Carbohydrate transport and metabolism',
    'E': 'Amino acid transport and metabolism',
    'F': 'Nucleotide transport and metabolism',
    'H': 'Coenzyme transport and metabolism',
    'I': 'Lipid transport and metabolism',
    'P': 'Inorganic ion transport',
    'Q': 'Secondary metabolites biosynthesis',
    'R': 'General function prediction only',
    'S': 'Function unknown',
}

print("Imports complete")

In [None]:
# Initialize Spark session
# NOTE: This requires running in an environment with Spark access to BERDL
spark = get_spark_session()
print(f"Spark version: {spark.version}")

## 1. Scale: How prevalent is "unknown function"?

First, let's quantify how many annotations have COG category S (unknown function) vs known categories.

In [None]:
# Query 1: Overall COG category distribution
query_overall = """
SELECT 
    COG_category,
    COUNT(*) as gene_count
FROM kbase_ke_pangenome.eggnog_mapper_annotations
WHERE COG_category IS NOT NULL 
    AND COG_category != '-'
GROUP BY COG_category
ORDER BY gene_count DESC
"""

print("Executing COG category distribution query...")
print("Expected time: 3-10 minutes")

In [None]:
%%time
df_cog_dist = spark.sql(query_overall).toPandas()
df_cog_dist['gene_count'] = pd.to_numeric(df_cog_dist['gene_count'])

print(f"Retrieved {len(df_cog_dist)} distinct COG categories")
print(f"Total annotated genes: {df_cog_dist['gene_count'].sum():,}")

In [None]:
# Separate single-letter COGs from composite
df_cog_dist['is_single'] = df_cog_dist['COG_category'].str.len() == 1
df_single = df_cog_dist[df_cog_dist['is_single']].copy()
df_composite = df_cog_dist[~df_cog_dist['is_single']].copy()

print(f"\nSingle-letter categories: {len(df_single)}")
print(f"Composite categories: {len(df_composite)}")

# Calculate proportions for single-letter COGs
total_single = df_single['gene_count'].sum()
df_single['proportion'] = df_single['gene_count'] / total_single * 100
df_single['description'] = df_single['COG_category'].map(COG_DESCRIPTIONS)
df_single = df_single.sort_values('gene_count', ascending=False)

print("\n" + "="*70)
print("SINGLE-LETTER COG CATEGORY DISTRIBUTION")
print("="*70)
for _, row in df_single.iterrows():
    desc = row['description'][:45] if pd.notna(row['description']) else 'Unknown'
    print(f"  {row['COG_category']}: {desc:45s} {row['gene_count']:>12,} ({row['proportion']:.2f}%)")

In [None]:
# THE KEY FINDING: What fraction is "unknown"?
s_count = df_single[df_single['COG_category'] == 'S']['gene_count'].values[0]
r_count = df_single[df_single['COG_category'] == 'R']['gene_count'].values[0]  # General function prediction
known_count = total_single - s_count - r_count

print("\n" + "="*70)
print("THE DARK MATTER QUESTION")
print("="*70)
print(f"\n  S (Unknown function):          {s_count:>12,} ({s_count/total_single*100:.1f}%)")
print(f"  R (General prediction only):   {r_count:>12,} ({r_count/total_single*100:.1f}%)")
print(f"  Known function:                {known_count:>12,} ({known_count/total_single*100:.1f}%)")
print(f"\n  → {(s_count+r_count)/total_single*100:.1f}% of annotated genes have unknown or vaguely predicted function")

## 2. Dark Matter by Gene Class

Are S-category genes more common in singletons (novel genes) than in core genes?

In [None]:
# Query 2: S-category by gene class (core/auxiliary/singleton)
query_by_class = """
SELECT 
    CASE 
        WHEN gc.is_core = 1 THEN 'Core'
        WHEN gc.is_singleton = 1 THEN 'Singleton'
        WHEN gc.is_auxiliary = 1 THEN 'Auxiliary'
        ELSE 'Unknown'
    END as gene_class,
    CASE 
        WHEN ann.COG_category = 'S' THEN 'S_Unknown'
        WHEN ann.COG_category = 'R' THEN 'R_General'
        WHEN ann.COG_category LIKE '%S%' THEN 'Contains_S'
        ELSE 'Known'
    END as knowledge_status,
    COUNT(*) as gene_count
FROM kbase_ke_pangenome.gene_cluster gc
JOIN kbase_ke_pangenome.gene_genecluster_junction j 
    ON gc.gene_cluster_id = j.gene_cluster_id
JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann 
    ON j.gene_id = ann.query_name
WHERE ann.COG_category IS NOT NULL 
    AND ann.COG_category != '-'
GROUP BY 
    CASE 
        WHEN gc.is_core = 1 THEN 'Core'
        WHEN gc.is_singleton = 1 THEN 'Singleton'
        WHEN gc.is_auxiliary = 1 THEN 'Auxiliary'
        ELSE 'Unknown'
    END,
    CASE 
        WHEN ann.COG_category = 'S' THEN 'S_Unknown'
        WHEN ann.COG_category = 'R' THEN 'R_General'
        WHEN ann.COG_category LIKE '%S%' THEN 'Contains_S'
        ELSE 'Known'
    END
"""

print("Executing gene class × knowledge status query...")
print("This is a big join - expected time: 10-20 minutes")

In [None]:
%%time
df_class_status = spark.sql(query_by_class).toPandas()
df_class_status['gene_count'] = pd.to_numeric(df_class_status['gene_count'])

print(f"Retrieved {len(df_class_status)} combinations")
df_class_status

In [None]:
# Pivot to see the pattern
pivot = df_class_status.pivot_table(
    index='gene_class',
    columns='knowledge_status',
    values='gene_count',
    fill_value=0
)

# Calculate row totals and percentages
pivot['Total'] = pivot.sum(axis=1)
for col in ['S_Unknown', 'R_General', 'Contains_S', 'Known']:
    if col in pivot.columns:
        pivot[f'{col}_pct'] = pivot[col] / pivot['Total'] * 100

print("\n" + "="*70)
print("DARK MATTER BY GENE CLASS")
print("="*70)
print("\nRaw counts:")
print(pivot[['S_Unknown', 'R_General', 'Contains_S', 'Known', 'Total']])

print("\nPercentages:")
pct_cols = [c for c in pivot.columns if c.endswith('_pct')]
print(pivot[pct_cols].round(2))

In [None]:
# Visualize the pattern
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Stacked bar chart of counts
plot_data = pivot[['S_Unknown', 'R_General', 'Known']].loc[['Core', 'Auxiliary', 'Singleton']]
plot_data.plot(kind='bar', stacked=True, ax=axes[0], color=['#d62728', '#ff7f0e', '#2ca02c'])
axes[0].set_title('Gene Counts by Class and Knowledge Status')
axes[0].set_ylabel('Number of Genes')
axes[0].set_xlabel('Gene Class')
axes[0].legend(title='Knowledge Status')
axes[0].tick_params(axis='x', rotation=0)

# Percentage of unknown by class
pct_unknown = pivot['S_Unknown_pct'].loc[['Core', 'Auxiliary', 'Singleton']]
colors = ['#2ca02c', '#ff7f0e', '#d62728']  # Green to red
bars = axes[1].bar(pct_unknown.index, pct_unknown.values, color=colors)
axes[1].set_title('Percentage of S (Unknown) by Gene Class')
axes[1].set_ylabel('% Unknown Function')
axes[1].set_xlabel('Gene Class')

# Add value labels
for bar, val in zip(bars, pct_unknown.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
                f'{val:.1f}%', ha='center', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('../figures/dark_matter_by_class.png', dpi=300, bbox_inches='tight')
plt.show()

# Key finding
core_pct = pivot.loc['Core', 'S_Unknown_pct'] if 'Core' in pivot.index else 0
singleton_pct = pivot.loc['Singleton', 'S_Unknown_pct'] if 'Singleton' in pivot.index else 0
print(f"\n★ KEY FINDING: Singletons have {singleton_pct:.1f}% unknown vs {core_pct:.1f}% in core genes")
print(f"  → Singleton enrichment: +{singleton_pct - core_pct:.1f} percentage points")

## 3. Composite COG Categories: What pairs with "S"?

Composite COG categories (e.g., "SV", "LS") indicate genes with multiple functional annotations. What categories commonly co-occur with S (unknown)?

In [None]:
# Filter to composite categories containing S
df_s_composite = df_composite[df_composite['COG_category'].str.contains('S')].copy()
df_s_composite = df_s_composite.sort_values('gene_count', ascending=False)

print("\n" + "="*70)
print("COMPOSITE COG CATEGORIES CONTAINING 'S' (Unknown)")
print("="*70)
print(f"\nFound {len(df_s_composite)} composite categories containing S")
print(f"Total genes in S-composites: {df_s_composite['gene_count'].sum():,}")

print("\nTop 20 S-containing composites:")
for i, (_, row) in enumerate(df_s_composite.head(20).iterrows()):
    cog = row['COG_category']
    # Extract non-S components
    partners = ''.join([c for c in cog if c != 'S'])
    partner_descs = [COG_DESCRIPTIONS.get(c, '?')[:20] for c in partners]
    print(f"  {i+1:2d}. {cog:10s} {row['gene_count']:>10,}  partners: {partners} ({', '.join(partner_descs)})")

In [None]:
# Analyze partner categories
partner_counts = {}
for _, row in df_s_composite.iterrows():
    cog = row['COG_category']
    count = row['gene_count']
    for c in cog:
        if c != 'S':
            partner_counts[c] = partner_counts.get(c, 0) + count

partner_df = pd.DataFrame([
    {'COG': k, 'count': v, 'description': COG_DESCRIPTIONS.get(k, 'Unknown')}
    for k, v in partner_counts.items()
]).sort_values('count', ascending=False)

print("\n" + "="*70)
print("CATEGORIES THAT CO-OCCUR WITH 'S' (Unknown)")
print("="*70)
print("\nThis tells us what functional domains are found alongside unknown domains:")
for _, row in partner_df.head(15).iterrows():
    print(f"  {row['COG']}: {row['description'][:40]:40s} {row['count']:>10,}")

# Key hypothesis test: Are L (mobile) and V (defense) common partners?
l_with_s = partner_counts.get('L', 0)
v_with_s = partner_counts.get('V', 0)
total_s_partners = sum(partner_counts.values())

print(f"\n★ HYPOTHESIS TEST: Do unknown genes associate with mobile elements and defense?")
print(f"  L (Mobile elements) co-occurs with S: {l_with_s:,} ({l_with_s/total_s_partners*100:.1f}%)")
print(f"  V (Defense) co-occurs with S:         {v_with_s:,} ({v_with_s/total_s_partners*100:.1f}%)")

In [None]:
# Visualize partner distribution
fig, ax = plt.subplots(figsize=(12, 6))

top_partners = partner_df.head(15)
colors = ['#d62728' if c in ['L', 'V'] else '#1f77b4' for c in top_partners['COG']]

bars = ax.barh(range(len(top_partners)), top_partners['count'], color=colors)
ax.set_yticks(range(len(top_partners)))
ax.set_yticklabels([f"{row['COG']}: {row['description'][:35]}" for _, row in top_partners.iterrows()])
ax.set_xlabel('Gene Count')
ax.set_title('Categories That Co-occur with S (Unknown Function)\nRed = Mobile/Defense (potential novel defense systems)')
ax.invert_yaxis()

plt.tight_layout()
plt.savefig('../figures/s_partner_categories.png', dpi=300, bbox_inches='tight')
plt.show()

## 4. Species-Level Dark Matter Variation

Do some species/phyla have more "dark matter" than others?

In [None]:
# Sample a set of species for per-species analysis
# We'll use species with 100-500 genomes for good sampling

query_species_sample = """
SELECT 
    p.gtdb_species_clade_id,
    s.GTDB_species,
    s.GTDB_taxonomy,
    p.no_genomes
FROM kbase_ke_pangenome.pangenome p
JOIN kbase_ke_pangenome.gtdb_species_clade s 
    ON p.gtdb_species_clade_id = s.gtdb_species_clade_id
WHERE p.no_genomes >= 100 AND p.no_genomes <= 500
ORDER BY p.no_genomes DESC
"""

df_species = spark.sql(query_species_sample).toPandas()
print(f"Found {len(df_species)} species with 100-500 genomes")

# Parse phylum
def extract_phylum(tax):
    if pd.isna(tax):
        return 'Unknown'
    for part in tax.split(';'):
        if part.startswith('p__'):
            return part.replace('p__', '')
    return 'Unknown'

df_species['phylum'] = df_species['GTDB_taxonomy'].apply(extract_phylum)
print(f"Phyla represented: {df_species['phylum'].nunique()}")
print(df_species['phylum'].value_counts().head(10))

In [None]:
# Sample species from major phyla
np.random.seed(42)

sampled = []
target_per_phylum = 5
major_phyla = df_species['phylum'].value_counts().head(8).index.tolist()

for phylum in major_phyla:
    phylum_species = df_species[df_species['phylum'] == phylum]
    n_sample = min(target_per_phylum, len(phylum_species))
    sampled.append(phylum_species.sample(n_sample))

df_sampled = pd.concat(sampled, ignore_index=True)
print(f"Sampled {len(df_sampled)} species across {len(major_phyla)} phyla")
print(df_sampled['phylum'].value_counts())

In [None]:
# Query S-category proportion for each sampled species
species_list = df_sampled['gtdb_species_clade_id'].tolist()
species_in_clause = "', '".join(species_list)

query_species_s = f"""
SELECT 
    gc.gtdb_species_clade_id,
    SUM(CASE WHEN ann.COG_category = 'S' THEN 1 ELSE 0 END) as s_count,
    SUM(CASE WHEN ann.COG_category LIKE '%S%' THEN 1 ELSE 0 END) as contains_s_count,
    COUNT(*) as total_annotated
FROM kbase_ke_pangenome.gene_cluster gc
JOIN kbase_ke_pangenome.gene_genecluster_junction j 
    ON gc.gene_cluster_id = j.gene_cluster_id
JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann 
    ON j.gene_id = ann.query_name
WHERE gc.gtdb_species_clade_id IN ('{species_in_clause}')
    AND ann.COG_category IS NOT NULL 
    AND ann.COG_category != '-'
GROUP BY gc.gtdb_species_clade_id
"""

print(f"Querying S-category proportions for {len(species_list)} species...")

In [None]:
%%time
df_species_s = spark.sql(query_species_s).toPandas()

for col in ['s_count', 'contains_s_count', 'total_annotated']:
    df_species_s[col] = pd.to_numeric(df_species_s[col])

df_species_s['pct_s'] = df_species_s['s_count'] / df_species_s['total_annotated'] * 100
df_species_s['pct_contains_s'] = df_species_s['contains_s_count'] / df_species_s['total_annotated'] * 100

# Merge with species metadata
df_species_s = df_species_s.merge(
    df_sampled[['gtdb_species_clade_id', 'GTDB_species', 'phylum', 'no_genomes']],
    on='gtdb_species_clade_id',
    how='left'
)

print(f"Retrieved dark matter stats for {len(df_species_s)} species")
df_species_s.head(10)

In [None]:
# Dark matter by phylum
phylum_stats = df_species_s.groupby('phylum').agg({
    'pct_s': ['mean', 'std', 'count'],
    'pct_contains_s': ['mean', 'std']
}).round(2)

phylum_stats.columns = ['mean_pct_s', 'std_pct_s', 'n_species', 'mean_pct_contains_s', 'std_contains_s']
phylum_stats = phylum_stats.sort_values('mean_pct_s', ascending=False)

print("\n" + "="*70)
print("DARK MATTER PERCENTAGE BY PHYLUM")
print("="*70)
print(phylum_stats)

In [None]:
# Visualize by phylum
fig, ax = plt.subplots(figsize=(12, 6))

phylum_order = phylum_stats.sort_values('mean_pct_s', ascending=True).index.tolist()
df_plot = df_species_s.copy()
df_plot['phylum'] = pd.Categorical(df_plot['phylum'], categories=phylum_order, ordered=True)

sns.boxplot(data=df_plot, x='pct_s', y='phylum', ax=ax, palette='viridis')
ax.set_xlabel('% of Genes with S (Unknown Function)')
ax.set_ylabel('Phylum')
ax.set_title('Dark Matter Gene Proportion by Phylum\n(each point is a species)')
ax.axvline(x=df_species_s['pct_s'].median(), color='red', linestyle='--', label='Overall median')
ax.legend()

plt.tight_layout()
plt.savefig('../figures/dark_matter_by_phylum.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n★ Overall dark matter: {df_species_s['pct_s'].mean():.1f}% ± {df_species_s['pct_s'].std():.1f}%")
print(f"  Range: {df_species_s['pct_s'].min():.1f}% - {df_species_s['pct_s'].max():.1f}%")

## 5. Summary and Conclusions

In [None]:
print("="*70)
print("DARK MATTER GENES: KEY FINDINGS")
print("="*70)

print("""
1. SCALE OF UNKNOWN
   - S (Unknown function): ~XX% of all annotated genes
   - R (General prediction): ~XX% 
   - Combined "dark matter": ~XX% of bacterial genes have unknown/vague function

2. ENRICHMENT IN NOVEL GENES
   - Singleton genes have XX% unknown vs XX% in core genes
   - Confirms that recently acquired genes are often uncharacterized
   - The "innovation frontier" of bacterial evolution is largely unexplored

3. DARK MATTER ASSOCIATIONS
   - S commonly pairs with: [list top partners]
   - Mobile elements (L) co-occur with S at XX%
   - Defense (V) co-occurs with S at XX%
   - Suggests unknown proteins may be part of novel defense systems

4. PHYLOGENETIC PATTERNS
   - Dark matter varies from XX% to XX% across phyla
   - [Highest phylum] has most unknown genes
   - [Lowest phylum] is best characterized

5. BIOLOGICAL IMPLICATIONS
   - The bacterial "dark genome" contains potentially novel:
     * Defense systems (CRISPR-like, toxin-antitoxin)
     * Mobile element cargo
     * Species-specific innovations
   - These genes are enriched in singletons = recent evolutionary experiments
   - High-priority targets for experimental characterization
""")

print("\n" + "="*70)
print("NEXT STEPS")
print("="*70)
print("""
1. Cluster S-category proteins by sequence similarity
   - Identify "dark matter families" that span species
   - These are likely functional but uncharacterized protein families

2. Cross-reference with defense system databases
   - Compare to DefenseFinder, CRISPRdb, TADB
   - Many "unknown" proteins may be novel defense components

3. Genomic context analysis
   - Are S-genes near mobile elements?
   - Do they cluster with known defense genes?

4. Expression/conservation analysis  
   - If conserved across species, probably functional
   - If expressed, not pseudogenes
""")

In [None]:
# Save results
df_cog_dist.to_csv('../data/cog_category_distribution.csv', index=False)
df_species_s.to_csv('../data/species_dark_matter_stats.csv', index=False)
partner_df.to_csv('../data/s_partner_categories.csv', index=False)

print("Data saved to ../data/")