## Data Source and Reference Set Construction  
Anchor sequences were obtained from the Spider Silkome Database (https://spider-silkome.org/about).  

### Environment Setup  
Ensure that the following dependencies have been installed (via Pixi):
```bash
pixi init spider_silkome
pixi add python pandas biopython mafft hmmer samtools flye minimap2 seqkit cd-hit
```

In [None]:
import os
import subprocess
import glob
import pandas as pd
from spider_silkome_module.config import PROCESSED_DATA_DIR, RAW_DATA_DIR, SCRIPTS_DIR, EXTERNAL_DATA_DIR, INTERIM_DATA_DIR

from spider_silkome_module import run_shell_command_with_check
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass

# Global Configuration
class Config:
    """
    Configuration class to hold paths and thresholds.
    Using a centralized config makes modifications easier.
    """
    PROJECT_NAME = "spider_silkome_20251222"
    # Paths
    SEEDS_QC_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "spidroin_seeds_qc"
    HMMBUILD_INPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "hmmbuild_input"
    HMMBUILD_OUTPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "hmmbuild_output"
    HMMSEARCH_OUTPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "hmmserch_output"
    MINIPROT_OUTPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "miniprot_output"
    spider_genome_path = RAW_DATA_DIR / "spider_genome"
    spider_genomes = ["Araneus_ventricosus",
                    "Pardosa_pseudoannulata",
                    "Evarcha_sp",
                    "Pholcus_sp",
                    "Heteropoda_venatoria",
                    "Scorpiops_zhui",
                    "Hippasa_lycosina",
                    "Songthela_sp",
                    "Pandercetes_sp",
                    "Trichonephila_clavata"
    ]



    # Threads
    THREADS = 8

    # Thresholds
    E_VALUE_THRES = 1e-10
    MIN_GENE_LEN = 500       # Minimum distance between NTD and CTD
    MAX_GENE_LEN = 100000    # Maximum distance (100kb)

    def __init__(self):
        # Create directories if they don't exist
        os.makedirs(Config.SEEDS_QC_DIR, exist_ok=True)
        os.makedirs(Config.HMMBUILD_INPUT_DIR, exist_ok=True)
        os.makedirs(Config.HMMBUILD_OUTPUT_DIR, exist_ok=True)
        os.makedirs(Config.HMMSEARCH_OUTPUT_DIR, exist_ok=True)
        os.makedirs(Config.MINIPROT_OUTPUT_DIR, exist_ok=True)

# Initialize configuration
config = Config()

spidroin_seeds = EXTERNAL_DATA_DIR / "spidroin_seeds_collection.faa"

## Acquisition and cleaning of seed sequences

The spidroin family is large and encompasses multiple types, including major ampullate silk (MaSp), minor ampullate silk (MiSp), flagelliform silk (Flag), aciniform silk (AcSp), pyriform silk (PySp), tubuliform silk (TuSp/CySp), and aggregate silk (AgSp). In addition, Schöneberg et al. (2025) reported a unique spidroin type found in the Araneoidea.

Procedure:

1. Download and classification:  
   Download the NTD and CTD amino acid sequences of each subfamily from the database.

2. Quality control:  
   Remove sequences containing simple repeats or with abnormal lengths (<50 aa).

3. Redundancy removal:  
   Use the CD-HIT tool at a 95% similarity threshold to eliminate highly redundant sequences and prevent model overfitting.

In [None]:
# Quality Control
qc_cmd = f"pixi run python3 {SCRIPTS_DIR}/extract_terminal_domains.py {spidroin_seeds} -o {config.SEEDS_QC_DIR} -l 50 --similarity 0.5"
run_shell_command_with_check(qc_cmd, config.SEEDS_QC_DIR/"processing_report.tsv", force=True)

In [None]:
# Remove highly redundant sequences using cd-hit
for input_path in Config.SEEDS_QC_DIR.glob("*TD.faa"):
    output_path = input_path.with_stem(f"{input_path.stem}_de_dup")
    de_dup_cmd = f"pixi run cd-hit -i {input_path} -o {output_path} -c 0.95 -T 0 -M 0"
    run_shell_command_with_check(de_dup_cmd, output_path)

## Multiple sequence alignment and pHMM construction  

To improve search sensitivity, especially for distantly related species (such as the span from Nephila clavata to spiders of the suborder Araneomorphae), it is necessary to construct a profile HMM.  

- Technical details:  
  - Tool selection: HMMER 3.4 suite (hmmbuild).  
  - Alignment strategy: Perform multiple sequence alignment (MSA) using the L-INS-i algorithm in MAFFT. This algorithm is the most accurate for capturing local structural features (such as conserved cysteine residues in the NTD or specific α-helical structures).  
- Model training:  
  - Construction of subfamily-specific models (e.g., MaSp_NTD.hmm): used for precise classification.  
  - Calibration: Use hmmpress to perform binary compression and indexing of the models.

In [None]:
# multiple sequence alignment
for input_path in Config.SEEDS_QC_DIR.glob("*_de_dup.faa"):
    name = input_path.stem.split("_de")[0]
    mafft_output_path = Config.HMMBUILD_INPUT_DIR / f"{name}_aln.faa"
    mafft_cmd = f"mafft --maxiterate 1000 --localpair --thread -1 {input_path} > {mafft_output_path}"
    run_shell_command_with_check(mafft_cmd, mafft_output_path)


In [None]:
# Alignment trimming
for input_path in Config.HMMBUILD_INPUT_DIR.glob("*_aln.faa"):
    aln_trimed_output_path = input_path.with_name(input_path.stem + "_trimed.faa")
    aln_trimed_cmd = f"pixi run trimal -in {input_path} -out {aln_trimed_output_path} -gt 0.8 -cons 60"
    run_shell_command_with_check(aln_trimed_cmd, aln_trimed_output_path)

In [None]:
# HMMER search
for input_path in Config.HMMBUILD_INPUT_DIR.glob("*_aln_trimed.faa"):
    name = input_path.stem.split("_aln")[0]
    hmmbuild_output_path = Config.HMMBUILD_OUTPUT_DIR / f"{name}.hmm"
    hmmbuild_cmd = f"hmmbuild -n {name} --amino --cpu 70 {hmmbuild_output_path} {input_path}"
    run_shell_command_with_check(hmmbuild_cmd, hmmbuild_output_path)
    hmmpress_cmd = f"hmmpress {hmmbuild_output_path}"
    run_shell_command_with_check(hmmpress_cmd, f"{hmmbuild_output_path}.h3m")


## Genomic Scanning Strategy

Conventional BLAST searches often produce a large number of false negatives when dealing with highly divergent sequences. In contrast, nhmmer (a component of HMMER) can directly map a protein pHMM model onto nucleotide sequences, or perform DNA-to-DNA HMM searches, thereby greatly improving sensitivity while maintaining high specificity.

Implementation scheme:
- Whole-genome scanning: Use nhmmer to perform a six-frame translation search against the target genome (or a direct nucleotide search, depending on the model type).
- Parameter settings: Set the E-value threshold to \(1 \times 10^{-10}\) to ensure high confidence; require --qcov (query coverage) greater than 70% to ensure that complete domains, rather than fragments, are detected.
- Result filtering: Remove low-complexity, nonspecific matches. Spider silk protein genomes often contain large numbers of simple repeat sequences, which necessitates soft-filtering of HMM hits using the output of RepeatMasker.

In [None]:
# Translate protein based on GFF file
for input_path in Config.HMMBUILD_OUTPUT_DIR.glob("*.hmm"):
    output_path = input_path.with_name(input_path.stem + "_consensus.fa")
    cmd = f"pixi run hmmemit -c {input_path} > {output_path}"
    run_shell_command_with_check(cmd, output_path)

In [None]:
for spider in Config.spider_genomes:
    hmm_dir = Config.HMMBUILD_OUTPUT_DIR
    spider_genome = Config.spider_genome_path / spider / f"{spider}.fa"
    spider_protein = spider_genome.with_name(spider + ".protein.fasta")
    spider_gff = spider_genome.with_name(spider + ".gff")
    gffread_cmd = f"pixi run gffread -g {spider_genome} -y {spider_protein} {spider_gff}"
    run_shell_command_with_check(gffread_cmd, spider_protein)
    spider_hmmsearch_output_dir = Config.HMMSEARCH_OUTPUT_DIR / spider
    for hmm in hmm_dir.glob("*.hmm"):
        hmmsearch_cmd = f"pixi run hmmsearch --cpu 75 --tblout {spider_hmmsearch_output_dir / hmm.stem}.tbl --domtblout {spider_hmmsearch_output_dir / hmm.stem}.domtbl {hmm} {spider_protein} > {spider_hmmsearch_output_dir / hmm.stem}.out"
        run_shell_command_with_check(hmmsearch_cmd, spider_hmmsearch_output_dir / f"{hmm.stem}.out", force=True)