# 数据源与参考集构建
锚点序列来源于 Spider Silkome Database (https://spider-silkome.org/about)。
## 环境配置
确保已安装以下依赖 (Via Pixi):
```bash
pixi init spider_silkome
pixi add python pandas biopython mafft hmmer samtools flye minimap2 seqkit
```

In [31]:
import os
import subprocess
import glob
import pandas as pd
from spider_silkome_module.config import PROCESSED_DATA_DIR, RAW_DATA_DIR, SCRIPTS_DIR, EXTERNAL_DATA_DIR, INTERIM_DATA_DIR

from spider_silkome_module import run_shell_command_with_check
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass

# Global Configuration
class Config:
    """
    Configuration class to hold paths and thresholds.
    Using a centralized config makes modifications easier.
    """
    PROJECT_NAME = "spider_silkome_20251130"
    # Paths
    SEEDS_QC_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "spidroin_seeds_qc"
    HMMBUILD_INPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "hmmbuild_input"
    HMMBUILD_OUTPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "hmmbuild_output"
    HMMSEARCH_OUTPUT_DIR = INTERIM_DATA_DIR / PROJECT_NAME / "hmmserch_output"



    # Threads
    THREADS = 8

    # Thresholds
    E_VALUE_THRES = 1e-10
    MIN_GENE_LEN = 500       # Minimum distance between NTD and CTD
    MAX_GENE_LEN = 100000    # Maximum distance (100kb)

    def __init__(self):
        # Create directories if they don't exist
        os.makedirs(Config.SEEDS_QC_DIR, exist_ok=True)
        os.makedirs(Config.HMMBUILD_INPUT_DIR, exist_ok=True)
        os.makedirs(Config.HMMBUILD_OUTPUT_DIR, exist_ok=True)
        os.makedirs(Config.HMMSEARCH_OUTPUT_DIR, exist_ok=True)

# Initialize configuration
config = Config()

spidroin_seeds = EXTERNAL_DATA_DIR / "spidroin_seeds_collection.faa"

## 1. 种子序列的获取与清洗
蛛丝蛋白家族庞大，涵盖了大壶状腺丝（MaSp）、小壶状腺丝（MiSp）、鞭毛状腺丝（Flag）、葡萄状腺丝（AcSp）、梨状腺丝（PySp）、管状腺丝（TuSp/CySp）以及聚合腺丝（AgSp）等多种类型。此外，Schöneberg等人（2025）还报道了在中纺亚目中发现的独特spidroin类型。
操作步骤：
1. 下载分类： 从数据库中分别下载各亚家族的NTD和CTD氨基酸序列。
2. 质量控制： 剔除序列中简单的重复序列或长度异常（<50aa）的序列。
3. 去冗余： 使用CD-HIT工具在95%的相似度水平上去除高度重复的序列，以防止模型过拟合。

In [24]:
# 质量控制
qc_cmd = f"pixi run python3 {SCRIPTS_DIR}/extract_terminal_domains.py {spidroin_seeds} -o {config.SEEDS_QC_DIR} -l 50 --similarity 0.5"
run_shell_command_with_check(qc_cmd, config.SEEDS_QC_DIR/"processing_report.tsv", force=True)

[32m2025-11-30 17:15:06.116[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: pixi run python3 /home/gyk/project/spider_silkome/scripts/extract_terminal_domains.py /home/gyk/project/spider_silkome/data/external/spidroin_seeds_collection.faa -o /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc -l 50 --similarity 0.5[0m


Loaded 11155 sequences from /home/gyk/project/spider_silkome/data/external/spidroin_seeds_collection.faa


[32m2025-11-30 17:16:54.692[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/processing_report.tsv[0m


  MiSp_CTD: 754 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MiSp_CTD.faa
  AcSp_CTD: 821 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_CTD.faa
  TuSp_CySp_CTD: 267 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/TuSp_CySp_CTD.faa
  CrSp_CTD: 41 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/CrSp_CTD.faa
  Other_CTD: 330 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/Other_CTD.faa
  Flag_CTD: 371 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/Flag_CTD.faa
  MaSp1_CTD: 600 sequences -> /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MaSp1_CTD.faa
  MaSp2_CTD: 531 sequences -> /home/gyk/project/spider_silkome/data/interim/s

True

In [None]:
# 使用cd-hit去除高度重复的序列
for input_path in Config.SEEDS_QC_DIR.glob("*TD.faa"):
    output_path = input_path.with_stem(f"{input_path.stem}_de_dup")
    de_dup_cmd = f"pixi run cd-hit -i {input_path} -o {output_path} -c 0.95 -T 0 -M 0"
    run_shell_command_with_check(de_dup_cmd, output_path)

[32m2025-11-30 17:19:39.486[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: pixi run cd-hit -i /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_CTD_de_dup.faa -o /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_CTD_de_dup_de_dup.faa -c 0.95 -T 0 -M 0[0m
Program: CD-HIT, V4.8.1 (+OpenMP), Apr 24 2025, 22:00:32
Command:
         /home/gyk/project/spider_silkome/.pixi/envs/default/bin/cd-hit
         -i
         /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_CTD_de_dup.faa
         -o
         /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_CTD_de_dup_de_dup.faa
         -c 0.95 -T 0 -M 0

Started: Sun Nov 30 17:19:39 2025
                            Output                              
------------------------------------

## 2 多序列比对与pHMM构建
为了提高搜索的灵敏度，特别是针对远缘物种（如从圆网蛛到中纺亚目蜘蛛的跨越），必须构建Profile HMM。

- 技术细节：
  - 工具选择： HMMER 3.4 suite (hmmbuild)。
  - 比对策略： 使用 MAFFT 的 L-INS-i 算法进行多序列比对（MSA）。该算法对捕捉局部结构特征（如NTD中的保守半胱氨酸残基或特定的α-螺旋结构）最为准确。

- 模型训练：
  - 构建亚家族特异性模型（如 MaSp_NTD.hmm）：用于精准分类。
  - 构建泛蛛丝蛋白模型（如 PanSpidroin_NTD.hmm）：用于发现未分类或新型蛛丝蛋白（如  中提到的 Ectatosticta davidi 的 discordant spidroins）。   

- 校准： 使用 hmmpress 对模型进行二进制压缩与索引。

In [33]:
# 比对
for input_path in Config.SEEDS_QC_DIR.glob("*_de_dup.faa"):
    name = input_path.stem.split("_de")[0]
    mafft_output_path = Config.HMMBUILD_INPUT_DIR / f"{name}_aln.faa"
    mafft_cmd = f"mafft --maxiterate 1000 --localpair --thread -1 {input_path} > {mafft_output_path}"
    run_shell_command_with_check(mafft_cmd, mafft_output_path)


[32m2025-11-30 20:00:17.313[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AcSp_CTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  550 / 561
done.

Progressive alignment ... 
STEP   512 /560 (thread   11) 
Reallocating (by thread 8) ..done. *alloclen = 4425
STEP   547 /560 (thread    9) 
Reallocating (by thread 3) ..done. *alloclen = 5623
STEP   560 /560 (thread    2) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  550 / 561
Seg

[32m2025-11-30 20:04:43.835[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AcSp_CTD_aln.faa[0m
[32m2025-11-30 20:04:43.836[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/CrSp_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/CrSp_CTD_aln.faa[0m


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
   20 / 30
done.

Progressive alignment ... 
STEP    29 /29 (thread    7) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

   20 / 30
Segment   1/  1    1-1735
003-0056-1 (thread    2) identical     001-0009-0 (thread    1) identical     001-0037-1 (thread    4) identical     001-0051-0 (thread    2) identical     002-0016-0 (thread    4) identical  

[32m2025-11-30 20:04:46.195[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/CrSp_CTD_aln.faa[0m
[32m2025-11-30 20:04:46.195[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AgSp1_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AgSp1_NTD_aln.faa[0m


tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  140 / 149
done.

Progressive alignment ... 
STEP   126 /148 (thread   15) 
Reallocating (by thread 0) ..done. *alloclen = 1711
STEP   148 /148 (thread    1) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  140 / 149
Segment   1/  1    1- 562
004-0294-0 (thread    2) identical     001-0037-1 (thread    3) identical     001-0246-0 (thread    4) identical     002-0111-1 (thread    3) identical     002-0274-0 (thread    2) identical     003-02

[32m2025-11-30 20:04:47.911[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AgSp1_NTD_aln.faa[0m
[32m2025-11-30 20:04:47.912[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/PanSpidroin_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/PanSpidroin_NTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
 2980 / 2985
done.

Progressive alignment ... 
STEP  2901 /2984 (thread    4) 
Reallocating (by thread 2) ..done. *alloclen = 4165

Reallocating (by thread 2) ..done. *alloclen = 5170

done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

 2980 / 2985
Segment   1/  1    1-3489
007-5964-0 (thread    3) worse    

[32m2025-11-30 22:38:52.127[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/PanSpidroin_NTD_aln.faa[0m
[32m2025-11-30 22:38:52.128[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MaSp_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp_NTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  490 / 496
done.

Progressive alignment ... 
STEP   465 /495 (thread   10) 
Reallocating (by thread 4) ..done. *alloclen = 2595
STEP   495 /495 (thread    8) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  490 / 496
Segment   1/  1    1-1333
005-0988-0 (thread    1) worse         001-0064-0 (thread   

[32m2025-11-30 22:39:28.960[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp_NTD_aln.faa[0m
[32m2025-11-30 22:39:28.961[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/Other_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/Other_CTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  270 / 278
done.

Progressive alignment ... 
STEP   182 /277 (thread    3) 
Reallocating (by thread 7) ..done. *alloclen = 2125
STEP   277 /277 (thread    8) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  270 / 278
Segment   1/  1    1-1378
006-0552-0 (thread    7) worse         001-0060-1 (thread   

[32m2025-11-30 22:39:58.483[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/Other_CTD_aln.faa[0m
[32m2025-11-30 22:39:58.484[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/pFlag_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/pFlag_NTD_aln.faa[0m
[32m2025-11-30 22:39:58.655[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider

OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16

Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterative refinement method (<16) with LOCAL pairwise alignment information

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
L

[32m2025-11-30 22:39:59.430[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp2_NTD_aln.faa[0m
[32m2025-11-30 22:39:59.431[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/TuSp_CySp_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/TuSp_CySp_CTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  180 / 191
done.

Progressive alignment ... 
STEP   186 /190 (thread    3) 
Reallocating (by thread 5) ..done. *alloclen = 2171
STEP   190 /190 (thread    9) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  180 / 191
Segment   1/  1    1- 859
003-0378-0 (thread    5) identical     001-0160-0 (thread   

[32m2025-11-30 22:40:04.361[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/TuSp_CySp_CTD_aln.faa[0m
[32m2025-11-30 22:40:04.361[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MaSp2_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp2_CTD_aln.faa[0m


rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  270 / 274
done.

Progressive alignment ... 
STEP   266 /273 (thread    0) 
Reallocating (by thread 7) ..done. *alloclen = 2474
STEP   273 /273 (thread   10) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  270 / 274
Segment   1/  1    1-1329
003-0543-0 (thread    5) better        001-0061-0 (thread    5) identical     001-0109-1 (thread    1) identical     001-0166-0 (thread    4) identical     001-0205-0 

[32m2025-11-30 22:40:18.301[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp2_CTD_aln.faa[0m
[32m2025-11-30 22:40:18.302[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MaSp1_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp1_NTD_aln.faa[0m


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  280 / 284
done.

Progressive alignment ... 
STEP   262 /283 (thread   13) 
Reallocating (by thread 8) ..done. *alloclen = 2083
STEP   283 /283 (thread    9) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  280 / 284
Segment   1/  1    1- 953
005-0564-0 (thread    7) identical     001-0131-1 (thread    8) identical     001-0234-1 (thread    8) ide

[32m2025-11-30 22:40:27.988[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp1_NTD_aln.faa[0m
[32m2025-11-30 22:40:27.988[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/TuSp_CySp_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/TuSp_CySp_NTD_aln.faa[0m


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  120 / 122
done.

Progressive alignment ... 
STEP   121 /121 (thread    3) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  120 / 122
Segment   1/  1    1- 508
005-0240-0 (thread    8) identical     001-0175-1 (thread    7) identical     002-0183-1 (thread    5) identical     003-0236-0 (thread    8) identical     005-0161-1 (thread    8) identica

[32m2025-11-30 22:40:29.239[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/TuSp_CySp_NTD_aln.faa[0m
[32m2025-11-30 22:40:29.239[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MaSp_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp_CTD_aln.faa[0m


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  130 / 138
done.

Progressive alignment ... 
STEP   100 /137 (thread   15) 
STEP   126 /137 (thread   15) done. *alloclen = 2101
STEP   137 /137 (thread   10) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  130 / 138
Segment   1/  1    1- 911
003-0272-0 (thread    4) identical     001-0118-1 (thread    6) identical     001-0262-1 (thread    8) id

[32m2025-11-30 22:40:31.667[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MaSp_CTD_aln.faa[0m
[32m2025-11-30 22:40:31.668[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AgSp2_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AgSp2_CTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  220 / 228
done.

Progressive alignment ... 
STEP   171 /227 (thread   11) 
Reallocating (by thread 4) ..done. *alloclen = 1490
STEP   227 /227 (thread   13) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  220 / 228
Segment   1/  1    1- 353
005-0452-0 (thread    8) worse         001-0210-0 (thread   

[32m2025-11-30 22:40:37.032[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AgSp2_CTD_aln.faa[0m
[32m2025-11-30 22:40:37.032[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/AcSp_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AcSp_NTD_aln.faa[0m


outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  150 / 154
done.

Progressive alignment ... 
STEP   146 /153 (thread    0) 
Reallocating (by thread 2) ..done. *alloclen = 3347
STEP   153 /153 (thread   15) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  150 / 154
Segment   1/  1    1-1756
005-0304-1 (thread    4) worse         001-0024-0 (thread    4) identical     001-0076-0 (thread    6) ide

[32m2025-11-30 22:40:46.812[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/AcSp_NTD_aln.faa[0m
[32m2025-11-30 22:40:46.812[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/MiSp_NTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MiSp_NTD_aln.faa[0m


rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
  660 / 667
done.

Progressive alignment ... 
STEP   666 /666 (thread    1) 
done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

  660 / 667
Segment   1/  1    1- 820
005-1330-1 (thread    5) worse         001-0086-0 (thread    3) identical     001-0173-1 (thread    7) identical     001-0239-0 (thread    5) worse      001-0306-0 (thread    4) identical     001-0344-1 (thread    5) identical     001-0412-1 (thread 

[32m2025-11-30 22:41:35.497[0m | [32m[1mSUCCESS [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m53[0m - [32m[1mCommand executed successfully, output file: /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/MiSp_NTD_aln.faa[0m
[32m2025-11-30 22:41:35.498[0m | [1mINFO    [0m | [36mspider_silkome_module.features[0m:[36mrun_shell_command_with_check[0m:[36m50[0m - [1mExecute command: mafft --maxiterate 1000 --localpair --thread -1 /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/spidroin_seeds_qc/PanSpidroin_CTD_de_dup.faa > /home/gyk/project/spider_silkome/data/interim/spider_silkome_20251130/hmmbuild_input/PanSpidroin_CTD_aln.faa[0m


OS = linux
The number of physical cores =  40
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.526
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
40 thread(s)

outputhat23=16
Loading 'hat3.seed' ... 
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ... 
 3340 / 3350
done.

Progressive alignment ... 
STEP  2901 /3349 (thread   11) 
Reallocating (by thread 3) ..done. *alloclen = 4636
STEP  3201 /3349 (thread    3) 
Reallocating (by thread 5) ..done. *alloclen = 7040
STEP  3301 /3349 (thread    9) 
Reallocating (by thread 5) ..done. *alloclen = 8308

Reallocating (by thread 0) ..done. *alloclen = 10583

done.
tbfast (aa) Version 7.526
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
16 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 8
randomseed = 0
blosum 62 / kimura 200

KeyboardInterrupt: 

In [None]:
# 构建 hmm 模型
for input_path in Config.HMM_INPUT_DIR.glob("*_aln.faa"):
    name = input_path.stem.split("_")[0]
    hmmbuild_output_path = Config.HMM_OUTPUT_DIR / f"{name}.hmm"
    hmmbuild_cmd = f"hmmbuild -n {name} --amino --cpu 70 {hmmbuild_output_path} {input_path}"
    run_shell_command_with_check(hmmbuild_cmd, hmmbuild_output_path)
    hmmpress_cmd = f"hmmpress {hmmbuild_output_path}"
    run_shell_command_with_check(hmmpress_cmd, f"{hmmbuild_output_path}.h3m")


## 3 基因组扫描策略
传统的BLAST搜索在处理高分歧序列时往往会产生大量假阴性。相比之下，nhmmer（HMMER组件）可以直接将蛋白的pHMM模型映射到核酸序列上，或者使用DNA-to-DNA的HMM搜索，从而在保持高特异性的同时极大提升灵敏度。

**执行方案：**

- 全基因组扫描： 使用 nhmmer 对目标基因组进行六框翻译搜索（或直接核酸搜索，取决于模型类型）。
- 参数设定： E-value 阈值设为 $1e^{-10}$ 以保证高置信度；--qcov（查询覆盖度）需大于70%，确保找到的是完整的结构域而非碎片。
- 结果过滤： 剔除低复杂度的非特异性匹配。蛛丝蛋白基因组中常含有大量的简单重复序列（Simple Repeats），需利用RepeatMasker的输出结果对HMM hits进行软过滤。