Skip to content

nicwulab/SARS-CoV-2_Abs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence analysis of SARS-CoV-2 antibodies

This README describes the analyses in:
A large-scale systematic survey reveals recurring molecular features of public antibody responses to SARS-CoV-2

Contents

Input files

Dependencies

Dependencies Installation

Install dependencies by conda:

conda create -n Abs -c bioconda -c anaconda -c conda-forge \
  python=3.9 \
  biopython \
  pandas \
  openpyxl \
  distance \
  logomaker \
  igblast \
  anarci \
  mafft \
  fasttree

Local igblast setup

Before analysis, do:

conda activate Abs

PyIR: An IgBLAST wrapper and parser

pip3 install crowelab_pyir

Database set up in pyir library directory

pyir setup

Manually install IMGT REF database

  1. Sequence download from http://www.imgt.org/vquest/refseqh.html#VQUEST

  2. Copy and paste, save as fasta (save all V gene in one file; all D gene in one file; all J gene in one file)

  3. Clean data (raw edit_imgt_file.pl can be found on igblast-1.17.1xxx/bin)

edit_imgt_file.pl imgt_database/human_prot/imgt_raw/IGV.fasta > imgt_database/human_prot/IGV.fasta

  1. Create database (use "-dbtype prot" for protein sequence, use "-dbtype nucl" for DNA sequence). For example:

makeblastdb -parse_seqids -dbtype prot -in imgt_database/human_prot/IGV.fasta

makeblastdb -parse_seqids -dbtype nucl -in imgt_database/human_nuc/IGV.fasta

  1. Run PyIR for igBlast

Run local igblast and CDR parser

  1. Run local igblast on kabat numbering system
igblastn -query result/test.fasta \
  -germline_db_V imgt_database/human_nuc/IGV.fasta \
  -germline_db_J imgt_database/human_nuc/IGJ.fasta \
  -germline_db_D imgt_database/human_nuc/IGD.fasta \
  -organism human -domain_system kabat \
  -auxiliary_data imgt_database/optional_file/human_gl.aux \
  -out result/igblast_output
  1. Parse igblast output

Baseline VDJ setup

  1. Download antibody repertoire data for healthy donors from cAb-Rep

  2. Cal_repertoire_freq.py is used to establish the baseline germline usage frequency

Analysis of VDJ gene usage and V gene pairing

  1. Extract VDJ gene usage
    python3 code/VDJgene_freq_analysis.py

CDR H3 clustering analysis

  1. Extract information from the antibody dataset for downstream analyses
    python3 code/parse_Ab_table.py

  2. Clustering CDR H3 sequences
    python3 code/CDRH3_clustering_optimal.py

  3. Analyzing CDR H3 clustering results
    python3 code/analyze_CDRH3_cluster.py

Identification of recurring somatic hypermutation (SHM)

  1. Numbering SARS2 antibody sequences according to Kabat numbering
    ANARCI --scheme kabat --csv -i Fasta/SARS-CoV-2-Ab.pep -o result/SARS-CoV-2-Ab

  2. Numbering germline sequences according to Kabat numbering
    ANARCI --scheme kabat --csv -i imgt_database/human_prot/IGV.fasta -o result/Human_IGV_gene_kabat_num

  3. Calling SHMs
    python3 code/SHM_analysis.py

Analysis of recurring SHM in IGHV1-58/IGKV3-20 antibodies

  1. Extracting light chain sequences from IGHV1-58/IGKV3-20 antibodies python3 code/extract_IGHV1-58_VL.py

  2. Multiple sequence alignment mafft Fasta/Cluster3_H158K320_LC.pep > Fasta/Cluster3_H158K320_LC.aln

  3. Identifying amino acid variants at VL residues 29 and 92 python3 code/extract_IGHV1-58_aa.py

  4. Constructing phylogenetic tree FastTree Fasta/Cluster3_H158K320_LC.aln > result/Cluster3_H158K320_LC.tree

Clonotype assignment

  1. Antibodies with the same IGHV, IGK(L)V, IGHJ, IGK(L)J, and belong to the same CDR H3 cluster will be assigned to the same clonotype

Deep learning model for antigen identification

Deep learning model is under CoV_Encoder

Plotting

  1. Plot summary statistics
    Rscript code/Plot_summary_Piechart.R

  2. Plot VDJ gene usage
    Rscript code/Plot_VDJgene_Freq.R

  3. plot IGHV/IGK(L)V pairing frequency
    Rscript code/Plot_point_heatmap.R

  4. Plot CDR H3 cluster size
    Rscript code/plot_CDRH3_cluster_summary.R

  5. Generate sequence logo for different CDR H3 clusters
    python3 code/CDRH3_seqlogo.py

  6. Plot IGHV gene usage for CDR H3 cluster 7
    Rscript code/plot_cluster7_Vgenes.R

  7. Plot analysis results for IGHD1-26-encoded S2 antibodies
    Rscript code/plot_IGHD1-26_analysis.R

  8. Plot SHM
    Rscript code/plot_SHM.R

  9. Plot the number of neutralizing vs non-neutralizing antibodies in each IGHV/IGK(L)V pair
    Rscript code/Plot_basic_stat.R

  10. Plot phylogenetic tree for the light chains of IGHV1-58/IGKV3-20 antibodies Rscript code/plot_tree_KV320.R