Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016:

Our basic pipeline

  1. Obtain protein sequences of species of interest and organise them into a directory.

We follow the Phytozome organisation of master_dir/species/annotation/species_version_proteins.fa where each species is denoted by the first letter of the genus name and all letters in the species names, for example Athaliana

  1. Pfam-based annotation of domains

usage: bash dir


  1. Parsing the pfamscan output with
  • The script parses the output of
  • The script extracts all domains for each proteins and removes redundant nested hits with larger e-values.
  • Domains are printed out in the order of apprearance in the query.
  • By default, Pfam_B domains are skipped.

usage: perl <options>

-p|--pfam <pfamscan.out>

-e|--evalue <evalue cutoff>


-v|--verbose <T/F> default F. Display more information about each domain (start, stop, evalue)

We usually parse all pfam outputs of interest in parallel using xargs

  1. Identification of non-canonical NLR-ID domain combinations with
  • This script is configured to find any parsed pfam files in specified directory or its sub-directories.
  • The script will parse the output of
  • Note that in current configutation, the script will specifically scan input directories for filenames matching "pfamscanparsed.verbose" If your naming scheme is different, you might want to modify line 62.
  • Configuration of 'db_description' is highly important as the first check in the script is to match species_id in db_description to the one in the name of the file. If successful, the script will print species_id and family name to standard out.
  • NLR proteins are identified based on the presence of NB-ARC domain.
  • Fusions are identified based on the presence of non-NBS non-LRR domains with specified evalue cutoff (default 1e-3).

usage: perl <options>

-i|--indir directory for batch retrieval of input *pfamscan*.parsed.verbose files

-e|--evalue evalue cutoff for determining domain fusions [default 1e-3]

-o|--output output directory

-d|--db_description description of datasets used in the analyses [Organism Species_ID NCBI_taxon_ID Family Database Date_aquired Restrictions Version Common_Name Source Reference] for example of this dataset see Additional file 1 in Sarris et al BMC Biology 2016


  • Summary of the number of NLRs and NLR-IDs identified in each species (such as Additional file 2 in Sarris et al BMC Biology 2016)

  • Summary of integrated domains with species list for each domain (such as Additional file 3 in Sarris et al BMC Biology 2016)

  • Abundance list of integrated domains (counted once for each family) that can be used to generate a Wordcloud (such as Figure 2 in Sarris et al BMC Biology 2016)

  • Contingency tables (per ID domain) for each species as well as for all species and Fisher's Exact left test

Example datasets:

The example dataset directory contains input Arabidopsis data as well as corresponding db_description file. It also contains the outputs from each stage of the analyses, so you can check your pipeline against them or test individual scripts.


