Set of scripts to annotate Pfam domains and extract NLR plant immune receptors and their architectures as published in Sarris et al BMC Biology 2016: https://bmcbiol.biomedcentral.com/articles/10.1186/s12915-016-0228-7
Our basic pipeline
- Obtain protein sequences of species of interest and organise them into a directory.
We follow the Phytozome organisation of
master_dir/species/annotation/species_version_proteins.fa where each species is denoted by the first letter of the genus name and all letters in the species names, for example
- Pfam-based annotation of domains
bash run_pfam_scan.sh dir
- HMMER software (http://hmmer.janelia.org/) including pfam_scan.pl (part of HMMER) Move in same directory as this script or set path at command string.
- Pfam database (http://pfam.xfam.org/)
- File names should be consistent with Phytozome and include Species_*_protein.fa
- perl modules specified in the scripts (best to install with cpan: http://www.cpan.org/modules/)
- Parsing the pfamscan output with K-parse_Pfam_domains_v3.1.pl
- The script parses the output of pfam_scan.pl
- The script extracts all domains for each proteins and removes redundant nested hits with larger e-values.
- Domains are printed out in the order of apprearance in the query.
- By default, Pfam_B domains are skipped.
perl K-parse_Pfam_domains_v3.1.pl <options>
-e|--evalue <evalue cutoff>
-v|--verbose <T/F> default F. Display more information about each domain (start, stop, evalue)
We usually parse all pfam outputs of interest in parallel using
- Identification of non-canonical NLR-ID domain combinations with K-parse_Pfam_domains_NLR-fusions-v2.2.pl
- This script is configured to find any parsed pfam files in specified directory or its sub-directories.
- The script will parse the output of K-parse_Pfam_domains-v3.1.pl.
- Note that in current configutation, the script will specifically scan input directories for filenames matching "pfamscanparsed.verbose" If your naming scheme is different, you might want to modify line 62.
- Configuration of 'db_description' is highly important as the first check in the script is to match species_id in db_description to the one in the name of the file. If successful, the script will print species_id and family name to standard out.
- NLR proteins are identified based on the presence of NB-ARC domain.
- Fusions are identified based on the presence of non-NBS non-LRR domains with specified evalue cutoff (default 1e-3).
perl K-parse_Pfam_domains_NLR-fusions-v2.2.pl <options>
-i|--indir directory for batch retrieval of input *pfamscan*.parsed.verbose files
-e|--evalue evalue cutoff for determining domain fusions [default 1e-3]
-o|--output output directory
-d|--db_description description of datasets used in the analyses [Organism Species_ID NCBI_taxon_ID Family Database Date_aquired Restrictions Version Common_Name Source Reference] for example of this dataset see Additional file 1 in Sarris et al BMC Biology 2016
Summary of the number of NLRs and NLR-IDs identified in each species (such as Additional file 2 in Sarris et al BMC Biology 2016)
Summary of integrated domains with species list for each domain (such as Additional file 3 in Sarris et al BMC Biology 2016)
Abundance list of integrated domains (counted once for each family) that can be used to generate a Wordcloud (such as Figure 2 in Sarris et al BMC Biology 2016)
Contingency tables (per ID domain) for each species as well as for all species and Fisher's Exact left test
The example dataset directory contains input Arabidopsis data as well as corresponding db_description file. It also contains the outputs from each stage of the analyses, so you can check your pipeline against them or test individual scripts.