A toolkit for immunoglobulin gene lineage tree analysis.
Hadas Neuman and Ramit Mehr
IgTreeZ is a comprehensive analysis tool for Immunoglobulin lineage trees.
To run IgTreeZ, you will need these Python3 and the Python packages NumPy, pandas, ETE3, and matplotlib
To use the 'draw' sub-program, you will need Graphviz version 2.30.1 or higher
- Clone the repo
git clone https://github.com/neumanh/IgTreeZ.git
- Download IgTreeZ_1.8.3.tar.gz. To extract the file in the local directory, use
tar -zxvf IgTreeZ*.tar.gz
- Download IgTreeZ_1.8.3.zip. To extract the file in the local directory, use
unzip IgTreeZ*.zip
IgTreeZ program includes 5 sub-programs:
- mutations
- poptree
- mtree
- filter
- draw
You can list them using
igtreez.py -h
- All sub-programs must be given an analysis name for your choice using the
-n
parameter - The
-t
parameter that gets Newick trees, can get the trees as files of directories containing the files. If a directory name is given, the program will try to read all the files inside the directory as Newick files. - The program creates an output directory named "IgTreeZ_output", in which the output files and the log files are created
IgTreeZ-mutations counts and profiles the mutations in a repertoire, based on tree topology and input sequences. Three input types are available for IgTreeZ-mutations:
IgTreeZ-mutation can process trees in a Newick format, and sequences in an AIRR format or in the old Change-O format. For an AIRR format dataset:
igtreez.py mutations -n example_name -t ../examples/*nw -d ../examples/F1-control_germ-pass.tsv -dbf airr
- The input database can be in Change-O or AIRR format. Use the
-dbf
parameter to specify the input format using-dbf airr
or-dbf changeo
. The default is the (old) Change-O format (with the tab extension). *The program uses the database fields (for Change-O / AIRR format): -
- SEQUENCE_IMGT / sequence_alignment
-
- GERMLINE_IMGT_D_MASK / germline_alignment
-
- CLONE / clone_id
-
- SEQUENCE_ID / sequence_id
- You can specify different field names using the parameters:
-sf
,-gf
,cf
,if
- If you send the sequences using a databse or Fasta files, and the sequences name contain the
:
or the;
characters (as in Illumina output), you can use the--illumina
paraemter to automaticly replace the colons and the semicolons with dashes --
.
If the CDR3_IMGT / cdr3_imgt column exists in the database, and the '--nocdr3' parameter is not used - the program defines mutations in the CDR3 region as well
The program profiles the mutation region based on IMGT region definition.
IgTreeZ-mutation can process trees in a Newick format, and sequences in a Fasta format:
igtreez.py mutations -n example_name -t ../examples/1004.nw -f ../examples/1004_aligned.fasta
Each Fasta file represents the sequences of one tree. The number of trees must be identical to the number of Fasta files.
The sequences names in the Fasta files should appear in the corresponding tree's node names. The Fasta file should also include a germline sequence.
The program assumes the germline sequence is named GL. You can define another name using the -gl
parameter. All Fasta files must include one germline sequence.
All the sequences in one Fasta file must be aligned and have the same length.
IgTreeZ-mutation can analyze a clone and lineage trees AIRR schema, which includes trees and sequences:
igtreez.py mutations -n example_name -j ../examples/full_schema_dataset_example.json
The AIRR scheme structure is described here
The mutation count of IgTreeZ-mutations can be used for a selection analysis, by creating an additional mutation count file using the --selection
paramterer:
igtreez.py mutations -n example_name -t ../examples/*nw -d ../examples/F1-control_germ-pass.tab --selection
Multiple output datasets can be input to an Rscript that quantifies the selection using the ShazaM's selection test on the given mutation counts:
Rscript run_shazam_on_trees_with_CDR3.r example_name1_for_selection.csv example_name2_for_selection.csv
If the script finds the 'CDR3_length' column in the datasets, it counts for the Ig sequence till the end of the CDR3 region.
IgTreeZ-mutations creates an output directory named IgTreeZ_output/example_name. Inside it, it creates these CSV files:
- IgTreeZ_output/example_name/example_name_mutations.csv - a dataset of 82 mutation properties. Each line in the file refers to one tree and each column to a different mutation property. The properties are described here
- IgTreeZ_output/example_name/example_name_for_selection.csv - Using the
--selection
parameter. A dataset of 8 columns. Each line in the file refers to one tree and each column to a different mutation property. The properties are described here. - XXX.log - A progress log file
- XXX_errors.log - An error log file (created only if the program encounters an error)
In addition to the mutations count, the program can generate discriptive plots using the --plot
parameter.
All the plots are created in the local directory IgTreeZ_output/example_name/mutation_plots. The program generates more than 70 plots for each analysis. A short explanation on each plot can be found here
For example - a pie that describes the distribution of the mutations in the CDRs and FWRs in the repertoire:
Another example - A pie plot of the distribution of the replacement mutations that lead to amino-acid hydrophobic/hydrophilic change, relative to the number of mutations that did not, in the repertoire:
IgTreeZ-poptree counts and profiles the populations and population transitions in a repertoire, based on tree nodes names. For example, it counts the accurances of each given population, the number of times population X transfomrs into population Y (that is, the number of times a population Y is the direct or indirect decendant of population X), and more. The program also counts the number of mutations between the populations as a transition distance.
IgTreeZ-poptree gets two types of inputs: Newick trees and AIRR scheme. The program also recieves the population names to search using the -p
parameter.
IgTreeZ-poptree can analyze trees in Newick format:
igtreez.py poptree -n example_name -t ../examples/*nw -p IgM IgA IgG
IgTreeZ-poptree can get as an input lineage trees as a clone and lineage trees AIRR schema:
igtreez.py poptree -n example_name -j ../examples/full_schema_dataset_example.json -p IgM IgA IgG
IgTreeZ-poptree creates an output directory named IgTreeZ_output/example_name/PopTree_results. These CSV files are created in the directory:
- XXX_population_levels.csv - Each column represents a population and each value represents the number of mutations from root to a single accurance of the popolation.
- XXX_populations_count_by_tree.csv - Each row represents a single tree, and each column represents a population. Each value represents the number of times the populations was found in the tree.
- XXX_populations_levle_summary.csv - A summary statistics of the populations accurance in all trees.
- XXX_transition_count_normalized_by_destination.csv - Each column represents a transition type that was found in all the trees. Each value represents the number of times this transition was found in all the given trees, divided in the number of times the target population was found in the given trees, and multiplied by a constant (for the convenience of working with numbers greater than one).
- XXX_transition_count_normalized_by_source.csv - Each column represents a transition type that was found in the trees. Each value represents the number of times this transition was found in all the given trees, divided in the number of times the source population was found in the given trees, and multiplied by a constant (for the convenience of working with numbers greater than one).
- XXX_transition_distances.csv - Each column represents a transition type that was found in the trees. Each line represents one transition and each value represents the number of mutation involve in this transition.
- XXX_transitions_count_by_tree.csv - Each row represents a single tree, and each column represents a transition type. Each value represents the number of times the transition was found in the tree.
- XXX_transitions_distance_summary.csv - A summary statistics of the transition accurance in all trees.
- XXX_transitions_summary_normalized_by_destination.csv - A summary statistics of the transition accurance in all trees, divided in the number of times the target population was found in the given trees, and multiplied by a constant.
- XXX_transitions_summary_normalized_by_source.csv* - A summary statistics of the transition accurance in all trees, divided in the number of times the source population was found in the given trees, and multiplied by a constant. In addition, these files are created in IgTreeZ_output/example_name:
- XXX.log - A progress log file
- XXX_errors.log - An error log file (created only if the program encounters an error)
by using the --plot
paraemter, the program creates discriptive plots.
All the plots are created in the local directory IgTreeZ_output/example_name/mutation_plots. The program generates 10 plots for each analysis. A short explanation on each plot can be found here
For example:
igtreez.py poptree -n example_name -t ../examples/*nw -p IgM IgA IgG --plot
IgTreeZ-mtree quantifies the shape properties of Immunoglobulin gene lineage trees by measuring 9 features, seven of them were found to have a significant correlation with several B cell response parameters by Shahaf et al., 2008.
Like IgTreeZ-poptree, IgTreeZ-mtree gets two types of inputs: Newick trees and AIRR scheme.
IgTreeZ-mtree can analyze trees in Newick format:
igtreez.py mtree -n example_name -t ../examples/*nw
IgTreeZ-poptree can get as an input lineage trees as a clone and lineage trees AIRR schema:
igtreez.py mtree -n example_name -j ../examples/full_schema_dataset_example.json
IgTreeZ-mtree creates an output directory named IgTreeZ_output/example_name Inside it, it creates one CSV file:
- XXX_mtree.csv - a dataset of 9 tree shape properties. Each line in the file refers to one tree and each. The properties are described here
- XXX.log - A progress log file
- XXX_errors.log - An error log file (created only if the program encounters an error)
IgTreeZ-filter filters trees by population composition or by tree size (number of nodes or number of leaves).
Like IgTreeZ-mtree, IgTreeZ-filter gets two types of inputs: Newick trees and AIRR scheme (see above). Only one filtering type can be done in each run (however, the output of one run can be the input of another).
IgTreeZ-filter can filter trees based on population composition using thee logic gates - and, or or not:
Using the AND paramtere, the program choose trees that includes all the give populations. For example:
igtreez.py filter -n example_name -t ../examples/*nw -AND IgA1 IgA2
Using the AND paramtere, the program choose trees that includes either one of the give populations. For example:
igtreez.py filter -n example_name -t ../examples/*nw -OR IgA1 IgA2
Using the AND paramtere, the program choose trees that includes none of the give populations. For example:
igtreez.py filter -n example_name -t ../examples/*nw -NOT IgM IgG
IgTreeZ-filter can filter trees based on tree size, based on number of nodes or leaves. The program can get a minimun number of nodes or leaves by sending only one value.
For example, for choosing trees with at least 2 leaves, type:
igtreez.py filter -n example_name -t ../examples/*nw -leaves 2
The program can also filter by a range of nodes or leaves by sending 2 values.
For example, for choosing trees with 3 to 100 nodes (including), type:
igtreez.py filter -n example_name -t ../examples/*nw -nodes 3 100
You can save the filtered trees themself, in addition to listing them, by using the --copy
parameter.
This parameters creates a new directory, named by the used filter, that contains the chosen trees. Use:
igtreez.py filter -n example_name -t ../examples/*nw -leaves 6 7 --copy
This command will create a directory with the Newick tree files, named more_than_6_leaves_and_less_than_7_nodes.csv the the output directory.
IgTreeZ-filter creates an output directory named IgTreeZ_output/example_name/Filtered_files. Inside it, it creates these files:
- XXX.csv - A list of the tree ID that passed the filter.
- XXX/ - A directory of the Newick tree files that passed the filter (when using the
--copy
parameter) - XXX.log - A progress log file
- XXX_errors.log - An error log file (created only if the program encounters an error)
IgTreeZ-draw draws tree using Graphviz and the graph description language DOT.
Like IgTreeZ-mtree, IgTreeZ-draw gets two types of inputs: Newick trees and AIRR scheme (see above). The drawn tree will include all the nodes names, unless you will use the -p
parameter.
For example:
igtreez.py draw -n example_name -t ../examples/*nw
You can color nodes, or remove node names, using the -p
parameter. The -p
parameter recieves population names that appears in the tree nodes. If you with to avoid the nodes names from the output figure, without coloring the nodes, just add a string that does not appears in any node, or a string that appears in all the nodes.
For coloring the tree nodes by isotypes, for example:
igtreez.py draw -n example_name -t ../examples/*nw -p IgD IgM IgA IgG IgE
If you would like to use specific colors, you can send them using the -c
parameter. The first color will color the first population name, the second will color the second, and so on. The number of colors must be equeal to the number of populations. The colors can be given in HEX format, or as color names as describes here. For example:
igtreez.py draw -n example_name -t ../examples/*nw -p IgD IgM IgA IgG IgE -c blue cyan deeppink #873943
By default, the program creates PNG files. However, any of the DOT program output files are accepted. For example:
igtreez.py draw -n example_name -t ../examples/*nw -p IgD IgM IgA IgG IgE --format svg
IgTreeZ-draw creates an output directory named IgTreeZ_output/example_name. Inside it, it creates these files:
- Drawn_trees - A direcory of all the tree figures
- Drawn_trees/legend.XXX - A legen file (when using the
-p
parameter) - XXX.log - A progress log file
- XXX_errors.log - An error log file (created only if the program encounters an error)
To cite IgTreeZ in publications, please use:
Neuman H, Arrouasse J, Kedmi M, Cerutti A, Magri G, Mehr R. IgTreeZ, A Toolkit for Immunoglobulin Gene Lineage Tree-Based Analysis, Reveals CDR3s Are Crucial for Selection Analysis. Front Immunol. 2022 Oct 26;13:822834. doi: 10.3389/fimmu.2022.822834. PMID: 36389731; PMCID: PMC9643157.
The paper can be found here.
Distributed under the AGPL3 License. See LICENSE
for more information.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Hadas Neuman hadas.doron@gmail.com
Prof. Ramit Mehr ramit.mehr@biu.ac.il
Project Link: https://github.com/neumanh/IgTreeZ