Skip to content

T cell Receptor Immunoglobulin Profiler (TRIP)

Fotis E. Psomopoulos edited this page Mar 30, 2020 · 1 revision

Documentation

TRIP is a software framework that provides analytics services on antigen receptor (B cell receptor immunoglobulin, BcR IG | T cell receptor, TR) gene sequence data. It is a web application written in R Shiny, an R package that can be used to build interactive web apps straight from R. It takes as input the output files of the IMGT/HighV-Quest tool (https://www.imgt.org/HighV-QUEST/login.action). Data from a given sample is organized into a folder containing 10 individual files in text (.txt) format. IMGT/HighV-Quest has a submission threshold of 500,000 sequences. If a single sample has more sequences, the data should be split into batches of 500,000 sequences and submitted to IMGT/HighV-Quest tool. Hence, multiple folders for these given samples will be generated. These folders should be named with the same identifier but a different extension in the form of "_0”, “_1” etc. “(acronym)” processes the data according to user-selected parameters and provides visualization of the results. Datasets containing different samples can be processed together. Users can select to analyze the data from each of the input samples separately, or the combined data files from all samples and visualize the results accordingly.

The first step is to upload the input data. Immediately after the upload, input data is automatically checked for the presence of data columns with a different/unknown title. In this case, users are asked to replace the names of these columns with the appropriate ones. Data columns that will not be used in the downstream analysis are removed at the very beginning of the process to reduce the overall complexity. Next, data Preselection (curation) and Selection (filtering) are applied, according to the user’s preferences.

As a last step, users can select the tool pipeline that they want to apply to the curated and filtered dataset(s). The default pipeline includes the following procedures: clonotype computation, repertoire analysis, multiple variable comparison, sequence alignment and amino acid position-based frequency estimation.

Detailed documentation

The tool web interface is organized in 12 major tabs.

Home

In this tab users can import their data by selecting the directory (“choose directory” button) where the data is stored. The tool takes as input the 10 output files of the IMGT/HighV-Quest tool in text format. Users can also choose only some of the files depending on the type of the downstream analysis. Previous sessions can also be loaded (“Restore Previous Sessions” button).

There are 2 options regarding the cell type (T cell and B cell) as well as 2 options based on the amount of available data (high- or low-throughput). Concerning the latter, the main difference is the application of the preselection and selection steps. In the case of high-throughput data, all filters are applied consequentially (i.e. if a sequence fails >1 selection criteria, only the first unsatisfied criterion will be reported), whereas for low-throughput data all criteria are applied at the same time.

High throughput Analysis/ NGS data Analysis

Preselection

The Preselection process comprises 4 different criteria:

  1. Only take into account Functional V-Gene: Only sequences utilizing a functional V gene are included into the downstream analysis. Sequences with pseudogenes (P) or open reading frame (ORF) genes are excluded from further analysis.
  2. Only take into account CDR3 with no Special Characters (X,,#)*: Only sequences without ambiguities (i.e. characters other than those of the 20 amino acids) are included in the analysis.
  3. Only take into account Productive Sequences: Only productive sequences (without stop codons and frameshifts) are included in the analysis.
  4. Only take into account CDR3 with valid start/end landmarks: Start/end CDR3 landmarks (anchors) can be customized by the user based on the type of data (BcR/TR, heavy/light chain). More than one valid landmark can be used. The different letters should be separated with a vertical bar (e.g. F|D). Sequences with landmarks other than the chosen ones are excluded from the analysis.

Users can visualize the results of the preselection (first cleaning) process in the “Preselection” tab. In the case of multi-sample datasets, results are provided for each individual sample separately, or for the combined dataset. The output consists of 4 table files: (i) a summary table with both the included and excluded sequences for each different criterion (“Summary”), (ii) the entire set of data (“All Data table”), (iii) the sequences that meet the preselection criteria and are included in the analysis (“Clean table”) and (iv) the excluded sequences (“Clean out table”). The last column of the “Clean out table” refers to the unsatisfied criteria. All 4 tables can be downloaded as text files.

Selection

The sequences that passed through the Preselection process (“Clean table”) are used as input for the data Selection (filtering) process. This step comprises 6 different filters:

  1. V-REGION identity %: Sequences with identity percent to germline that do not fall in the range set by the user are excluded from the analysis.
  2. Select Specific V Gene
  3. Select Specific J Gene
  4. Select Specific D Gene

Using the above 3 filters the user can select for sequences that carry one or more particular V, J and D genes or gene alleles, respectively. Different genes/gene alleles should be separated with a vertical line (|), e.g. TRBV11-2|TRBV29-1*03. 5. Select CDR3 length range: Only sequences with the selected CDR3 lengths are included in the analysis. 6. Only select CDR3 containing specific amino-acid sequence: Sequences with the specific CDR3 amino acid motif provided by the user are included in the analysis. The results of the Selection (filtering) process are presented in the “Selection” tab. This process provides 4 output files: (i) a summary table with both the included and excluded sequences for each filter (“Summary”), (ii) the data used as input after the Preselection process (“All Data table”), (iii) the sequences that passed through the selection filters (“Filter in table”) and (iv) the excluded sequences (“Filter out table”). The last column of the “Filter out table” refers to the filters that were not passed by each individual sequence. All the tables can be downloaded as text files.

Pipeline

Users can select the workflow that they want to apply to their dataset(s).

Scenario 1 – T cell receptor gene sequence data analysis

There are 7 different tools in the pipeline tab:

  1. Clonotype computation: The frequencies for all unique clonotypes of each sample are computed. There are 10 different options for clonotype definition. The results are presented in the “Clonotypes” tab in the form of a table, where the clonotype, the count, the frequency and the convergent evolution (if feasible) are given. Each clonotype is also a link that provides a table with all relevant immunogenetic data for that particular clonotype, based on the uploaded files. This table consists of all reads/sequences assigned to that clonotype and all relevant information. Each clonotype is given a unique cluster identifier (cluster ID).
  2. Highly similar clonotypes computation: Frequencies for all highly similar clonotypes are computed. The user can set the number of mismatches allowed for each CDR3 length found in the dataset and a clonotype frequency threshold (range: 0-1). Only clonotypes with a frequency above the applied threshold will be used in the subsequent grouping. The whole process can be performed with or without taking into account the rearranged V-gene. The results are presented in the “Highly Similar Clonotypes” tab as a table. A second table is also provided containing information regarding the clonotype grouping.
  3. Repertoires extraction: The number of clonotypes using each V, J or D gene/allele is computed over the total number of clonotypes based on the clonotype definition given in the previous “clonotype computation” step. If multiple samples are analyzed together the tool provides a total repertoire as well as the repertoire for each individual sample. Results are provided in the “Repertoires” tab as tables. Each table includes the gene/allele and information concerning the absolute count and frequency of sequences expressing that particular gene/allele.
  4. Highly similar repertoires extraction: Same as above except for the fact that the tool uses as input the clonotypes as computed in the “highly similar clonotypes computation”.
  5. Multiple value comparison: The tool performs cross-tabulation analysis between 2 selected variables. Many different variables can be selected by the user for this type of analysis depending on the selected input files from the Home tab. The results are presented at the “Multiple value comparison” tab as tables. Each table contains the values that were found to be associated and the relevant frequency.
  6. CDR3 with 1 amino acid length difference: This tool can be applied for datasets that consist of sequences with highly similar CDR3. The tool is able to align and create sequence logos for sequences with the same length as well as for sequences that differ by a single amino acid in terms of length.
  7. Logo: This tool creates an amino acid frequency table for the selected sequence region (CDR3, VDJ REGION, VJ REGION) of a given length. The frequency table is computed by counting the frequency of appearance of each of the 20 different amino acids at any given position of the sequence. The users have the option to select over the total frequency table or the table of the top clusters according to the clonotype frequencies. A logo is created using the above frequency table. The color code of the amino acids is created based on the 11 IMGT amino acid physicochemical classes. (http://www.imgt.org/IMGTeducation/Aide-memoire/_UK/aminoacids/IMGTclasses.html)

In the “Visualization” tab different types of charts (scatter, plots, bars etc.) are available for the visualization of the analysis results. Clonotypes are presented as bars and the user can select the frequency above which the clonotypes will be presented. The convergent evolution is also available for visualization with more than one chart type options. The computed repertoires are presented as pie-charts and the user can again select the minimum frequency of the gene/allele that will be presented. Regarding the “Multiple value comparison” tool, a plot of the 2 selected variables is presented. All the tables that are presented to the user can be downloaded in text format, whereas the plots and the graphics can be downloaded in png format.

Scenario 2 – B cell receptor immunoglobulin gene sequence data analysis

There are 4 additional tools in this pipeline tab:

  1. Insert identity groups: Input sequences are grouped into different categories based on the V-region identity percent. The user can determine the number and the identity percent range of mutational groups. (high limit: <, low limit: ≥)
  2. Somatic hypermutation status: The relative frequency of each germline identity group is computed. If the user has not defined any groups based on the somatic hypermutation (SHM) status using the “Insert identity groups” tool, the tool will group together only sequences that display the exact SHM status (e.g. sequences with an identity percent of 98.6% will be grouped together whereas sequences with 98.7% identity will form a distinct group). Relative frequencies for each SHM group will be computed based on the total number of sequences.
  3. Alignment: An alignment table is created for the user-selected region (VDJ REGION, VJ REGION). Sequences that are identical in terms of amino acid or nucleotide sequence level are grouped together in order to create the grouped alignment table. Alignments for the selected region can be provided at the nucleotide or amino acid level or both. Default reference sequences are extracted from the IMGT reference directory (http://www.imgt.org/vquest/refseqh.html). Reference sequences can be used either at the gene or gene allele level. At the gene level, allele *01 is considered as reference. Users can also submit their own reference sequence. There is also the possibility to align only a number of selected clonotypes through the “Select topN clonotypes” option or select those clonotypes that have an individual frequency above a given percent cutoff. Results are presented in the “Alignment” tab as tables. Each table can be downloaded in txt format.
  4. Somatic hypermutations: A table with all somatic hypermutations for all samples together as well as for each individual sample is computed based on the alignment table provided by the previous tool. The output table includes: (i) the mutation type, (ii) the position of the change, (ii) the region where the change occurs, (iii) the number of sequences carrying each change and (iv) the frequency of the change for every gene or allele based on the grouped alignment table regardless the clonotype. There is the possibility to analyze only a number of clonotypes by choosing the “Select topN clonotypes” or the “Select threshold for clonotypes” option or even some clonotypes separately by choosing the “Select clonotypes separately” option. Different clonotype/cluster identifiers (cluster IDs) should be separated by comma (e.g. 1,3,7). Results are given in the “Mutations” tab as tables. When different clonotypes are selected separately, different tables are created for each given clonotype. Each table can be downloaded in text format.

Step Dependencies in Pipeline:
  1. In order to apply “Highly Similar Clonotypes computation”, “Clonotypes computation” should have been selected previously.
  2. In order to apply “Repertoires Extraction”, “Clonotypes computation” should have run previously. If “Highly Similar Clonotypes computation” has been selected, repertoires will be extracted for both total clonotypes and highly similar clonotypes.
  3. The “Somatic hypermutation status” is applied using the groups that have been selected at “Insert Identity groups”.
  4. If both “Alignment” and “Clonotypes computation” have been selected, the cluster ID in the alignment table corresponds to the cluster ID in the clonotype table. Otherwise, all elements in the “cluster_ID” column of the alignment table are assigned to zero.
  5. In order to apply “Alignment” using the “Select top N clonotypes” option, “Clonotypes computation” should have run previously.
  6. In order to apply “Mutations”, “Alignment” should have run previously, using the corresponding “AA or Nt” option. The Mutation table is computed based on the grouped alignment table.
  7. In order to apply “Mutations” using the “Select top N clonotypes” or the “Select clonotypes separately” option, “Clonotypes computation” should have previously run.
  8. In order to apply “Logo” using the “Select top N clonotypes” option, “Clonotypes computation” should have run previously.
  9. Ιn order to run the “*Shared Clonotype computation” and the “Repertoire comparison” steps, the user must have loaded more than one datasets.