Skip to content

michelleapaz/T3E

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transposable Element Enrichment Estimator (T3E)

A tool for characterising the epigenetic profile of transposable elements using ChIP-seq data

T3E Image

Read about T3E in:

Almeida da Paz, M., Taher, L. T3E: a tool for characterising the epigenetic profile of transposable elements using ChIP-seq data. Mobile DNA 13, 29 (2022). https://doi.org/10.1186/s13100-022-00285-z

Software requirements

T3E was developed for UNIX environments, written and tested with the following versions:

  • Python 3.8.5
    • sys 3.8.5 (depending on sys version: os, math, time)
    • argparse 1.1
    • numpy 1.19.0
    • random (default version = 2)
  • Perl 5.30.0
  • R 3.6.3
    • dplyr 1.0.9
    • ggplot2 3.3.6
  • bedtools 2.27.1
  • bedmap 2.4.37
  • samtools 1.10
  • git 2.25.1
  • conda 4.13.0

External tool dependencies

T3E also requires external tools to run. They should be installed using a package manager or following their website instructions.

Bedtools

Bedtools is a toolset for a wide-range of genomics analysis tasks. To verify if bedtools is installed (and its version), run this command:

bedtools --help

If the usage instructions are printed to the terminal, it is installed. Check its version with the command (it is not recommended to use an older version than the one tested (v2.27.1), although older versions may work well):

bedtools --version

Bedmap

Bedmap is a program used to retrieve and process regions of interest in BED files. To verify if bedmap is installed (and its version), run this command:

bedmap --help

If the usage instructions are printed to the terminal, it is installed. Check its version with the command (it is not recommended to use an older version than the one tested (v2.4.37), although older versions may work well):

bedmap --version

Samtools

Samtools is a suit of programs for interacting with high-throughput sequencing data. To verify if samtools is installed (and its version), run this command:

samtools --help

If the usage instructions are printed to the terminal, it is installed. Check its version with the command (it is not recommended to use an older version than the one tested (v1.10), although older versions may work well):

samtools --version

R with required packages

T3E runs R script to filter out regions of extremely high signals (if asked to). The dplyr and ggplot2 are required. To verify if R is installed (and its version), run this command:

R

If the usage instructions are printed to the terminal, it is installed. To check if the required packages are already installed, try to load them:

library(dplyr)
library(ggplot2)

If the packages load without any error, they are already installed. Otherwise, install them in R:

install.packages("dplyr")
install.packages("ggplot2")

Installation

Cloning T3E repository from GitHub

Before using Git, make sure it is available. To verify if Git is installed, run this command:

git

If the usage instructions are printed to the terminal, it is installed. To clone T3E repository into a new directory, download a .zip file from GitHub or run the command:

git clone https://github.com/michelleapaz/T3E

Installing with conda

We recommend using conda environment (with defined dependencies) to run T3E. To verify if conda is installed, run this command:

conda

If the usage instructions are printed to the terminal, it is installed. Dependencies are defined in the file environment.yml. Create the environment using the command:

conda env create --file environment.yml

To see a list of all environments, run the command:

conda env list

The recently created environment called t3e-env should appear in the list. Then, activate this environment using the command:

conda activate t3e-env

In case any dependency is not successfully installed, it can be installed, for example, using conda:

conda install numpy

Or pip:

pip install numpy

After running T3E, do not forget to deactivate the t3e-env environment, running the command:

conda deactivate

Usage

We provide input files as examples and the repeat annotation files for human and mouse genomes used in our study. These files are available here: https://cloud.tugraz.at/index.php/s/Ec8HfnoasnMzcPS
To run T3E the content of five folders should be considered:

  1. ./bam/ - should contain the alignments (BAM files) for ChIP-seq samples and their corresponding input control. It is important that secondary reads are reported by the used mapper. Two files are provided as example (test_sample.bam and test_control.bam)

  2. ./references/ - contains

  • control_sample.txt file that contains the ChIP-seq input control and sample names (if more than one ChIP-seq sample, separate them using commas) separated by tab:
    control sample_1,sample_2,...sample_n
    For example:
    test_control test_sample

  • parameters.txt file containing the following parameters:

Arguments Explanation
species hg38 (Homo sapiens) or mm10 (Mus musculus)
iterations number of iterations [Example: 100]
alpha level of significance to report enrichment [Example: 0.05]
enrichment log2FC threshold to report enrichment [Example: 1.0]
filter filter out regions of extremely high signals (0 for NO and 1 for YES)

Example:

species	hg38
iterations	100
alpha	0.05
enrichment	1.0
filter	0

Like that, T3E considers the repeat annotation for Homo sapiens (hg38), simulates 100 input libraries for each ChIP-seq sample, considers level of significance of 0.05 and log2FC of 1.0 for reporting enrichments, and it does not filter out regions of extremely high signals

It is recommended to filter out regions of extremely high signals for the mouse genome (filter = 1)!

  • chromosome size files [hg38.genome (Homo sapiens) and mm10.genome (Mus musculus)]
    Content of hg38.genome file (first 5 lines):
chrom	size
chr1	248956422
chr2	242193529
chr3	198295559
chr4	190214555
  • path_dataset.csv file that is created by T3E and contains the path, library size and read length for each BAM file, separated by semicolon. Example:
./T3E/bam/test_control.bam;697382;76
./T3E/bam/test_sample.bam;186923;76
  1. ./repeats/ - contains transposable elements annotation [rmsk_hg38.bed (Homo sapiens) and rmsk_mm10.bed (Mus musculus)]. Custom repeat annotations for other species may be added and should follow the same format file (BED file) with information of TE individual copies. The order of the three first columns should be respected and must contain the chromosome, start coordinate on the chromosome and end coordinate on the chromosome. The fourth column should contain the information about the repeat (e.g. TE family/subfamily)
    Content of rmsk_hg38.bed file (first 5 lines):
chr1	11504	11675	L1MC5a
chr1	11677	11780	MER5B
chr1	15264	15355	MIR3
chr1	18906	19048	L2a
chr1	19971	20405	L3
  1. ./results/ - contains the output files (one folder for each control and sample BAM files)
  2. ./scripts/ - contains all Python, Perl and R scripts

The main.sh code uses the information contained in two files (parameters.txt and control_sample.txt) in ./references folder, processes the datasets and runs T3E scripts automatically

Run main.sh code:

nohup bash main.sh > log_file.txt 2>&1 &

This command is all you have to run since you managed to configure everything until now. You can check the status of the run printing the end of the log file with the command:

tail log_file.txt

T3E also creates a log file (e.g. log_test_sample.txt) which can be checked in the same manner. It is also possible to run each script separately (but it not necessary! T3E does it for you!). The scripts consist in three steps:

Calculate input-based background probability distribution:

It estimates the probability of a read starting at an effective genomic position in the ChIP-seq input control experiment

probabilities.py [-h] [--version] [--control <control_file>]
                 [--readlen <readlen>] [--species <species>]
                 [--outputfolder <outputfolder>]
Arguments Explanation
-h, --help shows help message and exits
--version shows version message and exits
--control ChIP-seq input control experiment [BED format]
--readlen ChIP-seq input control experiment read length in base pairs [Example --readlen 76]
--species hg38 (Homo sapiens) or mm10 (Mus musculus) [Example --species hg38]
--outputfolder output folder path [Example: /probabilities]


Example of input BED file (test_control.bed) for --control parameter (first 5 lines):

chr1	10004	10080	NS500343:103:H72MMBGXY:1:21109:21707:18958
chr1	10016	10092	NS500343:103:H72MMBGXY:1:21109:21707:18958
chr1	10022	10098	NS500343:103:H72MMBGXY:1:21109:21707:18958
chr1	10028	10104	NS500343:103:H72MMBGXY:1:21109:21707:18958
chr1	10034	10110	NS500343:103:H72MMBGXY:1:21109:21707:18958

The file contains the chromosome (column 1), start (column 2) and end (column 3) coordinates on the chromosome and the read ID (column 4) for each read in the ChIP-seq input control library (note that all loci should be reported for multimappers)


Example of command:

python3 ./T3E/scripts/probabilities.py --control ./T3E/results/test_control/test_control.bed --readlen 76 --species hg38 --outputfolder ./T3E/results/test_control/probabilities

The output files are created in the specified folder. In total, each chromosome has one .txt output file containing the genomic position (column 1) and the corresponding cumulative probability (column 2). In the example above, 24 .txt files were generated. Example of chr1_prob.txt output file (first 5 lines):

10004 	 5.821325789847425e-10
10005 	 1.164265157969485e-09
10006 	 1.7463977369542275e-09
10007 	 2.32853031593897e-09
10008 	 2.910662894923712e-09

Construct the background probability distribution of read mappings:

It is the core script of T3E and it computes the background distribution of read mappings by randomly sampling read mappings based on the structure of the input control library

t3e.py [-h] [--version] [--repeat <repeat_file>] [--sample <sample_file>] 
       [--readlen <readlen>] [--control <control_file>]
       [--controlcounts <control_counts>] [--probability <probability_folder>] 
       [--iter <iter>] [--species <species>]
       [--outputfolder <outputfolder>] [--outputprefix <outputprefix>]
Arguments Explanation
-h, --help shows help message and exits
--version shows version message and exits
--repeat transposable elements annotation [rmsk_hg38.bed (Homo sapiens) or rmsk_mm10.bed (Mus musculus)]
--sample ChIP-seq sample experiment [BED format]
--readlen ChIP-seq input control experiment read length in base pairs [Example --readlen 76]
--control ChIP-seq input control experiment [BED format]
--controlcounts ChIP-seq input control experiment counts [.txt format]
--probability probability folder path [Example: /control/probability/]
--iter number of iterations [Example: 100]
--species hg38 (Homo sapiens) or mm10 (Mus musculus) [Example --species hg38]
--outputfolder output folder path [Example: /results]
--outputprefix prefix name of your analysis [Example: test_sample]


Example of .txt input file (test_control_counts.txt) for --controlcounts parameter (first 5 lines):

Alu	37.9599074240402
AluJb	8086.88216090022
AluJo	4291.85264145515
AluJr	4867.75105757094
AluJr4	1122.24456393148

The file contains the TE family/subfamily (column 1) and the corresponding read mapping counts (column 2) for the ChIP-seq input control


Example of command:

python3 ./T3E/scripts/t3e.py --repeat ./T3E/repeats/rmsk_hg38.bed --sample ./T3E/results/test_sample/test_sample.bed --readlen 76 --control ./T3E/results/test_control/test_control.bed --controlcounts ./T3E/results/test_control/test_control_counts.txt --probability ./T3E/results/test_control/probabilities --iter 100 --species hg38 --outputfolder ./T3E/results/test_sample/ --outputprefix test_sample > ./T3E/log_test_sample.txt

The output files are created in the specified folder. In the example, the background file is important for the next script for computing TE families/subfamilies enrichments. Example of test_sample_background.txt output file (first 5 lines):

iter1	Alu	8.782564266947237
iter1	AluJb	2219.1462339576756
iter1	AluJo	1164.8765166651572
iter1	AluJr	1304.9989875595957
iter1	AluJr4	313.5173699906347

The file contains the number of iteration (column 1), TE family/subfamily (column 2) and the corresponding read mapping counts for the simulated background (column 3)

Calculate TE families/subfamilies enrichments:

It computes ChIP-seq enrichment at TE families/subfamilies relative to a background

enrichment.py [-h] [--version] [--background <background>]
              [--signal <signal>] [--iter <iter>] [--alpha <alpha>]
              [--enrichment <enrichment>] [--outputfolder <outputfolder>]
              [--outputprefix <outputprefix>]
Arguments Explanation
-h, --help shows help message and exits
--version shows version message and exits
--background background file created by T3E [Example: sample001_background.txt]
--signal ChIP-seq sample experiment counts [.txt format]
--iter number of iterations [Example: 100]
--alpha level of significance to report enrichment [Example: 0.05]
--enrichment log2FC threshold to report enrichment [Example: 1.0]
--outputfolder output folder path [Example: /results]
--outputprefix prefix name of your analysis [Example: test_sample]


Example of .txt input file (test_sample_counts.txt) for --signal parameter (first 5 lines):

Alu	14.2787735236149
AluJb	2088.6238435312
AluJo	1200.62110262026
AluJr	1333.68223686911
AluJr4	299.617299043992

The file contains the TE family/subfamily (column 1) and the corresponding read mapping counts for the ChIP-seq sample experiments (column 2)


Example of command:

python3 ./T3E/scripts/enrichment.py --background ./T3E/results/test_sample/test_sample_background.txt --signal ./T3E/results/test_sample/test_sample_counts.txt --iter 100 --alpha 0.05 --enrichment 1.0 --outputfolder ./T3E/results/test_sample/ --outputprefix test_sample

The output file contains all the enrichments and it is created in the specified folder. Example of test_sample_enrichment.txt output file (first 5 lines):

Alu	0.02	0.39804917554376207
AluJb	1.0	-0.07102969494637044
AluJo	0.09	0.046781617267046424
AluJr	0.34	0.014120002340119377
AluJr4	0.59	-0.015369887385721339

Note that the file contains the TE family/subfamily (column 1) and its corresponding P-value (column 2) and log2FC (column 3). But also, T3E prints the enriched TE families/subfamilies considering the chosen level of significance and log2FC thresholds:

AluYh7   0.04	1.2034230038490004
Charlie4 0.02	1.6156846715708846
DNA1_Mam 0.04	2.400492454970415
Eulor1   0.01	3.120769507990133
Eulor12  0.05	2.651369092553242

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published