#### Install MATES and required packages

In [None]:
%%bash
git clone https://github.com/mcgilldinglab/MATES.git
conda create -n mates_env python=3.9
conda activate mates_env
conda install -c bioconda samtools
pip install pysam
conda install -c bioconda bedtools
pip install pybedtools
cd MATES
pip install .

conda install ipykernel
python -m ipykernel install --user --name=mates_env

Cloning into 'MATES'...


In [1]:
import warnings
warnings.filterwarnings("ignore")

**Download the sample data to the same folder of this notebook.**

#### Build TE reference (this may takes a few minutes)

In [None]:
%%bash
### Edit the path to build_reference.py according to where you place this notebook
python ../build_reference.py --species Mouse

#### Run MATES

In [3]:
from MATES import MATES_pipeline
mates = MATES_pipeline('exclusive', '10X', 'test_samplelist.txt', 'test_bam_path.txt', threads_num=5, bc_ind='CR', bc_path_file = 'test_cb_path.txt', ref_path = 'TE_nooverlap.csv')
mates.preprocessing()
mates.run()

#### APIs

MATES_pipeline(TE_mode, data_mode, sample_list_file, bam_path_file, bc_ind='CB', threads_num=1,bc_path_file=None, bin_size=5, proportion=80, cut_off=50,ref_path = 'Default')

Initializes the MATES pipeline with the following parameters:

- TE_mode: str
    The mode of TE, either 'inclusive' or 'exclusive'.

- data_mode: str
    The mode of data format, either '10X' or 'Smart_seq'. '10X': one sample (.bam file) has multiple cells, 'Smart_seq':one sample (.bam file) has **only** one cell.

- sample_list_file: str
    The path to the sample list file. If mode is '10X', the file should contain the sample names. If mode is 'Smart_seq', the file should contain the cell names.

- bam_path_file: str
    The path to the file containing the paths to the .bam files. Each row in this file is the bam file directory for the corresponding row in sample_list_file.

- bc_path_file: str
    Only VALID for '10X' format. The path to the file containing the paths to the barcode files. Each row in this file is the barcode file directory for the corresponding row in sample_list_file.

- bc_ind: str
    Only VALID for '10X' format. The barcode field indicator in the bam file. Default is 'CB'.

- threads_num: int
    The number of threads to use for processing the bam files. Default is 1.

- bin_size: int
    The bin size for the coverage vector. Default is 5.

- proportion: int
    The proportion to determine the bins are unique-mapping or multi-mapping for training. Default is 80.

- cut_off: int
    The minimal number of TE reads of a TE sub-family to be considered as a informative in the dataset. Default is 50. 

- ref_path: str
    The path to the TE reference file. Default is 'Default'. If 'Default', the reference file will be 'TE_nooverlap.csv' for 'exclusive' mode and 'TE_full.csv' for 'inclusive' mode.

MATES_pipeline.preprocessing()

Preprocesses the data for the MATES training and quantifying TEs.

MATES_pipeline.run(quantify_locus_TE=True,BATCH_SIZE=256, AE_LR=1e-6, MLP_LR=1e-6, AE_EPOCHS=150, MLP_EPOCHS=150, DEVICE='cpu')

Runs the MATES pipeline and quantify sub_family level TEs. Also quanitfy locus_level TE by default.

- quantify_locus_TE: bool
    If True, quantifies the TE loci. Quantify locus_level TE need more running time and computation resource. Default is True.

- BATCH_SIZE: int
    The batch size for training the model. Default is 256.

- AE_LR: float
    The learning rate for training the autoencoder. Default is 1e-6.

- MLP_LR: float
    The learning rate for training the MLP. Default is 1e-6.

- AE_EPOCHS: int
    The number of epochs for training the autoencoder. Default is 150.

- MLP_EPOCHS: int
    The number of epochs for training the MLP. Default is 150.

- DEVICE: str
    The device to use for training the model. Default is 'cpu'.