Skip to content

Python script to get an unified list of Ribo-Seq ORFs from a list of compiled ORF datasets.

Notifications You must be signed in to change notification settings

jorruior/gencode-riboseqORFs

Repository files navigation

README

This readme is a guideline for any user that wants to generate an unified set of Ribo-seq ORFs and reproduced the Phase I ORF list publised in this article: https://doi.org/10.1038/s41587-022-01369-0

While this script is designed to unify independent sets of ORFs and map them to a specific Ensembl version, it is not a tool to analyze ribosome-profiling data. However, GENCODE plans to develop a Phase II to re-analyze ribosome-profiling datasets and call ORFs.

Version 1.1 (2022.09.19): -The updated version also includes and considers annotated CDS sequences without annotated start and/or stop codons. Phase I contains 48 Ribo-seq ORFs overlapping to incomplete proteins.

-Unitary pseudogenes are now considered as a separate category and included in the output files. GENCODE plans to include this category in next updates, but they were not considered initially in Phase I.

DEPENDENCIES:

This script is based on python3 and shell, requiring some additional packages to correctly run all steps:

-gffread (v0.1.10) http://ccb.jhu.edu/software/stringtie/dl/gffread-0.12.6.tar.gz

-bedtools (v2.27.1 or newer) https://github.com/arq5x/bedtools2

-Python3 packages: Biopython

INPUT ANNOTATION FILES:

This pipeline requires a series of files in the correct format to analyze the data. First of all, the user will need to collect all annotation files in a single folder that will be required as input for the main script. These files can be downloaded from Ensembl or GENCODE; we also included a bash script to automatically download and convert all files for a specific Ensembl release:

bash scripts/retrieve_ensembl_data.sh <ENSEMBL_RELEASE (e.g. 101)> <GENOME_ASSEMBLY (e.g. GRCh38)>

Files will be stored in a new named 'Ens + number of release'. The files are:

PROTEOME_FASTA: Fastq file with all annotated proteins. The protein ID should be the only element in the header.

TRANSCRIPTOME_FASTA: Fasta file with all transcripts. It should include both mRNA and ncRNA, and the transcript ID should be the only element in the header.

SORTED_TRANSCRIPTOME_GTF: Sorted GTF file with exon and CDS sequences for the transcripts included in TRANSCRIPTOME_FASTA -Ensembl format-. Transcript IDs (field 'transcript_id') and protein IDs (field 'protein_id') should match with the IDs in the fasta files. A GTF can be sorted running this command in bash:

sort -k1,1 -k4,4n -k5,5n <TRANSCRIPTOME_GTF> > <SORTED_TRANSCRIPTOME_GTF>

TRANSCRIPT_SUPPORT: Tab-delimited file containing all transcripts IDs, Transcript Support Level (TSL) scores, and APPRIS support scores. This file will be used to prioritize transcripts that translate each ORF. <This file has to be downloaded from Ensembl Biomart, a downloaded compatible file based on Ensembl v.101 is located in the folder Ens101>

PSITES_BED: File including coordinates of P-sites for the annotated proteins. This file can be obtained from the fasta file by running:

python3 ../scripts/calculate_frame_bed.py <SORTED_TRANSCRIPTOME_GTF>

INPUT ORF FILES:

In addition, the pipeline will require input files with the protein sequences and coordinates of the ORFs that will be inspected and unified, tagged by study. The name of the ORF needs to be concatenated with the name of the study by '--' in the FASTA file (to avoid name duplication from different studies). e.g.:

>A1BG_58858387_46aa--6_Chen2020
MQPRAQGAVGVLRSAGDSGLAPSPPVAAQGRGLWGAGEASLIPPRN*
>A1BG_58858945_25aa--6_Chen2020
MPSCAARDPSPTSPSSCCARARRRP*
>A1BG:ENST00000598345.1.3797--3_Raj2016
MPSCAARDPSPTSPSSCCARARRRP*

Additionally, both the name of the ORF and the study have to be present in the fourth and fifth fields of the BED files. e.g.:

19      58346882        58347022        A1BG_58858387_46aa      6_Chen2020      -
19      58347503        58347580        A1BG_58858945_25aa      6_Chen2020      -
19      58347503        58347580        A1BG:ENST00000598345.1.3797     3_Raj2016       -

Ideally both sequences and exonic coordinates should be included as input for the main script, but ORF studies are very heterogeneous and sometimes only sequences or coordinates are available. If only the protein sequences are available, please use the -a option in the main script. If only the exonic coordinates are available, it is possible to run this in-house bash script to convert 1-based (Important: All BED/GTF coordinate files should be 1-based) BED coordinates (<BED_FILE>) into protein sequences with the format needed to run the pipeline:

bash scripts/bed1_to_fasta.sh <BED_FILE>

Outputs: <BED_FILE>.nucl.fa and <BED_FILE>.prot.fa

MAIN SCRIPT - ORF_mapper_to_GENCODE.py:

python3 ORF_mapper_to_GENCODE.py -d <FOLDER> -f <ORFS_FA_FILE> -b <ORFS_BED_FILE (1-based)>
Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -d FOLDER, --input_dir=FOLDER
                        (Required) Directory with required Ensembl files
                        (transcriptome gft and fasta, proteome fasta, tab file
                        with APPRIS and TSL support, protein psites generated
                        by 'calculate_frame_bed.py'
  -f ORFS_FA_FILE, --input_fasta=ORFS_FA_FILE
                        (Required) File with all translated candidate ORFs. A
                        FASTA can be generated from a BED file using the
                        script 'bed1_to_fasta.sh'
  -b ORFS_BED_FILE, --bed=ORFS_BED_FILE
                        (Required) File with 1-based BED coordinates of all
                        translated candidate ORFs. ORF names should match in
                        both fasta and bed files. If -a is activated, the BED
                        coordinates will be written into this file.
  -o OUT_NAME, --output=OUT_NAME
                        Tag name for the generated output files. Default: BED
                        filename
  -l LEN_CUTOFF, --min_len_cutoff=LEN_CUTOFF
                        Minimum ORF length threshold in amino acids (without
                        stop codon). default=16
  -L MAX_LEN_CUTOFF, --max_len_cutoff=MAX_LEN_CUTOFF
                        Maximum ORF length threshold in amino acids (without
                        stop codon). default=none
  -c COL_THR, --collapse_cutoff=COL_THR
                        Minimum required fraction to collapse ORFs with
                        similar stretches of overlapping amino-acid sequences.
                        default=0.9
  -m METHOD, --collapse_method=METHOD
                        Method to cluster ORF variants. 'longest_string' will
                        collapse ORFs if the longest shared string is above -c
                        threshold and the start codon and/or stop codon are 
                        shared (default). 'psite_overlap' is a slower method 
                        that collapses ORFs if the fraction of shared psites
                        is above -c threshold
  -a CALCULATE_COORDINATES, --make_annot_bed=CALCULATE_COORDINATES
                        If ORF BED file is not available, generate it from the
                        FASTA file. WARNING: BED file will be writen to the
                        filename given by -b/--bed. (ATG/NTG/XTG/no, default =
                        no)
  --multiple=MULT       If -a option is activated, include ORFs that map to
                        multiple genomic coordinates. (yes/no, default = yes)
  -g GENOMIC, --genomic=GENOMIC
                        If -a option is activated, this optional argument uses
                        a BED file with ORF genomic coordinates to help
                        mapping ORF sequences to the correct exon-intron
                        positions. Use 'none' if no file is given or -a is not
                        activated. (default = none)
  -G FGENOMIC, --force_genomic=FGENOMIC
                        If -a option is activated, this optional argument uses
                        a BED file with ORF genomic coordinates to FORCE
                        mapping ORF sequences to the correct exon-intron
                        positions. ORFs that cannot be mapped to these genomic
                        regions will be discarded. Use 'none' if no file is
                        given or -a is not activated. (default = none)
  -C CDS_CASES, --add_cds=CDS_CASES
                        If 'yes', include CDSs and non-unitary pseudogenes in
                        the output GTF. (default = 'no')

                        

If the BED file including ORF coordinates is not available, the option -a should be enabled and the BED coordinates will be written into the file specified by the option -b. If -a is set to no (default), -b BED file will be used as input for extracting ORF coordinates. -a has three different writing options: ATG will write all ORF starting with ATG codon; NTG will write all ORFs starting with non-ATG codons, XTG will write all ORFs regardless of the translation initiation codon. The first Phase I ORF dataset was generated including ATG ORFs. Additionally, there is the possibility of including an additional 'guide' BED file (-g) to limit ORF mapping to the ranges defined by this file. Unlike the main BED file, this annotation file should not include intron coordinates and should define the genomic loci coordinates where the ORF can be found, discarding ORFs that mapped to alternative genomic loci. This parameter is optional and included for the studies that only generated output sequences and genomic coordinates without introns.

Moreover, the script includes the parameter -m which offers two options for collapsing ORFs located in the same locus and sharing some degree of similarity. 'longest_string' is the default parameter used to calculate Phase I ORFs and will collapse pairs of overlapping ORFs when the length of the longest shared amino acid string divided by the length of the short ORF is higher or equal than -c (default: 0.9). 'psite_overlap' will collapse pairs of overlapping ORFs sharing a minimum fraction of P-site positions higher or equal than -c (default: 0.9). In both cases the longest ORF will be selected as representative and the shorter ORFs will be considered as variants, with the possibility of a variant being assigned to two or more independent ORFs.

Outputs:

<OUT_NAME>.orfs.fa: Protein sequences for all unified ORFs (including pseudogenes and annotated CDS)

<OUT_NAME>.orfs.bed: 1-based BED file with all unified ORFs (including pseudogenes and annotated CDS)

<OUT_NAME>.orfs.gtf: 1-based GTF file with all unified ORFs (excluding pseudogenes and annotated CDS)

<OUT_NAME>.orfs.out: Table with features of all unified ORFs:

orf_id: Unique identifier for each Ribo-seq ORF

phase_id: Identifer for the Ribo-seq ORF in Phase I ('unknown' otherwise)

chrm: Chromosome

starts: Start exon coordinates

ends: End exon coordinates

strand: Strand

trans: Ensembl transcript ID (v.101) of the main host transcript assigned to the ORF. When more than one transcript could be assigned to an ORF, the one with the highest APPRIS or Transcript Support Level (TSL) support was prioritized

gene: Ensembl gene ID (v.101) of the main host gene

gene_name: Ensembl gene name (v.101) of the main host gene

orf_biotype: ORF biotype based on 6 outlined categories (Box 1): Upstream ORFs (uORFs), Upstream overlapping ORFs (uoORFs), Downstream ORFs (dORFs), Downstream overlapping ORFs (doORFs), Internal out-of-frame ORFs (intORFs), Long non-coding RNA ORFs (lncRNA-ORFs)

gene_biotype: Ensembl gene biotype (v.101) of the main host gene: protein-coding or non-coding. LncRNA-ORFs in protein-coding genes are mapped to processed transcripts

pep: Amino acid protein sequence

orf_length: ORF length, it includes the stop codon

n_datasets: Number of Ribo-seq datasets with evidence of translation of the ORF (main ORF or any alternative clustered ORF)

X_study: 1: The ORF was identified in this study. 0: The ORF was not identified in this study. A column per study is added to the table.

pseudogene_ov: 1: The ORF overlaps a pseudogene. 0: The ORF does not overlap any pseudogene

CDS_ov: 1: The ORF overlaps an annotated CDS in any frame. 0: The ORF does not overlap any CDS

CDS_as_ov: 1: The ORF overlaps an annotated CDS in an alternative antisense frame. 0: The ORF does not overlap any antisense CDS

all_trans: Ensembl transcript IDs (v.101) of all the transcripts that could be assigned to the ORF

all_genes: Ensembl gene IDs (v.101) of all the genes that could be assigned to the ORF

all_gene_names: Ensembl gene names (v.101) of all the genes that could be assigned to the ORF

n_variants: Total number of additional clustered ORF isoforms

seq_variants: Amino acid protein sequences of the alternative ORF isoforms

all_orf_names: Original names of the identified ORFs in each Ribo-seq dataset

phase_id: Identifer for the Ribo-seq ORF in Phase I

phase_biotype: Identifer for the Ribo-seq ORF biotype in Phase I

If -a is enabled, the script generates two additional files '<OUT_NAME>.altmapped' and '<OUT_NAME>.unmapped' with additional statistics about ORFs that did not map to the transcriptome or that map to multiple genomic regions (in this case, a random region is selected if --multiple is set to yes).

Example of an altmapped output file:

BET1L_180332_35aa--6_Chen2020   altmap  MAPPEGARSRPNQLPAEVPEAARSFNFLRAARGSC*    138230979       138231086       180226  180333  ENST00000446912
BRCA1:NM_007297:chr17:-:194:274--1_Ji2015       altexons        MDLSALRVEEVQNVINAMQKILECPI*     43121676;43124017       43121679;43124096       43106533;43124017       43106536;43124096       ENST00000652672

In this example, the first ORF maps to two alternative genomic loci (the selected X:138230979-138231086 and the alternative X:180226-180333). The second ORF maps to the same genomic loci but there are two different possible 1st exon combinations (the selected two-exon ORF X:43121676-43121679 and X:43124017-43124096 and the alternative X:43106533-43106536 and X:43124017-43124096).

Example of an unmapped output file:

ENST00000075120_71_149--4_VanHeesch2019 unmapped        MIPARPRDAVMVRLWILPEDVEKTCC*
ENST00000616540_42_66--7_Gaertner2020   nanoORF MSARSPTS*
A1BG-AS1_58865324_38aa--6_Chen2020      NTG     LVHSAWCDPPRLWSRISTQVIQLRPALPRPTRDMCSVT*

In this example, the first ORF was not fully map to any transcript isoform (unmapped), the second ORF was shorter than the minimum length cut-off (nanoORF), and the third ORF started with a near-cognate ORF (NTG, these cases can be included with the option -a NTG or -a XTG). If --multiple=no, altmapped ORFs are also included in this file (tag multiple_coords). If --max_len_cutoff is used, excluded ORFs are also included in this file (tags longORF).

TEST: GENERATING THE PHASE I RIBO-SEQ ORF LIST -Ensembl v.101-

bash scripts/retrieve_ensembl_data.sh 101 GRCh38
python3 ORF_mapper_to_GENCODE.py -d Ens101 -f phaseI/sORFs_genomic_hg38.prot.20210329.fa -b phaseI/sORFs_genomic_hg38.prot.20210329.bed -a no
#Pseudogenes and annotated CDS should be excluded from the output

About

Python script to get an unified list of Ribo-Seq ORFs from a list of compiled ORF datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published