Skip to content

Latest commit

 

History

History
executable file
·
138 lines (95 loc) · 12.6 KB

Data_preparation.md

File metadata and controls

executable file
·
138 lines (95 loc) · 12.6 KB

Preparing simulated datasets for RNA benchmark

Several scripts from RNAseqEval project were used for our benchmark of RNA mappers on third generation sequencing reads. Mapers were tested on a set of real and synthetic datasets. To simulate synthetic datasets, we used PBSIM version 1.0.3, downloaded from https://code.google.com/archive/p/pbsim/.

All datasets, references, gene annotations, gene expressions data and simulation data are available on FigShare.

Synthetic datasets were created from the following organisms:

  • Saccharomyces cerevisiae S288 (baker’s yeast)
  • Drosophila melanogaster r6 (fruit fly)
  • Homo Sapiens GRCh38.p7 (human) Reference genomes for all organisms were downloaded from http://www.ncbi.nlm.nih.gov.

PBSIM is intended to be used as a genomic reads simulator, taking as input a reference sequence and a set of simulation parameters (e.g., coverage, read length, error profile). To generate RNA-seq reads, PBSIM was applied to a set of transcripts generated from a particular genome using gene annotations downloaded from https://genome.ucsc.edu/cgi-bin/hgTables. To make the datasets as realistic as possible, real datasets were analyzed and used to determine simulation parameters. Real gene expression datasets were used to select a set of transcripts for simulation (downloaded from http://bowtie-bio.sourceforge.net/recount/; core (human), nagalakshmi (yeast) and modencodefly (fruit fly) datasets were used).

Simulated data preparation

Simulated datasets were generated using the following workflow:

  1. Analyze real datasets to determine error profiles.
  2. Filter annotations (keep only primary assembly information) and unify chromosome names.
  3. Separate annotations for genes with one isoform and genes with alternative splicing, keeping up to 3 isoforms randomly for each gene with alternative splicing.
  4. Generate a transcriptome from processed annotations and a reference genome.
  5. Analyze gene expression data and determine gene coverage histogram (see Figure).
  6. Approximate gene coverage histogram with 3 points to determine coverage and number of genes in simulated dataset (see Figure). Scale coverages proportionally down to make a smaller dataset, more suitable for testing.
  7. Extract 6 subsets of sequences from generated transcriptome, 3 for genes with single splicing and 3 for genes with alternative splicing. Each set contains a number of transcripts corresponding to the number of genes from a previous step.
  8. Using PBSIM, simulate reads on each generated subset of transcriptome, using coverages determined in step 6 and error profiles determined in step 1.
  9. Combine generated reads into a single generated dataset.

1. Error profile of real datasets

Real datasets were analyzed to determine error profiles. For simulation of PacBio reads, PBSIM parameters (read length, error probability by type, etc) were set to match those of a dataset containing reads of insert (see Supplement table 1). For simulation of MinION ONT reads, PBSIM parameters (read length, error probability by type etc.) were set to match those for MinION reads from a R9 chemistry dataset obtained from the Loman lab website (http://lab.loman.net/2016/07/30/nanopore-r9-data-release). Only 2d reads statistics were used.

Error profiles were determined by mapping the reads to the reference using GraphMap (https://github.com/isovic/graphmap), and by running errorrates.py script from https://github.com/isovic/samscripts. The script take SAM file with alignments and a reference as input, and calculated error rates by examining CIGAR strings for alignments and by comparing the corresponding bases in the read to the ones in the reference. Since PacBio error profile was determined from RNA reads, the reference was a transcriptome (S.Cerevisiae, D. Melanogaster and H. Sapiens), while for ONT MinION reads, the reference was a genome (D. Melanogaster).

Usage example for errorrates.py:

errorrates.py base d_melanogaster_transcriptome.fa pacBio_ROI_dataset.sam

2. Filtering annotations any unifying chromosome names

Annotation and genome reference fles were additionally processed. Chromosome names, which were different for annotations and reference we unified, transforming them into form Chr[designation]. Only sequences representing actual chromosomes were kept for both reference and annotations. E.g. unfinished scaffolds and annotation with an unknown reference were removed. This was done using process_data.py script within this repository.

3. Separating annotations into groups

Annotations for each organism were separated into two groups. The first group contained annotations for genes with a single isoform and the second group contained annotations for genes with alternative splicing. Whether two annotations belong to the same gene was determined based on their chromosome, strand and position overlap. This was done because those two groups were simulated diferently. Additionally, duplicate annotation definitions were removed keeping only the first definition and only up to three randomly chosen isoforms were kept for each gene with alternate splicing. Anntation groupig was done using the process_data.py script in this repository and by using mode split-alternate. The number of isoforms to keep can be adjusted by setting the value of the constant ALTERNATE_SPLICINGS_TO_KEEP in the script. Since S. Cerevisiae has very few annotations with alternate splicing, they were disregarded and only annotations for genes with a single isoform were used for the testing.

4. Generating transcriptomes

Transcriptomes are generated from processed annotations and genome reference using the script generate_transcriptome.py in this repository. Since the annotations were separated into two groups, a transcriptom (or a set of transcripts) was generated for each group.

Usage example:

generate_transcriptome.py dmelanogaster_annotations_single_P.gtf dmelanogaster_ref_P.fasta dmelanogaster_transcriptome_single.fasta

Since PBSIM requires that a reference for simulation is at least 100 base-pairs long, all transcripts shorter then 100bp were removed from transcriptomes. This was done using fastqfilter.py script from https://github.com/isovic/samscripts with option minlen.

Usage example:

fastqfilter.py minlen 100 dmelanogaster_transcriptome_single.fasta dmelanogaster_transcriptome_single_P.fasta

5. Gene expression data

To make the simulations more realistic, real gene expression datasets were used to model a set of transcripts for simulation (downloaded from http://bowtie-bio.sourceforge.net/recount/; core (human), nagalakshmi (yeast) and modencodefly (fruit fly) datasets were used). For each organism, a gene coverage histogram was drawn. Figure below shows such histogram for D. Melanogaster exptrssion data.

6. Approximation of gene expression data

Gene expression histogram was approximated with three points (red squares in the figure), each point was described by a total number of genes and average coverage of those genes. Coverages were then scaled down to values more suitable to testing. These three points were later used for simulation. Since we hadgene expression information, and annotation describe transcripts, the simulations we done separately for single isoform genes and for genes with alternative splicing. For simplicity, we rounded the coverage and number of genes from each point. For example, the Table below shows the numbers used to generate dataset 2 (D. melanogaster). The annotations includes roughly 23,000 genes with a single isoform and 3,000 genes with alternative splicing. Rounding up the ratio, we have decided to simulate 1/10 genes with alternative splicing and 9/10 genes without. We considered that each gene undergoing alternative splicing gave rise to three different isoforms with equal expression.

Group Total no. genes Coverage Genes without alternative splicing Genes with alternative splicing Transcripts with alternative splicing Coverage for AS transcripts
1 5000 5 4500 500 1500 2
2 2000 50 1750 250 750 15
3 2000 100 1750 250 750 30

S. Cerevisiae has a negligible number of alternatly spliced genes and in that case they were ignored. The transcriptome was divided into three groups (approximation points in gene expression histogram) with 4000, 1000 and 1000 genes. The groups were simulated with coverage of 5, 40 and 100 respectively.

The gene expression histogram approximation data for human chromosome 19 is given in the table below:

Group Total no. genes Coverage Genes without alternative splicing Genes with alternative splicing Transcripts with alternative splicing Coverage for AS transcripts
1 1200 20 400 800 2400 6
2 250 100 100 150 450 30
3 70 400 20 50 150 130

7. Extracting transcriptome subsets for simulation

After approximating gene expression histogram with three points, we have determined the number of transcripts and coverage for each group. There were 3 greoups for S. Cerevisiae because genes with alternative splicing were disregarded, and 6 groups for D. Melanogaster and human chromosome 19.

For each organism, transcriptome generated in the step 4 was divided into a 3 groups. This was done using a process_data.py script in this repository, with the option trans-split. The number of transcripts in each group is set within the script. The script will generate 3 new files with the same name as the original transcriptome file with appended '_G1', '_G2' and '_G3'.

For D. Melanogaster transcriptome for genes with alternate splicing, the script is invoked by:

prepare_data.py trans-split dmelanogaster_transcriptome_alternte.fasta

The script will then generate 3 new files:

dmelanogaster_transcriptome_alternte_G1.fasta
dmelanogaster_transcriptome_alternte_G2.fasta
dmelanogaster_transcriptome_alternte_G3.fasta

8. Simulation using PBSIM

Trascriptomes generated in the presious step and coverages determined in step 6 were used for simulation. Simulation parameters were set acording to the error profiles callculated in the step 1.

Simulation running example:

pbsim-1.0.3-Linux-amd64/Linux-amd64/bin/pbsim \
--data-type CLR --depth 5 \
--model_qc pbsim-1.0.3-Linux-amd64/data/model_qc_clr \
--length-mean 3080 \
--length-sd 2211 \
--length-min 50 \
--length-max 50000 \
--accuracy-mean 0.95 \
--accuracy-sd 0.11 \
--accuracy-min 0.7 \
--difference-ratio 47:38:15 \
../s_cerevisiae_transcriptome_F_G1.fa

9. Combining simulation results into final datasets

When running a simulation, PBSIM will generate 3 files for each reference used.

  • REF fie containing a single reference sequence
  • FASTQ file containing all reads simulated from the corresponding reference
  • MAF file describing how exactly were all reads from the corresponding reference generated

Since we used PBSIM on a transcriptome which consists of a large number of sequences, PBSIM will generate a large number of files. Additionally, a simulation of a single dataset consists of several groups, for which the simulation results have to be accumulated together into a single file.

For each partial simulation (group), generated FASTQ files are conactenated together into a single file. For example:

cat group1/*.fastq > s_cerevisiae_dataset_G1.fastq

A prefox is then added to all headers for that partial simulation. This will later enable the evaluation script to determine to which group does a read belong and to find appropriate MAF file describing how the read was generated. For example, headers could be modified like this:

sed -i -e 's/ @S/ @SimG1_S/' s_cerevisiae_dataset_G1.fastq

In the end, to create the final dataset, all FASTQ file belonging to separate groups are combined together. For example:

cat s_cerevisiae_dataset_G1.fastq s_cerevisiae_dataset_G2.fastq s_cerevisiae_dataset_G3.fastq > s_cerevisiae_dataset.fastq

Final dataset (s_cerevisiae_dataset.fastq) can now be used for testing. It generation is somewhat complex making it more realistic then simply generating reads from all annotations with equal probability. The FASTQ headers have also been modified enable the evaluation script (Process_pbsim_data.py) to correctly evaluate the mapping results.