diff --git a/docs/Usage/Pick.rst b/docs/Usage/Pick.rst index 4f86c52c1..2bbd08cb7 100644 --- a/docs/Usage/Pick.rst +++ b/docs/Usage/Pick.rst @@ -319,18 +319,21 @@ Usage:: usage: Mikado pick [-h] [--start-method {fork,spawn,forkserver}] [-p PROCS] --json-conf JSON_CONF [--scoring-file SCORING_FILE] [-i INTRON_RANGE INTRON_RANGE] [--pad] + [--pad-max-splices PAD_MAX_SPLICES] + [--pad-max-distance PAD_MAX_DISTANCE] [--subloci_out SUBLOCI_OUT] [--monoloci_out MONOLOCI_OUT] [--loci_out LOCI_OUT] [--prefix PREFIX] [--no_cds] - [--source SOURCE] [--flank FLANK] [--purge] - [--subloci-from-cds-only] [--monoloci-from-simple-overlap] - [-db SQLITE_DB] [-od OUTPUT_DIR] [--single] [-l LOG] - [-v | -nv] [-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}] + [--source SOURCE] [--flank FLANK] [--purge] [--cds-only] + [--monoloci-from-simple-overlap] + [--consider-truncated-for-retained] [-db SQLITE_DB] + [-od OUTPUT_DIR] [--single] [-l LOG] [-v | -nv] + [-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--mode {nosplit,stringent,lenient,permissive,split}] [gff] - + positional arguments: gff - + optional arguments: -h, --help show this help message and exit --start-method {fork,spawn,forkserver} @@ -350,6 +353,12 @@ Usage:: outside of this range will be penalised. Default: (60, 900) (default: None) --pad Whether to pad transcripts in loci. (default: False) + --pad-max-splices PAD_MAX_SPLICES + Maximum splice sites that can be crossed during + transcript padding. (default: None) + --pad-max-distance PAD_MAX_DISTANCE + Maximum amount of bps that transcripts can be padded + with (per side). (default: None) --subloci_out SUBLOCI_OUT --monoloci_out MONOLOCI_OUT --loci_out LOCI_OUT This output file is mandatory. If it is not specified @@ -357,7 +366,7 @@ Usage:: (default: None) --prefix PREFIX Prefix for the genes. Default: Mikado (default: None) --no_cds Flag. If set, not CDS information will be printed out - in the GFF output files. (default: None) + in the GFF output files. (default: False) --source SOURCE Source field to use for the output files. (default: None) --flank FLANK Flanking distance (in bps) to group non-overlapping @@ -366,8 +375,7 @@ Usage:: --purge Flag. If set, the pipeline will suppress any loci whose transcripts do not pass the requirements set in the JSON file. (default: False) - --subloci-from-cds-only - "Flag. If set, Mikado will only look for overlap in + --cds-only "Flag. If set, Mikado will only look for overlap in the coding features when clustering transcripts (unless one transcript is non-coding, in which case the whole transcript will be considered). Default: @@ -378,6 +386,11 @@ Usage:: transcripts by simple overlap, not by looking at the presence of shared introns. Default: False. (default: False) + --consider-truncated-for-retained + Flag. If set, Mikado will consider as retained intron + events also transcripts which lack UTR but whose CDS + ends within a CDS intron of another model. (default: + False) -db SQLITE_DB, --sqlite-db SQLITE_DB Location of an SQLite database to overwrite what is specified in the configuration file. (default: None) @@ -399,7 +412,7 @@ Usage:: but also split when both ORFs lack BLAST hits - split: split multi-orf transcripts regardless of what BLAST data is available. (default: None) - + Log options: -l LOG, --log LOG File to write the log to. Default: decided by the configuration file. (default: None) @@ -412,6 +425,7 @@ Usage:: file. (default: None) + .. block end diff --git a/docs/Usage/Prepare.rst b/docs/Usage/Prepare.rst index 4b05b05fb..96d000afd 100644 --- a/docs/Usage/Prepare.rst +++ b/docs/Usage/Prepare.rst @@ -48,16 +48,12 @@ Command line usage: [-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST] [-l LOG] [--lenient] [-m MINIMUM_LENGTH] [-p PROCS] [-scds] [--labels LABELS] [--single] [-od OUTPUT_DIR] - [-o OUT] [-of OUT_FASTA] [--json-conf JSON_CONF] + [-o OUT] [-of OUT_FASTA] [--json-conf JSON_CONF] [-k] [gff [gff ...]] - - Mikado prepare analyses an input GTF file and prepares it for the picking - analysis by sorting its transcripts and performing some simple consistency - checks. - + positional arguments: gff Input GFF/GTF file(s). - + optional arguments: -h, --help show this help message and exit --fasta FASTA Genome FASTA file. Required. @@ -79,7 +75,7 @@ Command line usage: -m MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH Minimum length for transcripts. Default: 200 bps. -p PROCS, --procs PROCS - Number of processors to use (default 1) + Number of processors to use (default None) -scds, --strip_cds Boolean flag. If set, ignores any CDS/UTR segment. --labels LABELS Labels to attach to the IDs of the transcripts of the input files, separated by comma. @@ -91,6 +87,9 @@ Command line usage: Output file. Default: mikado_prepared.fasta. --json-conf JSON_CONF Configuration file. + -k, --keep-redundant Boolean flag. If invoked, Mikado prepare will retain + redundant models. + Collection of transcripts from the annotation files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -124,4 +123,3 @@ Mikado prepare will produce two files: * a FASTA file of the transcripts, in the proper cDNA orientation. .. warning:: contrary to other tools such as eg gffread from Cufflinks [Cufflinks]_, Mikado prepare will **not** try to calculate the loci for the transcripts. This task will be performed later in the pipeline. As such, the GTF file is formally incorrect, as multiple transcripts in the same locus but coming from different assemblies will *not* have the same gene_id but rather will have kept their original one. Moreover, if two gene_ids were identical but discrete in the input files (ie located on different sections of the genome), this error will not be corrected. If you desire to use this GTF file for any purpose, please use a tool like gffread to calculate the loci appropriately. - diff --git a/docs/Usage/Serialise.rst b/docs/Usage/Serialise.rst index 2832f3cfe..f25bca090 100644 --- a/docs/Usage/Serialise.rst +++ b/docs/Usage/Serialise.rst @@ -85,19 +85,16 @@ Usage:: $ mikado serialise --help usage: Mikado serialise [-h] [--start-method {fork,spawn,forkserver}] [--orfs ORFS] [--transcripts TRANSCRIPTS] - [-mr MAX_REGRESSION] + [-mr MAX_REGRESSION] [--codon-table CODON_TABLE] [--max_target_seqs MAX_TARGET_SEQS] - [--blast_targets BLAST_TARGETS] [--discard-definition] - [--xml XML] [-p PROCS] [--single-thread] - [--genome_fai GENOME_FAI] [--junctions JUNCTIONS] - [-mo MAX_OBJECTS] [-f] --json-conf JSON_CONF - [-l [LOG]] [-od OUTPUT_DIR] + [--blast_targets BLAST_TARGETS] [--xml XML] [-p PROCS] + [--single-thread] [--genome_fai GENOME_FAI] + [--junctions JUNCTIONS] + [--external-scores EXTERNAL_SCORES] [-mo MAX_OBJECTS] + [-f] --json-conf JSON_CONF [-l [LOG]] [-od OUTPUT_DIR] [-lv {DEBUG,INFO,WARN,ERROR}] [db] - Mikado serialise creates the database used by the pick program. It handles - Junction and ORF BED12 files as well as BLAST XML results. - optional arguments: -h, --help show this help message and exit --start-method {fork,spawn,forkserver} @@ -123,13 +120,14 @@ Usage:: "Amount of sequence in the ORF (in %) to backtrack in order to find a valid START codon, if one is absent. Default: None + --codon-table CODON_TABLE + Codon table to use. Default: 0 (ie Standard, NCBI #1, + but only ATG is considered a valid stop codon. --max_target_seqs MAX_TARGET_SEQS Maximum number of target sequences. --blast_targets BLAST_TARGETS Target sequences - --discard-definition Flag. If set, the sequences IDs instead of their - definition will be used for serialisation. --xml XML XML file(s) to parse. They can be provided in three ways: - a comma-separated list - as a base folder - using bash-like name expansion (*,?, etc.). In this @@ -146,6 +144,11 @@ Usage:: --genome_fai GENOME_FAI --junctions JUNCTIONS + --external-scores EXTERNAL_SCORES + Tabular file containing external scores for the + transcripts. Each column should have a distinct name, + and transcripts have to be listed on the first column. + -mo MAX_OBJECTS, --max-objects MAX_OBJECTS Maximum number of objects to cache in memory before committing to the database. Default: 100,000 i.e. @@ -157,17 +160,17 @@ Usage:: -l [LOG], --log [LOG] Optional log file. Default: stderr -lv {DEBUG,INFO,WARN,ERROR}, --log_level {DEBUG,INFO,WARN,ERROR} - Log level. Default: INFO + Log level. Default: derived from the configuration; if + absent, INFO db Optional output database. Default: derived from json_conf - Technical details ~~~~~~~~~~~~~~~~~ -The schema of the database is quite simple, as it is composed only of 7 discrete tables in two groups. The first group, *chrom* and *junctions*, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations. -The second group serialises the data regarding ORFs and BLAST files. The need of using a database is mainly driven by the latter, as querying a relational database is faster than retrieving the information from the XML files themselves at runtime. +The schema of the database is quite simple, as it is composed only of 9 discrete tables in two groups. The first group, *chrom* and *junctions*, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations. +The second group serialises the data regarding ORFs, BLAST files and external arbitrary data. The need of using a database is mainly driven by the latter, as querying a relational database is faster than retrieving the information from the XML files themselves at runtime. .. database figure generated with `SchemaCrawler `_, using the following command line: schemacrawler -c graph -url=jdbc:sqlite:sample_data/mikado.db -o docs/Usage/database_schema.png --outputformat=png -infolevel=maximum