Updated command lines for the three main programs in the documentation (

EI-CoreBioinformatics#136)
lucventurini · Oct 26, 2018 · 959a9e9 · 959a9e9
1 parent 392e2af
commit 959a9e9
Show file tree

Hide file tree

Showing 3 changed files with 49 additions and 34 deletions.
diff --git a/docs/Usage/Pick.rst b/docs/Usage/Pick.rst
@@ -319,18 +319,21 @@ Usage::
     usage: Mikado pick [-h] [--start-method {fork,spawn,forkserver}] [-p PROCS]
                        --json-conf JSON_CONF [--scoring-file SCORING_FILE]
                        [-i INTRON_RANGE INTRON_RANGE] [--pad]
+                       [--pad-max-splices PAD_MAX_SPLICES]
+                       [--pad-max-distance PAD_MAX_DISTANCE]
                        [--subloci_out SUBLOCI_OUT] [--monoloci_out MONOLOCI_OUT]
                        [--loci_out LOCI_OUT] [--prefix PREFIX] [--no_cds]
-                       [--source SOURCE] [--flank FLANK] [--purge]
-                       [--subloci-from-cds-only] [--monoloci-from-simple-overlap]
-                       [-db SQLITE_DB] [-od OUTPUT_DIR] [--single] [-l LOG]
-                       [-v | -nv] [-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
+                       [--source SOURCE] [--flank FLANK] [--purge] [--cds-only]
+                       [--monoloci-from-simple-overlap]
+                       [--consider-truncated-for-retained] [-db SQLITE_DB]
+                       [-od OUTPUT_DIR] [--single] [-l LOG] [-v | -nv]
+                       [-lv {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                        [--mode {nosplit,stringent,lenient,permissive,split}]
                        [gff]
-
+    
     positional arguments:
       gff
-
+    
     optional arguments:
       -h, --help            show this help message and exit
       --start-method {fork,spawn,forkserver}
@@ -350,14 +353,20 @@ Usage::
                             outside of this range will be penalised. Default: (60,
                             900) (default: None)
       --pad                 Whether to pad transcripts in loci. (default: False)
+      --pad-max-splices PAD_MAX_SPLICES
+                            Maximum splice sites that can be crossed during
+                            transcript padding. (default: None)
+      --pad-max-distance PAD_MAX_DISTANCE
+                            Maximum amount of bps that transcripts can be padded
+                            with (per side). (default: None)
       --subloci_out SUBLOCI_OUT
       --monoloci_out MONOLOCI_OUT
       --loci_out LOCI_OUT   This output file is mandatory. If it is not specified
                             in the configuration file, it must be provided here.
                             (default: None)
       --prefix PREFIX       Prefix for the genes. Default: Mikado (default: None)
       --no_cds              Flag. If set, not CDS information will be printed out
-                            in the GFF output files. (default: None)
+                            in the GFF output files. (default: False)
       --source SOURCE       Source field to use for the output files. (default:
                             None)
       --flank FLANK         Flanking distance (in bps) to group non-overlapping
@@ -366,8 +375,7 @@ Usage::
       --purge               Flag. If set, the pipeline will suppress any loci
                             whose transcripts do not pass the requirements set in
                             the JSON file. (default: False)
-      --subloci-from-cds-only
-                            "Flag. If set, Mikado will only look for overlap in
+      --cds-only            "Flag. If set, Mikado will only look for overlap in
                             the coding features when clustering transcripts
                             (unless one transcript is non-coding, in which case
                             the whole transcript will be considered). Default:
@@ -378,6 +386,11 @@ Usage::
                             transcripts by simple overlap, not by looking at the
                             presence of shared introns. Default: False. (default:
                             False)
+      --consider-truncated-for-retained
+                            Flag. If set, Mikado will consider as retained intron
+                            events also transcripts which lack UTR but whose CDS
+                            ends within a CDS intron of another model. (default:
+                            False)
       -db SQLITE_DB, --sqlite-db SQLITE_DB
                             Location of an SQLite database to overwrite what is
                             specified in the configuration file. (default: None)
@@ -399,7 +412,7 @@ Usage::
                             but also split when both ORFs lack BLAST hits - split:
                             split multi-orf transcripts regardless of what BLAST
                             data is available. (default: None)
-
+    
     Log options:
       -l LOG, --log LOG     File to write the log to. Default: decided by the
                             configuration file. (default: None)
@@ -412,6 +425,7 @@ Usage::
                             file. (default: None)
 
 
+
 .. block end
 
 

diff --git a/docs/Usage/Prepare.rst b/docs/Usage/Prepare.rst
@@ -48,16 +48,12 @@ Command line usage:
                           [-s | -sa STRAND_SPECIFIC_ASSEMBLIES] [--list LIST]
                           [-l LOG] [--lenient] [-m MINIMUM_LENGTH] [-p PROCS]
                           [-scds] [--labels LABELS] [--single] [-od OUTPUT_DIR]
-                          [-o OUT] [-of OUT_FASTA] [--json-conf JSON_CONF]
+                          [-o OUT] [-of OUT_FASTA] [--json-conf JSON_CONF] [-k]
                           [gff [gff ...]]
-
-    Mikado prepare analyses an input GTF file and prepares it for the picking
-    analysis by sorting its transcripts and performing some simple consistency
-    checks.
-
+    
     positional arguments:
       gff                   Input GFF/GTF file(s).
-
+    
     optional arguments:
       -h, --help            show this help message and exit
       --fasta FASTA         Genome FASTA file. Required.
@@ -79,7 +75,7 @@ Command line usage:
       -m MINIMUM_LENGTH, --minimum_length MINIMUM_LENGTH
                             Minimum length for transcripts. Default: 200 bps.
       -p PROCS, --procs PROCS
-                            Number of processors to use (default 1)
+                            Number of processors to use (default None)
       -scds, --strip_cds    Boolean flag. If set, ignores any CDS/UTR segment.
       --labels LABELS       Labels to attach to the IDs of the transcripts of the
                             input files, separated by comma.
@@ -91,6 +87,9 @@ Command line usage:
                             Output file. Default: mikado_prepared.fasta.
       --json-conf JSON_CONF
                             Configuration file.
+      -k, --keep-redundant  Boolean flag. If invoked, Mikado prepare will retain
+                            redundant models.
+
 
 Collection of transcripts from the annotation files
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -124,4 +123,3 @@ Mikado prepare will produce two files:
 * a FASTA file of the transcripts, in the proper cDNA orientation.
 
 .. warning:: contrary to other tools such as eg gffread from Cufflinks [Cufflinks]_, Mikado prepare will **not** try to calculate the loci for the transcripts. This task will be performed later in the pipeline. As such, the GTF file is formally incorrect, as multiple transcripts in the same locus but coming from different assemblies will *not* have the same gene_id but rather will have kept their original one. Moreover, if two gene_ids were identical but discrete in the input files (ie located on different sections of the genome), this error will not be corrected. If you desire to use this GTF file for any purpose, please use a tool like gffread to calculate the loci appropriately.
-
diff --git a/docs/Usage/Serialise.rst b/docs/Usage/Serialise.rst
@@ -85,19 +85,16 @@ Usage::
     $ mikado serialise --help
     usage: Mikado serialise [-h] [--start-method {fork,spawn,forkserver}]
                             [--orfs ORFS] [--transcripts TRANSCRIPTS]
-                            [-mr MAX_REGRESSION]
+                            [-mr MAX_REGRESSION] [--codon-table CODON_TABLE]
                             [--max_target_seqs MAX_TARGET_SEQS]
-                            [--blast_targets BLAST_TARGETS] [--discard-definition]
-                            [--xml XML] [-p PROCS] [--single-thread]
-                            [--genome_fai GENOME_FAI] [--junctions JUNCTIONS]
-                            [-mo MAX_OBJECTS] [-f] --json-conf JSON_CONF
-                            [-l [LOG]] [-od OUTPUT_DIR]
+                            [--blast_targets BLAST_TARGETS] [--xml XML] [-p PROCS]
+                            [--single-thread] [--genome_fai GENOME_FAI]
+                            [--junctions JUNCTIONS]
+                            [--external-scores EXTERNAL_SCORES] [-mo MAX_OBJECTS]
+                            [-f] --json-conf JSON_CONF [-l [LOG]] [-od OUTPUT_DIR]
                             [-lv {DEBUG,INFO,WARN,ERROR}]
                             [db]
 
-    Mikado serialise creates the database used by the pick program. It handles
-    Junction and ORF BED12 files as well as BLAST XML results.
-
     optional arguments:
       -h, --help            show this help message and exit
       --start-method {fork,spawn,forkserver}
@@ -123,13 +120,14 @@ Usage::
                             "Amount of sequence in the ORF (in %) to backtrack in
                             order to find a valid START codon, if one is absent.
                             Default: None
+      --codon-table CODON_TABLE
+                            Codon table to use. Default: 0 (ie Standard, NCBI #1,
+                            but only ATG is considered a valid stop codon.
 
       --max_target_seqs MAX_TARGET_SEQS
                             Maximum number of target sequences.
       --blast_targets BLAST_TARGETS
                             Target sequences
-      --discard-definition  Flag. If set, the sequences IDs instead of their
-                            definition will be used for serialisation.
       --xml XML             XML file(s) to parse. They can be provided in three
                             ways: - a comma-separated list - as a base folder -
                             using bash-like name expansion (*,?, etc.). In this
@@ -146,6 +144,11 @@ Usage::
       --genome_fai GENOME_FAI
       --junctions JUNCTIONS
 
+      --external-scores EXTERNAL_SCORES
+                            Tabular file containing external scores for the
+                            transcripts. Each column should have a distinct name,
+                            and transcripts have to be listed on the first column.
+
       -mo MAX_OBJECTS, --max-objects MAX_OBJECTS
                             Maximum number of objects to cache in memory before
                             committing to the database. Default: 100,000 i.e.
@@ -157,17 +160,17 @@ Usage::
       -l [LOG], --log [LOG]
                             Optional log file. Default: stderr
       -lv {DEBUG,INFO,WARN,ERROR}, --log_level {DEBUG,INFO,WARN,ERROR}
-                            Log level. Default: INFO
+                            Log level. Default: derived from the configuration; if
+                            absent, INFO
       db                    Optional output database. Default: derived from
                             json_conf
 
 
-
 Technical details
 ~~~~~~~~~~~~~~~~~
 
-The schema of the database is quite simple, as it is composed only of 7 discrete tables in two groups. The first group, *chrom* and *junctions*, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations.
-The second group serialises the data regarding ORFs and BLAST files. The need of using a database is mainly driven by the latter, as querying a relational database is faster than retrieving the information from the XML files themselves at runtime.
+The schema of the database is quite simple, as it is composed only of 9 discrete tables in two groups. The first group, *chrom* and *junctions*, serialises the information pertaining to the reliable junctions - ie information which is not relative to the transcripts but rather to their genomic locations.
+The second group serialises the data regarding ORFs, BLAST files and external arbitrary data. The need of using a database is mainly driven by the latter, as querying a relational database is faster than retrieving the information from the XML files themselves at runtime.
 
 .. database figure generated with `SchemaCrawler <http://sualeh.github.io/SchemaCrawler/>`_, using the following command line:
     schemacrawler -c graph -url=jdbc:sqlite:sample_data/mikado.db -o docs/Usage/database_schema.png --outputformat=png -infolevel=maximum