pfasta

pfasta is a command-line tool for working with FASTA files to filter and sanitize them based on various criterion. This includes:

Filtering out sequences that contain invalid amino acids

Take sequences that contain invalid characters and replace/fix them

Filter a set of sequences by a maximum and/or minimum sequence length

Sub-sample a set of sequences for building a reduced set of randomly selected sequences

At it's basline, pfasta takes a single sequence file as input and writes a new output sequence. There are a series of flags that can be applied, as outlined in the Usage section below.

Usage

pfasta <flags> filename.fasta

-o <output filename> (default: output.fasta)
   Define the name of the output FASAT file

--non-unique-header
  Flag that, if provided allows multiple FASTA records to have identical headers

--duplicate-record (default: fail)
  Flag that provides a keyword that defines how duplicate FASTA records are dealt with. 
  Options are:
      fail   : throws an exception and exits the parsing 
  ignore : duplicate records are retained
  remove : duplicate records are removed

--duplicate-sequence (default: fail)
  Flag that provides a keyword that defines how duplicate sequences are dealt with. 
  Options are:
      fail   : throws an exception and exits the parsing 
  ignore : duplicate sequences are retained
  remove : duplicate sequences are removed

--invalid-sequence (default: fail)
  Flag that provides a keyword that defines how invalid sequences are dealt with. 
  Options are:
      fail                : throws an exception and exits the parsing 
  ignore              : invalid sequences are retained
  remove              : invalid sequences are removed
  convert-all         : invalid residues are converted according to the standard conversion table 
                        (shown below) but if OTHER invalid residues are found an exception is raised
                        B->N,    U->C,    X->G,    Z->Q,    '*'->'',    '-'->''
  convert-res         : invalid residues are converted according to the standard conversion table
                        with the exception of sequence-alignment gaps ('-') 
  convert-all-ignore  : invalid residues are converted according to the standard conversion table,
                            and if OTHER invalid residues are found they are ignored
  convert-res-ignore  : invalid residues are converted according to the standard conversion table,
                        with the exception of the sequence-aligment gap ('-') character, but 
                if OTHER invalid residues are found they are ignored

--number-lines (default: 60)
  Flag that defines the number of lines in the output FASTA file

--shortest-seq-lines (default: None)
  Flag that defines a filter that sets the shortest sequence returned

--longest-seq-lines (default: None)
  Flag that defines a filter that sets the longest sequence returned

--random-subsample (default: None)
  Flag that defines the number of randomly sub-sampled sequences. Allows a test FASTA file to be 
  generated as a sub-set for testing analysis pipelines

--print-statistics
  Flag that, if provided, means statistics about the FINAL set of sequences written

--no-outputfile
  Flag that, if provided, means NO outputfile is generated.

--silent
  Flag that, if provided, means pfasta generates ZERO output to STDOUT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pfasta.rst

pfasta.rst

pfasta

Usage

Files

pfasta.rst

Latest commit

History

pfasta.rst

File metadata and controls

pfasta

Usage