Installation

This was the alpha version of matchbox. The release version is now available, with more features and ongoing support! This repository is deprecated, and exists purely for archival purposes.

⚠️ WARNING ⚠️ This is an alpha version, and I guarantee I will make breaking changes to matchbox syntax and behaviour. The final version will feel either a little bit different, or completely different. Use at your own risk! Italicised annotations on this readme explain features I'm looking to redesign. If you have any ideas about any of this, please share them with me!

Installation

Currently, you need to clone this repo and build it with cargo.

git clone ..
cd matchbox-alpha
cargo build --release
./target/release/matchbox-alpha --help

Usage

A matchbox script can be executed across your reads with the run subcommand.

matchbox run "read.id" reads.fq

When you want to write longer matchbox code, put it in a .mb file and use the script subcommand.

matchbox script "trim.mb" reads.fq

To process paired FASTQs, provide your read2 file using the command-line parameter paired-with:

matchbox script my_script.mb read1.fq --paired-with read2.fq

Scripting language reference

Matchbox doesn't do anything on its own - it requires a script to tell it what to do with each read in your input, written in the matchbox scripting language.

Matchbox scripts are made of statements, of which there are a few different types.

Send statements

A send statement has an expression on the left to evaluate, and an output location on the right, to which the resulting value will be sent. The expression and the output location are separated by a pipe |>.

Example: Send the name of each read to an output file called names.txt.

read.id |> file('ids.txt')

You can omit the output location, which will send the value to stdout.

Example: Print the full sequence of each read to stdout.

read.seq

A list of output locations is given below.

Assignment with `:`

Using a colon :, you can assign a name to a value, so that it can be used as shorthand elsewhere in the script. Names can contain letters, numbers and underscores (though they cannot start with a number).

Example: Print out the ID of each read to a file called ids.txt.

filename: 'ids.txt'
output: file(filename)

read.id |> output

Conditional statements with `if`

Statements can be conditional using if, only executing the branch on the right of the arrow => when their conditions are met.

Example: If a read is longer than 1000 bases, send it to a file long.fq.

if read.seq.len() > 1000 => read |> file('long.fq')

Pattern matching

Conditional statements can include pattern-matching, allowing reads to be taken apart structurally. The pattern, enclosed in square brackets [], describes a geometry which must match the read for the statement on the right of the arrow => to execute. Fixed sequences can be used in patterns (or variables which refer to fixed sequences, such as primer in the below example). Additionally, variable regions denoted with underscores _ can be used, each consisting of any number of nucleotides.

Example: If a read contains a primer (with any number of nucleotides on either side), send it to a file filtered.fq.

primer: AGTCAGTGCTAG

if read is [_ primer _] => read |> file('filtered.fq')

When a pattern matches a read multiple times (e.g. when multiple primers are present), only the first match of the pattern is executed. I'm working on optionally allowing you to execute the branch each time the pattern is matched within each read.

Edit distance is allowed when matching, and the command-line argument error-rate allows the user to decide what proportion of each sequence needs to match. I'm working on a feature to allow you to specify error rate for each individual sequence, if you wish to.

When there are multiple patterns to be matched, they are tried in sequence, and the first successful branch is executed.

Example: If a read contains a primer, send it to a file forward.fq. Otherwise, if it contains the reverse complement of the primer, send it to reverse.fq.

primer: AGTCAGTGCTAG

if read is
    [_ primer _] => read |> file('forward.fq')
    [_ -primer _] => read |> file('reverse.fq')

Extracting regions with `:`

Often, you want to extract specific parts of a read, while trimming off other parts. Using a colon :, you can bind a user-defined name to a slice of the read. This slice retains the metadata of the read but includes only the part of the sequence that falls within the specified region; hence, it's the simplest way to trim reads in matchbox.

Example: If a read contains AGAG and later CTCT, then call the region in between them mid, and then print out this trimmed region mid to a file filtered.fa.

if read is [_ AGAG mid:_ CTCT _] => mid |> file('filtered.fa')

For FASTQ and SAM, the quality score is also trimmed to only contain the variable region. Other SAM fields like the modification string are not yet correctly handled by this slicing. Bugs galore

Fixed-length regions with `|n|`

To specify a region of a fixed length n, vertical bars |n| can be used.

Example: If a read contains the sequence AGAG and later CTCT, then extract four bases following AGAG, and call these bc. Extract everything after this until CTCT, and call it mid. Then, tag this trimmed region mid with the barcode sequence that was found, and print the trimmed reads to a file filtered.fa.

if read is [_ AGAG bc:|4| mid:_ CTCT _] =>
    mid.tag('barcode={bc.seq}') |> file('filtered.fa')

Example: Take the first four bases of each read, and accumulate them into counts. This will report how frequently a read starts with each possible set of four bases.

if read is [first:|4| _] => first.seq |> counts

Parameterising patterns with `for`

Sometimes, a read contains a sequence from a known list of sequences (e.g. a barcode list). With the for syntax, you can introduce a named parameter to handle such regions. For each value from the list that can be used match the pattern, the branch is executed. I'm adding a feature to specify whether you want to execute the branch for all matching barcodes, or just the best matching one.

Example: Load a list of barcodes from barcodes.tsv. If a read contains a primer immediately followed by a barcode sequence, send the read to filtered.fq. Also, send the name of the matched barcode to counts, to quantify which barcodes appeared most frequently. If a read did not contain the primer and barcode structure, send it to bad.fq. (For this script to work, your barcode TSV must contain columns called 'name' and 'seq'.)

primer: AGTCAGTGCTAG
bcs: tsv('barcodes.tsv')

'total' |> counts

if read is
    [_ primer bc.seq _] for bc in bcs => {
        bc.name |> counts
        read |> file('filtered.fq')
    }
    [_] => read |> file('bad.fq')

Grouping `{}` and separation `;`

As shown above, curly braces {} can be used to group several statements together so that they are all executed. Statements are separated either by newlines, or by semicolons ;.

Example: For reads over 1000 bases in length, collect them in a file filtered.fq, and print their names to stdout.

if read.seq.len() > 1000 => { read.id; read |> file('filtered.fq') }

Descriptive analysis with `as`

Pattern matching is great when you're already sure of what structure your reads have. First, though, it's good to search through your reads to discover what structure is present. An as expression allows you to infer a read's structure in terms of some fixed sequences you expect to find.

Example: Describe a read in terms of the presence of TSO sequences and poly(A) tails. Accumulate these into counts, to get a summary of general read structure.

tso: AGCTGATCG
read as { poly_a: AAAAAAAA, tso } |> counts

Outputs

There are a few built-in output locations to which you can send values. They accumulate output from across all of your reads, and they are the way of getting filtered or summarised information out of matchbox.

These are a weird part of matchbox, and I'm thinking about how to redesign them. E.g. you should be able to send values into counts and then send the counts output into a file, but you can't chain outputs in this way. Also, you should be able to count different types of things - e.g. you have a read structure with two barcodes, and want to count occurrences of each barcode type seperately - but you can't, because counts is a single global output. Clearly there are limitations to this model.

Name	Description
`stdout`	Values are printed directly to stdout.
`file(exp)`	Values are printed to a file whose name is given by exp. If the filename indicates a FASTA, FASTQ or SAM/BAM file, the values will be formatted as appropriate. If the values can't be treated as FASTA/FASTQ/SAM reads, matchbox will crash.
`counts`	Values are counted and, after processing your reads, the number of times each unique value was sent to `counts` is printed to stdout.
`average`	Values must be numeric. After processing your reads, the average and standard deviation of all accumulated values is printed to stdout.

Expressions

Variables

An expression can be a variable name, assuming that name has been bound in this scope, or in any looser scope.

Input read formats

read is a built-in variable which refers to the current read being processed. It has fields that can be accessed using dot . notation.

FASTA, FASTQ and SAM reads contain slightly different fields. Depending on the file type you are inputting into matchbox, different fields will be available to you.

Field	Type	Description	FASTA	FASTQ	SAM	Paired FASTQ
`seq`	Str	The raw nucleotide sequence of the read.	✅	✅	✅
`desc`	Str	The entire description line of the read, after the ID. For FASTA, sometimes referred to as the definition line.	✅	✅
`id`	Str	The name of the read. For FASTA, this is anything from the `>` to the first whitespace in the description line. For FASTQ, this is anything from the `@` to the first whitespace. For SAM, this is the `qname` column.	✅	✅	✅
`qual`	Str	The quality score of the read, as a string of ASCII-encoded Phred scores. Found in the fourth line of a FASTQ read, or the `qual` field of a SAM.		✅	✅
`flag`	Num	The flag, which encodes a lot of metadata about the read.			✅
`rname`	Str	The reference name.			✅
`mapq`	Num	The quality of the mapping, if appropriate. -1 otherwise.			✅
`rnext`	Num	The next read, if applicable.			✅
`pnext`	Num	The starting position of the next read.			✅
`tlen`	Num	The template length.			✅
`r1`	Rec	The first read of a paired FASTQ. `r1` has all the fields of a regular FASTQ record.				✅
`r2`	Rec	The second read of a paired FASTQ. `r2` all the fields of a regular FASTQ record.				✅

Functions

A function call is another type of expression. Matchbox's small set of built-in functions allow you to transform data.

To call a function in matchbox, the syntax is similar to most other languages: the function's name, followed by parentheses which contain a comma-separated list of arguments.

x: len(read.seq)

slice(read, 0, x / 2) |> file('halved_reads.fq')

To allow for programming in an OOP method-chaining style, functions can be called as though they are methods of their first argument (i.e. their first argument is moved before the dot):

x: read.seq.len()

read.slice(0, x / 2) |> file('halved_reads.fq')

The full function list is given here:

Name	Argument types	Return type	Description
`len` / `length`	Str / Seq / List	Num	Find the length of a string, sequence or list.
`rename`	Rec, Str	Rec	Create a new record, with an updated 'name' field.
`slice`	Rec / Seq / Str, Num, Num	Rec / Seq / Str	Take a slice of a read, sequence or string, from a starting index to an ending index. Note that slicing is inclusive of the start and exclusive of the end, so `slice(read, n, m)` is the interval `n <= x < m`. Hence, you can say `slice(read, 0, len(read.seq))`.
`translate`	Seq	Str	Convert a sequence to its amino acid string.
`concat`	Rec, Rec	Rec	Concatenate two reads (or slices of a read) together. The metadata of the first read is kept (ID, description, etc.) while the sequences (and quality scores, if applicable) of the two reads are concatenated.
`to_str`	Bool / Num / Seq	Str	Convert a value to string format.
`to_seq`	Str	Seq	Convert a value to a sequence.
`csv`	Str	List(Rec)	Given a filename, open the file as CSV and read all the rows into a list of records. Each record has fields named after the columns of the CSV, with values from the given row.
`tsv`	Str	List(Rec)	Given a filename, open the file as TSV and read all the rows into a list of records. Each record has fields named after the columns of the TSV, with values from the given row.

todo: write up the rest of the functions

Operators

Operators are the same as functions, but are often infix

Operator	Description
Num `+` Num	Addition
Num `-` Num	Subtraction
Num `` Num*	Multiplication
Num `/` Num	Division
`-` Num	Negation
`-` (Seq / Rec)	Reverse complementation (of a sequence or a read)
Num `%` Num	Modulo
Any `\|>` Output	Send a value to the output

Todo

Features I'm currently adding to matchbox.

Give command line args to a matchbox script matchbox --args "frw4 = AGCTAGTCGATG"
- Or draw from nextflow like matchbox --threads 4 -frw4 ACGATGCTGATGCTG
- Ability to access input file name in the script (so you can call your outputs like '{input_file_name}_trimmed.fq')
Handle SAM optional fields
Ability to combine strings/seqs with binary + operator. so read.tag('bc={bc1.seq+bc2.seq+bc3.seq}') works intuitively
- Maybe make tag take many args?
Concatenate two parts of a read together with an operator, e.g. when rotating a plasmid [part1:_ part2:(AGCTAGTCG _)] => part2 + part1 |> file('rotated.fq')
Figure out the file size estimation problem on gzipped files
Make sequence values case insensitive
- Make sequences permit IUPAC codes
Add comments
Adding headers to csvs?
- Maybe

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
examples		examples
figures		figures
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installation

Usage

Scripting language reference

Send statements

Assignment with `:`

Conditional statements with `if`

Pattern matching

Extracting regions with `:`

Fixed-length regions with `|n|`

Parameterising patterns with `for`

Grouping `{}` and separation `;`

Descriptive analysis with `as`

Outputs

Expressions

Variables

Input read formats

Functions

Operators

Todo

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Scripting language reference

Send statements

Assignment with :

Conditional statements with if

Pattern matching

Extracting regions with :

Fixed-length regions with |n|

Parameterising patterns with for

Grouping {} and separation ;

Descriptive analysis with as

Outputs

Expressions

Variables

Input read formats

Functions

Operators

Todo

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Assignment with `:`

Conditional statements with `if`

Extracting regions with `:`

Fixed-length regions with `|n|`

Parameterising patterns with `for`

Grouping `{}` and separation `;`

Descriptive analysis with `as`

Packages