This was the alpha version of matchbox. The release version is now available, with more features and ongoing support! This repository is deprecated, and exists purely for archival purposes.
Currently, you need to clone this repo and build it with cargo.
git clone ..
cd matchbox-alpha
cargo build --release
./target/release/matchbox-alpha --help
A matchbox script can be executed across your reads with the run subcommand.
matchbox run "read.id" reads.fqWhen you want to write longer matchbox code, put it in a .mb file and use the script subcommand.
matchbox script "trim.mb" reads.fqTo process paired FASTQs, provide your read2 file using the command-line parameter paired-with:
matchbox script my_script.mb read1.fq --paired-with read2.fq
Matchbox doesn't do anything on its own - it requires a script to tell it what to do with each read in your input, written in the matchbox scripting language.
Matchbox scripts are made of statements, of which there are a few different types.
A send statement has an expression on the left to evaluate, and an output location on the right, to which the resulting value will be sent. The expression and the output location are separated by a pipe |>.
Example: Send the name of each read to an output file called names.txt.
read.id |> file('ids.txt')You can omit the output location, which will send the value to stdout.
Example: Print the full sequence of each read to stdout.
read.seqA list of output locations is given below.
Using a colon :, you can assign a name to a value, so that it can be used as shorthand elsewhere in the script. Names can contain letters, numbers and underscores (though they cannot start with a number).
Example: Print out the ID of each read to a file called ids.txt.
filename: 'ids.txt'
output: file(filename)
read.id |> outputStatements can be conditional using if, only executing the branch on the right of the arrow => when their conditions are met.
Example: If a read is longer than 1000 bases, send it to a file long.fq.
if read.seq.len() > 1000 => read |> file('long.fq')Conditional statements can include pattern-matching, allowing reads to be taken apart structurally. The pattern, enclosed in square brackets [], describes a geometry which must match the read for the statement on the right of the arrow => to execute. Fixed sequences can be used in patterns (or variables which refer to fixed sequences, such as primer in the below example). Additionally, variable regions denoted with underscores _ can be used, each consisting of any number of nucleotides.
Example: If a read contains a primer (with any number of nucleotides on either side), send it to a file filtered.fq.
primer: AGTCAGTGCTAG
if read is [_ primer _] => read |> file('filtered.fq')When a pattern matches a read multiple times (e.g. when multiple primers are present), only the first match of the pattern is executed. I'm working on optionally allowing you to execute the branch each time the pattern is matched within each read.
Edit distance is allowed when matching, and the command-line argument error-rate allows the user to decide what proportion of each sequence needs to match. I'm working on a feature to allow you to specify error rate for each individual sequence, if you wish to.
When there are multiple patterns to be matched, they are tried in sequence, and the first successful branch is executed.
Example: If a read contains a primer, send it to a file forward.fq. Otherwise, if it contains the reverse complement of the primer, send it to reverse.fq.
primer: AGTCAGTGCTAG
if read is
[_ primer _] => read |> file('forward.fq')
[_ -primer _] => read |> file('reverse.fq')Often, you want to extract specific parts of a read, while trimming off other parts. Using a colon :, you can bind a user-defined name to a slice of the read. This slice retains the metadata of the read but includes only the part of the sequence that falls within the specified region; hence, it's the simplest way to trim reads in matchbox.
Example: If a read contains AGAG and later CTCT, then call the region in between them mid, and then print out this trimmed region mid to a file filtered.fa.
if read is [_ AGAG mid:_ CTCT _] => mid |> file('filtered.fa')For FASTQ and SAM, the quality score is also trimmed to only contain the variable region. Other SAM fields like the modification string are not yet correctly handled by this slicing. Bugs galore
To specify a region of a fixed length n, vertical bars |n| can be used.
Example: If a read contains the sequence AGAG and later CTCT, then extract four bases following AGAG, and call these bc. Extract everything after this until CTCT, and call it mid. Then, tag this trimmed region mid with the barcode sequence that was found, and print the trimmed reads to a file filtered.fa.
if read is [_ AGAG bc:|4| mid:_ CTCT _] =>
mid.tag('barcode={bc.seq}') |> file('filtered.fa')Example: Take the first four bases of each read, and accumulate them into counts. This will report how frequently a read starts with each possible set of four bases.
if read is [first:|4| _] => first.seq |> countsSometimes, a read contains a sequence from a known list of sequences (e.g. a barcode list). With the for syntax, you can introduce a named parameter to handle such regions. For each value from the list that can be used match the pattern, the branch is executed. I'm adding a feature to specify whether you want to execute the branch for all matching barcodes, or just the best matching one.
Example: Load a list of barcodes from barcodes.tsv. If a read contains a primer immediately followed by a barcode sequence, send the read to filtered.fq. Also, send the name of the matched barcode to counts, to quantify which barcodes appeared most frequently. If a read did not contain the primer and barcode structure, send it to bad.fq. (For this script to work, your barcode TSV must contain columns called 'name' and 'seq'.)
primer: AGTCAGTGCTAG
bcs: tsv('barcodes.tsv')
'total' |> counts
if read is
[_ primer bc.seq _] for bc in bcs => {
bc.name |> counts
read |> file('filtered.fq')
}
[_] => read |> file('bad.fq')As shown above, curly braces {} can be used to group several statements together so that they are all executed. Statements are separated either by newlines, or by semicolons ;.
Example: For reads over 1000 bases in length, collect them in a file filtered.fq, and print their names to stdout.
if read.seq.len() > 1000 => { read.id; read |> file('filtered.fq') }Pattern matching is great when you're already sure of what structure your reads have. First, though, it's good to search through your reads to discover what structure is present. An as expression allows you to infer a read's structure in terms of some fixed sequences you expect to find.
Example: Describe a read in terms of the presence of TSO sequences and poly(A) tails. Accumulate these into counts, to get a summary of general read structure.
tso: AGCTGATCG
read as { poly_a: AAAAAAAA, tso } |> countsThere are a few built-in output locations to which you can send values. They accumulate output from across all of your reads, and they are the way of getting filtered or summarised information out of matchbox.
These are a weird part of matchbox, and I'm thinking about how to redesign them. E.g. you should be able to send values into counts and then send the counts output into a file, but you can't chain outputs in this way. Also, you should be able to count different types of things - e.g. you have a read structure with two barcodes, and want to count occurrences of each barcode type seperately - but you can't, because counts is a single global output. Clearly there are limitations to this model.
| Name | Description |
|---|---|
stdout |
Values are printed directly to stdout. |
file(exp) |
Values are printed to a file whose name is given by exp. If the filename indicates a FASTA, FASTQ or SAM/BAM file, the values will be formatted as appropriate. If the values can't be treated as FASTA/FASTQ/SAM reads, matchbox will crash. |
counts |
Values are counted and, after processing your reads, the number of times each unique value was sent to counts is printed to stdout. |
average |
Values must be numeric. After processing your reads, the average and standard deviation of all accumulated values is printed to stdout. |
An expression can be a variable name, assuming that name has been bound in this scope, or in any looser scope.
read is a built-in variable which refers to the current read being processed. It has fields that can be accessed using dot . notation.
FASTA, FASTQ and SAM reads contain slightly different fields. Depending on the file type you are inputting into matchbox, different fields will be available to you.
| Field | Type | Description | FASTA | FASTQ | SAM | Paired FASTQ |
|---|---|---|---|---|---|---|
seq |
Str | The raw nucleotide sequence of the read. | ✅ | ✅ | ✅ | |
desc |
Str | The entire description line of the read, after the ID. For FASTA, sometimes referred to as the definition line. | ✅ | ✅ | ||
id |
Str | The name of the read. For FASTA, this is anything from the > to the first whitespace in the description line. For FASTQ, this is anything from the @ to the first whitespace. For SAM, this is the qname column. |
✅ | ✅ | ✅ | |
qual |
Str | The quality score of the read, as a string of ASCII-encoded Phred scores. Found in the fourth line of a FASTQ read, or the qual field of a SAM. |
✅ | ✅ | ||
flag |
Num | The flag, which encodes a lot of metadata about the read. | ✅ | |||
rname |
Str | The reference name. | ✅ | |||
mapq |
Num | The quality of the mapping, if appropriate. -1 otherwise. | ✅ | |||
rnext |
Num | The next read, if applicable. | ✅ | |||
pnext |
Num | The starting position of the next read. | ✅ | |||
tlen |
Num | The template length. | ✅ | |||
r1 |
Rec | The first read of a paired FASTQ. r1 has all the fields of a regular FASTQ record. |
✅ | |||
r2 |
Rec | The second read of a paired FASTQ. r2 all the fields of a regular FASTQ record. |
✅ |
A function call is another type of expression. Matchbox's small set of built-in functions allow you to transform data.
To call a function in matchbox, the syntax is similar to most other languages: the function's name, followed by parentheses which contain a comma-separated list of arguments.
x: len(read.seq)
slice(read, 0, x / 2) |> file('halved_reads.fq')To allow for programming in an OOP method-chaining style, functions can be called as though they are methods of their first argument (i.e. their first argument is moved before the dot):
x: read.seq.len()
read.slice(0, x / 2) |> file('halved_reads.fq')The full function list is given here:
| Name | Argument types | Return type | Description |
|---|---|---|---|
len / length |
Str / Seq / List | Num | Find the length of a string, sequence or list. |
rename |
Rec, Str |
Rec | Create a new record, with an updated 'name' field. |
slice |
Rec / Seq / Str, Num, Num |
Rec / Seq / Str | Take a slice of a read, sequence or string, from a starting index to an ending index. Note that slicing is inclusive of the start and exclusive of the end, so slice(read, n, m) is the interval n <= x < m. Hence, you can say slice(read, 0, len(read.seq)). |
translate |
Seq | Str | Convert a sequence to its amino acid string. |
concat |
Rec, Rec |
Rec | Concatenate two reads (or slices of a read) together. The metadata of the first read is kept (ID, description, etc.) while the sequences (and quality scores, if applicable) of the two reads are concatenated. |
to_str |
Bool / Num / Seq | Str | Convert a value to string format. |
to_seq |
Str | Seq | Convert a value to a sequence. |
csv |
Str | List(Rec) | Given a filename, open the file as CSV and read all the rows into a list of records. Each record has fields named after the columns of the CSV, with values from the given row. |
tsv |
Str | List(Rec) | Given a filename, open the file as TSV and read all the rows into a list of records. Each record has fields named after the columns of the TSV, with values from the given row. |
todo: write up the rest of the functions
Operators are the same as functions, but are often infix
| Operator | Description |
|---|---|
Num + Num |
Addition |
Num - Num |
Subtraction |
Num * Num |
Multiplication |
Num / Num |
Division |
- Num |
Negation |
- (Seq / Rec) |
Reverse complementation (of a sequence or a read) |
Num % Num |
Modulo |
Any |> Output |
Send a value to the output |
Features I'm currently adding to matchbox.
- Give command line args to a matchbox script
matchbox --args "frw4 = AGCTAGTCGATG"- Or draw from nextflow like
matchbox --threads 4 -frw4 ACGATGCTGATGCTG - Ability to access input file name in the script (so you can call your outputs like
'{input_file_name}_trimmed.fq')
- Or draw from nextflow like
- Handle SAM optional fields
- Ability to combine strings/seqs with binary
+operator. soread.tag('bc={bc1.seq+bc2.seq+bc3.seq}')works intuitively- Maybe make
tagtake many args?
- Maybe make
- Concatenate two parts of a read together with an operator, e.g. when rotating a plasmid
[part1:_ part2:(AGCTAGTCG _)] => part2 + part1 |> file('rotated.fq') - Figure out the file size estimation problem on gzipped files
- Make sequence values case insensitive
- Make sequences permit IUPAC codes
- Add comments
- Adding headers to csvs?
- Maybe
