Skip to content

Commit

Permalink
Add tests and update docs
Browse files Browse the repository at this point in the history
Make it clear in the docs and the help screen that when using mmseqs,
users can provide a premade target DB and it will work fine.  Also
include new tests to make sure this works okay.
  • Loading branch information
mooreryan committed Jan 26, 2022
1 parent 66e616e commit 1f61396
Show file tree
Hide file tree
Showing 6 changed files with 140 additions and 12 deletions.
19 changes: 19 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,25 @@ A couple of things to note here:

For more info on command line usage, see the help screen by running `cerise --help`.

### Premade target databases

If you use `mmseqs`, you can use a premade target database. Simply pass in the path of the target DB and it will work fine. Here is an example where I make a search DB first, then run Cerise as before.

```
$ mmseqs createdb clustered_targets.fasta clustered_targets.db
$ cerise \
clustered_queries.fasta \
clustered_targets.db \
--query-clusters=query_clusters.tsv \
--target-clusters=target_clusters.tsv \
--all-queries=queries.fasta \
--all-targets=targets.fasta \
--search-config='--threads 4 -s 7 --num-iterations 3' \
--search-program=mmseqs
```

*Using premade DBs in this way is not yet supported when using `blast` or `diamond`.*

### More examples

There are a ton of examples on how to run (and how to break) Cerise in the [test](https://github.com/mooreryan/cerise/tree/main/cerise/test) directory of this repository. In this directory, you will see directories ending in `*.t`. Each of these specifies a self contained set of tests, including all the files needed to run tests in that directory. You will also find a `run.t` in each of the subdirectories. These files are where you will find the examples exercising the behavior of the `cerise` command line app. See [here](https://bitheap.org/cram/) for info on how to read these files.
Expand Down
7 changes: 6 additions & 1 deletion cerise/lib/cli.ml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,12 @@ let queries_term =
Arg.(required & pos 0 (some non_dir_file) None & info [] ~docv:"QUERIES" ~doc)

let targets_term =
let doc = "Path to target sequences" in
let doc =
"Path to target sequences/db. If you use the 'mmseqs' option, this can \
either be the path to sequenecs in a FASTA file, or a MMseqs2 sequence \
database. If you use an already existing database, then it will skip the \
database construction step."
in
Arg.(required & pos 1 (some non_dir_file) None & info [] ~docv:"TARGETS" ~doc)

let outdir_term =
Expand Down
27 changes: 16 additions & 11 deletions cerise/lib/clusters.ml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ let maybe_keep_cluster_members ~centroid ~clusters ~keep =
(* After the first homology search, run this to figure out which seqs to use in
the next round. *)
let get_new_search_input_seq_ids ~query_clusters ~target_clusters btab_fname =
let init =
let clusters =
match (query_clusters, target_clusters) with
| None, None ->
failwith "you should provide either query or target clusters"
Expand All @@ -53,13 +53,18 @@ let get_new_search_input_seq_ids ~query_clusters ~target_clusters btab_fname =
| Some _, Some _ ->
(Some (Set.empty (module String)), Some (Set.empty (module String)))
in
In_channel.with_file btab_fname ~f:(fun ic ->
In_channel.fold_lines ic ~init
~f:(fun (keep_queries, keep_targets) line ->
match String.split ~on:'\t' line with
| query :: target :: _rest ->
( maybe_keep_cluster_members ~centroid:query
~clusters:query_clusters ~keep:keep_queries,
maybe_keep_cluster_members ~centroid:target
~clusters:target_clusters ~keep:keep_targets )
| _ -> failwith "bad btab file"))
let count, clusters =
In_channel.with_file btab_fname ~f:(fun ic ->
In_channel.fold_lines ic ~init:(0, clusters)
~f:(fun (i, (keep_queries, keep_targets)) line ->
match String.split ~on:'\t' line with
| query :: target :: _rest ->
( i + 1,
( maybe_keep_cluster_members ~centroid:query
~clusters:query_clusters ~keep:keep_queries,
maybe_keep_cluster_members ~centroid:target
~clusters:target_clusters ~keep:keep_targets ) )
| _ -> failwith "bad btab file"))
in
if count > 0 then clusters
else Utils.abort "ERROR: there were no hits of your queries to your targets"
4 changes: 4 additions & 0 deletions cerise/lib/utils.ml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
open! Core

let abort ?(exit_code = 1) msg =
let () = eprintf "%s\n" msg in
Caml.exit exit_code

(* See
https://github.com/ocaml/dune/commit/154272b779fe8943a9ce1b4afabb30150ab94ba6 *)

Expand Down
91 changes: 91 additions & 0 deletions cerise/test/mmseqs_search.t/run.t
Original file line number Diff line number Diff line change
Expand Up @@ -97,3 +97,94 @@ Bad target clusters.
$ grep -A1 Failure cerise_oe | sed -E 's/^ +//'
(Failure "bad line in clusters file 'apple\tpie\tgood'")


Trying to do clustered targets but missing --all-targets.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ cerise -vv clustered_queries.fasta clustered_targets.fasta --target-clusters target_clusters.tsv --search-config='-s=1 --threads=4' 2> err
[2]
$ grep -A1 Failure err | sed -E 's/^ +//'
(Failure
"--target-clusters and --all-targets must both be present, or neither should be present")

Trying to do clustered targets but missing --target-clusters.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ cerise -vv clustered_queries.fasta clustered_targets.fasta --all-targets targets.fasta --search-config='-s=1 --threads=4' 2> err
[2]
$ grep -A1 Failure err | sed -E 's/^ +//'
(Failure "you need to have at least one of query or target clusters")


Trying to do clustered queries but missing --all-queries.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ cerise -vv clustered_queries.fasta clustered_targets.fasta --query-clusters query_clusters.tsv --search-config='-s=1 --threads=4' 2> err
[2]
$ grep -A1 Failure err | sed -E 's/^ +//'
(Failure
"--query-clusters and --all-queries must both be present, or neither should be present")

Trying to do clustered queries but missing --query-clusters.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ cerise -vv clustered_queries.fasta clustered_targets.fasta --all-queries queries.fasta --search-config='-s=1 --threads=4' 2> err
[2]
$ grep -A1 Failure err | sed -E 's/^ +//'
(Failure "you need to have at least one of query or target clusters")


Premade target DB; both queries and targets clustered.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ mmseqs createdb clustered_targets.fasta clustered_targets.db > /dev/null 2>&1
$ cerise -vv clustered_queries.fasta clustered_targets.db --query-clusters query_clusters.tsv --target-clusters target_clusters.tsv --all-queries queries.fasta --all-targets targets.fasta --search-config='-s=1 --threads=4' > cerise_oe 2>&1
$ ls cerise_out
cerise.first_search.tsv
cerise.new_queries.fasta
cerise.new_targets.fasta
cerise.second_search.tsv
command_logs.txt
$ grep '^>' cerise_out/cerise.new_queries.fasta | cut -f1 -d' ' | sort | diff - expected_new_queries__both.txt
$ grep '^>' cerise_out/cerise.new_targets.fasta | cut -f1 -d' ' | sort | diff - expected_new_targets__both.txt
$ sort -k1,2 cerise_out/cerise.first_search.tsv | cut -f1,2 | diff - expected_first_search__both.tsv
$ sort -k1,2 cerise_out/cerise.second_search.tsv | cut -f1,2 | diff - expected_second_search__both.tsv

Premade target DB; just queries clustered.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ mmseqs createdb clustered_targets.fasta clustered_targets.db > /dev/null 2>&1
$ cerise clustered_queries.fasta clustered_targets.db --query-clusters query_clusters.tsv --all-queries queries.fasta --search-config='-s=1 --threads=4' > cerise_oe 2>&1
$ ls cerise_out
cerise.first_search.tsv
cerise.new_queries.fasta
cerise.second_search.tsv
command_logs.txt
$ grep '^>' cerise_out/cerise.new_queries.fasta | cut -f1 -d' ' | sort | diff - expected_new_queries__clustered_queries.txt
$ sort -k1,2 cerise_out/cerise.first_search.tsv | cut -f1,2 | diff - expected_first_search__clustered_queries.tsv
$ sort -k1,2 cerise_out/cerise.second_search.tsv | cut -f1,2 | diff - expected_second_search__clustered_queries.tsv

Premade target DB; just targets clustered.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ mmseqs createdb clustered_targets.fasta clustered_targets.db > /dev/null 2>&1
$ cerise clustered_queries.fasta clustered_targets.db --target-clusters target_clusters.tsv --all-targets targets.fasta --search-config='-s=1 --threads=4' > cerise_oe 2>&1
$ ls cerise_out
cerise.first_search.tsv
cerise.new_targets.fasta
cerise.second_search.tsv
command_logs.txt
$ grep '^>' cerise_out/cerise.new_targets.fasta | cut -f1 -d' ' | sort | diff - expected_new_targets__clustered_targets.txt
$ sort -k1,2 cerise_out/cerise.first_search.tsv | cut -f1,2 | diff - expected_first_search__clustered_targets.tsv
$ sort -k1,2 cerise_out/cerise.second_search.tsv | cut -f1,2 | diff - expected_second_search__clustered_targets.tsv

Queries have no hits in the DB.

$ if [ -d cerise_out ]; then rm -r cerise_out; fi
$ cerise silly_queries.faa clustered_targets.fasta --target-clusters target_clusters.tsv --all-targets targets.fasta --search-config='-s=1 --threads=4' > cerise_oe 2>&1
[1]
$ cat cerise_oe
ERROR: there were no hits of your queries to your targets
$ ls cerise_out
cerise.first_search.tsv
command_logs.txt
4 changes: 4 additions & 0 deletions cerise/test/mmseqs_search.t/silly_queries.faa
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
>s1
xxa
>s2
VVVVVVVVVVVVVVV

0 comments on commit 1f61396

Please sign in to comment.