Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Update RESULTS, add CHANGELOG, new Clojure versions of reverse-comple…
…ment benchmark
- Loading branch information
1 parent
46a0b86
commit c5cc477
Showing
6 changed files
with
293 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
Jul 29, 2009 | ||
|
||
Added some Clojure programs for the reverse-complement benchmark. See | ||
rcomp/README for some memory issues I'm still trying to solve. | ||
|
||
Found a significant speedup for the Clojure program for the | ||
k-nucleotide problem. Lesson: If you want to use the same constant | ||
map like a function many times, assign it to a Var once, and let-bind | ||
it where it is used to a name. Don't just have the map there in the | ||
first element of a form -- Clojure will recreate a new map each time | ||
before calling it. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
July 29, 2009 | ||
|
||
revcomp.clj-4.clj is the best I've got so far, but it runs out of | ||
memory on the full size benchmark. | ||
|
||
long-input.txt contains 3 DNA sequences in FASTA format. The first is | ||
50,000,000 characters long, the second is 75,000,000 characters long, | ||
and the third is 125,000,000 characters long. Each needs to be | ||
reversed, have each character replaced with a different one, and | ||
printed out, so we need to store each of the strings one at a time, | ||
but it is acceptable to deallocate/garbage-collect the previous one | ||
when starting on the next. I think my code should be doing that, but | ||
I don't know how to verify that. | ||
|
||
I've read that a Java String takes 2 bytes per character, plus about | ||
38 bytes of overhead per string. That is about 250 Mbytes for the | ||
longest one. I also read in a seq of lines, and these long strings | ||
are split into lines with 60 characters (plus a newline) each. Thus | ||
the string's data needs to be stored at least twice temporarily -- | ||
once for the many 60-character strings, plus the final long one. | ||
Also, the Java StringBuilder that Clojure's (str ...) function uses | ||
probably needs to be copied and reallocated periodically as it | ||
outgrows its current allocation. So I could imagine needing about 3 * | ||
250 Mbytes temporarily, but that doesn't explain why my 1536 Mbytes of | ||
JVM memory are being exhausted. | ||
|
||
It would be possible to improve things by not creating all of the | ||
separate strings, one for each line, and then concatenating them | ||
together. But first I'd like to explain why it is using so much, | ||
because I must be missing something. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,159 @@ | ||
;; Author: Andy Fingerhut (andy_fingerhut@alum.wustl.edu) | ||
;; Date: Jul 29, 2009 | ||
|
||
;; I don't see why revcomp.clj-2.clj is using so much memory. My next | ||
;; idea is to memory map the input file, and only copy small parts of | ||
;; it at a time into other buffers, reversed and complemented, for | ||
;; writing. Not necessarily very Clojure-ish, but the input data is | ||
;; too big for the most straightforward kinds of solutions. | ||
|
||
;; Unfortunately, I don't think we can memory-map standard input. | ||
|
||
;; The next most memory-efficient thing I can think of is to slurp in | ||
;; part of the input file at a time -- only one DNA sequence's worth | ||
;; at a time, not a as a sequence of lines, but as a string or byte | ||
;; array. | ||
|
||
|
||
;;(set! *warn-on-reflection* true) | ||
|
||
(defn fasta-description-line | ||
"Return true when the line l is a FASTA description line" | ||
[l] | ||
(= \> (first (seq l)))) | ||
|
||
|
||
(defn split-with-2 | ||
[pred coll] | ||
(loop [s (seq coll) | ||
reversed-take-while '()] | ||
(if s | ||
(if (pred (first s)) | ||
(recur (next s) (cons (first s) reversed-take-while)) | ||
[(reverse reversed-take-while) s]) | ||
[(reverse reversed-take-while) '()]))) | ||
|
||
|
||
;; TBD: Try avoiding the use of when-let, in case it might be causing | ||
;; me to hold on to a head of a sequence when I don't want to. | ||
|
||
(defn fasta-desc-dna-str-pairs | ||
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter." | ||
[lines] | ||
(lazy-seq | ||
(let [lines (seq (drop-while (fn [l] (not (fasta-description-line l))) | ||
lines))] | ||
(when lines | ||
(let [[lines-before-next-desc next-desc-line-onwards] | ||
(split-with-2 (fn [l] (not (fasta-description-line l))) | ||
(rest lines))] | ||
(cons [(first lines) (apply str lines-before-next-desc)] | ||
(fasta-desc-dna-str-pairs next-desc-line-onwards))))))) | ||
|
||
|
||
(comment | ||
|
||
(defn fasta-desc-dna-str-pairs | ||
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter." | ||
[lines] | ||
(lazy-seq | ||
(when-let [lines (drop-while (fn [l] | ||
(not (fasta-description-line l))) | ||
lines)] | ||
(when-let [lines (seq lines)] | ||
(let [[lines-before-next-desc next-desc-line-onwards] | ||
(split-with-2 (fn [l] (not (fasta-description-line l))) | ||
(rest lines))] | ||
(cons [(first lines) (apply str lines-before-next-desc)] | ||
(fasta-desc-dna-str-pairs next-desc-line-onwards))))))) | ||
) | ||
|
||
|
||
(def complement-of-dna-char | ||
{\w \W, \W \W, | ||
\s \S, \S \S, | ||
\a \T, \A \T, | ||
\t \A, \T \A, | ||
\u \A, \U \A, | ||
\g \C, \G \C, | ||
\c \G, \C \G, | ||
\y \R, \Y \R, | ||
\r \Y, \R \Y, | ||
\k \M, \K \M, | ||
\m \K, \M \K, | ||
\b \V, \B \V, | ||
\d \H, \D \H, | ||
\h \D, \H \D, | ||
\v \B, \V \B, | ||
\n \N, \N \N }) | ||
|
||
|
||
(defn print-reverse-complement-of-str-in-lines [#^java.io.BufferedWriter bw | ||
#^java.lang.String s | ||
max-len] | ||
(let [comp complement-of-dna-char | ||
len (int (count s)) | ||
max-len (int max-len)] | ||
(when (> len 0) | ||
(loop [start (int (dec len)) | ||
to-print-before-nl (int max-len)] | ||
(let [next-start (int (dec start)) | ||
next-to-print-before-nl (int (dec to-print-before-nl))] | ||
(. bw write (int (comp (. s charAt start)))) | ||
(when (zero? next-to-print-before-nl) | ||
(. bw newLine)) | ||
(when (not (zero? start)) | ||
(if (zero? next-to-print-before-nl) | ||
(recur next-start max-len) | ||
(recur next-start next-to-print-before-nl))))) | ||
;; Need one more newline at the end if the string was not a | ||
;; multiple of max-len characters. | ||
(when (not= 0 (rem len max-len)) | ||
(. bw newLine)) | ||
))) | ||
|
||
|
||
(defn println-string-to-buffered-writer [#^java.io.BufferedWriter bw | ||
#^java.lang.String s] | ||
(. bw write (.toCharArray s) 0 (count s)) | ||
(. bw newLine)) | ||
|
||
|
||
(let [max-dna-chars-per-line 60] | ||
;; (with-open [br (java.io.BufferedReader. *in*) | ||
;; bw (java.io.BufferedWriter. *out*)] | ||
(let [br (java.io.BufferedReader. *in*) | ||
bw (java.io.BufferedWriter. *out*)] | ||
(doseq [[desc dna-seq] (fasta-desc-dna-str-pairs (line-seq br))] | ||
(println-string-to-buffered-writer bw desc) | ||
;; (println-string-to-buffered-writer bw dna-seq) | ||
(print-reverse-complement-of-str-in-lines bw dna-seq | ||
max-dna-chars-per-line) | ||
(. bw flush)) | ||
)) | ||
|
||
|
||
(comment | ||
|
||
(let [max-dna-chars-per-line 60] | ||
;; (with-open [br (java.io.BufferedReader. *in*) | ||
;; bw (java.io.BufferedWriter. *out*)] | ||
(let [br (java.io.BufferedReader. *in*) | ||
bw (java.io.BufferedWriter. *out*)] | ||
(doseq [[dna-seq-num [desc dna-seq]] | ||
(map (fn [x y] [x y]) | ||
(iterate inc 1) | ||
(fasta-desc-dna-str-pairs (line-seq br)))] | ||
(println-string-to-buffered-writer bw (format "%d" dna-seq-num)) | ||
(println-string-to-buffered-writer bw desc) | ||
;; (. bw flush) | ||
(print-reverse-complement-of-str-in-lines bw dna-seq | ||
max-dna-chars-per-line) | ||
;; (. bw flush) | ||
) | ||
(. bw flush) | ||
)) | ||
) | ||
|
||
|
||
(. System (exit 0)) |