Navigation Menu

Skip to content

Commit

Permalink
Update RESULTS, add CHANGELOG, new Clojure versions of reverse-comple…
Browse files Browse the repository at this point in the history
…ment benchmark
  • Loading branch information
jafingerhut committed Jul 30, 2009
1 parent 46a0b86 commit c5cc477
Show file tree
Hide file tree
Showing 6 changed files with 293 additions and 30 deletions.
11 changes: 11 additions & 0 deletions CHANGELOG
@@ -0,0 +1,11 @@
Jul 29, 2009

Added some Clojure programs for the reverse-complement benchmark. See
rcomp/README for some memory issues I'm still trying to solve.

Found a significant speedup for the Clojure program for the
k-nucleotide problem. Lesson: If you want to use the same constant
map like a function many times, assign it to a Var once, and let-bind
it where it is used to a name. Don't just have the map there in the
first element of a form -- Clojure will recreate a new map each time
before calling it.
51 changes: 32 additions & 19 deletions RESULTS
Expand Up @@ -18,39 +18,35 @@ Times are real / user / sys on my iMac
-----------------------------------------------------
noT noT | noT
long test on my iMac:
rcomp | 8.3 | 11.9 | | 8.5 | no implementation yet
| 4.8 | 7.9 | | 3.6 |
| 2.1 | 2.3 | | 1.4 |
rcomp | 8.3 | 11.9 | | 8.5 | current
| 4.8 | 7.9 | | 3.6 | implementation
| 2.1 | 2.3 | | 1.4 | crashes on long test

long test on my iMac:
mand- | wrong | out of | 32.7 | 28.6 | 340.4
elbrot | output | mem | 59.3 | 54.4 | 350.5
| | (?) | 0.8 | 0.4 | 4.7
mand- | wrong | out of | 32.7 | 28.6 | 325.0
elbrot | output | mem | 59.3 | 54.4 | 326.1 clj =
| | (?) | 0.8 | 0.4 | 3.2 6.01 x java

noT noT | T
long test on my iMac:
k-nuc- | 190.9 | 306.0 | 90.5 | 52.4 | 1677.6 (27m 57.6s)
leotide | 187.9 | 302.7 | 130.8 | 89.6 | 2245.1 (37m 25.1s)
| 2.4 | 1.9 | 4.6 | 1.8 | 24.2 ( 24.2s)

Disappointing result: Replacing dna-char-to-code-val with a macro did
not speed things up, and may have slowed them down:
| 1800.0 (30m 0.0s)
| 2317.0 (38m 37.0s)
| 30.6 ( 30.6s)
k-nuc- | 190.9 | 306.0 | 90.5 | 53.5 | 1286.4 (21m 26.4s)
leotide | 187.9 | 302.7 | 130.8 | 91.8 | 1118.1 (18m 38.1s) clj =
| 2.4 | 1.9 | 4.6 | 2.1 | 21.8 ( 21.8s) 12.1 x java

k-nucleotide long test on benchmark shootout machine:
| 164.9 | 249.8 | 52.0 | 20.6 |
| 164.9 | 246.7 | 112.5 | 58.8 |
| ? | ? | ? | ? |

k-nucleotide medium test on my iMac:
| 8.6 | 12.7 | 3.9 | 3.9 | 64.2 / 69.6 / 69.1 / 67.1
| 7.9 | 12.5 | 5.4 | 5.7 | 98.4 / 92.9 / 93.1 / 88.1
| 0.6 | 0.1 | 0.3 | 0.2 | 1.5 / 1.6 / 1.6 / 1.7
| 8.6 | 12.7 | 3.9 | 3.9 | 43.8
| 7.9 | 12.5 | 5.4 | 5.7 | 46.5 clj =
| 0.6 | 0.1 | 0.3 | 0.2 | 1.5 8.1 x java

k-nucleotide medium test, all clj, modified-pmap with specified number
of parallel threads, on my iMac:
of parallel threads, on my iMac (note, these are from
knucleotide.clj-5.clj, not the slightly improved knucleotide.clj-6.clj
version):
| 1 | 2 | 3 | 4 | 5 | 6 |
| 74.9 | 70.7 | 77.2 | 76.8 | 82.5 | 77.8 |
| 125.9 | 122.1 | 134.6 | 134.0 | 143.4 | 134.0 |
Expand All @@ -61,6 +57,23 @@ fasta
thread-ring


The longest few benchmark run on my iMac are:

mins benchmark test language

127 mandelbrot long perl
21 knuc long clj (21m 26.4s / 18m 38.1s / 21.8s)
9 fasta long perl
5 mandelbrot long clj
5 knuc long perl
3 knuc long sbcl
2 mandelbrot long sbcl
1.5 knuc long ghc
0.9 knuc long java
? rcomp long clj (runs out of memory or crahes with IOException)



mandelbrot notes:

sbcl version implemented with threads. I run it with only 1 thread.
Expand Down
30 changes: 30 additions & 0 deletions rcomp/README
@@ -0,0 +1,30 @@
July 29, 2009

revcomp.clj-4.clj is the best I've got so far, but it runs out of
memory on the full size benchmark.

long-input.txt contains 3 DNA sequences in FASTA format. The first is
50,000,000 characters long, the second is 75,000,000 characters long,
and the third is 125,000,000 characters long. Each needs to be
reversed, have each character replaced with a different one, and
printed out, so we need to store each of the strings one at a time,
but it is acceptable to deallocate/garbage-collect the previous one
when starting on the next. I think my code should be doing that, but
I don't know how to verify that.

I've read that a Java String takes 2 bytes per character, plus about
38 bytes of overhead per string. That is about 250 Mbytes for the
longest one. I also read in a seq of lines, and these long strings
are split into lines with 60 characters (plus a newline) each. Thus
the string's data needs to be stored at least twice temporarily --
once for the many 60-character strings, plus the final long one.
Also, the Java StringBuilder that Clojure's (str ...) function uses
probably needs to be copied and reallocated periodically as it
outgrows its current allocation. So I could imagine needing about 3 *
250 Mbytes temporarily, but that doesn't explain why my 1536 Mbytes of
JVM memory are being exhausted.

It would be possible to improve things by not creating all of the
separate strings, one for each line, and then concatenating them
together. But first I'd like to explain why it is using so much,
because I must be missing something.
2 changes: 1 addition & 1 deletion rcomp/clj-run.sh
Expand Up @@ -5,4 +5,4 @@

source ../env.sh

$JAVA -server -Xmx1280m ${JAVA_PROFILING} -cp ${CLOJURE_CLASSPATH} clojure.main revcomp.clj-2.clj "$@"
$JAVA -server -Xmx1536m ${JAVA_PROFILING} -cp ${CLOJURE_CLASSPATH} clojure.main revcomp.clj-4.clj "$@"
70 changes: 60 additions & 10 deletions rcomp/revcomp.clj-2.clj
Expand Up @@ -10,20 +10,41 @@
(= \> (first (seq l))))


;; TBD: Try avoiding the use of when-let, in case it might be causing
;; me to hold on to a head of a sequence when I don't want to.

(defn fasta-desc-dna-str-pairs
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
[lines]
(when-let [x (drop-while (fn [l]
(not (fasta-description-line l)))
lines)]
(when-let [x (seq x)]
(lazy-seq
(lazy-seq
(let [lines (seq (drop-while (fn [l] (not (fasta-description-line l)))
lines))]
(when lines
(let [[lines-before-next-desc next-desc-line-onwards]
(split-with (fn [l] (not (fasta-description-line l))) (rest x))]
(cons [(first x) (apply str lines-before-next-desc)]
(split-with (fn [l] (not (fasta-description-line l)))
(rest lines))]
(cons [(first lines) (apply str lines-before-next-desc)]
(fasta-desc-dna-str-pairs next-desc-line-onwards)))))))


(comment

(defn fasta-desc-dna-str-pairs
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
[lines]
(lazy-seq
(when-let [lines (drop-while (fn [l]
(not (fasta-description-line l)))
lines)]
(when-let [lines (seq lines)]
(let [[lines-before-next-desc next-desc-line-onwards]
(split-with (fn [l] (not (fasta-description-line l)))
(rest lines))]
(cons [(first lines) (apply str lines-before-next-desc)]
(fasta-desc-dna-str-pairs next-desc-line-onwards)))))))
)


(def complement-of-dna-char
{\w \W, \W \W,
\s \S, \S \S,
Expand Down Expand Up @@ -75,11 +96,40 @@


(let [max-dna-chars-per-line 60]
(with-open [br (java.io.BufferedReader. *in*)
bw (java.io.BufferedWriter. *out*)]
;; (with-open [br (java.io.BufferedReader. *in*)
;; bw (java.io.BufferedWriter. *out*)]
(let [br (java.io.BufferedReader. *in*)
bw (java.io.BufferedWriter. *out*)]
(doseq [[desc dna-seq] (fasta-desc-dna-str-pairs (line-seq br))]
(println-string-to-buffered-writer bw desc)
(println-string-to-buffered-writer bw dna-seq)
;; (print-reverse-complement-of-str-in-lines bw dna-seq
;; max-dna-chars-per-line)
(. bw flush))
))


(comment

(let [max-dna-chars-per-line 60]
;; (with-open [br (java.io.BufferedReader. *in*)
;; bw (java.io.BufferedWriter. *out*)]
(let [br (java.io.BufferedReader. *in*)
bw (java.io.BufferedWriter. *out*)]
(doseq [[dna-seq-num [desc dna-seq]]
(map (fn [x y] [x y])
(iterate inc 1)
(fasta-desc-dna-str-pairs (line-seq br)))]
(println-string-to-buffered-writer bw (format "%d" dna-seq-num))
(println-string-to-buffered-writer bw desc)
;; (. bw flush)
(print-reverse-complement-of-str-in-lines bw dna-seq
max-dna-chars-per-line))))
max-dna-chars-per-line)
;; (. bw flush)
)
(. bw flush)
))
)


(. System (exit 0))
159 changes: 159 additions & 0 deletions rcomp/revcomp.clj-4.clj
@@ -0,0 +1,159 @@
;; Author: Andy Fingerhut (andy_fingerhut@alum.wustl.edu)
;; Date: Jul 29, 2009

;; I don't see why revcomp.clj-2.clj is using so much memory. My next
;; idea is to memory map the input file, and only copy small parts of
;; it at a time into other buffers, reversed and complemented, for
;; writing. Not necessarily very Clojure-ish, but the input data is
;; too big for the most straightforward kinds of solutions.

;; Unfortunately, I don't think we can memory-map standard input.

;; The next most memory-efficient thing I can think of is to slurp in
;; part of the input file at a time -- only one DNA sequence's worth
;; at a time, not a as a sequence of lines, but as a string or byte
;; array.


;;(set! *warn-on-reflection* true)

(defn fasta-description-line
"Return true when the line l is a FASTA description line"
[l]
(= \> (first (seq l))))


(defn split-with-2
[pred coll]
(loop [s (seq coll)
reversed-take-while '()]
(if s
(if (pred (first s))
(recur (next s) (cons (first s) reversed-take-while))
[(reverse reversed-take-while) s])
[(reverse reversed-take-while) '()])))


;; TBD: Try avoiding the use of when-let, in case it might be causing
;; me to hold on to a head of a sequence when I don't want to.

(defn fasta-desc-dna-str-pairs
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
[lines]
(lazy-seq
(let [lines (seq (drop-while (fn [l] (not (fasta-description-line l)))
lines))]
(when lines
(let [[lines-before-next-desc next-desc-line-onwards]
(split-with-2 (fn [l] (not (fasta-description-line l)))
(rest lines))]
(cons [(first lines) (apply str lines-before-next-desc)]
(fasta-desc-dna-str-pairs next-desc-line-onwards)))))))


(comment

(defn fasta-desc-dna-str-pairs
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
[lines]
(lazy-seq
(when-let [lines (drop-while (fn [l]
(not (fasta-description-line l)))
lines)]
(when-let [lines (seq lines)]
(let [[lines-before-next-desc next-desc-line-onwards]
(split-with-2 (fn [l] (not (fasta-description-line l)))
(rest lines))]
(cons [(first lines) (apply str lines-before-next-desc)]
(fasta-desc-dna-str-pairs next-desc-line-onwards)))))))
)


(def complement-of-dna-char
{\w \W, \W \W,
\s \S, \S \S,
\a \T, \A \T,
\t \A, \T \A,
\u \A, \U \A,
\g \C, \G \C,
\c \G, \C \G,
\y \R, \Y \R,
\r \Y, \R \Y,
\k \M, \K \M,
\m \K, \M \K,
\b \V, \B \V,
\d \H, \D \H,
\h \D, \H \D,
\v \B, \V \B,
\n \N, \N \N })


(defn print-reverse-complement-of-str-in-lines [#^java.io.BufferedWriter bw
#^java.lang.String s
max-len]
(let [comp complement-of-dna-char
len (int (count s))
max-len (int max-len)]
(when (> len 0)
(loop [start (int (dec len))
to-print-before-nl (int max-len)]
(let [next-start (int (dec start))
next-to-print-before-nl (int (dec to-print-before-nl))]
(. bw write (int (comp (. s charAt start))))
(when (zero? next-to-print-before-nl)
(. bw newLine))
(when (not (zero? start))
(if (zero? next-to-print-before-nl)
(recur next-start max-len)
(recur next-start next-to-print-before-nl)))))
;; Need one more newline at the end if the string was not a
;; multiple of max-len characters.
(when (not= 0 (rem len max-len))
(. bw newLine))
)))


(defn println-string-to-buffered-writer [#^java.io.BufferedWriter bw
#^java.lang.String s]
(. bw write (.toCharArray s) 0 (count s))
(. bw newLine))


(let [max-dna-chars-per-line 60]
;; (with-open [br (java.io.BufferedReader. *in*)
;; bw (java.io.BufferedWriter. *out*)]
(let [br (java.io.BufferedReader. *in*)
bw (java.io.BufferedWriter. *out*)]
(doseq [[desc dna-seq] (fasta-desc-dna-str-pairs (line-seq br))]
(println-string-to-buffered-writer bw desc)
;; (println-string-to-buffered-writer bw dna-seq)
(print-reverse-complement-of-str-in-lines bw dna-seq
max-dna-chars-per-line)
(. bw flush))
))


(comment

(let [max-dna-chars-per-line 60]
;; (with-open [br (java.io.BufferedReader. *in*)
;; bw (java.io.BufferedWriter. *out*)]
(let [br (java.io.BufferedReader. *in*)
bw (java.io.BufferedWriter. *out*)]
(doseq [[dna-seq-num [desc dna-seq]]
(map (fn [x y] [x y])
(iterate inc 1)
(fasta-desc-dna-str-pairs (line-seq br)))]
(println-string-to-buffered-writer bw (format "%d" dna-seq-num))
(println-string-to-buffered-writer bw desc)
;; (. bw flush)
(print-reverse-complement-of-str-in-lines bw dna-seq
max-dna-chars-per-line)
;; (. bw flush)
)
(. bw flush)
))
)


(. System (exit 0))

0 comments on commit c5cc477

Please sign in to comment.