Permalink
Browse files

Update RESULTS, add CHANGELOG, new Clojure versions of reverse-comple…

…ment benchmark
  • Loading branch information...
1 parent 46a0b86 commit c5cc47790e2e7320c4eae47630bd6d3258b09c16 @jafingerhut committed Jul 30, 2009
Showing with 293 additions and 30 deletions.
  1. +11 −0 CHANGELOG
  2. +32 −19 RESULTS
  3. +30 −0 rcomp/README
  4. +1 −1 rcomp/clj-run.sh
  5. +60 −10 rcomp/revcomp.clj-2.clj
  6. +159 −0 rcomp/revcomp.clj-4.clj
View
@@ -0,0 +1,11 @@
+Jul 29, 2009
+
+Added some Clojure programs for the reverse-complement benchmark. See
+rcomp/README for some memory issues I'm still trying to solve.
+
+Found a significant speedup for the Clojure program for the
+k-nucleotide problem. Lesson: If you want to use the same constant
+map like a function many times, assign it to a Var once, and let-bind
+it where it is used to a name. Don't just have the map there in the
+first element of a form -- Clojure will recreate a new map each time
+before calling it.
View
@@ -18,39 +18,35 @@ Times are real / user / sys on my iMac
-----------------------------------------------------
noT noT | noT
long test on my iMac:
-rcomp | 8.3 | 11.9 | | 8.5 | no implementation yet
- | 4.8 | 7.9 | | 3.6 |
- | 2.1 | 2.3 | | 1.4 |
+rcomp | 8.3 | 11.9 | | 8.5 | current
+ | 4.8 | 7.9 | | 3.6 | implementation
+ | 2.1 | 2.3 | | 1.4 | crashes on long test
long test on my iMac:
-mand- | wrong | out of | 32.7 | 28.6 | 340.4
-elbrot | output | mem | 59.3 | 54.4 | 350.5
- | | (?) | 0.8 | 0.4 | 4.7
+mand- | wrong | out of | 32.7 | 28.6 | 325.0
+elbrot | output | mem | 59.3 | 54.4 | 326.1 clj =
+ | | (?) | 0.8 | 0.4 | 3.2 6.01 x java
noT noT | T
long test on my iMac:
-k-nuc- | 190.9 | 306.0 | 90.5 | 52.4 | 1677.6 (27m 57.6s)
-leotide | 187.9 | 302.7 | 130.8 | 89.6 | 2245.1 (37m 25.1s)
- | 2.4 | 1.9 | 4.6 | 1.8 | 24.2 ( 24.2s)
-
-Disappointing result: Replacing dna-char-to-code-val with a macro did
-not speed things up, and may have slowed them down:
- | 1800.0 (30m 0.0s)
- | 2317.0 (38m 37.0s)
- | 30.6 ( 30.6s)
+k-nuc- | 190.9 | 306.0 | 90.5 | 53.5 | 1286.4 (21m 26.4s)
+leotide | 187.9 | 302.7 | 130.8 | 91.8 | 1118.1 (18m 38.1s) clj =
+ | 2.4 | 1.9 | 4.6 | 2.1 | 21.8 ( 21.8s) 12.1 x java
k-nucleotide long test on benchmark shootout machine:
| 164.9 | 249.8 | 52.0 | 20.6 |
| 164.9 | 246.7 | 112.5 | 58.8 |
| ? | ? | ? | ? |
k-nucleotide medium test on my iMac:
- | 8.6 | 12.7 | 3.9 | 3.9 | 64.2 / 69.6 / 69.1 / 67.1
- | 7.9 | 12.5 | 5.4 | 5.7 | 98.4 / 92.9 / 93.1 / 88.1
- | 0.6 | 0.1 | 0.3 | 0.2 | 1.5 / 1.6 / 1.6 / 1.7
+ | 8.6 | 12.7 | 3.9 | 3.9 | 43.8
+ | 7.9 | 12.5 | 5.4 | 5.7 | 46.5 clj =
+ | 0.6 | 0.1 | 0.3 | 0.2 | 1.5 8.1 x java
k-nucleotide medium test, all clj, modified-pmap with specified number
-of parallel threads, on my iMac:
+of parallel threads, on my iMac (note, these are from
+knucleotide.clj-5.clj, not the slightly improved knucleotide.clj-6.clj
+version):
| 1 | 2 | 3 | 4 | 5 | 6 |
| 74.9 | 70.7 | 77.2 | 76.8 | 82.5 | 77.8 |
| 125.9 | 122.1 | 134.6 | 134.0 | 143.4 | 134.0 |
@@ -61,6 +57,23 @@ fasta
thread-ring
+The longest few benchmark run on my iMac are:
+
+mins benchmark test language
+
+127 mandelbrot long perl
+ 21 knuc long clj (21m 26.4s / 18m 38.1s / 21.8s)
+ 9 fasta long perl
+ 5 mandelbrot long clj
+ 5 knuc long perl
+ 3 knuc long sbcl
+ 2 mandelbrot long sbcl
+ 1.5 knuc long ghc
+ 0.9 knuc long java
+ ? rcomp long clj (runs out of memory or crahes with IOException)
+
+
+
mandelbrot notes:
sbcl version implemented with threads. I run it with only 1 thread.
View
@@ -0,0 +1,30 @@
+July 29, 2009
+
+revcomp.clj-4.clj is the best I've got so far, but it runs out of
+memory on the full size benchmark.
+
+long-input.txt contains 3 DNA sequences in FASTA format. The first is
+50,000,000 characters long, the second is 75,000,000 characters long,
+and the third is 125,000,000 characters long. Each needs to be
+reversed, have each character replaced with a different one, and
+printed out, so we need to store each of the strings one at a time,
+but it is acceptable to deallocate/garbage-collect the previous one
+when starting on the next. I think my code should be doing that, but
+I don't know how to verify that.
+
+I've read that a Java String takes 2 bytes per character, plus about
+38 bytes of overhead per string. That is about 250 Mbytes for the
+longest one. I also read in a seq of lines, and these long strings
+are split into lines with 60 characters (plus a newline) each. Thus
+the string's data needs to be stored at least twice temporarily --
+once for the many 60-character strings, plus the final long one.
+Also, the Java StringBuilder that Clojure's (str ...) function uses
+probably needs to be copied and reallocated periodically as it
+outgrows its current allocation. So I could imagine needing about 3 *
+250 Mbytes temporarily, but that doesn't explain why my 1536 Mbytes of
+JVM memory are being exhausted.
+
+It would be possible to improve things by not creating all of the
+separate strings, one for each line, and then concatenating them
+together. But first I'd like to explain why it is using so much,
+because I must be missing something.
View
@@ -5,4 +5,4 @@
source ../env.sh
-$JAVA -server -Xmx1280m ${JAVA_PROFILING} -cp ${CLOJURE_CLASSPATH} clojure.main revcomp.clj-2.clj "$@"
+$JAVA -server -Xmx1536m ${JAVA_PROFILING} -cp ${CLOJURE_CLASSPATH} clojure.main revcomp.clj-4.clj "$@"
@@ -10,20 +10,41 @@
(= \> (first (seq l))))
+;; TBD: Try avoiding the use of when-let, in case it might be causing
+;; me to hold on to a head of a sequence when I don't want to.
+
(defn fasta-desc-dna-str-pairs
"Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
[lines]
- (when-let [x (drop-while (fn [l]
- (not (fasta-description-line l)))
- lines)]
- (when-let [x (seq x)]
- (lazy-seq
+ (lazy-seq
+ (let [lines (seq (drop-while (fn [l] (not (fasta-description-line l)))
+ lines))]
+ (when lines
(let [[lines-before-next-desc next-desc-line-onwards]
- (split-with (fn [l] (not (fasta-description-line l))) (rest x))]
- (cons [(first x) (apply str lines-before-next-desc)]
+ (split-with (fn [l] (not (fasta-description-line l)))
+ (rest lines))]
+ (cons [(first lines) (apply str lines-before-next-desc)]
(fasta-desc-dna-str-pairs next-desc-line-onwards)))))))
+(comment
+
+(defn fasta-desc-dna-str-pairs
+ "Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
+ [lines]
+ (lazy-seq
+ (when-let [lines (drop-while (fn [l]
+ (not (fasta-description-line l)))
+ lines)]
+ (when-let [lines (seq lines)]
+ (let [[lines-before-next-desc next-desc-line-onwards]
+ (split-with (fn [l] (not (fasta-description-line l)))
+ (rest lines))]
+ (cons [(first lines) (apply str lines-before-next-desc)]
+ (fasta-desc-dna-str-pairs next-desc-line-onwards)))))))
+)
+
+
(def complement-of-dna-char
{\w \W, \W \W,
\s \S, \S \S,
@@ -75,11 +96,40 @@
(let [max-dna-chars-per-line 60]
- (with-open [br (java.io.BufferedReader. *in*)
- bw (java.io.BufferedWriter. *out*)]
+;; (with-open [br (java.io.BufferedReader. *in*)
+;; bw (java.io.BufferedWriter. *out*)]
+ (let [br (java.io.BufferedReader. *in*)
+ bw (java.io.BufferedWriter. *out*)]
(doseq [[desc dna-seq] (fasta-desc-dna-str-pairs (line-seq br))]
(println-string-to-buffered-writer bw desc)
+ (println-string-to-buffered-writer bw dna-seq)
+;; (print-reverse-complement-of-str-in-lines bw dna-seq
+;; max-dna-chars-per-line)
+ (. bw flush))
+ ))
+
+
+(comment
+
+(let [max-dna-chars-per-line 60]
+;; (with-open [br (java.io.BufferedReader. *in*)
+;; bw (java.io.BufferedWriter. *out*)]
+ (let [br (java.io.BufferedReader. *in*)
+ bw (java.io.BufferedWriter. *out*)]
+ (doseq [[dna-seq-num [desc dna-seq]]
+ (map (fn [x y] [x y])
+ (iterate inc 1)
+ (fasta-desc-dna-str-pairs (line-seq br)))]
+ (println-string-to-buffered-writer bw (format "%d" dna-seq-num))
+ (println-string-to-buffered-writer bw desc)
+ ;; (. bw flush)
(print-reverse-complement-of-str-in-lines bw dna-seq
- max-dna-chars-per-line))))
+ max-dna-chars-per-line)
+ ;; (. bw flush)
+ )
+ (. bw flush)
+ ))
+)
+
(. System (exit 0))
@@ -0,0 +1,159 @@
+;; Author: Andy Fingerhut (andy_fingerhut@alum.wustl.edu)
+;; Date: Jul 29, 2009
+
+;; I don't see why revcomp.clj-2.clj is using so much memory. My next
+;; idea is to memory map the input file, and only copy small parts of
+;; it at a time into other buffers, reversed and complemented, for
+;; writing. Not necessarily very Clojure-ish, but the input data is
+;; too big for the most straightforward kinds of solutions.
+
+;; Unfortunately, I don't think we can memory-map standard input.
+
+;; The next most memory-efficient thing I can think of is to slurp in
+;; part of the input file at a time -- only one DNA sequence's worth
+;; at a time, not a as a sequence of lines, but as a string or byte
+;; array.
+
+
+;;(set! *warn-on-reflection* true)
+
+(defn fasta-description-line
+ "Return true when the line l is a FASTA description line"
+ [l]
+ (= \> (first (seq l))))
+
+
+(defn split-with-2
+ [pred coll]
+ (loop [s (seq coll)
+ reversed-take-while '()]
+ (if s
+ (if (pred (first s))
+ (recur (next s) (cons (first s) reversed-take-while))
+ [(reverse reversed-take-while) s])
+ [(reverse reversed-take-while) '()])))
+
+
+;; TBD: Try avoiding the use of when-let, in case it might be causing
+;; me to hold on to a head of a sequence when I don't want to.
+
+(defn fasta-desc-dna-str-pairs
+ "Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
+ [lines]
+ (lazy-seq
+ (let [lines (seq (drop-while (fn [l] (not (fasta-description-line l)))
+ lines))]
+ (when lines
+ (let [[lines-before-next-desc next-desc-line-onwards]
+ (split-with-2 (fn [l] (not (fasta-description-line l)))
+ (rest lines))]
+ (cons [(first lines) (apply str lines-before-next-desc)]
+ (fasta-desc-dna-str-pairs next-desc-line-onwards)))))))
+
+
+(comment
+
+(defn fasta-desc-dna-str-pairs
+ "Take a sequence of lines from a FASTA format file. Return a lazy sequence of 2-element vectors [desc dna-seq], where desc is a FASTA description line, and sdna-seq is the concatenated FASTA DNA sequence with that description. If input file is big, you'll save lots of memory if you call this function in a with-open for the file, and don't hold on to the head of the lines parameter."
+ [lines]
+ (lazy-seq
+ (when-let [lines (drop-while (fn [l]
+ (not (fasta-description-line l)))
+ lines)]
+ (when-let [lines (seq lines)]
+ (let [[lines-before-next-desc next-desc-line-onwards]
+ (split-with-2 (fn [l] (not (fasta-description-line l)))
+ (rest lines))]
+ (cons [(first lines) (apply str lines-before-next-desc)]
+ (fasta-desc-dna-str-pairs next-desc-line-onwards)))))))
+)
+
+
+(def complement-of-dna-char
+ {\w \W, \W \W,
+ \s \S, \S \S,
+ \a \T, \A \T,
+ \t \A, \T \A,
+ \u \A, \U \A,
+ \g \C, \G \C,
+ \c \G, \C \G,
+ \y \R, \Y \R,
+ \r \Y, \R \Y,
+ \k \M, \K \M,
+ \m \K, \M \K,
+ \b \V, \B \V,
+ \d \H, \D \H,
+ \h \D, \H \D,
+ \v \B, \V \B,
+ \n \N, \N \N })
+
+
+(defn print-reverse-complement-of-str-in-lines [#^java.io.BufferedWriter bw
+ #^java.lang.String s
+ max-len]
+ (let [comp complement-of-dna-char
+ len (int (count s))
+ max-len (int max-len)]
+ (when (> len 0)
+ (loop [start (int (dec len))
+ to-print-before-nl (int max-len)]
+ (let [next-start (int (dec start))
+ next-to-print-before-nl (int (dec to-print-before-nl))]
+ (. bw write (int (comp (. s charAt start))))
+ (when (zero? next-to-print-before-nl)
+ (. bw newLine))
+ (when (not (zero? start))
+ (if (zero? next-to-print-before-nl)
+ (recur next-start max-len)
+ (recur next-start next-to-print-before-nl)))))
+ ;; Need one more newline at the end if the string was not a
+ ;; multiple of max-len characters.
+ (when (not= 0 (rem len max-len))
+ (. bw newLine))
+ )))
+
+
+(defn println-string-to-buffered-writer [#^java.io.BufferedWriter bw
+ #^java.lang.String s]
+ (. bw write (.toCharArray s) 0 (count s))
+ (. bw newLine))
+
+
+(let [max-dna-chars-per-line 60]
+;; (with-open [br (java.io.BufferedReader. *in*)
+;; bw (java.io.BufferedWriter. *out*)]
+ (let [br (java.io.BufferedReader. *in*)
+ bw (java.io.BufferedWriter. *out*)]
+ (doseq [[desc dna-seq] (fasta-desc-dna-str-pairs (line-seq br))]
+ (println-string-to-buffered-writer bw desc)
+;; (println-string-to-buffered-writer bw dna-seq)
+ (print-reverse-complement-of-str-in-lines bw dna-seq
+ max-dna-chars-per-line)
+ (. bw flush))
+ ))
+
+
+(comment
+
+(let [max-dna-chars-per-line 60]
+;; (with-open [br (java.io.BufferedReader. *in*)
+;; bw (java.io.BufferedWriter. *out*)]
+ (let [br (java.io.BufferedReader. *in*)
+ bw (java.io.BufferedWriter. *out*)]
+ (doseq [[dna-seq-num [desc dna-seq]]
+ (map (fn [x y] [x y])
+ (iterate inc 1)
+ (fasta-desc-dna-str-pairs (line-seq br)))]
+ (println-string-to-buffered-writer bw (format "%d" dna-seq-num))
+ (println-string-to-buffered-writer bw desc)
+ ;; (. bw flush)
+ (print-reverse-complement-of-str-in-lines bw dna-seq
+ max-dna-chars-per-line)
+ ;; (. bw flush)
+ )
+ (. bw flush)
+ ))
+)
+
+
+(. System (exit 0))

0 comments on commit c5cc477

Please sign in to comment.