Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Tree: 759e99a4e0
Fetching contributors…

Cannot retrieve contributors at this time

184 lines (135 sloc) 7.286 kB
----------------------------------------------------------------------
On Feb 21 2013 the fastest Java and Clojure versions on the Alioth web
site were:
64-bit 4 core
revcomp.clojure = my revcomp.clj-15.clj except for one comment line
revcomp.java-3.java = my revcomp.java-3.java except for one comment line
----------------------------------------------------------------------
As of May 28, 2012, the fastest Java and Clojure versions on the
Alioth web site were:
Benchmark Fastest Java prog Fastest Clojure prog
machine
32-bit 1 core revcomp.java-6.java revcomp.clojure-4.clojure
32-bit 4 core revcomp.java-3.java revcomp.clojure-4.clojure
64-bit 1 core revcomp.java-6.java revcomp.clojure-4.clojure
64-bit 4 core revcomp.java-3.java revcomp.clojure-4.clojure
revcomp.clojure-4.clojure from the web site is what I call
revcomp.clj-13.clj in my github repository.
These were all run on Ubuntu with jdk1.7.0_04, clojure-1.4.0
What I call revcomp.clj-14.clj in my github repository was written by
Stuart Halloway and added to the test.benchmark namespace on May 28,
2012. I will compare its performance to the others above soon.
----------------------------------------------------------------------
Mac OS X 10.5.8 with java 1.5.0_24 failed with this exception when
running revcomp.clj-10.clj. Perhaps this program relies on a change
made in java 1.6?
Exception in thread "main" java.lang.IllegalArgumentException: Can't call public method of non-public class: public void java.lang.AbstractStringBuilder.setCharAt(int,char)
at clojure.lang.Reflector.invokeMatchingMethod(Reflector.java:85)
at clojure.lang.Reflector.invokeInstanceMethod(Reflector.java:28)
at revcomp$revcomp_buf_and_write.invoke(revcomp.clj:59)
at revcomp$_main.doInvoke(revcomp.clj:77)
at clojure.lang.RestFn.invoke(RestFn.java:398)
at clojure.lang.AFn.applyToHelper(AFn.java:159)
at clojure.lang.RestFn.applyTo(RestFn.java:133)
at revcomp.main(Unknown Source)
July 29, 2009
revcomp.clj-4.clj is the best I've got so far, but it runs out of
memory on the full size benchmark.
long-input.txt contains 3 DNA sequences in FASTA format. The first is
50,000,000 characters long, the second is 75,000,000 characters long,
and the third is 125,000,000 characters long. Each needs to be
reversed, have each character replaced with a different one, and
printed out, so we need to store each of the strings one at a time,
but it is acceptable to deallocate/garbage-collect the previous one
when starting on the next. I think my code should be doing that, but
I don't know how to verify that.
I've read that a Java String takes 2 bytes per character, plus about
38 bytes of overhead per string. That is about 250 Mbytes for the
longest one. I also read in a seq of lines, and these long strings
are split into lines with 60 characters (plus a newline) each. Thus
the string's data needs to be stored at least twice temporarily --
once for the many 60-character strings, plus the final long one.
Also, the Java StringBuilder that Clojure's (str ...) function uses
probably needs to be copied and reallocated periodically as it
outgrows its current allocation. So I could imagine needing about 3 *
250 Mbytes temporarily, but that doesn't explain why my 1536 Mbytes of
JVM memory are being exhausted.
It would be possible to improve things by not creating all of the
separate strings, one for each line, and then concatenating them
together. But first I'd like to explain why it is using so much,
because I must be missing something.
----------------------------------------------------------------------
Jul 30, 2009
Using jconsole, I was able to determine that when doing this run:
./batch.sh clj long
with clj-run.sh using revcomp.clj-5.clj
Size of the 3 FASTA DNA sequences in the long-input.txt file:
>ONE Homo sapiens alu
lines: 833,334, 60 characters each except for last
total characters: 50,000,000
>TWO IUB ambiguity codes
lines: 1,250,000, 60 chars each
total characters: 75,000,000
>THREE Homo sapiens frequency
lines: 2083334, 60 characters each except for last
total characters: 125,000,000
During processing of the first FASTA DNA sequence, the long term
heap usage was always at least 300 Mbytes. Why not closer to 100
Mbytes?
During processing of the second FASTA DNA sequence, the long term heap
usage was always at least about 500 Mbytes.
During the third, it goes up to around 900 Mbytes, and spends way way
too much time garbage collecting, at least with these parameters on
the 'java' command line:
----------------------------------------
JAVA=java
JVM_MEM_OPTS="-Xmx1536m -XX:NewRatio=2 -XX:+UseParallelGC"
JMX_MONITORING=-Dcom.sun.management.jmxremote
JAVA_PROFILING=
CLOJURE_JAR_DIR=/Users/andy/.clojure
CLOJURE_CLASSPATH=$CLOJURE_JAR_DIR/clojure-1.1.0-alpha-SNAPSHOT.jar:$CLOJURE_JAR_DIR/clojure-contrib.jar
CLJ_PROG=revcomp.clj-5.clj
$JAVA -server $JVM_MEM_OPTS $JMX_MONITORING ${JAVA_PROFILING} -cp ${CLOJURE_CLASSPATH} clojure.main $CLJ_PROG "$@"
----------------------------------------
Why does it generate *so much* garbage? By comparison, the program
batchcat.clj-1.clj runs in this much time on long-input.txt.
real 2m32.357s
user 1m23.952s
sys 0m14.407s
----------------------------------------------------------------------
July 31, 2009
revcomp.clj-6.clj is the best I've got so far. It doesn't allocate
memory very quickly at all, especially compared to revcomp.clj-5.clj,
improving performance dramatically.
Probably the best improvement that could be made now is not to read in
the input file using line-seq, but something more like slurp, except
that is specific to FASTA format files, so that it can stop and return
each DNA sequence as a big long string. That could significantly
reduce the memory required.
This is a very useful program for determining where your heap memory
is going:
jmap -histo <java_process_id>
Or put a -F before -histo if the first attempt fails.
Also, jconsole is very nice to attach to a running program to get an
idea of how fast it is allocating memory, and garbage collecting.
I learned that Clojure's cons and LazyPersistentList are 48 bytes per
element, which for revcomp.clj-6.clj is one for each line of the
current DNA sequence. That turns out to be almost as much memory as
the DNA sequence data itself in the input file! Also it looks like
Java Strings are 2 bytes per character, plus 40 bytes of overhead per
String. It all adds up pretty quickly. Let's look at the memory
required for the longest of the three DNA sequences in the long input
file:
>THREE Homo sapiens frequency
lines: 2083334, 60 characters each except for last
total characters: 125,000,000
125,000,000 characters * (2 bytes/char) = 300,000,000 bytes
2,083,334 lines * (2 Clojure cons/line) * (48 bytes/cons) = 200,000,064 bytes
That much more for the LazyPersistentList = 200,000,064 bytes
(One is for the lines in the original order, the other for the
reversed version. I don't know yet why there are 2 more per line.)
2,083,334 Java Strings * (40 bytes/string) = 83,333,360 bytes
Grand total = 747 Mbytes
Obviously I could save quite a bit if I just stored a DNA sequence as
an array of Java bytes, with no lists or lazy sequences involved.
That would get it down to about 150 Mbytes.
Jump to Line
Something went wrong with that request. Please try again.