Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce needed memory and speed up #2

Merged

Conversation

MartinGuehmann
Copy link
Contributor

This pull request builds up on #1 and accepting it will also accept #1 if not accepted before.

This pull request aims to reduce the needed memory for loading and running CLANS. It also cleans up and simplifies some code that was on the way. This is minor refractoring, which however is not the main aim, and should be subject of another pull request.

The main source of memory waste right now is using Strings as hash keys in HashMaps even so Integers or the Value object itself could be used as a key.

Right now, I consider this pull request as work in progress and thus it should not be merged, yet.

HashMap requires a key and a value, both must be Objects, that means both
are pointers in the HasMap in require 8 bytes each in 64 bit Java.

Additionally, comes the memory for the Objects, if we use the same object for
key and value, we can save that memory.

To achieve that we need be able to use MinimalHsp as key in a HashMap, since
we only want to use query and hit of MinimalHsp, the overriden methods
hashCode and equals should only depend on those.
And query and hit should be final so that they cannot be changed once 
MinimalHsp is in a HashMap, since this would screw up the HashMap.
In particular, declare local variables as close to where they are used.
Especially keep them in a local scope.
… key and value in the HashTable to save memory
…trings

Strings need a lot of memory for representing two numbers seperated by an
underscore. However, the value for the key is already contained in the
MinimalAttractionValue object itself.

To use an MinimalAttractionValue as key to itsself, it quals and hashCode
function must depend on its values of query and hit. Since the att field is
supposed the value part in the HashTable, this is ignored by quals and hashCode.

This is a bit wired, but HashMap does not allow primitive types then a long
as key would be the choice or a pair of two ints, if Java would allow to pass
the object itself than a pointer onto it.
…mory

Internally the HashSet also uses a HashMap, which is filled with a pointer to
a static dummy object, so we save memory on the value object, but not on the
pointer itself, which is not a smart design.
Tempory variables should not be members of a class, especially if they are only used locally
…f it contains a gap

String.replaceAll can be implemented in a way that it returns a new String even so the original 
String does not contain the gap character. This wastes time and memory for allocating the new
String.
…ction.java to reduce meomory usage

However, since hashkeys[i] = i, this looks to be superflous
This wasn't really a map and just used memory without need.
And made the code harder to read.
… it is used as such anyway

This simplifies the code and saves memory for the HashMap and the wrappers it requires for
primitives. In fact the Integer objects were basically used as indeces.
…mplexity

This avoids adding dummy objects or fields. In principle, this could reduce memory
needs, however HashSet uses internally a HashMap and uses a static dummy Object for
filling the value part. That is a not very nice implementation.

However, this is now out of sight of the programmer so that other code issues get clearer.
…nterests on ConvexClustering

In the worst case, each node has an attraction value to every other node. That are O(N²) attraction
avlues, if N is the number of nodes (aka sequences).  The old version looked for each node on all the
connections instead of just the connections of that particular node, which needed thus O(N²) loop
iterations. With the new implementation it just needs in the inner loop O(N) iteration. Which improves
the overall algorithm from O(N⁴) to O(N³).

This is a big improvement in speed. However, it cost O(N²) extra memory, which was however transiently
needed to load the data.
This is not only a useful feature for checking that the cluster algorithm
produces the same output after modification, for instance adding
multi threading, but also useful for the user.
This helps to check whether changes, such as adding multithreading
to the clustering code, indeed speed it up.
@MartinGuehmann MartinGuehmann changed the title WIP: Reduce needed memory and speed up Reduce needed memory and speed up Oct 13, 2020
@MartinGuehmann
Copy link
Contributor Author

MartinGuehmann commented Oct 13, 2020

This PR now accelerates loading by reducing the needed memory. The original implementation used in HashMaps Strings as keys, which however represented numbers. The Strings stored two numbers seperated by an underscore. These numbers represented a node or sequence ID and with 90000 sequences these would be 10 chars plus the separator char and since Java uses two bytes per char that would be 22 bytes per edge between nodes, which is a lot if I want to load 600 Million edges.

However, the standard HashMap of Java does not accept primitives but only objects, which require a pointer in the HashMap, in 64bit Java that are another 8 bytes wasted, therefore these HashMaps used for loading the data use the object they contain for both key and value. This is a work around for Java's limitations to save more memory.

This PR also accelerates detecting clusters by using the "convex" method. For that the code about the convex clustering has been refactored into its own subclass and the code was cleaned to make it possible to work with it. The convex clustering was accelerated by using more memory and adding multi-threading.

Additionally the user interface around the clustering was cleaned, so that its buttons is more clear about what it is doing.

For now I consider this PR as complete. The changes should not change the output of CLANS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants