Reduce needed memory and speed up #2

MartinGuehmann · 2020-10-06T01:47:31Z

This pull request builds up on #1 and accepting it will also accept #1 if not accepted before.

This pull request aims to reduce the needed memory for loading and running CLANS. It also cleans up and simplifies some code that was on the way. This is minor refractoring, which however is not the main aim, and should be subject of another pull request.

The main source of memory waste right now is using Strings as hash keys in HashMaps even so Integers or the Value object itself could be used as a key.

Right now, I consider this pull request as work in progress and thus it should not be merged, yet.

HashMap requires a key and a value, both must be Objects, that means both are pointers in the HasMap in require 8 bytes each in 64 bit Java. Additionally, comes the memory for the Objects, if we use the same object for key and value, we can save that memory. To achieve that we need be able to use MinimalHsp as key in a HashMap, since we only want to use query and hit of MinimalHsp, the overriden methods hashCode and equals should only depend on those. And query and hit should be final so that they cannot be changed once MinimalHsp is in a HashMap, since this would screw up the HashMap.

In particular, declare local variables as close to where they are used. Especially keep them in a local scope.

… key and value in the HashTable to save memory

…n the hHashMaps in BlastVersion2.gethits

…n the HashMaps in FileHandling2.blast

…trings Strings need a lot of memory for representing two numbers seperated by an underscore. However, the value for the key is already contained in the MinimalAttractionValue object itself. To use an MinimalAttractionValue as key to itsself, it quals and hashCode function must depend on its values of query and hit. Since the att field is supposed the value part in the HashTable, this is ignored by quals and hashCode. This is a bit wired, but HashMap does not allow primitive types then a long as key would be the choice or a pair of two ints, if Java would allow to pass the object itself than a pointer onto it.

…n unused parameter from a function

…a String to save memory

…mory Internally the HashSet also uses a HashMap, which is filled with a pointer to a static dummy object, so we save memory on the value object, but not on the pointer itself, which is not a smart design.

Tempory variables should not be members of a class, especially if they are only used locally

…o set the members on construction

…f it contains a gap String.replaceAll can be implemented in a way that it returns a new String even so the original String does not contain the gap character. This wastes time and memory for allocating the new String.

…ction.java to reduce meomory usage However, since hashkeys[i] = i, this looks to be superflous

This wasn't really a map and just used memory without need. And made the code harder to read.

… it is used as such anyway This simplifies the code and saves memory for the HashMap and the wrappers it requires for primitives. In fact the Integer objects were basically used as indeces.

…mplexity This avoids adding dummy objects or fields. In principle, this could reduce memory needs, however HashSet uses internally a HashMap and uses a static dummy Object for filling the value part. That is a not very nice implementation. However, this is now out of sight of the programmer so that other code issues get clearer.

…ction.java

…shMap

…ith a captial "F" in line with all the other menu items

…uster.java Don't use "booleanVar == false" use instead "!booleanVar"

…luster

…computeAttractionVariance

… name better what it does

…peeding up

…nterests on ConvexClustering In the worst case, each node has an attraction value to every other node. That are O(N²) attraction avlues, if N is the number of nodes (aka sequences). The old version looked for each node on all the connections instead of just the connections of that particular node, which needed thus O(N²) loop iterations. With the new implementation it just needs in the inner loop O(N) iteration. Which improves the overall algorithm from O(N⁴) to O(N³). This is a big improvement in speed. However, it cost O(N²) extra memory, which was however transiently needed to load the data.

This is not only a useful feature for checking that the cluster algorithm produces the same output after modification, for instance adding multi threading, but also useful for the user.

This helps to check whether changes, such as adding multithreading to the clustering code, indeed speed it up.

…elCasing, more concrete names

…s and premature declaration

…rDetectionResults

…esults

…sterDetectionResults

…dex shown in the cluster result window

…ave memory

MartinGuehmann · 2020-10-13T05:48:15Z

This PR now accelerates loading by reducing the needed memory. The original implementation used in HashMaps Strings as keys, which however represented numbers. The Strings stored two numbers seperated by an underscore. These numbers represented a node or sequence ID and with 90000 sequences these would be 10 chars plus the separator char and since Java uses two bytes per char that would be 22 bytes per edge between nodes, which is a lot if I want to load 600 Million edges.

However, the standard HashMap of Java does not accept primitives but only objects, which require a pointer in the HashMap, in 64bit Java that are another 8 bytes wasted, therefore these HashMaps used for loading the data use the object they contain for both key and value. This is a work around for Java's limitations to save more memory.

This PR also accelerates detecting clusters by using the "convex" method. For that the code about the convex clustering has been refactored into its own subclass and the code was cleaned to make it possible to work with it. The convex clustering was accelerated by using more memory and adding multi-threading.

Additionally the user interface around the clustering was cleaned, so that its buttons is more clear about what it is doing.

For now I consider this PR as complete. The changes should not change the output of CLANS.

MartinGuehmann added 20 commits October 7, 2020 02:26

Add a bash script for making and jaring Clans from the command line

07d9f14

Cleanup places for better HasMap usage, before replace

be2e352

In particular, declare local variables as close to where they are used. Especially keep them in a local scope.

Make ClusterDataLoadHelper.parse_hsp_block use the same MinimalHsp as…

3f3485c

… key and value in the HashTable to save memory

Reduce memory by using the same MinimalHsp object for key and value i…

ce0fbb4

…n the hHashMaps in BlastVersion2.gethits

Reduce memory by using the same MinimalHsp object for key and value i…

9bac3cb

…n the HashMaps in FileHandling2.blast

Cleanup code initialize a local HasMap as late as possible, removed a…

3892993

…n unused parameter from a function

Use of MinimalAttractionValue itself as key in the HashTable instead …

fc22539

…a String to save memory

Use HashSets instead of HashMaps in SelectedSubsetHandling to save me…

92b77f1

…mory Internally the HashSet also uses a HashMap, which is filled with a pointer to a static dummy object, so we save memory on the value object, but not on the pointer itself, which is not a smart design.

Remove tmp members from IterationsComputerThread

9942c68

Tempory variables should not be members of a class, especially if they are only used locally

Simplify code: Use the two argement construcor of AminoAcidSequence t…

b808005

…o set the members on construction

Use Integer instead of String as type for the HashMaps in ClusterDete…

28cdf87

…ction.java to reduce meomory usage However, since hashkeys[i] = i, this looks to be superflous

Remove hashkeys from ClusterDetection.java, since hashkeys[i] = i

d646435

This wasn't really a map and just used memory without need. And made the code harder to read.

White space cleanup in ClusterDetection.java

d6ad91d

Remove unused parameter of ClusterDetection.getconnecteds

71f6358

Turn clusterhash in ClusterDetection.multilinkage into a 2D-array, as…

a394ea5

… it is used as such anyway This simplifies the code and saves memory for the HashMap and the wrappers it requires for primitives. In fact the Integer objects were basically used as indeces.

Use existing Integer objects for HashMaps and HashSets in ClusterDete…

5d82e11

…ction.java

MartinGuehmann force-pushed the ReduceMemoryWhileLoading branch from d68b0d9 to 5d82e11 Compare October 7, 2020 01:36

MartinGuehmann added 9 commits October 7, 2020 23:34

Accelerate loading with many HSPs, by presetting the capacity of a Ha…

bf613bc

…shMap

Rename basevec to remainingSeqIDs in ClusterDetection.java

f313999

Rename the methods in ClusterDetection.java for convex clustering

fa2059c

Rename currvec to newClusterSeqIDs in ClusterDetection.java

84140a6

Having the "Find clusters" menu item in the "Windows" menu starting w…

b336f7d

…ith a captial "F" in line with all the other menu items

Clean up sort code in ClusterDetection.java

bcd7387

Rename retvec to returnClusters in ClusterDetectionn.getConvex

9743c8f

Clean up white space and add camelCasing for varibables in SequenceCl…

c5265ec

…uster.java Don't use "booleanVar == false" use instead "!booleanVar"

Add also camelCasing and fix naming for member variables in SequenceC…

1d295c5

…luster

MartinGuehmann added 23 commits October 8, 2020 04:37

Save time by using the same HashSet for computeAverageAttraction and …

4497d38

…computeAttractionVariance

CamelCase the members of ConvexClustering

11f9cbc

Clean up variable names in ConvexClustering.getMaxAttraction

346c04b

Cleanup local variable names in ConvexClustering

cc796f5

Clean up ConvexClustering.getCluster further

06ae700

Minor cleanup of ClusterDetection.java

ade2e40

Rename getCluster to getOneCluster in ConvexClustering to reflect its…

64b7cf9

… name better what it does

Use ArrayList instead of Vector for members of ConvexClustering for s…

fe03af5

…peeding up

Implement minimum number of sequences for ConvexClustering

7c086b3

Add the number of found clusters to the cluster output window title bar

8f718d9

This is not only a useful feature for checking that the cluster algorithm produces the same output after modification, for instance adding multi threading, but also useful for the user.

Report to the command line how long clustering took

47ab00e

This helps to check whether changes, such as adding multithreading to the clustering code, indeed speed it up.

Accelerate ConvexClustering by multithreading

de6e8f1

Give clusters in the cluster detection window proper names

225b65c

Simplify vector access in WindowClusterDetectionResults.java

de93573

Improve variable names in WindowClusterDetectionResults.java, use cam…

d76963e

…elCasing, more concrete names

Cleanup white space in WindowClusterDetectionResults.java

f1c00f0

Cleanup variable names in WindowClusterDetectionResults further: Name…

150477c

…s and premature declaration

Do not add sequences if the cancel button was pressed in WindowCluste…

0666831

…rDetectionResults

Better label the buttons with what they do in WindowClusterDetectionR…

e096511

…esults

Add the sequences to the sequence groups as they show up in WindowClu…

3c079c4

…sterDetectionResults

Flexibilize the naming of the new sequence groups and conserve the in…

b3a489a

…dex shown in the cluster result window

Use Integer as key in HashMap for only moving selected sequences to s…

e9e4b0b

…ave memory

MartinGuehmann changed the title ~~WIP: Reduce needed memory and speed up~~ Reduce needed memory and speed up Oct 13, 2020

This was referenced Oct 13, 2020

Refactor ClusterData.compute_attraction_values #4

Merged

Fix computing the attraction values #5

Merged

Use convex clustering without the GUI #6

Merged

MartinGuehmann mentioned this pull request Dec 2, 2020

Miscellaneous Fixes #7

Merged

vikramalva merged commit e9e4b0b into proteinevolution:master Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce needed memory and speed up #2

Reduce needed memory and speed up #2

MartinGuehmann commented Oct 6, 2020

MartinGuehmann commented Oct 13, 2020 •

edited

Loading

Reduce needed memory and speed up #2

Reduce needed memory and speed up #2

Conversation

MartinGuehmann commented Oct 6, 2020

MartinGuehmann commented Oct 13, 2020 • edited Loading

MartinGuehmann commented Oct 13, 2020 •

edited

Loading