Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hashcons NormalizedFilePath values for efficient heap usage #340

Merged
merged 6 commits into from Jun 26, 2021

Conversation

pepeiborra
Copy link
Collaborator

NormalizedFilePath is one of the biggest costs in terms of heap space. The solution adopted here relies on the MemoTrie package to create a global trie of normalised file paths. The trie is branched by dir name and not by character, which hopefully leads to a more compact representation.

The profiles below have been collected using an Edit experiment on a synthetic repo containing 300 empty modules with maximal imports, (every module imports all the modules below it).

BEFORE

image

AFTER

image

@pepeiborra pepeiborra changed the title Hashcons NormalizedFilePath values Hashcons NormalizedFilePath values for efficient heap usage Jun 23, 2021
@wz1000
Copy link
Collaborator

wz1000 commented Jun 23, 2021

The trie is branched by dir name and not by character, which hopefully leads to a more compact representation.

The HasTrie instance for lists looks like HasTrie x => HasTrie [x]. Won't this end up branching on characters anyway?

Perhaps you can store the filepath chunks as a ByteString and define a HasTrie instance for ByteString in terms of a hashmap.

@pepeiborra
Copy link
Collaborator Author

You are right, it will end up branching on the characters, which is a bit silly.

I implemented your suggestion here: pepeiborra@49120e8

However, the heap profile doesn't change. Either I've done something wrong, or it's likely that the heap cost is dominated by the interned FilePath values, and the trie is just a blip. We still may want the HashMap trie for faster performance though

@pepeiborra
Copy link
Collaborator Author

@wz1000 I don't think there's a material perf difference given most dir names are ~10 characters, and the cost of converting a String to Text is probably the same as 10 function calls in the Trie

@pepeiborra
Copy link
Collaborator Author

pepeiborra commented Jun 23, 2021

Ended up rolling my own internIO for simplicity. It would be better to intern every path component separately, I might give that a go tomorrow. But I'll need a more realistic benchmark, because now toNormalizedFilePath doesn't even show in the top 15:

image

@pepeiborra
Copy link
Collaborator Author

I tested several variations but didn't find any to be an overall improvement:

  1. Represent the file path as Text
  2. Represent the file path as path segments ([Text]) and intern the segments
  3. Represent the uri as Maybe NormalizedUri

3 was a very minor improvement but not worth the extra complexity.

For niceness, I have move the intern to a smart constructor and made the real constructor abstract. I could add a pattern synonym instead if that's preferred?

@pepeiborra
Copy link
Collaborator Author

This is ready to merge and upload to Hackage please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants