In investigating recent high memory usage bugs, file content often appears near the top of the profile. One report had 500MB of live heap used by file content.
At first blush, this seems ridiculous: why should we keep (e.g.) the entire source text of the standard library in memory? We already keep the (exported) AST, so why should we need the raw text? At worst we should read it back in when we need it.
The most basic problem is that we can't guarantee that the file we read is in the state we want it to be. If a file changes on disk twice in quick succession, we may get the second contents when we meant to read the first. Depending on what we're doing, that could result in a user facing error or a corruption of the metadata cache.
A less fundamental but still serious problem is that we assume that the file contents are available whenever we need them, for free. Naive changes to load the file on demand will probably cause performance problems. Minimizing the length of time file data is held in memory, while also minimizing the number of times it is read from disk, will require refactoring. I think that refactoring will be fairly invasive.
Off the top of my head, here are the places we might need .go file contents. (I'm deliberately ignoring go.mod files; we work with them very differently, so they're a separate topic. But also they're tiny and not worth worrying about.)
AFAIK, the AST does not pin the file contents in memory. I'm confident in this belief because the parser supports a Reader as input.
During snapshot.clone, we are by definition looking at files that have changed and we can't get their old contents. However, we need a small and known subset of the file's data, all of which should be contained in its AST. If we depend on the old content anywhere, we should be able to just stop, either by guaranteeing we have the ParseHeader AST of the old file, or by explicitly summarizing it into a file metadata struct.
The killer: UTF-8 to UTF-16 offset translation. The ColumnMapper type requires the content of the file, and for good reason. We have these spread liberally throughout the codebase. To avoid poor performance, we would need to delay Range creation, working only with Spans until the end of the request. This is the major refactoring I mentioned above.
I'm probably missing other places, of course.
Once we have the necessary foundation laid, I can think of some ideas for how to reduce churn: an actual (LRU?) cache for file content, compressing less frequently used source text, etc. But first we need to get file content lifetime under control, the same way I had to get snapshot lifetime under control to deal with growth in the memoize.Cache.
The text was updated successfully, but these errors were encountered:
Good question, I should've at least tried to put the cost in context.
From what I've seen it's typically in the neighborhood of 10% of the heap, where most of the rest is AST and type information. But it really varies. Part of the problem is that we can trim the AST and type information for non-workspace packages, but we can't trim the source code. So I think it's a scaling issue for people who have a lot of external dependencies.
I think we can save memory here more easily than we can elsewhere. But your implication is correct; this is not a huge win in the grand scheme of things. If you have concrete ideas for things to do about the elephants in the room we should talk about them. (In another issue or on Slack.)