-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/csv: Read returns strings which has reference of string of whole line #30222
Comments
The current implementation deliberately pins a large string and creates sub-slices out of the string. Doing this provides a significant performance speed-up, at the expense of potentially pinning a lot of memory (as you are running into). For a little bit of extra complexity, we can do something where it batch allocates fields up to a certain size, providing an upper bound on the maximum amount of memory that is pinned, but still getting most of the performance benefits. |
Thanks to your quick reply, @dsnet, I understand it's trade-off between cpu performance and memory usage. I write below about my real situation encounter this issue and I want to. my real situationMy program reads 16M line CSV (about 1giga Byte). I want to...
|
As a general principle, I think we should fight hard against documented (or API knobs) that constrict the implementation towards one given approach.
Correct. I think there is room for improvement here that I would like to avoid API expansion or documentation of implementation details. |
Understand. It was only my request. I already managed the problem by deepcopy, but worried about other people will meet this situation.
Thanks. |
You can always deep-copy the strings if you don't want the aliasing behavior. On the other hand, if we made copies unprompted, users who want aliasing would have no way to recover that behavior. |
On the third hand, the memory behavior of string-copying is an implementation detail of the garbage collector. In theory, the GC could notice that only a small fraction of the string is live during garbage collection and allocate smaller substrings at that point. (CC @RLH @aclements for GC knowledge.) |
I would bet that there is already an issue filed somewhere for chopping up large retained strings, but I can't find it. |
@bcmills I already understand it's tradeoff problem between performance and memory usage. My requests are " |
My point is, at some point the runtime itself may do that deep-copy for you. That would save the expense of copying if the data turns out to be short-lived, and would not require any new options or API surface. I'm also not sure that it merits a mention in the package documentation: string operations are pervasive in Go programs, and they all alias by default. (Allocating new strings is the exception, not the rule.) Given that, I don't think that repeating that information in every individual API would be productive either. |
It was discussed a little here. I think you and I may have been talking in person about how GC could recognize that only part of a string is reachable and chop it up. As far as I know, there's no issue filed for this. This is of course actually quite hard to do, since you need to not only recognize when you've reached a string, but keep track of extra information on what part of it you've reached. It's extra hard if you want to chop strings down from both ends, since then you need to recognize not just the string object, but the string header that points to the string object so you can read the length from the header. And even if you can do all of this, updating all of the references to point to the new string(s) is quite difficult (and probably would require specializing the write barrier). |
You don't have to update all references, though. You just have to somehow decide when this is worth doing, and then let each instance take care of itself. |
I'm not sure I follow. If you don't update all of the references to a moved string, you'll have to retain the old string, which will just increase the memory footprint rather than decreasing it. |
I came across this issue today and spent many hours trying to troubleshoot it. Having written the same exact program in Rust and Go, Go was consuming 2 GB of memory while the Rust version was consuming 400 MB with virtually the same performance. Personally, I think the current behaviour is extremely confusing. In the end, I am working around it by doing something similar to @mtgto but instead have condensed it to one line: items = append(items, string([]byte(record[3]))) |
csv.Reader return strings pointing to the same backarray. Therefore, using directly one of those will hold memory for the entire readed line instead of just the interested column. ref: golang/go#30222
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
"encoding/csv"
Reader.Read()
returnsrecord []string
, which has reference to the string of whole line.I wrote my code hold only one field in
record
, and found the usage of memory is huger than expects me.To confirm it, I wrote some sample code:
https://play.golang.org/p/OpsuPDjMRfJ
This line says that each field of record has reference of line string.
https://github.com/golang/go/blob/go1.11.5/src/encoding/csv/reader.go#L386
What did you expect to see?
Has only reference of the field of record which I use it.
In sample code of playground, first field ( = "0" ) is held, but second field "0 ... 0" is not held.
What did you see instead?
It held over 100002*10000 bytes in memory.
The text was updated successfully, but these errors were encountered: