-
Notifications
You must be signed in to change notification settings - Fork 905
Description
Currently the Blob.Size computes blob size by loading the blob content in memory. This is not the best way for the repositories with very large objects. We have a scenarios wherein some objects are in GBs and we face OutOfMemoryException while reading blob content. We'd want to leverage Blob.Size to screen out such items in our processing, but Blob.Size has same issue.
IMHO Blob.Size should not load entire content in memory. I had a discussion with @ethomson earlier on this. Below is the excerpt from our discussion. While I believe that changing Blob.Size implementation is the neat fix, otherwise 2 different APIs for computing Blob size would be confusing for consumers, I am also worried that it might affect checkout performance. Please let me know your suggestion on this.
[@ethomson] If you really only want the size of a file, then we should be able to get that without loading the whole thing into RAM. (Though upon further reflection, I don’t think that Blob.Size is the place to do this. Getting only the size (and never doing any more with the blob) is a pretty rare use case, generally speaking you get the Blob because you want to act upon it and Blob.Size not loading the whole object would penalize people who call Blob.Size first before calling Blob.Contents. Which is a pretty compelling use case.)
[@ravindp] – I am not sure why and how much slower would it be to call Blob.Contents after Blob.Size. If Blob.Size is an inexpensive call then I believe effective time would remain almost same. It’s just that cost would move from Blob.Size to Blob.Contents.
[@ethomson] - Worth exploring. I agree that it shouldn’t add a whole lot – the worst-case is to open all the .idx files, binary search through them looking for the ID being requested, opening the packfile that corresponds to the idx once that’s found. I just want to avoid adding a lot of overhead when you do something like checkout that does this with great frequency.