Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think models leak #112

Closed
Miezhiko opened this issue Dec 15, 2020 · 5 comments
Closed

I think models leak #112

Miezhiko opened this issue Dec 15, 2020 · 5 comments

Comments

@Miezhiko
Copy link

I suggest writing memory leak test disabled by default for models.
Use ::new to load models in scope several times and do things with them.

Add memory check after all loops done.

For my service/bot I store all the models in lazystatic mutex for now as workaround.
(For my case it's also nice for speed up but in general I think leaks are bad)

@guillaume-be
Copy link
Owner

guillaume-be commented Dec 15, 2020

Hello @Qeenon ,

Thank you for raising this issue. I tried reproducing the error and could notice that the sanitizer was identifying a leak of 20 bytes when variables are stored on the GPU. The problem does not seem to appear when loading on CPU.

I tried looping over a sequence of model loading/inference (see gist https://gist.github.com/guillaume-be/76e0d287dc125592e8a2088cc48f7066). No memory leak seems to be visible after ~ 5000 iterations. I have an intuition that the issue could come from the tokenizer loading, and not necessarily from the model itself. Which tokenizer are you loading?

Are you using a GPU for your service? Is there a model in particular for which the memory leak is more severe? Would you be able to share a snippet of code to reproduce the issue?

I will raise the issue of GPU memory leak with the author of the torch bindings and see if it could come from here. Note that the models are not meant to be loaded for every query, the lazy_static or use of the batched_fn crate (see example at https://github.com/epwalsh/rust-dl-webserver) are the appropriate way to serve such models.

@Miezhiko
Copy link
Author

Miezhiko commented Dec 16, 2020

Yet I'm not 100% sure that leaks related to models, so far I'm trying to detect whether it's so

https://github.com/Qeenon/Amadeus/blob/d20660596042d8929050bc8154fe16e0cf91ff15/src/steins/ai/bert.rs#L80

This file loads QA / Conversation / Translation models, on this commit it was on demand and after some time bot used to become alike 25GB on RAM. Now I've changed to code and put those inside lazy_static and so far it's going okay-ish but I'd let it to keep running for some more time to be sure that it was related to models.

They was working on CPU (I didn't setup cuda properly on host machine).

@guillaume-be
Copy link
Owner

guillaume-be commented Dec 16, 2020

Translation example (see gist at https://gist.github.com/guillaume-be/60d4a4a61ec16d21478ba497d517a054)
This does not raise any warning using RUSTFLAGS=-Zsanitizer=leak cargo run -Zbuild-std --target x86_64-unknown-linux-gnu translation.

Below I am including the logs from valgrind target/debug/examples/translation --leak-check=full, which seems to indicate that some memory is not freed at the exit: https://gist.github.com/guillaume-be/c278ff8c9665264ef901736ea53ab88f

Some of the warnings seem to be caused by the reqwest library. I tried to rerun a minimal example:

extern crate anyhow;


use rust_bert::resources::{Resource, RemoteResource};
use rust_bert::marian::{MarianModelResources, MarianVocabResources, MarianSpmResources, MarianConfigResources};

fn main() -> anyhow::Result<()> {

    let model_resource = Resource::Remote(RemoteResource::from_pretrained(MarianModelResources::ENGLISH2RUSSIAN));
    let vocab_resource = Resource::Remote(RemoteResource::from_pretrained(MarianVocabResources::ENGLISH2RUSSIAN));
    let merge_resource = Resource::Remote(RemoteResource::from_pretrained(MarianSpmResources::ENGLISH2RUSSIAN));
    let config_resource = Resource::Remote(RemoteResource::from_pretrained(MarianConfigResources::ENGLISH2RUSSIAN));

    let _out1 = model_resource.get_local_path();
    let _out2 = vocab_resource.get_local_path();
    let _out3 = merge_resource.get_local_path();
    let _out4 = config_resource.get_local_path();
    Ok(())
}
valgrind log (minimal example)
==28563== Memcheck, a memory error detector
==28563== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==28563== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==28563== Command: target/debug/examples/resource_download --leak-check=full
==28563== 
==28563== Warning: set address range perms: large range [0x4dab000, 0x40fba000) (defined)
==28563== Warning: set address range perms: large range [0x40fba000, 0x51cf0000) (defined)
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5c0, 0x1ffefff5c0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFC96: cpuinfo_linux_get_max_possible_processor (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBDFA1: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5c0, 0x1ffefff5c0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFD16: cpuinfo_linux_get_max_present_processor (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBDFAC: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5b0, 0x1ffefff5b0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFD99: cpuinfo_linux_detect_possible_processors (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBE00D: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Source and destination overlap in memcpy_chk(0x1ffefff5b0, 0x1ffefff5b0, 5)
==28563==    at 0x4843BF0: __memcpy_chk (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==28563==    by 0x44EC0F83: cpuinfo_linux_parse_cpulist (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBFDF9: cpuinfo_linux_detect_present_processors (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x44EBE02F: cpuinfo_x86_linux_init (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x51FDF47E: __pthread_once_slow (pthread_once.c:116)
==28563==    by 0x44EBA3B6: cpuinfo_initialize (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B157: at::native::compute_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x41D1B30C: at::native::get_cpu_capability() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x42706CB8: THFloatVector_startup::THFloatVector_startup() (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4197ED85: _GLOBAL__sub_I_THVector.cpp (in /home/guillaume/libtorch/lib/libtorch_cpu.so)
==28563==    by 0x4011B89: call_init.part.0 (dl-init.c:72)
==28563==    by 0x4011C90: call_init (dl-init.c:30)
==28563==    by 0x4011C90: _dl_init (dl-init.c:119)
==28563== 
==28563== Thread 2 reqwest-internal:
==28563== Syscall param statx(file_name) points to unaddressable byte(s)
==28563==    at 0x522579FE: statx (statx.c:29)
==28563==    by 0xAB4B00: statx (weak.rs:134)
==28563==    by 0xAB4B00: std::sys::unix::fs::try_statx (fs.rs:123)
==28563==    by 0xAB30A7: std::sys::unix::fs::stat (fs.rs:1105)
==28563==    by 0x510D3D: std::fs::metadata (fs.rs:1567)
==28563==    by 0x5126E1: openssl_probe::find_certs_dirs::{{closure}} (lib.rs:31)
==28563==    by 0x5125ED: core::ops::function::impls:: for &mut F>::call_mut (function.rs:269)
==28563==    by 0x510E3E: core::iter::traits::iterator::Iterator::find::check::{{closure}} (iterator.rs:2227)
==28563==    by 0x514733: core::iter::adapters::map::map_try_fold::{{closure}} (map.rs:87)
==28563==    by 0x5116F9: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==28563==    by 0x514454:  as core::iter::traits::iterator::Iterator>::try_fold (map.rs:113)
==28563==    by 0x514642: core::iter::traits::iterator::Iterator::find (iterator.rs:2231)
==28563==    by 0x510C5C:  as core::iter::traits::iterator::Iterator>::next (filter.rs:55)
==28563==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==28563== 
==28563== Syscall param statx(buf) points to unaddressable byte(s)
==28563==    at 0x522579FE: statx (statx.c:29)
==28563==    by 0xAB4B00: statx (weak.rs:134)
==28563==    by 0xAB4B00: std::sys::unix::fs::try_statx (fs.rs:123)
==28563==    by 0xAB30A7: std::sys::unix::fs::stat (fs.rs:1105)
==28563==    by 0x510D3D: std::fs::metadata (fs.rs:1567)
==28563==    by 0x5126E1: openssl_probe::find_certs_dirs::{{closure}} (lib.rs:31)
==28563==    by 0x5125ED: core::ops::function::impls:: for &mut F>::call_mut (function.rs:269)
==28563==    by 0x510E3E: core::iter::traits::iterator::Iterator::find::check::{{closure}} (iterator.rs:2227)
==28563==    by 0x514733: core::iter::adapters::map::map_try_fold::{{closure}} (map.rs:87)
==28563==    by 0x5116F9: core::iter::traits::iterator::Iterator::try_fold (iterator.rs:1888)
==28563==    by 0x514454:  as core::iter::traits::iterator::Iterator>::try_fold (map.rs:113)
==28563==    by 0x514642: core::iter::traits::iterator::Iterator::find (iterator.rs:2231)
==28563==    by 0x510C5C:  as core::iter::traits::iterator::Iterator>::next (filter.rs:55)
==28563==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==28563== 
==28563== 
==28563== HEAP SUMMARY:
==28563==     in use at exit: 1,897,131 bytes in 30,570 blocks
==28563==   total heap usage: 379,202 allocs, 348,632 frees, 54,173,510 bytes allocated
==28563== 
==28563== LEAK SUMMARY:
==28563==    definitely lost: 114 bytes in 1 blocks
==28563==    indirectly lost: 0 bytes in 0 blocks
==28563==      possibly lost: 2,120 bytes in 8 blocks
==28563==    still reachable: 1,894,897 bytes in 30,561 blocks
==28563==         suppressed: 0 bytes in 0 blocks
==28563== Rerun with --leak-check=full to see details of leaked memory
==28563== 
==28563== For lists of detected and suppressed errors, rerun with: -s
==28563== ERROR SUMMARY: 6 errors from 6 contexts (suppressed: 0 from 0)
   

I am not quite sure what is going on here, this goes a bit beyond my comfort zone.
@jerry73204 I see you did some troubleshooting on the tch-rs crate related to memory leak, do you have an idea on what may be happening here?
@proycon , @epwalsh reaching out to you as well for support if you would have some time

Note I also ran the following script for ~200 iterations and did not notice a significant increase in memory consumption (stable at ~2544MB): https://gist.github.com/guillaume-be/34a982ca33749ba4be2951836ab36b97

I also ran an end-to-end translation example, including reloading the entire model and tokenizer at each iteration (see https://gist.github.com/guillaume-be/06bbc56639522d8745f2d357b310bc17). I ran the script for 20 minutes (500 full model reloads and translation), the memory consumption remained stable at 2560MB). I could run it longer, but I am unlikely to reach a 25GB memory use in a realistic amount of time.

@guillaume-be
Copy link
Owner

@Qeenon I ran a few more experiments on my end.
Looking at the valgrind logs, there are two potential sources of leaks that are entirely related to the model loading. I have compared the value with experiments running inference and can now rule out memory leaks during predictions.

For the model loading, these seem to appear when registering variables or modules to the variable store. As a validation, I did a quick experiment, and a very basic module creation using the base tch-rs library also raises a memory leak warning in valgrind:

fn main() -> anyhow::Result<()> {

    let device = Device::cuda_if_available();
    let vs = nn::VarStore::new(device);
    let _module = nn::linear(&vs.root() / "dense", 1024, 1024, Default::default());

    Ok(())
}

Since running this for more than a million iteration does not lead to any actual memory leak by monitoring the resources consumed by the process, I believe this is a spurious error.

Here is a summary of my investigations so far:

  • The LeakSanitizer does not return any memory leak error for the models I tested (masked language model and translation)
  • Valgrind indicates potential for a moderate memory leak at model loading (not prediction). This is linked to the registration of variables in the variable store in tch-rs, and does not seem to result in a actual memory leak (false positive)
  • I tried running several hundreds of model loading / 1 prediction on the translation model and did not see a noticeable increase in memory consumption

Based on the following I would assume that there does not seem to be an obvious issue of memory leak in the models from the library. Loading the model once and running predictions on demand is indeed the right way of using those - is this working for you?

@Miezhiko
Copy link
Author

Thank you for your investigations here.

I'm really right now can't be sure with it and next time I will do tests on my side first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants