Memory to store models of closed browser sessions persisting. #1324

ml-l · 2024-01-23T04:17:20Z

When running h2oGPT through Docker (gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0) without pre-selecting a base model in order to dynamically choose models in upon connecting to the instance through the browser session, memory allocated/used for any models loaded in that session remain allocated/used and inaccessible after the browser session is closed without unloading prior to closing.

This is the case regardless of running with or without GPUs.
uncertain whether this issue exists when using other methods to install and run h2oGPT

The text was updated successfully, but these errors were encountered:

pseudotensor · 2024-01-28T09:05:05Z

Are you referring to CPU or GPU memory? Thanks!

ml-l · 2024-01-31T00:37:29Z

@pseudotensor Both.
i.e. if I run in GPU mode, the build up that occurs from this is in GPU memory. If I run in CPU only mode, the buildup is in the CPU/system memory.

pseudotensor · 2024-02-09T01:10:19Z

Hi @ml-l Thanks for finding. Some clean-up while back led to issue. I pushed fix for case of loading new model or unloading model leading to memory still being consumed.

Does that solve your problem? Sorry for the long delay in fix.

ml-l · 2024-02-12T02:16:26Z

No worries regarding the delay.
I've tried using the latest docker image (tags 0.1.0-324 / latest) and seemingly the issue is still there.

pseudotensor · 2024-02-12T07:06:06Z

I'm confident the continued GPU use is fixed. I confirmed it was there and that I fixed it.

As for still using CPU, I also saw that was fixed.

If you give me a specific sequence of what you are doing I can take a look.

ml-l · 2024-02-13T02:25:57Z

It could very well be with how I've configured things or maybe where/how I'm checking might not align with where you've put in the fix(?)

My sequence of actions running in GPU mode are as follows:

1: removed my current gcr.io/vorvan/h2oai/h2ogpt-runtime:latest docker image to ensure that the latest one is downloaded.
2: run nvidia-smi -l 1 in a separate bash to monitor GPU usage (Idle usage output seen below)

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:17:00.0 Off |                    0 |
| N/A   34C    P0              43W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:65:00.0 Off |                    0 |
| N/A   34C    P0              46W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

3: ran the following 2 commands to run h2ogpt in GPU mode and to be able to choose models dynamically

export GRADIO_SERVER_PORT=7860
sudo docker run
    --gpus device=0 \
    --runtime=nvidia \
    --shm-size=2g \
    -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
    --rm --init \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v /mnt/alpha/.cache:/workspace/.cache \
    -v /mnt/alpha/h2ogpt_share/save:/workspace/save \
    -v /mnt/alpha/h2ogpt_share/user_path:/workspace/user_path \
    -v /mnt/alpha/h2ogpt_share/db_dir_UserData:/workspace/db_dir_UserData \
    -v /mnt/alpha/h2ogpt_share/users:/workspace/users \
    -v /mnt/alpha/h2ogpt_share/db_nonusers:/workspace/db_nonusers \
    -v /mnt/alpha/h2ogpt_share/llamacpp_path:/workspace/llamacpp_path \
    -v /mnt/alpha/h2ogpt_share/h2ogpt_auth:/workspace/h2ogpt_auth \
    -e USER=someone \
    gcr.io/vorvan/h2oai/h2ogpt-runtime:latest /workspace/generate.py \
       --use_safetensors=True \
       --save_dir='/workspace/save/' \
       --use_gpu_id=False \
       --user_path=/workspace/user_path \
       --langchain_mode="LLM" \
       --langchain_modes="['UserData', 'LLM']" \
       --score_model=None \
       --max_max_new_tokens=2048 \
       --max_new_tokens=1024

At this point, idle GPU0 usage is 2641MiB / 40960MiB (GPU1 remained 4MiB / 40960MiB as intended from command).

4: Open an incognito/private browser (in my case Firefox but I don't think this should matter) to hosted h2oGPT instance.

5: Opened Models tab to enter the following parameters

Base Model = HuggingFaceH4/zephyr-7b-beta
LORA = None (default)
Enter Server = None (default)
Prompt Type = zephyr

6: Clicked Load (Download) Model. And now GPU0 usage is 17195MiB / 40960MiB.

7: Closed the browser that's connected to h2ogpt. GPU0 usage remains 17195MiB / 40960MiB.

8: Re-opened browser in incognito/private mode again to go to hosted h2ogpt again. GPU0 usage remains 17195MiB / 40960MiB.

9: Repeated step 5 to load in another zephyr model to see if the fix was preventing multiple copies of the same model being loaded. GPU0 usage is now 31619MiB / 40960MiB

10: Clicking UnLoad Model brings GPU0 usage to 17195MiB / 40960MiB

11: Closing the browser session and checking again the GPU0 usage remains to be 17195MiB / 40960MiB

Only until I stop the docker container that's running h2oGPT that GPU0 memory usages goes back to 4MiB / 40960MiB.

And double checking the hash of the docker image I'm using, the output of

sudo docker inspect --format='{{index .RepoDigests 0}}' gcr.io/vorvan/h2oai/h2ogpt-runtime:latest

is the following:

gcr.io/vorvan/h2oai/h2ogpt-runtime@sha256:806b31aadbd0ca24f1e0e2822c9f38a6a51b1e0f45c56a290081f35c04997dc4

EDIT: Step 9 should've said repeat steps 5 and 6 rather than 3. i.e. load in another zephyr model (to check whether the fix was through preventing extra memory being allocated for the same model being loaded in) rather than spinning up another Docker container.

pseudotensor · 2024-02-14T07:27:13Z

Ah yes, if you do the step:

Closed the browser that's connected to h2ogpt. GPU0 usage remains 17195MiB / 40960MiB.

The server loses who you are and the model is associated with that prior user.

The problem is this: gradio-app/gradio#4016
gradio-app/gradio#7227

I'm unsure how to work around.

pseudotensor · 2024-02-14T07:28:40Z

Ok, I'm following along. First I have:

docker pulled

docker pull gcr.io/vorvan/h2oai/h2ogpt-runtime:0.1.0

run watch -n 1 nvidia-smi

Then I ran my version of your run:

export GRADIO_SERVER_PORT=7860
docker run \
    --gpus device=0 \
    --runtime=nvidia \
    --shm-size=2g \
    -p $GRADIO_SERVER_PORT:$GRADIO_SERVER_PORT \
    --rm --init \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v /home/jon/.cache/huggingface/hub:/workspace/.cache/huggingface/hub \
    -v /home/jon/.cache/huggingface/modules:/workspace/.cache/huggingface/modules \
    -v /home/jon/h2ogpt/save:/workspace/save \
    -v /home/jon/h2ogpt/user_path:/workspace/user_path \
    -v /home/jon/h2ogpt/db_dir_UserData:/workspace/db_dir_UserData \
    -v /home/jon/h2ogpt/users:/workspace/users \
    -v /home/jon/h2ogpt/db_nonusers:/workspace/db_nonusers \
    -v /home/jon/h2ogpt/llamacpp_path:/workspace/llamacpp_path \
    -v /home/jon/h2ogpt/h2ogpt_auth:/workspace/h2ogpt_auth \
    -e USER=jon \
    gcr.io/vorvan/h2oai/h2ogpt-runtime:latest /workspace/generate.py \
       --use_safetensors=True \
       --save_dir='/workspace/save/' \
       --use_gpu_id=False \
       --user_path=/workspace/user_path \
       --langchain_mode="LLM" \
       --langchain_modes="['UserData', 'LLM']" \
       --score_model=None \
       --max_max_new_tokens=2048 \
       --max_new_tokens=1024

open browser localhost:7860
clicked load model on zehpyr 7b beta like you
check nvidia-smi, see 18236MB used

click unload and then wait 20s, then see 3.6GB used (embedding etc.)

pseudotensor · 2024-02-15T01:12:32Z

I modified gradio to be able to do this now.

Fixes #1324 -- clear memory when browser tab closes

pseudotensor added a commit that referenced this issue Feb 9, 2024

Fix memory clean-up between loads. Related to Issue #1324

efd495c

pseudotensor self-assigned this Feb 12, 2024

pseudotensor closed this as completed in f4b4eb9 Feb 15, 2024

pseudotensor added a commit that referenced this issue Feb 15, 2024

Merge pull request #1407 from h2oai/fixes1324

a2de006

Fixes #1324 -- clear memory when browser tab closes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory to store models of closed browser sessions persisting. #1324

Memory to store models of closed browser sessions persisting. #1324

ml-l commented Jan 23, 2024

pseudotensor commented Jan 28, 2024

ml-l commented Jan 31, 2024 •

edited

pseudotensor commented Feb 9, 2024

ml-l commented Feb 12, 2024

pseudotensor commented Feb 12, 2024

ml-l commented Feb 13, 2024 •

edited

pseudotensor commented Feb 14, 2024 •

edited

pseudotensor commented Feb 14, 2024

pseudotensor commented Feb 15, 2024

Memory to store models of closed browser sessions persisting. #1324

Memory to store models of closed browser sessions persisting. #1324

Comments

ml-l commented Jan 23, 2024

pseudotensor commented Jan 28, 2024

ml-l commented Jan 31, 2024 • edited

pseudotensor commented Feb 9, 2024

ml-l commented Feb 12, 2024

pseudotensor commented Feb 12, 2024

ml-l commented Feb 13, 2024 • edited

pseudotensor commented Feb 14, 2024 • edited

pseudotensor commented Feb 14, 2024

pseudotensor commented Feb 15, 2024

ml-l commented Jan 31, 2024 •

edited

ml-l commented Feb 13, 2024 •

edited

pseudotensor commented Feb 14, 2024 •

edited