StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

krokoko · 2024-02-12T20:14:04Z

Describe the bug

Hi team !
I am using Diffusers and specifically looking at this documentation, however the link to open a notebook in Studio Lab doesn't work (the notebook is not present in the GitHub repo).

I am trying to deploy https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt to a SageMaker endpoint. I am able to deploy and run inference against the model, however it works only for the first request. Subsequent calls are failing due to an out of memory error.

Is there anything I could tweak on the diffusers side ? Thank you

Reproduction

In the notebook, I am using the following workflow:

Download the model artifacts from HF
Add an inference script, with the following code:

import base64
import torch
from io import BytesIO
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image


def model_fn(model_dir):
    pipe = StableVideoDiffusionPipeline.from_pretrained(model_dir, torch_dtype=torch.float16, variant="fp16")
    pipe.enable_model_cpu_offload()
    pipe.unet.enable_forward_chunking() #https://huggingface.co/docs/diffusers/using-diffusers/svd#reduce-memory-usage

    return pipe


def predict_fn(data, pipe):
    
    #available parameters: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py

    # get prompt & parameters
    prompt = data.pop("inputs", data)
    
    #motion_bucket_id=180
    #noise_aug_strength=0.1
    seed = data.pop("seed", 42)
    decode_chunk_size = data.pop("decode_chunk_size", 8)
    
    image = load_image(prompt)
    image = image.resize((1024, 576))

    generator = torch.manual_seed(seed)
    frames = pipe(image, decode_chunk_size=decode_chunk_size, generator=generator).frames[0]

    # create response
    encoded_frames = []
    for image in frames:
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        encoded_frames.append(base64.b64encode(buffered.getvalue()).decode())

    # create response
    return {"frames": encoded_frames}

Compress model artifacts and inference script
Deploy the model to a SageMaker endpoint using sagemaker-huggingface-inference-toolkit:

import json
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join("s3://",sagemaker_session_bucket,"async_svd_inference/output"), # Where our results will be stored
)

hub = {
    'SM_NUM_GPUS': json.dumps(8),
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   #model_data=s3_model_uri,      # path to your model and script
   model_data="s3://BUCKET/model.tar.gz",
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.37.0",  # transformers version used
   pytorch_version="2.1.0",       # pytorch version used
   py_version='py310',            # python version used
   env=hub
)

# deploy the endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.48xlarge",
    async_inference_config=async_config
    )

Logs

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacty of 22.20 GiB of which 199.12 MiB is free. Process 22167 has 17.79 GiB memory in use. Process 22175 has 4.21 GiB memory in use. Of the allocated memory 2.68 GiB is allocated by PyTorch, and 157.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF



### System Info

Platform: AWS
Instance: g5.48XLarge 
Diffusers==0.26.2
Transformers==4.37
Accelerate==0.27.0 

### Who can help?

@sayakpaul @DN6

The text was updated successfully, but these errors were encountered:

DN6 · 2024-02-21T08:40:02Z

@krokoko You could try reducing the decode chunk size to 4. Looking at the traceback, it seems like multiple processes are trying to use the same GPU? Is that the case here?

krokoko · 2024-02-27T03:33:38Z

@DN6 thanks ! I did try to use a lower decode chunk size but same outcome... I will try with a different container to deploy the model

github-actions · 2024-03-22T15:03:12Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

DN6 · 2024-03-28T06:19:11Z

Closing this issue for now. Feel free to reopen if the problem persists.

krokoko added the bug Something isn't working label Feb 12, 2024

github-actions bot added the stale Issues that haven't received updates label Mar 22, 2024

DN6 closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

krokoko commented Feb 12, 2024

DN6 commented Feb 21, 2024

krokoko commented Feb 27, 2024

github-actions bot commented Mar 22, 2024

DN6 commented Mar 28, 2024

StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

Comments

krokoko commented Feb 12, 2024

Describe the bug

Reproduction

Logs

DN6 commented Feb 21, 2024

krokoko commented Feb 27, 2024

github-actions bot commented Mar 22, 2024

DN6 commented Mar 28, 2024