Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

Closed
krokoko opened this issue Feb 12, 2024 · 4 comments
Closed

StableVideoDiffusionPipeline : CUDA out of memory on SageMaker #6956

krokoko opened this issue Feb 12, 2024 · 4 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@krokoko
Copy link

krokoko commented Feb 12, 2024

Describe the bug

Hi team !
I am using Diffusers and specifically looking at this documentation, however the link to open a notebook in Studio Lab doesn't work (the notebook is not present in the GitHub repo).

I am trying to deploy https://huggingface.co/stabilityai/stable-video-diffusion-img2vid-xt to a SageMaker endpoint. I am able to deploy and run inference against the model, however it works only for the first request. Subsequent calls are failing due to an out of memory error.

Is there anything I could tweak on the diffusers side ? Thank you

Reproduction

In the notebook, I am using the following workflow:

  • Download the model artifacts from HF
  • Add an inference script, with the following code:
import base64
import torch
from io import BytesIO
from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image


def model_fn(model_dir):
    pipe = StableVideoDiffusionPipeline.from_pretrained(model_dir, torch_dtype=torch.float16, variant="fp16")
    pipe.enable_model_cpu_offload()
    pipe.unet.enable_forward_chunking() #https://huggingface.co/docs/diffusers/using-diffusers/svd#reduce-memory-usage

    return pipe


def predict_fn(data, pipe):
    
    #available parameters: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py

    # get prompt & parameters
    prompt = data.pop("inputs", data)
    
    #motion_bucket_id=180
    #noise_aug_strength=0.1
    seed = data.pop("seed", 42)
    decode_chunk_size = data.pop("decode_chunk_size", 8)
    
    image = load_image(prompt)
    image = image.resize((1024, 576))

    generator = torch.manual_seed(seed)
    frames = pipe(image, decode_chunk_size=decode_chunk_size, generator=generator).frames[0]

    # create response
    encoded_frames = []
    for image in frames:
        buffered = BytesIO()
        image.save(buffered, format="JPEG")
        encoded_frames.append(base64.b64encode(buffered.getvalue()).decode())

    # create response
    return {"frames": encoded_frames}
import json
from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.async_inference.async_inference_config import AsyncInferenceConfig
from sagemaker.s3 import s3_path_join

# create async endpoint configuration
async_config = AsyncInferenceConfig(
    output_path=s3_path_join("s3://",sagemaker_session_bucket,"async_svd_inference/output"), # Where our results will be stored
)

hub = {
    'SM_NUM_GPUS': json.dumps(8),
}

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   #model_data=s3_model_uri,      # path to your model and script
   model_data="s3://BUCKET/model.tar.gz",
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.37.0",  # transformers version used
   pytorch_version="2.1.0",       # pytorch version used
   py_version='py310',            # python version used
   env=hub
)

# deploy the endpoint
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.48xlarge",
    async_inference_config=async_config
    )

Logs

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB. GPU 0 has a total capacty of 22.20 GiB of which 199.12 MiB is free. Process 22167 has 17.79 GiB memory in use. Process 22175 has 4.21 GiB memory in use. Of the allocated memory 2.68 GiB is allocated by PyTorch, and 157.72 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF


### System Info

Platform: AWS
Instance: g5.48XLarge 
Diffusers==0.26.2
Transformers==4.37
Accelerate==0.27.0 

### Who can help?

@sayakpaul @DN6 
@krokoko krokoko added the bug Something isn't working label Feb 12, 2024
@DN6
Copy link
Collaborator

DN6 commented Feb 21, 2024

@krokoko You could try reducing the decode chunk size to 4. Looking at the traceback, it seems like multiple processes are trying to use the same GPU? Is that the case here?

@krokoko
Copy link
Author

krokoko commented Feb 27, 2024

@DN6 thanks ! I did try to use a lower decode chunk size but same outcome... I will try with a different container to deploy the model

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Mar 22, 2024
@DN6
Copy link
Collaborator

DN6 commented Mar 28, 2024

Closing this issue for now. Feel free to reopen if the problem persists.

@DN6 DN6 closed this as completed Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

2 participants