Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add CuBLAS support in Docker images #403

Merged
merged 4 commits into from
May 29, 2023

Conversation

sebastien-prudhomme
Copy link
Contributor

@sebastien-prudhomme sebastien-prudhomme commented May 28, 2023

Description

This PR fixes #280 partially. The CI needs also to be modified to build multiple Docker images corresponding to different compilation options.

Notes for Reviewers

I've choosed to use CUDA 11 as default version. This can be changed when building the Docker image by using the CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION args:

docker build --build-arg CUDA_MAJOR_VERSION=12 --build-arg CUDA_MINOR_VERSION=1 ...

I've also choosed to installed needed librairies only when BUILD_TYPE is "cublas". I've adapted things for "openblas" and "stablediffusion" options in the same way.

Be carefull now that the rebuild made on start of the image will only allow rebuilding with the same options provided at build time.

For people you want to test on Linux, you need a NVIDIA card, a recent NVIDIA driver and the nvidia-container-toolkit. Then just launch the container with docker run --gpus all ... and don't forget to configure "gpu_layers" in your model definition.

You should see VRAM offloading when the model is loaded:

llama.cpp: loading model from /build/models/vicuna
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 1024
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 1932.71 MB (+ 2052.00 MB per state)
llama_model_load_internal: [cublas] offloading 32 layers to GPU
llama_model_load_internal: [cublas] total VRAM used: 3475 MB

Signed commits

  • Yes, I signed my commits.

@ghost
Copy link

ghost commented May 28, 2023

Hello there! How do you set how many layers are offloaded to GPU?

@sebastien-prudhomme
Copy link
Contributor Author

Hello there! How do you set how many layers are offloaded to GPU?

Hi, you need to setup "gpu_layers" in the model definition:

backend: llama
context_size: 1024
name: vicuna
parameters:
  model: vicuna
  temperature: 0.2
  top_k: 80
  top_p: 0.7
template:
  chat: vicuna-chat
  completion: vicuna-completion
gpu_layers: 32

@sebastien-prudhomme
Copy link
Contributor Author

@marianbastiUNRN for your problem described on Discord: CUDA version?, NVIDIA driver version? See nvidia-smi output.

Try to build the image with the same CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION as your driver.

CUDA and driver compatibility: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-compatibility

@mudler
Copy link
Owner

mudler commented May 29, 2023

@sebastien-prudhomme that's amazing! thank you! this is looking good at a first pass, I'll review it later and try to give it a shot locally too

@ghost
Copy link

ghost commented May 29, 2023

@marianbastiUNRN for your problem described on Discord: CUDA version?, NVIDIA driver version? See nvidia-smi output.

Try to build the image with the same CUDA_MAJOR_VERSION and CUDA_MINOR_VERSION as your driver.

CUDA and driver compatibility: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#minor-version-compatibility

Thanks for the reply!
Im using the image nvidia/cuda:12.1.1-cudnn8-devel-ubuntu20.04 (CUDA 12.1.1 cudnn 8 as you can tell), same as my drivers. Last night i was able to offload work to GPU by hardcoding the gpu layers on bindings.cpp and options.go and then restarting the container, forcing a rebuild. So the problem is definitely with my config-file.yaml.
As i mentioned on Discord, i get an error like

  line 1: cannot unmarshal !!map into []*api.Config

My config file look like this (formatted in utf-8 and validated ):

---
name: gpt-3.5-turbo
description: |
  Manticore 13B - (previously Wizard Mega) 
license: N/A
config_file: |
  backend: llama
  parameters:
    model: manticore
    top_k: 80
    temperature: 0.2
    top_p: 0.7
  context_size: 1024
  f16: true
  template:
    completion: manticore-completion
    chat: manticore-chat
prompt_templates:
  - name: manticore-completion
    content: |
      ### Instruction: Complete the following sentence: {{.Input}}

      ### Assistant:
  - name: manticore-chat
    content: |
      ### Instruction: {{.Input}}

      ### Assistant:

gpu_layers: 60

Other relevant comment of mine here

Dockerfile Outdated Show resolved Hide resolved
Copy link
Owner

@mudler mudler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic, thanks @sebastien-prudhomme !

@mudler mudler merged commit 2272324 into mudler:master May 29, 2023
3 checks passed
@mudler mudler added the enhancement New feature or request label May 30, 2023
@sebastien-prudhomme sebastien-prudhomme deleted the dockerfile-cublas branch May 31, 2023 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature: docker-cuda images
2 participants