Sharded adapters not working #46

markovalexander · 2023-11-20T20:48:25Z

System Info

Model info:

{
  "model_id": "mistralai/Mistral-7B-Instruct-v0.1",
  "model_sha": "7ad5799710574ba1c1d953eba3077af582f3a773",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 1102544,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.1.0",
  "sha": null,
  "docker_label": null
}

2 A100 gpus, NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 outside docker.

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

Run mistral example with docker on 2 gpus:

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard 2

Then try to generate:

❯ curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
    -H 'Content-Type: application/json'
{"error":"Request failed during generation: Server error: local variable 'lora_b' referenced before assignment","error_type":"generation"}%

Basically the issue that when trying to multiply first lora_a matrix, we get it sharded with shape [2048, r] while input is not sharded and has shape [49, 4096] .

Expected behavior

Generation completed successfully

The text was updated successfully, but these errors were encountered:

tgaddair · 2023-11-20T22:12:16Z

Thanks for reporting this issue, @markovalexander. That was definitely a recent regression, let me take a look and get this fixed today.

tgaddair · 2023-11-21T03:59:56Z

Hey @markovalexander and @abhibst, thanks for your patience with this. I just put up #47, which should address this issue. Feel free to test it out. Alternatively, I'll try and land this tonight, so new docker images should hopefully be available in shortly (next couple of hours).

tgaddair self-assigned this Nov 20, 2023

tgaddair added the bug Something isn't working label Nov 20, 2023

tgaddair mentioned this issue Nov 21, 2023

Fixed tensor parallelism splits #47

Merged

tgaddair closed this as completed in #47 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sharded adapters not working #46

Sharded adapters not working #46

markovalexander commented Nov 20, 2023

tgaddair commented Nov 20, 2023

tgaddair commented Nov 21, 2023

Sharded adapters not working #46

Sharded adapters not working #46

Comments

markovalexander commented Nov 20, 2023

System Info

Information

Tasks

Reproduction

Expected behavior

tgaddair commented Nov 20, 2023

tgaddair commented Nov 21, 2023