Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sharded adapters not working #46

Closed
2 of 4 tasks
markovalexander opened this issue Nov 20, 2023 · 2 comments · Fixed by #47
Closed
2 of 4 tasks

Sharded adapters not working #46

markovalexander opened this issue Nov 20, 2023 · 2 comments · Fixed by #47
Assignees
Labels
bug Something isn't working

Comments

@markovalexander
Copy link

System Info

Model info:

{
  "model_id": "mistralai/Mistral-7B-Instruct-v0.1",
  "model_sha": "7ad5799710574ba1c1d953eba3077af582f3a773",
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": "text-generation",
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 1102544,
  "max_waiting_tokens": 20,
  "validation_workers": 2,
  "version": "0.1.0",
  "sha": null,
  "docker_label": null
}

2 A100 gpus, NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 outside docker.

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Run mistral example with docker on 2 gpus:

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/predibase/lorax:latest --model-id $model --num-shard 2

Then try to generate:

❯ curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
    -H 'Content-Type: application/json'
{"error":"Request failed during generation: Server error: local variable 'lora_b' referenced before assignment","error_type":"generation"}%

Basically the issue that when trying to multiply first lora_a matrix, we get it sharded with shape [2048, r] while input is not sharded and has shape [49, 4096] .

Expected behavior

Generation completed successfully

@tgaddair
Copy link
Contributor

Thanks for reporting this issue, @markovalexander. That was definitely a recent regression, let me take a look and get this fixed today.

@tgaddair tgaddair self-assigned this Nov 20, 2023
@tgaddair tgaddair added the bug Something isn't working label Nov 20, 2023
@tgaddair
Copy link
Contributor

Hey @markovalexander and @abhibst, thanks for your patience with this. I just put up #47, which should address this issue. Feel free to test it out. Alternatively, I'll try and land this tonight, so new docker images should hopefully be available in shortly (next couple of hours).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants