Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for MPS support on Apple Silicon #393

Merged
merged 4 commits into from
Mar 18, 2023

Conversation

WojtekKowaluk
Copy link
Contributor

@WojtekKowaluk WojtekKowaluk commented Mar 18, 2023

First you need to install macOS Ventura 13.3 Beta (or later), it does not work on 13.2.
Then you have to install torch dev version, it does not work on 2.0.0.
pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html
Then it should work with changes from this PR.
Tested with facebook/opt-2.7b model.

@WojtekKowaluk WojtekKowaluk changed the title Fixed for MPS support on Apple Silicon Fix for MPS support on Apple Silicon Mar 18, 2023
modules/models.py Outdated Show resolved Hide resolved
@oobabooga
Copy link
Owner

Does it not work if you install pytorch with

conda install pytorch torchvision torchaudio -c pytorch

as recommended here? https://pytorch.org/get-started/locally/

@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Mar 18, 2023

Does it not work if you install pytorch with

conda install pytorch torchvision torchaudio -c pytorch

as recommended here? https://pytorch.org/get-started/locally/

I get this error with your conda instructions:

NotImplementedError: The operator 'aten::cumsum.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

@oobabooga
Copy link
Owner

Thanks for the information and the PR. That was really helpful!

@oobabooga oobabooga merged commit bcd8afd into oobabooga:main Mar 18, 2023
@WojtekKowaluk
Copy link
Contributor Author

I will provide my full instructions in a sec.

@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Mar 18, 2023

I have Python 3.10.10 installed.

python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -U --pre torch torchvision -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

note that you will get this error:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
rwkv 0.4.2 requires torch~=1.13.1, but you have torch 2.1.0.dev20230317 which is incompatible.

ignore it and run:

python server.py

@oobabooga
Copy link
Owner

I have linked this thread in the README for MacOS users to refer to.

TheTerrasque pushed a commit to TheTerrasque/text-generation-webui that referenced this pull request Mar 19, 2023
commit 0cbe2dd
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 12:24:54 2023 -0300

    Update README.md

commit 36ac7be
Merge: d2a7fac 705f513
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 11:57:10 2023 -0300

    Merge pull request oobabooga#407 from ThisIsPIRI/gitignore

    Add loras to .gitignore

commit d2a7fac
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 11:56:04 2023 -0300

    Use pip instead of conda for pytorch

commit 705f513
Author: ThisIsPIRI <thisispiri@gmail.com>
Date:   Sat Mar 18 23:33:24 2023 +0900

    Add loras to .gitignore

commit a0b1a30
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 11:23:56 2023 -0300

    Specify torchvision/torchaudio versions

commit c753261
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 10:55:57 2023 -0300

    Disable stop_at_newline by default

commit 7c945cf
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 10:55:24 2023 -0300

    Don't include PeftModel every time

commit 86b9900
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 10:27:52 2023 -0300

    Remove rwkv dependency

commit a163807
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Sat Mar 18 03:07:27 2023 -0300

    Update README.md

commit a7acfa4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 22:57:46 2023 -0300

    Update README.md

commit bcd8afd
Merge: dc35861 e26763a
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 22:57:28 2023 -0300

    Merge pull request oobabooga#393 from WojtekKowaluk/mps_support

    Fix for MPS support on Apple Silicon

commit e26763a
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 22:56:46 2023 -0300

    Minor changes

commit 7994b58
Author: Wojtek Kowaluk <wojtek@Wojteks-MacBook-Pro.local>
Date:   Sat Mar 18 02:27:26 2023 +0100

    clean up duplicated code

commit dc35861
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 21:05:17 2023 -0300

    Update README.md

commit 30939e2
Author: Wojtek Kowaluk <wojtek@Wojteks-MacBook-Pro.local>
Date:   Sat Mar 18 00:56:23 2023 +0100

    add mps support on apple silicon

commit 7d97da1
Author: Wojtek Kowaluk <wojtek@Wojteks-MacBook-Pro.local>
Date:   Sat Mar 18 00:17:05 2023 +0100

    add venv paths to gitignore

commit f2a5ca7
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 20:50:27 2023 -0300

    Update README.md

commit 8c8286b
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 20:49:40 2023 -0300

    Update README.md

commit 0c05e65
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 20:25:42 2023 -0300

    Update README.md

commit adc2003
Merge: 20f5b45 66e8d12
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 20:19:33 2023 -0300

    Merge branch 'main' of github.com:oobabooga/text-generation-webui

commit 20f5b45
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 20:19:04 2023 -0300

    Add parameters reference oobabooga#386 oobabooga#331

commit 66e8d12
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 19:59:37 2023 -0300

    Update README.md

commit 9a87111
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 19:52:22 2023 -0300

    Update README.md

commit d4f38b6
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 18:57:48 2023 -0300

    Update README.md

commit ad7c829
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 18:55:01 2023 -0300

    Update README.md

commit 4426f94
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 18:51:07 2023 -0300

    Update the installation instructions. Tldr use WSL

commit 9256e93
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 17:45:28 2023 -0300

    Add some LoRA params

commit 9ed2c45
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 16:06:11 2023 -0300

    Use markdown in the "HTML" tab

commit f0b2645
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 13:07:17 2023 -0300

    Add a comment

commit 7da742e
Merge: ebef4a5 02e1113
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 12:37:23 2023 -0300

    Merge pull request oobabooga#207 from EliasVincent/stt-extension

    Extension: Whisper Speech-To-Text Input

commit ebef4a5
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:58:45 2023 -0300

    Update README

commit cdfa787
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:53:28 2023 -0300

    Update README

commit 3bda907
Merge: 4c13067 614dad0
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:48:48 2023 -0300

    Merge pull request oobabooga#366 from oobabooga/lora

    Add LoRA support

commit 614dad0
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:43:11 2023 -0300

    Remove unused import

commit a717fd7
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:42:25 2023 -0300

    Sort the imports

commit 7d97287
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:41:12 2023 -0300

    Update settings-template.json

commit 29fe7b1
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:39:48 2023 -0300

    Remove LoRA tab, move it into the Parameters menu

commit 214dc68
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 11:24:52 2023 -0300

    Several QoL changes related to LoRA

commit 4c13067
Merge: ee164d1 53b6a66
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Fri Mar 17 09:47:57 2023 -0300

    Merge pull request oobabooga#377 from askmyteapot/Fix-Multi-gpu-GPTQ-Llama-no-tokens

    Update GPTQ_Loader.py

commit 53b6a66
Author: askmyteapot <62238146+askmyteapot@users.noreply.github.com>
Date:   Fri Mar 17 18:34:13 2023 +1000

    Update GPTQ_Loader.py

    Correcting decoder layer for renamed class.

commit 0cecfc6
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 21:35:53 2023 -0300

    Add files

commit 104293f
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 21:31:39 2023 -0300

    Add LoRA support

commit ee164d1
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 18:22:16 2023 -0300

    Don't split the layers in 8-bit mode by default

commit 0a2aa79
Merge: dd1c596 e085cb4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 17:27:03 2023 -0300

    Merge pull request oobabooga#358 from mayaeary/8bit-offload

    Add support for memory maps with --load-in-8bit

commit e085cb4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 13:34:23 2023 -0300

    Small changes

commit dd1c596
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 12:45:27 2023 -0300

    Update README

commit 38d7017
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 12:44:03 2023 -0300

    Add all command-line flags to "Interface mode"

commit 83cb20a
Author: awoo <awoo@awoo>
Date:   Thu Mar 16 18:42:53 2023 +0300

    Add support for --gpu-memory witn --load-in-8bit

commit 23a5e88
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 11:16:17 2023 -0300

    The LLaMA PR has been merged into transformers

    huggingface/transformers#21955

    The tokenizer class has been changed from

    "LLaMATokenizer"

    to

    "LlamaTokenizer"

    It is necessary to edit this change in every tokenizer_config.json
    that you had for LLaMA so far.

commit d54f3f4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 10:19:00 2023 -0300

    Add no-stream checkbox to the interface

commit 1c37896
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 10:18:34 2023 -0300

    Remove unused imports

commit a577fb1
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Thu Mar 16 00:46:59 2023 -0300

    Keep GALACTICA special tokens (oobabooga#300)

commit 25a00ea
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 23:43:35 2023 -0300

    Add "Experimental" warning

commit 599d313
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 23:34:08 2023 -0300

    Increase the reload timeout a bit

commit 4d64a57
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 23:29:56 2023 -0300

    Add Interface mode tab

commit b501722
Merge: ffb8986 d3a280e
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 20:46:04 2023 -0300

    Merge branch 'main' of github.com:oobabooga/text-generation-webui

commit ffb8986
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 20:44:34 2023 -0300

    Mini refactor

commit d3a280e
Merge: 445ebf0 0552ab2
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 20:22:08 2023 -0300

    Merge pull request oobabooga#348 from mayaeary/feature/koboldai-api-share

    flask_cloudflared for shared tunnels

commit 445ebf0
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 20:06:46 2023 -0300

    Update README.md

commit 0552ab2
Author: awoo <awoo@awoo>
Date:   Thu Mar 16 02:00:16 2023 +0300

    flask_cloudflared for shared tunnels

commit e9e76bb
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 19:42:29 2023 -0300

    Delete WSL.md

commit 09045e4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 19:42:06 2023 -0300

    Add WSL guide

commit 9ff5033
Merge: 66256ac 055edc7
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 19:37:26 2023 -0300

    Merge pull request oobabooga#345 from jfryton/main

    Guide for Windows Subsystem for Linux

commit 66256ac
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 19:31:27 2023 -0300

    Make the "no GPU has been detected" message more descriptive

commit 055edc7
Author: jfryton <35437877+jfryton@users.noreply.github.com>
Date:   Wed Mar 15 18:21:14 2023 -0400

    Update WSL.md

commit 89883a3
Author: jfryton <35437877+jfryton@users.noreply.github.com>
Date:   Wed Mar 15 18:20:21 2023 -0400

    Create WSL.md guide for setting up WSL Ubuntu

    Quick start guide for Windows Subsystem for Linux (Ubuntu), including port forwarding to enable local network webui access.

commit 67d6247
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 18:56:26 2023 -0300

    Further reorganize chat UI

commit ab12a17
Merge: 6a1787a 3028112
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 18:31:39 2023 -0300

    Merge pull request oobabooga#342 from mayaeary/koboldai-api

    Extension: KoboldAI api

commit 3028112
Author: awoo <awoo@awoo>
Date:   Wed Mar 15 23:52:46 2023 +0300

    KoboldAI api

commit 6a1787a
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 16:55:40 2023 -0300

    CSS fixes

commit 3047ed8
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 16:41:38 2023 -0300

    CSS fix

commit 87b84d2
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 16:39:59 2023 -0300

    CSS fix

commit c1959c2
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 16:34:31 2023 -0300

    Show/hide the extensions block using javascript

commit 348596f
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 15:11:16 2023 -0300

    Fix broken extensions

commit c5f14fb
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 14:19:28 2023 -0300

    Optimize the HTML generation speed

commit bf812c4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 14:05:35 2023 -0300

    Minor fix

commit 658849d
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 13:29:00 2023 -0300

    Move a checkbutton

commit 05ee323
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 13:26:32 2023 -0300

    Rename a file

commit 40c9e46
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 13:25:28 2023 -0300

    Add file

commit d30a140
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 13:24:54 2023 -0300

    Further reorganize the UI

commit ffc6cb3
Merge: cf2da86 3b62bd1
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:56:21 2023 -0300

    Merge pull request oobabooga#325 from Ph0rk0z/fix-RWKV-Names

    Fix rwkv names

commit cf2da86
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:51:13 2023 -0300

    Prevent *Is typing* from disappearing instantly while streaming

commit 4146ac4
Merge: 1413931 29b7c5a
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:47:41 2023 -0300

    Merge pull request oobabooga#266 from HideLord/main

    Adding markdown support and slight refactoring.

commit 29b7c5a
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:40:03 2023 -0300

    Sort the requirements

commit ec972b8
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:33:26 2023 -0300

    Move all css/js into separate files

commit 693b53d
Merge: 63c5a13 1413931
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:08:56 2023 -0300

    Merge branch 'main' into HideLord-main

commit 1413931
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 12:01:32 2023 -0300

    Add a header bar and redesign the interface (oobabooga#293)

commit 9d6a625
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Wed Mar 15 11:04:30 2023 -0300

    Add 'hallucinations' filter oobabooga#326

    This breaks the API since a new parameter has been added.
    It should be a one-line fix. See api-example.py.

commit 3b62bd1
Author: Forkoz <59298527+Ph0rk0z@users.noreply.github.com>
Date:   Tue Mar 14 21:23:39 2023 +0000

    Remove PTH extension from RWKV

    When loading the current model was blank unless you typed it out.

commit f0f325e
Author: Forkoz <59298527+Ph0rk0z@users.noreply.github.com>
Date:   Tue Mar 14 21:21:47 2023 +0000

    Remove Json from loading

    no more 20b tokenizer

commit 128d18e
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 17:57:25 2023 -0300

    Update README.md

commit 1236c7f
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 17:56:15 2023 -0300

    Update README.md

commit b419dff
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 17:55:35 2023 -0300

    Update README.md

commit 72d207c
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 16:31:27 2023 -0300

    Remove the chat API

    It is not implemented, has not been tested, and this is causing confusion.

commit afc5339
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 16:04:17 2023 -0300

    Remove "eval" statements from text generation functions

commit 5c05223
Merge: b327554 87192e2
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 08:05:24 2023 -0300

    Merge pull request oobabooga#295 from Zerogoki00/opt4-bit

    Add support for quantized OPT models

commit 87192e2
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 08:02:21 2023 -0300

    Update README

commit 265ba38
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 07:56:31 2023 -0300

    Rename a file, add deprecation warning for --load-in-4bit

commit 3da73e4
Merge: 518e5c4 b327554
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 07:50:36 2023 -0300

    Merge branch 'main' into Zerogoki00-opt4-bit

commit b327554
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Tue Mar 14 00:18:13 2023 -0300

    Update bug_report_template.yml

commit 33b9a15
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 23:03:16 2023 -0300

    Delete config.yml

commit b5e0d3c
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 23:02:25 2023 -0300

    Create config.yml

commit 7f301fd
Merge: d685332 02d4075
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 22:41:21 2023 -0300

    Merge pull request oobabooga#305 from oobabooga/dependabot/pip/accelerate-0.17.1

    Bump accelerate from 0.17.0 to 0.17.1

commit 02d4075
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Mar 14 01:40:42 2023 +0000

    Bump accelerate from 0.17.0 to 0.17.1

    Bumps [accelerate](https://github.com/huggingface/accelerate) from 0.17.0 to 0.17.1.
    - [Release notes](https://github.com/huggingface/accelerate/releases)
    - [Commits](huggingface/accelerate@v0.17.0...v0.17.1)

    ---
    updated-dependencies:
    - dependency-name: accelerate
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit d685332
Merge: 481ef3c df83088
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 22:39:59 2023 -0300

    Merge pull request oobabooga#307 from oobabooga/dependabot/pip/bitsandbytes-0.37.1

    Bump bitsandbytes from 0.37.0 to 0.37.1

commit 481ef3c
Merge: a0ef82c 715c3ec
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 22:39:22 2023 -0300

    Merge pull request oobabooga#304 from oobabooga/dependabot/pip/rwkv-0.4.2

    Bump rwkv from 0.3.1 to 0.4.2

commit df83088
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Mar 14 01:36:18 2023 +0000

    Bump bitsandbytes from 0.37.0 to 0.37.1

    Bumps [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) from 0.37.0 to 0.37.1.
    - [Release notes](https://github.com/TimDettmers/bitsandbytes/releases)
    - [Changelog](https://github.com/TimDettmers/bitsandbytes/blob/main/CHANGELOG.md)
    - [Commits](https://github.com/TimDettmers/bitsandbytes/commits)

    ---
    updated-dependencies:
    - dependency-name: bitsandbytes
      dependency-type: direct:production
      update-type: version-update:semver-patch
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit 715c3ec
Author: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Date:   Tue Mar 14 01:36:02 2023 +0000

    Bump rwkv from 0.3.1 to 0.4.2

    Bumps [rwkv](https://github.com/BlinkDL/ChatRWKV) from 0.3.1 to 0.4.2.
    - [Release notes](https://github.com/BlinkDL/ChatRWKV/releases)
    - [Commits](https://github.com/BlinkDL/ChatRWKV/commits)

    ---
    updated-dependencies:
    - dependency-name: rwkv
      dependency-type: direct:production
      update-type: version-update:semver-minor
    ...

    Signed-off-by: dependabot[bot] <support@github.com>

commit a0ef82c
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 22:35:28 2023 -0300

    Activate dependabot

commit 3fb8196
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 22:28:00 2023 -0300

    Implement "*Is recording a voice message...*" for TTS oobabooga#303

commit 0dab2c5
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 22:18:03 2023 -0300

    Update feature_request.md

commit 79e519c
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 20:03:08 2023 -0300

    Update stale.yml

commit 1571458
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 19:39:21 2023 -0300

    Update stale.yml

commit bad0b0a
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 19:20:18 2023 -0300

    Update stale.yml

commit c805843
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 19:09:06 2023 -0300

    Update stale.yml

commit 60cc7d3
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:53:11 2023 -0300

    Update stale.yml

commit 7c17613
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:47:31 2023 -0300

    Update and rename .github/workflow/stale.yml to .github/workflows/stale.yml

commit 47c941c
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:37:35 2023 -0300

    Create stale.yml

commit 511b136
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:29:38 2023 -0300

    Update bug_report_template.yml

commit d6763a6
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:27:24 2023 -0300

    Update feature_request.md

commit c6ecb35
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:26:28 2023 -0300

    Update feature_request.md

commit 6846427
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:19:07 2023 -0300

    Update feature_request.md

commit bcfb7d7
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:16:18 2023 -0300

    Update bug_report_template.yml

commit ed30bd3
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:14:54 2023 -0300

    Update bug_report_template.yml

commit aee3b53
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:14:31 2023 -0300

    Update bug_report_template.yml

commit 7dbc071
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:09:58 2023 -0300

    Delete bug_report.md

commit 69d4b81
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:09:37 2023 -0300

    Create bug_report_template.yml

commit 0a75584
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 18:07:08 2023 -0300

    Create issue templates

commit 02e1113
Author: EliasVincent <riesyeti@outlook.de>
Date:   Mon Mar 13 21:41:19 2023 +0100

    add auto-transcribe option

commit 518e5c4
Author: oobabooga <112222186+oobabooga@users.noreply.github.com>
Date:   Mon Mar 13 16:45:08 2023 -0300

    Some minor fixes to the GPTQ loader

commit 8778b75
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 22:11:40 2023 +0300

    use updated load_quantized

commit a6a6522
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 22:11:32 2023 +0300

    determine model type from model name

commit b6c5c57
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 22:11:08 2023 +0300

    remove default value from argument

commit 63c5a13
Merge: 683556f 7ab45fb
Author: Alexander Hristov Hristov <polimonom@gmail.com>
Date:   Mon Mar 13 19:50:08 2023 +0200

    Merge branch 'main' into main

commit e1c952c
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 20:22:38 2023 +0300

    make argument non case-sensitive

commit b746250
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 20:18:56 2023 +0300

    Update README

commit 3c9afd5
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 20:14:40 2023 +0300

    rename method

commit 1b99ed6
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 20:01:34 2023 +0300

    add argument --gptq-model-type and remove duplicate arguments

commit edbc611
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 20:00:38 2023 +0300

    use new quant loader

commit 345b6de
Author: Ayanami Rei <wennadocta@protonmail.com>
Date:   Mon Mar 13 19:59:57 2023 +0300

    refactor quant models loader and add support of OPT

commit 48aa528
Author: EliasVincent <riesyeti@outlook.de>
Date:   Sun Mar 12 21:03:07 2023 +0100

    use Gradio microphone input instead

commit 683556f
Author: HideLord <polimonom@gmail.com>
Date:   Sun Mar 12 21:34:09 2023 +0200

    Adding markdown support and slight refactoring.

commit 3b41459
Merge: 1c0bda3 3375eae
Author: Elias Vincent Simon <riesyeti@outlook.de>
Date:   Sun Mar 12 19:19:43 2023 +0100

    Merge branch 'oobabooga:main' into stt-extension

commit 1c0bda3
Author: EliasVincent <riesyeti@outlook.de>
Date:   Fri Mar 10 11:47:16 2023 +0100

    added installation instructions

commit a24fa78
Author: EliasVincent <riesyeti@outlook.de>
Date:   Thu Mar 9 21:18:46 2023 +0100

    tweaked Whisper parameters

commit d5efc06
Merge: 00359ba 3341447
Author: Elias Vincent Simon <riesyeti@outlook.de>
Date:   Thu Mar 9 21:05:34 2023 +0100

    Merge branch 'oobabooga:main' into stt-extension

commit 00359ba
Author: EliasVincent <riesyeti@outlook.de>
Date:   Thu Mar 9 21:03:49 2023 +0100

    interactive preview window

commit 7a03d0b
Author: EliasVincent <riesyeti@outlook.de>
Date:   Thu Mar 9 20:33:00 2023 +0100

    cleanup

commit 4c72e43
Author: EliasVincent <riesyeti@outlook.de>
Date:   Thu Mar 9 12:46:50 2023 +0100

    first implementation
@GundamWing
Copy link

GundamWing commented Mar 21, 2023

It's a little fussy to get set up properly, but runs amazingly well once it's functional.

I'm using the Pyg-6b model on an M2 Mac mini with 64 GB of RAM. The response times with the default settings are pretty quick as you can see below:
Output generated in 13.07 seconds (2.98 tokens/s, 39 tokens)
Output generated in 12.56 seconds (4.30 tokens/s, 54 tokens)
Output generated in 18.12 seconds (4.03 tokens/s, 73 tokens)
Output generated in 9.62 seconds (5.41 tokens/s, 52 tokens)
Output generated in 7.62 seconds (3.67 tokens/s, 28 tokens)
Output generated in 11.51 seconds (3.91 tokens/s, 45 tokens)
Output generated in 11.89 seconds (2.69 tokens/s, 32 tokens)
Output generated in 12.61 seconds (3.65 tokens/s, 46 tokens)
Output generated in 9.70 seconds (2.78 tokens/s, 27 tokens)

Also, another important note is that if you're using the interface on a different Mac with the --listen option, Safari will error out for some reason. Using Chrome will work fine in that case. Also, the --no-stream option will error out with a CUDA error so stay away from that for now.

@juwalter
Copy link

juwalter commented Mar 26, 2023

@oobabooga would instructions from here https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model also work for MPS / Silicon?

General thinking:

  • https://github.com/qwopqwop200/GPTQ-for-LLaMa does:
    • llama weights to HF (hugging face) model format
    • apply GPTQ magic
    • model weights still in HF format? (would guess so)
  • oobabooga/text-generation-webui does:
    • use hugging face transformers library to load model and perform inference?

So, I guess my question comes down to: is the hugging face transformers library in principle all that is needed? Assuming "hugging face transformers" + most recent pytorch w/ support for MPS 64bit (coming in macOS 13.3 (next week))? Or is there a hard dependency on cuda, since:

python setup_cuda.py install
https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model#step-1-install-gptq-for-llama

in other words: is this step a convenience for installing cuda related dependencies, or does integration of https://github.com/qwopqwop200/GPTQ-for-LLaMa into oobabooga/text-generation-webui have the same dependency on quant_cuda.cpp from the qwopqwop200/GPTQ-for-LLaMa project?

many thanks in advance!

@oobabooga
Copy link
Owner

dependency on quant_cuda.cpp

Yes, there is this dependency as far as I understand. GPTQ-for-LLaMa uses a custom CUDA kernel that must be compiled with nvcc in your operating system.

The only low level requirements for this project in the end are pytorch with a GPU backend (mps on MacOS) and nvcc. If you have both set up, it should just work.

@juwalter
Copy link

ah, yes: "custom CUDA kernel" - I think this is the crucial part here, which then makes it non-portable to AMD or Apple/MPS

thank you!

@tillhanke
Copy link

I am not sure, where I made a mistake. I simply followed the instructions given by @WojtekKowaluk. But my mac always shows 100% CPU activity and basically no GPU activity. I got a M2-Pro Macbook 14"
I set up my env with python 3.10.10 and the nightly build of pytorch.
Do I have to set the device somewhere? I just called python server.py.
Thanks!

@cannin
Copy link

cannin commented Apr 1, 2023

Using the latest from on an M1 Max I get this issue:

CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?

Have not found instructions on how to compile it for MacOS.

EDIT: This stable diffusion project had a similar issue: d8ahazard/sd_dreambooth_extension#1103 and they seemed to have worked around the issue.

@dogjamboree
Copy link

Any plans on fixing this or should Mac users look elsewhere? Thanks!!

@GundamWing
Copy link

Any plans on fixing this or should Mac users look elsewhere? Thanks!!

@dogjamboree Fixing what exactly? It works well on an M1/M2 Mac using both GPU and CPU cores. Anything that requires CUDA specific calls will not work on a Mac. If you can give more details on what isn't working for you, maybe someone can provide more specific help.
The CUDA specific calls are for NVIDIA graphics cards only and don't work with the Mac since it's either Apple GPUs or AMD on current machines. This means that the bitsandbytes stuff will not work on a Mac regardless.

@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Apr 6, 2023

Not everything is supported on Mac yet, there is work to do on lower level tools and libs, e.g.
bitsandbytes-foundation/bitsandbytes#257 to have accelerated 4bit/8bit mode.

EDIT: This stable diffusion project had a similar issue: d8ahazard/sd_dreambooth_extension#1103 and they seemed to have worked around the issue.

@cannin There is no workaround. As for now bitsandbytes is NVIDIA specific optimisation that lets you run in 8bit and 4bit mode, without it you have to run in 16bits mode and this is already supported.

Other thing you can do is to run 4bit/8bit mode on CPU only and this is already provided with build-in llama.cpp support.

@GaidamakUA
Copy link

What is the correct way to run it? It works really slow on my Mac M1.
I'm running it with python server.py --model alpaca-13b-ggml-q4_0-lora-merged --chat

@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Apr 8, 2023

@GaidamakUA Yes, for me it is also slower than standalone llama.cpp, trying to figure out why.

OK, I have figured it out. We need to set correctly numbers of threads for Apple Silicon. Basically you should set it to number of Performance cores (P cores) on your CPU to get best performance. Use --threads n parameter to set this value.

M1/M2: --threads 4
M1/M2 Pro (8 cores) --threads 6
M1/M2 Pro (10 cores) --threads 8
M1/M2 Max: --threads 8
M1 Ulta: --threads 16

@reblevins
Copy link

reblevins commented Apr 8, 2023

Thank you @WojtekKowaluk for your help. I followed your instructions, but I'm getting the following error trying to load the vicuna-13b-GPTQ-4bit-128g model.

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /Users/***/dev/learning/text-generation-webui/server.py:308 in <module>                       │
│                                                                                                  │
│   305 │   │   i = int(input()) - 1                                                               │
│   306 │   │   print()                                                                            │
│   307 │   shared.model_name = available_models[i]                                                │
│ ❱ 308 shared.model, shared.tokenizer = load_model(shared.model_name)                             │
│   309 if shared.args.lora:                                                                       │
│   310 │   add_lora_to_model(shared.args.lora)                                                    │
│   311                                                                                            │
│                                                                                                  │
│ /Users/***/dev/learning/text-generation-webui/modules/models.py:50 in load_model              │
│                                                                                                  │
│    47 │   # Default settings                                                                     │
│    48 │   if not any([shared.args.cpu, shared.args.load_in_8bit, shared.args.wbits, shared.arg   │
│    49 │   │   if any(size in shared.model_name.lower() for size in ('13b', '20b', '30b')):       │
│ ❱  50 │   │   │   model = AutoModelForCausalLM.from_pretrained(Path(f"{shared.args.model_dir}/   │
│    51 │   │   else:                                                                              │
│    52 │   │   │   model = AutoModelForCausalLM.from_pretrained(Path(f"{shared.args.model_dir}/   │
│    53 │   │   │   if torch.has_mps:                                                              │
│                                                                                                  │
│ /Users/***/opt/miniconda3/lib/python3.9/site-packages/transformers/models/auto/auto_factory.p │
│ y:471 in from_pretrained                                                                         │
│                                                                                                  │
│   468 │   │   │   )                                                                              │
│   469 │   │   elif type(config) in cls._model_mapping.keys():                                    │
│   470 │   │   │   model_class = _get_model_class(config, cls._model_mapping)                     │
│ ❱ 471 │   │   │   return model_class.from_pretrained(                                            │
│   472 │   │   │   │   pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs,   │
│   473 │   │   │   )                                                                              │
│   474 │   │   raise ValueError(                                                                  │
│                                                                                                  │
│ /Users/***/opt/miniconda3/lib/python3.9/site-packages/transformers/modeling_utils.py:2681 in  │
│ from_pretrained                                                                                  │
│                                                                                                  │
│   2678 │   │   │   │   │   key: device_map[key] for key in device_map.keys() if key not in modu  │
│   2679 │   │   │   │   }                                                                         │
│   2680 │   │   │   │   if "cpu" in device_map_without_lm_head.values() or "disk" in device_map_  │
│ ❱ 2681 │   │   │   │   │   raise ValueError(                                                     │
│   2682 │   │   │   │   │   │   """                                                               │
│   2683 │   │   │   │   │   │   Some modules are dispatched on the CPU or the disk. Make sure yo  │
│   2684 │   │   │   │   │   │   the quantized model. If you want to dispatch the model on the CP  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: 
                        Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
                        the quantized model. If you want to dispatch the model on the CPU or the disk while keeping
                        these modules in 32-bit, you need to set `load_in_8bit_fp32_cpu_offload=True` and pass a custom
                        `device_map` to `from_pretrained`. Check
                        https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
                        for more details.

@0xdevalias
Copy link

0xdevalias commented Apr 9, 2023

I installed it just now as follows (I use pyenv and pyenv-virtualenv, though those parts probably aren't needed to make it work) on my 2019 Intel MacBook Pro:

⇒ git clone git@github.com:oobabooga/text-generation-webui.git
# ..snip..cd text-generation-webui
# ..snip..

⇒ pyenv install miniconda3-latest
# ..snip..

⇒ pyenv local miniconda3-latest
# ..snip..

⇒ conda create -n textgen python=3.10.9
# ..snip..

⇒ conda activate textgen
# ..snip..

⇒ pyenv local miniconda3-latest/envs/textgen
# ..snip..

⇒ pip3 install torch torchvision torchaudio
# ..snip..

⇒ pip install -r requirements.txt
# ..snip..

Then downloaded the following model:

⇒ python download-model.py anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g
# ..snip..

But then when I try to run it, I get the following error (ModuleNotFoundError: No module named 'llama_inference_offload'):

⇒ python server.py --auto-devices --chat --wbits 4 --groupsize 128 --model anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (no such file), '/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (no such file), '/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g...
Traceback (most recent call last):
  File "/Users/devalias/dev/AI/text-generation-webui/server.py", line 302, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/Users/devalias/dev/AI/text-generation-webui/modules/models.py", line 100, in load_model
    from modules.GPTQ_loader import load_quantized
  File "/Users/devalias/dev/AI/text-generation-webui/modules/GPTQ_loader.py", line 14, in <module>
    import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

Googling for that module led me to the following:

So I tried installing the requirements for that into the same conda environment, but just ended up with another error (ERROR: No matching distribution found for triton==2.0.0):

⇒ cd ..
# ..snip..

⇒ git clone git@github.com:qwopqwop200/GPTQ-for-LLaMa.git
# ..snip..

⇒ cd GPTQ-for-LLaMa
# ..snip..

⇒ pyenv local miniconda3-latest/envs/textgen
# ..snip..

⇒ pip install -r requirements.txt
Collecting git+https://github.com/huggingface/transformers (from -r requirements.txt (line 4))
  Cloning https://github.com/huggingface/transformers to /private/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T/pip-req-build-_6j4_tu0
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /private/var/folders/j4/kxtq1cjs1l98xfqncjbsbx1c0000gn/T/pip-req-build-_6j4_tu0
  Resolved https://github.com/huggingface/transformers to commit 656e869a4523f6a0ce90b3aacbb05cc8fb5794bb
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: safetensors==0.3.0 in /Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages (from -r requirements.txt (line 1)) (0.3.0)
Collecting datasets==2.10.1
  Downloading datasets-2.10.1-py3-none-any.whl (469 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 469.0/469.0 kB 6.8 MB/s eta 0:00:00
Requirement already satisfied: sentencepiece in /Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen/lib/python3.10/site-packages (from -r requirements.txt (line 3)) (0.1.97)
Collecting accelerate==0.17.1
  Using cached accelerate-0.17.1-py3-none-any.whl (212 kB)
ERROR: Could not find a version that satisfies the requirement triton==2.0.0 (from versions: none)
ERROR: No matching distribution found for triton==2.0.0

Edit: Ok, it seems you can just install triton from source and it will work:

Note that I also re-setup my conda environment on python 3.9.x as part of this:

⇒ python --version
Python 3.9.16

Though that now just leads me into this rabbithole of issues :(

⇒ python server.py --chat --wbits 4 --groupsize 128 --model ./models/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g

..snip..

Loading ./models/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g...
Could not find the quantized model in .pt or .safetensors format, exiting...

We can see that load_model is called by server.py here:

load_model is defined here:

When loading a quantized model that sets --wbits to anything greater than 0, it calls load_quantized:

load_quantized is defined here:

And we can see the logic it uses to try and find the relevant model file to load here (which has some special handling for llama-*- models, otherwise defaulting to {model_name}-{shared.args.wbits}bit; which is why the above reddit post says you need to add the -4bit suffix to the filename):

So we can work around that by making a symlink of the model file (or renaming it), as follows:

cd models/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g
# ..snip..

⇒ ln -s gpt-x-alpaca-13b-native-4bit-128g.pt anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g-4bit.pt

Then we can re-run server.py

cd ../..
# ..snip..

⇒ python server.py --auto-devices --chat --wbits 4 --groupsize 128 --model anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

..snip..

Loading anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g...
Found models/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g-4bit.pt
Loading model ...
Done.
Traceback (most recent call last):
  File "/Users/devalias/dev/AI/text-generation-webui/server.py", line 302, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/Users/devalias/dev/AI/text-generation-webui/modules/models.py", line 102, in load_model
    model = load_quantized(model_name)
  File "/Users/devalias/dev/AI/text-generation-webui/modules/GPTQ_loader.py", line 153, in load_quantized
    model = model.to(torch.device('cuda:0'))
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1888, in to
    return super().to(*args, **kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Seems it doesn't like running with --auto-devices, so let's try --cpu:

⇒ python server.py --cpu --chat --wbits 4 --groupsize 128 --model anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...

..snip..

Loading anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g...
Found models/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g/anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g-4bit.pt
Loading model ...
Done.
Loaded the model in 7.46 seconds.
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.

Success!


Edit2: Almost success.. it was enough to get the webui running, but it still fails when trying to generate a prompt with a AssertionError: Torch not compiled with CUDA enabled, despite having passed through the --cpu flag to the webui:

Traceback (most recent call last):
  File "/Users/devalias/dev/AI/text-generation-webui/modules/callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/Users/devalias/dev/AI/text-generation-webui/modules/text_generation.py", line 220, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/generation/utils.py", line 1485, in generate
    return self.sample(
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/generation/utils.py", line 2524, in sample
    outputs = self(
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/devalias/dev/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 450, in forward
    out = QuantLinearFunction.apply(x.reshape(-1,x.shape[-1]), self.qweight, self.scales,
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/cuda/amp/autocast_mode.py", line 106, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/Users/devalias/dev/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 364, in forward
    output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/Users/devalias/dev/AI/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 336, in matmul248
    output = torch.empty((input.shape[0], qweight.shape[1]), device='cuda', dtype=torch.float16)
  File "/Users/devalias/.pyenv/versions/miniconda3-latest/envs/textgen_py3_9_16/lib/python3.9/site-packages/torch/cuda/__init__.py", line 239, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
Output generated in 0.31 seconds (0.00 tokens/s, 0 tokens, context 67)

@0xdevalias
Copy link

It is because in chat mode, every time we call generate we reset state and process whole chat context from start. Instead of generate, it should just append new tokens with eval and call reset only when needed (i.e. chat history has changed because user changed character or cleared it). Currently calls to LM are stateless, so we process whole chat history every time we want to generate next response. There is room for improvement here, but it will require changes to the application's architecture.

@WojtekKowaluk The following may be relevant to this:

@0xdevalias
Copy link

@mozzipa For the "ModuleNotFoundError: No module named 'llama_inference_offload'" error you can see the steps I took to resolve it in my earlier comment

For the Could not find the quantized model in .pt or .safetensors format, exiting part of your issue, my earlier comment also links to a bunch of resources related to that, and goes through step by step how to figure out and fix it, based on the model you're using, etc.

@mozzipa
Copy link

mozzipa commented Apr 14, 2023

@0xdevalias , Thanks for your advice.
But, I could not solve issues. As I followed your resolution as below.

`conda create -n textgen python=3.9`

`pip3 install torch torchvision torchaudio`

`pip install -r requirements.txt` for text-generation-webui

`pip install -r requirements.txt` for GPTQ-for-LLaMa

`git clone https://github.com/openai/triton.git;` under GPTQ-for-LLaMa

`cd triton/python;`

`pip install cmake;`

`pip install -e .` ⇒ Error

`python [download-model.py](http://download-model.py/) anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g`

`cd models`

`mv anon8231489123_gpt4-x-alpaca-13b-native-4bit-128g gpt4-x-alpaca-13b-native-4bit-128g`

`python server.py --chat --wbits 4 --groupsize 128 --model gpt4-x-alpaca-13b-native-4bit-128g —cpu`

`cd models`

`cd gpt4-x-alpaca-13b-native-4bit-128g`

`mv [gpt-x-alpaca-13b-native-4bit-128g-cuda.pt](http://gpt-x-alpaca-13b-native-4bit-128g-cuda.pt/) [gpt-x-alpaca-13b-native-4bit-128g-4bit.pt](http://gpt-x-alpaca-13b-native-4bit-128g-4bit.pt/)`

Following error occurs after prompt typing under python server.py --chat --wbits 4 --groupsize 128 --model gpt4-x-alpaca-13b-native-4bit-128g --cpu

CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (no such file), '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading gpt4-x-alpaca-13b-native-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models/gpt4-x-alpaca-13b-native-4bit-128g/gpt-x-alpaca-13b-native-4bit-128g-4bit.pt
Loading model ...
Done.
Loaded the model in 16.43 seconds.
Loading the extension "gallery"... Ok.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
  File "/Users/macmini/Documents/Coding/Solidity/Page/cloudRun/oobabooga/text-generation-webui/modules/callbacks.py", line 66, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "/Users/macmini/Documents/Coding/Solidity/Page/cloudRun/oobabooga/text-generation-webui/modules/text_generation.py", line 251, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/transformers/generation/utils.py", line 1508, in generate
    return self.sample(
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/transformers/generation/utils.py", line 2547, in sample
    outputs = self(
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/macmini/Documents/Coding/Solidity/Page/cloudRun/oobabooga/text-generation-webui/repositories/GPTQ-for-LLaMa/quant.py", line 426, in forward
    quant_cuda.vecquant4matmul(x, self.qweight, y, self.scales, self.qzeros, self.groupsize)
NameError: name 'quant_cuda' is not defined
Output generated in 0.20 seconds (0.00 tokens/s, 0 tokens, context 36, seed 2012656668)

While I have another 13b model, I have downloaded gpt4-x-alpaca-13b-native-4bit-128g to have .pt . And the way made me to pass only upto WebUI commencement.

I'd like to do your another way with @WojtekKowaluk advice. But, it was not able to find difference with what i did. I got same error as previous.

python server.py --model llamacpp-13b

CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file), '/System/Volumes/Preboot/Cryptexes/OS/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (no such file), '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading llamacpp-13b...
Traceback (most recent call last):
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/llama_cpp.py", line 36, in _load_shared_library
    return ctypes.CDLL(str(_lib_path))
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/ctypes/__init__.py", line 382, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so, 0x0006): tried: '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so' (no such file), '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/macmini/Documents/Coding/Solidity/Page/cloudRun/oobabooga/text-generation-webui/server.py", line 471, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/Users/macmini/Documents/Coding/Solidity/Page/cloudRun/oobabooga/text-generation-webui/modules/models.py", line 107, in load_model
    from modules.llamacpp_model_alternative import LlamaCppModel
  File "/Users/macmini/Documents/Coding/Solidity/Page/cloudRun/oobabooga/text-generation-webui/modules/llamacpp_model_alternative.py", line 9, in <module>
    from llama_cpp import Llama
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/__init__.py", line 1, in <module>
    from .llama_cpp import *
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/llama_cpp.py", line 46, in <module>
    _lib = _load_shared_library(_lib_base_name)
  File "/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/llama_cpp.py", line 38, in _load_shared_library
    raise RuntimeError(f"Failed to load shared library '{_lib_path}': {e}")
RuntimeError: Failed to load shared library '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so': dlopen(/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so, 0x0006): tried: '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so' (no such file), '/Users/macmini/miniconda/envs/textgen/lib/python3.9/site-packages/llama_cpp/libllama.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))

@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Apr 14, 2023

I decided to write summary, because this thread became very chaotic:

thank you 0xdevalias for listing related issues, this is very useful.

@mozzipa
Copy link

mozzipa commented Apr 14, 2023

Probably M1 with miniconda is not able to use..

@phnessu4
Copy link

phnessu4 commented Apr 14, 2023

It seems the --auto-devices not working with model directly.

error on this, Apple silicon not support CUDA.
python3 server.py --chat --model llama-7b --auto-devices

This will be extremely slow

python3 server.py --chat --model llama-7b --load-in-8bit --cpu
...
Output generated in 16.23 seconds (0.80 tokens/s, 13 tokens, context 738, seed 1231793443)
Output generated in 12.82 seconds (0.62 tokens/s, 8 tokens, context 761, seed 1386274341)

still slow , event slower than sd-web-ui with sd2.1 model.

python3 server.py --chat --model llama-7b --cpu
...
Output generated in 8.13 seconds (1.48 tokens/s, 12 tokens, context 37, seed 862240921)
Output generated in 6.73 seconds (1.49 tokens/s, 10 tokens, context 56, seed 1597084080)
Output generated in 7.55 seconds (1.46 tokens/s, 11 tokens, context 72, seed 2062780666)

Test base on m2 max 96G.

I'm not sure why it's so slow. I'll try cpp next. Does anyone test the speed? How fast it can be on Mac? How to optimize it ?

@WojtekKowaluk any suggest ?

@mozzipa
Copy link

mozzipa commented Apr 24, 2023

In my case, it has been solved with below.

  1. remove llama-cpp-python
    pip uninstall llama-cpp-python
  2. install llama-cpp-python as arm64
    arch -arm64 python -m pip install llama-cpp-python --no-cache

@aaronturnershr
Copy link

In my case, it has been solved with below.

  1. remove llama-cpp-python
    pip uninstall llama-cpp-python
  2. install llama-cpp-python as arm64
    arch -arm64 python -m pip install llama-cpp-python --no-cache

mozzipa would you provide complete setup for the M1 Max?

@AntonioCiolino
Copy link

in my case I was on python 3.11 and backing down to 3.10 made the app run after the nightly torch install.

@ArtProGZ
Copy link

I'm running this on an M1 MacAir. I've downloaded a wizardLM model and put this command: python3 server.py --cpu --chat --wbits 4 --groupsize 128 --model WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors

Then this is the error message I get

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/configuration_utils.py", line 658, in _get_config_dict
config_dict = cls._dict_from_json_file(resolved_config_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/configuration_utils.py", line 745, in _dict_from_json_file
text = reader.read()
^^^^^^^^^^^^^
File "", line 322, in decode
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/ufodisk/Documents/AI/oobabooga_macos/text-generation-webui/server.py", line 952, in
shared.model, shared.tokenizer = load_model(shared.model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ufodisk/Documents/AI/oobabooga_macos/text-generation-webui/modules/models.py", line 74, in load_model
shared.model_type = find_model_type(model_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/ufodisk/Documents/AI/oobabooga_macos/text-generation-webui/modules/models.py", line 62, in find_model_type
config = AutoConfig.from_pretrained(Path(f'{shared.args.model_dir}/{model_name}'), trust_remote_code=shared.args.trust_remote_code)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 916, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/configuration_utils.py", line 573, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/transformers/configuration_utils.py", line 661, in _get_config_dict
raise EnvironmentError(
OSError: It looks like the config file at 'models/WizardLM-7B-uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors' is not a valid JSON file.

@newjxmaster
Copy link

I am not sure, where I made a mistake. I simply followed the instructions given by @WojtekKowaluk. But my mac always shows 100% CPU activity and basically no GPU activity. I got a M2-Pro Macbook 14"
I set up my env with python 3.10.10 and the nightly build of pytorch.
Do I have to set the device somewhere? I just called python server.py.
Thanks!

did you managed to fix your issue?

@ibehnam ibehnam mentioned this pull request May 21, 2023
1 task
@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Jun 10, 2023

You can follow this instruction to get GPU acceleration with llama.cpp:

  • pip uninstall llama-cpp-python
  • CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install llama-cpp-python
  • set n-gpu-layers to anything other than 0
  • currently following quantisation methods are supported: Q4_0, Q4_1, Q2_K, Q4_K, and Q6_K

I get following results with MacBook Pro M1:

  • 13B (Q4_0) with Metal enabled:
llama_print_timings:        load time =  4306.79 ms
llama_print_timings:      sample time =   235.66 ms /   138 runs   (    1.71 ms per token)
llama_print_timings: prompt eval time =  4306.74 ms /    16 tokens (  269.17 ms per token)
llama_print_timings:        eval time = 24826.94 ms /   137 runs   (  181.22 ms per token)
llama_print_timings:       total time = 30742.29 ms
Output generated in 31.30 seconds (4.38 tokens/s, 137 tokens, context 16, seed 187622703)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  4306.79 ms
llama_print_timings:      sample time =   206.16 ms /   138 runs   (    1.49 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 24233.16 ms /   138 runs   (  175.60 ms per token)
llama_print_timings:       total time = 25971.79 ms
Output generated in 26.41 seconds (5.19 tokens/s, 137 tokens, context 16, seed 296171994)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  4306.79 ms
llama_print_timings:      sample time =   436.11 ms /   138 runs   (    3.16 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 24098.52 ms /   138 runs   (  174.63 ms per token)
llama_print_timings:       total time = 26263.08 ms
Output generated in 26.59 seconds (5.15 tokens/s, 137 tokens, context 16, seed 690694720)
  • 13B (Q4_0) with CPU:
llama_print_timings:        load time =  6570.52 ms
llama_print_timings:      sample time =   119.31 ms /   138 runs   (    0.86 ms per token)
llama_print_timings: prompt eval time =  6570.48 ms /    16 tokens (  410.65 ms per token)
llama_print_timings:        eval time = 25307.78 ms /   137 runs   (  184.73 ms per token)
llama_print_timings:       total time = 32359.11 ms
Output generated in 32.56 seconds (4.21 tokens/s, 137 tokens, context 16, seed 83893567)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6570.52 ms
llama_print_timings:      sample time =   111.39 ms /   138 runs   (    0.81 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 23952.44 ms /   138 runs   (  173.57 ms per token)
llama_print_timings:       total time = 24449.80 ms
Output generated in 24.69 seconds (5.55 tokens/s, 137 tokens, context 16, seed 976567250)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6570.52 ms
llama_print_timings:      sample time =   121.34 ms /   138 runs   (    0.88 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 24275.60 ms /   138 runs   (  175.91 ms per token)
llama_print_timings:       total time = 24800.14 ms
Output generated in 24.98 seconds (5.48 tokens/s, 137 tokens, context 16, seed 1030662172)
  • 7B (Q4_0) with Metal enabled:
llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   133.05 ms /    77 runs   (    1.73 ms per token)
llama_print_timings: prompt eval time =  6887.83 ms /    16 tokens (  430.49 ms per token)
llama_print_timings:        eval time =  8762.61 ms /    76 runs   (  115.30 ms per token)
llama_print_timings:       total time = 16282.09 ms
Output generated in 16.53 seconds (4.60 tokens/s, 76 tokens, context 16, seed 1703054888)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   229.93 ms /    77 runs   (    2.99 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7226.16 ms /    77 runs   (   93.85 ms per token)
llama_print_timings:       total time =  8139.00 ms
Output generated in 8.44 seconds (9.01 tokens/s, 76 tokens, context 16, seed 1286945878)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6887.87 ms
llama_print_timings:      sample time =   133.86 ms /    77 runs   (    1.74 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7927.91 ms /    77 runs   (  102.96 ms per token)
llama_print_timings:       total time =  8573.75 ms
Output generated in 8.84 seconds (8.60 tokens/s, 76 tokens, context 16, seed 708232749)
  • 7B (Q4_0) with CPU:
llama_print_timings:        load time =  2274.95 ms
llama_print_timings:      sample time =    64.64 ms /    91 runs   (    0.71 ms per token)
llama_print_timings: prompt eval time =  2274.92 ms /    16 tokens (  142.18 ms per token)
llama_print_timings:        eval time =  8862.54 ms /    90 runs   (   98.47 ms per token)
llama_print_timings:       total time = 11383.13 ms
Output generated in 11.58 seconds (7.77 tokens/s, 90 tokens, context 16, seed 465242178)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  2274.95 ms
llama_print_timings:      sample time =    65.01 ms /    91 runs   (    0.71 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  9130.53 ms /    91 runs   (  100.34 ms per token)
llama_print_timings:       total time =  9402.73 ms
Output generated in 9.58 seconds (9.39 tokens/s, 90 tokens, context 16, seed 180162858)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  2274.95 ms
llama_print_timings:      sample time =    65.14 ms /    91 runs   (    0.72 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  9392.93 ms /    91 runs   (  103.22 ms per token)
llama_print_timings:       total time =  9651.00 ms
Output generated in 9.85 seconds (9.14 tokens/s, 90 tokens, context 16, seed 774229344)
  • 7B (Q6_K) with Metal enabled:
llama_print_timings:        load time =  6842.40 ms
llama_print_timings:      sample time =   124.85 ms /    74 runs   (    1.69 ms per token)
llama_print_timings: prompt eval time =  6842.36 ms /    16 tokens (  427.65 ms per token)
llama_print_timings:        eval time = 10257.48 ms /    73 runs   (  140.51 ms per token)
llama_print_timings:       total time = 17754.36 ms
Output generated in 18.01 seconds (4.05 tokens/s, 73 tokens, context 16, seed 1028397680)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6842.40 ms
llama_print_timings:      sample time =   166.95 ms /    74 runs   (    2.26 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  8654.03 ms /    74 runs   (  116.95 ms per token)
llama_print_timings:       total time =  9367.92 ms
Output generated in 9.61 seconds (7.60 tokens/s, 73 tokens, context 16, seed 1790436034)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  6842.40 ms
llama_print_timings:      sample time =   124.14 ms /    74 runs   (    1.68 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time =  7982.41 ms /    74 runs   (  107.87 ms per token)
llama_print_timings:       total time =  8608.89 ms
Output generated in 8.84 seconds (8.26 tokens/s, 73 tokens, context 16, seed 2034046597)
  • 7B (Q6_K) with CPU:
llama_print_timings:        load time =  9696.51 ms
llama_print_timings:      sample time =    70.16 ms /    93 runs   (    0.75 ms per token)
llama_print_timings: prompt eval time =  9696.47 ms /    16 tokens (  606.03 ms per token)
llama_print_timings:        eval time = 32414.95 ms /    92 runs   (  352.34 ms per token)
llama_print_timings:       total time = 42499.44 ms
Output generated in 42.76 seconds (2.15 tokens/s, 92 tokens, context 16, seed 207008041)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9696.51 ms
llama_print_timings:      sample time =    69.48 ms /    93 runs   (    0.75 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 35336.74 ms /    93 runs   (  379.96 ms per token)
llama_print_timings:       total time = 35719.30 ms
Output generated in 36.00 seconds (2.56 tokens/s, 92 tokens, context 16, seed 471859824)
Llama.generate: prefix-match hit

llama_print_timings:        load time =  9696.51 ms
llama_print_timings:      sample time =    71.53 ms /    93 runs   (    0.77 ms per token)
llama_print_timings: prompt eval time =     0.00 ms /     1 tokens (    0.00 ms per token)
llama_print_timings:        eval time = 38983.90 ms /    93 runs   (  419.18 ms per token)
llama_print_timings:       total time = 39360.85 ms
Output generated in 39.67 seconds (2.32 tokens/s, 92 tokens, context 16, seed 943906759)

So seems there is no performance improvement for Q4_0 at this time, but over 3x improvement for Q6_K.

@MeinDeutschkurs
Copy link

MeinDeutschkurs commented Jun 20, 2023

I‘m sorry: It is not the AUTOMATIC1111 for text-Generation on Mac. Whatever I tried, I never was able to run it on Apple Silicon devices
image

i give up.

🏳️

@ryanniccolls
Copy link

Prompt Engineering has compiled most of the issues above into a video.
https://youtu.be/btmVhRuoLkc

@MeinDeutschkurs
Copy link

Prompt Engineering has compiled most of the issues above into a video. https://youtu.be/btmVhRuoLkc

I‘ll give it a try! Thank you for the link.

@noobmldude
Copy link

I see this issue when loading an AutoGPTQ model:

/text-generation-webui/uienv/lib/python3.10/site-packages/torch/cuda/init.py”, line 239, in _lazy_init raise AssertionError(“Torch not compiled with CUDA enabled”) AssertionError: Torch not compiled with CUDA enabled

@unixwzrd
Copy link

Using this over the past few months, I've been impressed with what this does, but it definitely does not use the M1/M2 GPU and unified memory. I think I may have tracked it down and am working on a fix once and for all on this. I'm keeping track of how I am progressing and hope to have a push to the repository in the next couple of days. The problem seems to be that the tensor processing falls back to CPU when the device gets changed (I think) to CUDA and not finding it in the ARM64 M2 I have, falls back not to using MPS, but falls back to using CPU. Stress testing of my python modules validated I can use 96% of my M2 GPU's and learned way more than I ever wanted to about the ARM64 M2 Max and unified memory.

So far I have completely rebuilt all of the Python modules and stress tested them with complete success. So, I know it's not my modules or environment. I built my own OpenBLAS and LAPACK, I have instructions for that, but I think all can be solved easier than picking apart module builds and dependencies. I am a complete GitHub Open Source contribution virgin, so if someone would like to reach out as a contact in case I have any questions about the process, I don't want to screw the repository up making my changes.

Fingers crossed, I'll have MPS working with all this soon and fixes uploaded with instructions to follow.

Cheers!

@RDearnaley
Copy link

RDearnaley commented Jul 14, 2023

Since the main installation README.md for oobabooga on Mac links to this thread, which is quite long and mostly several months old, and refers to developer builds of a Mac OS that has now been released, it would be really nice if someone could give a concise, up-to date set of instructions summarizing how to enable MPS when installing oobabooga on Mac. Or even nicer if they could add it to the main installation README.md instead of linking to this thread..

@MeinDeutschkurs
Copy link

MeinDeutschkurs commented Jul 14, 2023

Since the main installation README.md for oobabooga on Mac links to this thread, which is quite long and mostly several months old, and refers to developer builds of a Mac OS that has now been released, it would be really nice if someone could give a concise, up-to date set of instructions summarizing how to enable MPS when installing oobabooga on Mac. Or even nicer if they could add it to the main installation README.md instead of linking to this thread..

I tried to do it like the link above (video https://youtu.be/btmVhRuoLkc), but I'm still struggling. I started to try my luck with lm-sys/FastChat

Edit: And for my usecase FastChat works as intended.

@unixwzrd
Copy link

unixwzrd commented Jul 14, 2023

Since
Here's my writeup as a work in process.

https://github.com/unixwzrd/oobabooga-macOS

UPDATE: Created a repository for oobabooga-macOS with my raw text notes, will get more added to it soon, but figured to publish my notes as they might help someone now.

UPDATE: Made it through a lot of this and created a very comprehensive guide to the tools you will need, using Conda and vent's, and have everything working now for LLaMA models. The GPU is processing, I can actually see it when monitoring performance. There;'s still something hitting the CPU and Gadio chews up CPU and GPU in Firefox (like the orange "working" bar, about 30% of the GPU when visible) , gotta be a better way to do some of this stuff. Discovered several other modules which need extra attention along the way and I did have a little trouble with the request module after re-basing and merging my local woking copy of oobabooga to current, not sure why, but tinkering with it git it working.

Fun fact, on a 30B with no quantization and 3k context, I'm getting about 4-6 tokens per second. I'm sure there is some tuning I could do, but this is pretty good and very usable.

I'm finishing up my writeup, I've been building up my environment and texting everything along the way. I should have something soon, had a couple of things I had to back out, and I've had a couple of extensions tear down my venv or stop working for one reason or another. There are a lot of interdependencies and even version interdependencies which make some of the extensions incompatible. Whisper is one which sticks out for me, it insists on downgrading things like Numpy. I'll try to have my notes up by the end of the weekend. But I have learned quite a bit about all the moving parts here and even with my first attempt at a macOS README, it will most likely not be comprehensive.

I've been sifting through the codebase here and turned up inconsistencies in how Torch devices are handled and trying to get a handle on that.

I have got llama.cpp working standalone and built for MPS, but for some reason, I am having difficulty with the llama_cpp_python module which on a simple level, just a python wrapper around llama-cpp (actually, it's a dylan so you can have its functionality in Python). The Python module seems to have refused to build with metal support the last time I installed it, so either something changed, or I botched the installation.

They're about three major items on my to-do, the python modules and building them, Torch devices and the codebase, and creating a README with all my notes and steps. I have a couple of stress-testing/probing scripts which do nothing meaningful other than manipulate data types and tensors using PyTorch showing it working and consuming GPU resources on Apple Silicon. IS anyone would find those useful, let me know and I'll create a repository for them.

@jessyone
Copy link

MacBook Pro (Retina, 15-inch, Late 2013)
2 GHz 四核Intel Core i7
电脑在以上配置,在终端运行./start_macos.sh情况下出现以下错误:******************************************************************

  • WARNING: You haven't downloaded any model yet.
  • Once the web UI launches, head over to the bottom of the "Model" tab and download one.

Traceback (most recent call last):
File "/Users/macbookpro/Desktop/oobabooga_macos/text-generation-webui/server.py", line 12, in
import gradio as gr
ModuleNotFoundError: No module named 'gradio'
(base) macbookdembp:oobabooga_macos macbookpro$ ./start_macos.sh


  • WARNING: You haven't downloaded any model yet.
  • Once the web UI launches, head over to the bottom of the "Model" tab and download one.

Traceback (most recent call last):
File "/Users/macbookpro/Desktop/oobabooga_macos/text-generation-webui/server.py", line 12, in
import gradio as gr
ModuleNotFoundError: No module named 'gradio'

求解决方案,谢谢

@hsheraz
Copy link

hsheraz commented Aug 7, 2023

Getting this error when running: python server.py --thread 4

Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Intel MKL WARNING: Support of Intel(R) Streaming SIMD Extensions 4.2 (Intel(R) SSE4.2) enabled only processors has been deprecated. Intel oneAPI Math Kernel Library 2025.0 will require Intel(R) Advanced Vector Extensions (Intel(R) AVX) instructions.
Traceback (most recent call last):
File "/Users/<USER_NAME>/Documents/Random Work/LLM/text-generation-webui/server.py", line 27, in
import modules.extensions as extensions_module
File "/Users/<USER_NAME>/Documents/Random Work/LLM/text-generation-webui/modules/extensions.py", line 8, in
import modules.shared as shared
File "/Users/<USER_NAME>/Documents/Random Work/LLM/text-generation-webui/modules/shared.py", line 239, in
args.loader = fix_loader_name(args.loader)
File "/Users/<USER_NAME>/Documents/Random Work/LLM/text-generation-webui/modules/shared.py", line 202, in fix_loader_name
name = name.lower()
AttributeError: 'NoneType' object has no attribute 'lower'

I am trying to run model Llama-2-7B-Chat-GGML
MacOS M1 chip, 8 GB

@clearsitedesigns
Copy link

Is this approach still valid even though torch is now at 2.0.1 Stable>

@manikantapothuganti
Copy link

Is this approach still valid even though torch is now at 2.0.1 Stable>

The latest command for PyTorch includes the following note, suggesting that we might no longer require adherence to this previously suggested approach. If feasible, could you please confirm this, @WojtekKowaluk ? Your confirmation would be greatly appreciated. Thank you!

# MPS acceleration is available on MacOS 12.3+ pip3 install torch torchvision torchaudio

@WojtekKowaluk
Copy link
Contributor Author

WojtekKowaluk commented Aug 10, 2023

@clearsitedesigns @manikantapothuganti:
PyTorch dev version is still required. It won't be required after PyTorch 2.1.0 is released.

While PyTorch 2.0.1 has some cumsum op support, you will get this error (it is different error than with 2.0.0):

RuntimeError: MPS does not support cumsum op with int64 input

@bbecausereasonss
Copy link

bbecausereasonss commented Aug 11, 2023

Hmm. Tried this and am getting this error?

GGML_ASSERT: /private/var/folders/hl/v2rnh0fx3hjcctc41tzk7rkw0000gn/T/pip-install-mdqd4bv2/llama-cpp-python_dff92b844a124189bb952ea0fbc93386/vendor/llama.cpp/ggml-metal.m:738: false && "not implemented"
/bin/sh: line 1: 51430 Abort trap: 6 python server.py --chat

@oobabooga
Copy link
Owner

FYI, I have created a new general thread for discussing Mac/Metal setup: #3760

It is now pinned at the top of the Issues section.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet