Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disabling exllama on old compute. #986

Merged
merged 2 commits into from
Sep 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
17 changes: 12 additions & 5 deletions server/text_generation_server/utils/layers.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,20 @@

from text_generation_server.utils.gptq.quant_linear import QuantLinear

HAS_EXLLAMA = True
if os.getenv("DISABLE_EXLLAMA") == "True":
HAS_EXLLAMA = False
try:
from text_generation_server.utils.gptq.exllama import Ex4bitLinear
except ImportError:
major, _minor = torch.cuda.get_device_capability()
except Exception:
major = 1
HAS_EXLLAMA = False
CAN_EXLLAMA = major >= 8
if os.getenv("DISABLE_EXLLAMA") == "True":
HAS_EXLLAMA = False
elif CAN_EXLLAMA:
try:
from text_generation_server.utils.gptq.exllama import Ex4bitLinear
HAS_EXLLAMA = True
except ImportError:
pass

from typing import Optional

Expand Down
4 changes: 2 additions & 2 deletions server/text_generation_server/utils/weights.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,10 +170,10 @@ def get_multi_weights_row(self, prefix: str, quantize: str):
"Cannot load `gptq` weight, make sure the model is already quantized, or quantize it with `text-generation-server quantize ORIGINAL_MODEL_ID NEW_MODEL_ID`"
)

from text_generation_server.utils.layers import HAS_EXLLAMA
from text_generation_server.utils.layers import HAS_EXLLAMA, CAN_EXLLAMA

if use_exllama:
if not HAS_EXLLAMA:
if not HAS_EXLLAMA and CAN_EXLLAMA:
logger.warning(
"Exllama GPTQ cuda kernels (which are faster) could have been used, but are not currently installed, try using BUILD_EXTENSIONS=True"
)
Expand Down