Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama #729

psych0v0yager · 2024-03-06T00:59:05Z

Refactored the exl2 function in exllamav2.py.

The new version offers the following benefits:

auto split support. You no longer need to split a large model over 2 GPUs manually, exllama will do it for you
8 bit cache support. Supports the 8 bit cache, can squeeze more context into the same GPU
Additional exllamav2 improvements. Supports low_mem, fasttensors.
No longer need to pass in num_experts, it is optional.
Future support for 4 bit cache. Whenever turbo updates the pip package, uncomment the 4 bit lines for 4 bit support.
Refactored the function parameters. Changed the model_kwargs dictionary to individual parameters. Combined with documentation this makes it easier for new users to understand what options they can select.

Future effort.

Will look into replacing the Huggingface Tokenizer with the ExllamaV2 Tokenizer. Currently I am unsure what benefits it will provide, but it is worth a shot.

…mav2 library

rlouf · 2024-03-06T05:21:40Z

Great! Is this ready for review?

psych0v0yager · 2024-03-07T05:06:38Z

Updates:

Added LoRA support. LoRAs can now be hot-swapped dynamically as needed.

Here is an example how to use the LoRA feature

from outlines import models, generate

model = models.exl2(model_path = "/path/to/mistral_openorca", max_seq_len=8192, device = "cuda", gpu_split = "auto", verbose=True)

generator = generate.text(model)
answer = generator("Почему трава зеленая?", max_tokens=100)
print(answer)

model.update_lora("/path/to/russian_openorca")

generator = generate.text(model)
answer = generator("Почему трава зеленая?", max_tokens=100)
print(answer)

model.update_lora(None)

generator = generate.text(model)
answer = generator("Почему трава зеленая?", max_tokens=100)
print(answer)

This is a demonstration showing the new loading/unloading capabilities.

The following models/adapters were used in this demo

Model:
https://huggingface.co/Open-Orca/Mistral-7B-OpenOrca/tree/d0d05321894845b388ce6ea85321b2e3a59aaf5f
(must use safetensors)

Adapter:
https://huggingface.co/IlyaGusev/saiga_mistral_7b_lora

…h without inputting a device.

rlouf · 2024-03-07T12:51:58Z

This is really awesome! Let me know when I can review

psych0v0yager · 2024-03-07T18:56:45Z

Yeah the inputs were messed up. I put the try/except input block inside update LoRA. It may be a little slower to start, but the subsequent ones still load in 0.0s.

The latest update will fix the error

psych0v0yager · 2024-03-08T21:31:41Z

Please feel free to review this branch. It is ready

outlines/models/exllamav2.py

…mav2 library

…h without inputting a device.

psych0v0yager · 2024-03-11T21:33:06Z

I can make all these changes and push asap

psych0v0yager · 2024-03-11T23:04:10Z

I pushed the changes and updated my branch. There are some issues with my implementation and the recent regex changes pushed last week.

A week ago this code was able to run without errors

from outlines import models, generate

model = models.exl2(model_path = "./miqu_GPTQ", max_seq_len=25000, device = "cuda", gpu_split = "auto", verbose=True)

prompt = """You are a sentiment-labelling assistant.
Is the following review positive or negative?

Review: This restaurant is just awesome!
"""

generator = outlines.generate.choice(model, ["Positive", "Negative"])
answer = generator(prompt)

However now I receive the following error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[2], [line 7]
      [1] prompt = """You are a sentiment-labelling assistant.
      [2] Is the following review positive or negative?
      [3] 
      [4] Review: This restaurant is just awesome!
      [5] """
----> [7] generator = outlines.generate.choice(model, ["Positive", "Negative"])
      [8] answer = generator(prompt)

File [...]/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    [905] if not args:
    [906]     raise TypeError(f'{funcname} requires at least '
    [907]                     '1 positional argument')
--> [909] return dispatch(args[0].__class__)(*args, **kw)

File [...]/outlines/generate/choice.py:17, in choice(model, choices, sampler)
     [11] @singledispatch
     [12] def choice(
     [13]     model, choices: List[str], sampler: Sampler = multinomial()
     [14] ) -> SequenceGenerator:
     [15]     regex_str = r"(" + r"|".join(choices) + r")"
---> [17]     generator = regex(model, regex_str, sampler)
     [18]     generator.format_sequence = lambda x: x
     [20]     return generator

File [...]/functools.py:909, in singledispatch.<locals>.wrapper(*args, **kw)
    [905] if not args:
    [906]     raise TypeError(f'{funcname} requires at least '
    [907]                     '1 positional argument')
--> [909] return dispatch(args[0].__class__)(*args, **kw)

File [...]/outlines/generate/regex.py:35, in regex(model, regex_str, sampler)
     [14] @singledispatch
     [15] def regex(model, regex_str: str, sampler: Sampler = multinomial()):
     [16]     """Generate structured text in the language of a regular expression.
     [17] 
     [18]     Parameters
   (...)
     [33] 
     [34]     """
---> [35]     fsm = RegexGuide(regex_str, model.tokenizer)
     [37]     device = model.device
     [38]     generator = SequenceGenerator(fsm, model, sampler, device)

File [...]/outlines/fsm/guide.py:132, in RegexGuide.__init__(self, regex_string, tokenizer)
    [126]         raise ValueError(
    [127]             "The vocabulary does not allow us to build a sequence that matches the input regex"
    [128]         )
    [130]     return states_to_token_maps, empty_token_ids, regex_fsm.finals
--> [132] (
    [133]     self.states_to_token_maps,
    [134]     self.empty_token_ids,
    [135]     fsm_finals,
    [136] ) = create_states_mapping(
    [137]     regex_string, tuple(sorted(tokenizer.vocabulary.items()))
    [138] )
    [139] self.vocabulary = list(tokenizer.vocabulary.values())
    [140] self.eos_token_id = tokenizer.eos_token_id

ValueError: not enough values to unpack (expected 3, got 2)

rlouf · 2024-03-12T09:18:20Z

Can you clear the cache using outlines.caching.clear_cache() and try again?

rlouf · 2024-03-12T18:00:08Z

I don't have this problem locally, so it must be the cache. I still need to try the lora hotswapping functionality, will take a look tomorrow and hopefully merge this.

rlouf · 2024-03-13T13:24:05Z

Great work, thank you!

Refactored exl2 method to add in more features supported by the exlla…

14d0160

…mav2 library

psych0v0yager added 5 commits March 6, 2024 20:11

Added LoRA support

a46d86a

Added unloading as well

e0544f0

fixed LoRA import

f533907

fixed LoRA import

49165f3

fixed LoRA import

741251d

psych0v0yager added 2 commits March 6, 2024 23:11

Made max_seq_len optional again

d5232d8

Made remaining params optional

5caa973

psych0v0yager changed the title ~~Refactored exl2 method to add in more features supported by exllama~~ Refactored exl2 method to add LoRA, 8bit cache, and other featuressupported by exllama Mar 7, 2024

psych0v0yager changed the title ~~Refactored exl2 method to add LoRA, 8bit cache, and other featuressupported by exllama~~ Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama Mar 7, 2024

psych0v0yager added 2 commits March 7, 2024 00:26

Removed optional flag on device. Even before my changes it would cras…

1be4858

…h without inputting a device.

Fixed type check

f75ff1f

Fixed the input error

e33d344

4 bit cache support is now active

be528af

rlouf reviewed Mar 8, 2024

View reviewed changes

psych0v0yager added 10 commits March 11, 2024 22:01

Refactored exl2 method to add in more features supported by the exlla…

bd984ad

…mav2 library

Added LoRA support

aa99511

Added unloading as well

eb0bbfb

fixed LoRA import

b7bd216

fixed LoRA import

9b720a3

fixed LoRA import

f7d5324

Made max_seq_len optional again

e62a70b

Made remaining params optional

29135ee

Removed optional flag on device. Even before my changes it would cras…

ec38d6e

…h without inputting a device.

Fixed type check

8d51cb5

psych0v0yager added 2 commits March 11, 2024 22:01

Fixed the input error

f9b6e44

4 bit cache support is now active

2cf15c2

rlouf force-pushed the IanMods branch from be528af to 2cf15c2 Compare March 11, 2024 21:02

psych0v0yager added 2 commits March 11, 2024 16:54

Made formatting changes

a8ab1e6

updated branch

7dec179

rlouf mentioned this pull request Mar 12, 2024

Add aliases for guides #739

Closed

rlouf merged commit 03c71f7 into dottxt-ai:main Mar 13, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama #729

Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama #729

psych0v0yager commented Mar 6, 2024

rlouf commented Mar 6, 2024

psych0v0yager commented Mar 7, 2024 •

edited by rlouf

Loading

rlouf commented Mar 7, 2024

psych0v0yager commented Mar 7, 2024

psych0v0yager commented Mar 8, 2024

psych0v0yager commented Mar 11, 2024

psych0v0yager commented Mar 11, 2024

rlouf commented Mar 12, 2024 •

edited

Loading

rlouf commented Mar 12, 2024

rlouf commented Mar 13, 2024

Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama #729

Refactored exl2 method to add LoRA, 8bit cache, and other features supported by exllama #729

Conversation

psych0v0yager commented Mar 6, 2024

rlouf commented Mar 6, 2024

psych0v0yager commented Mar 7, 2024 • edited by rlouf Loading

rlouf commented Mar 7, 2024

psych0v0yager commented Mar 7, 2024

psych0v0yager commented Mar 8, 2024

psych0v0yager commented Mar 11, 2024

psych0v0yager commented Mar 11, 2024

rlouf commented Mar 12, 2024 • edited Loading

rlouf commented Mar 12, 2024

rlouf commented Mar 13, 2024

psych0v0yager commented Mar 7, 2024 •

edited by rlouf

Loading

rlouf commented Mar 12, 2024 •

edited

Loading