Use llama.cpp or TextGen-webui? #1

MichaelMartinez · 2023-08-21T20:54:32Z

Cool project!!!

There are a lot of models out there that will probably perform way better than vanilla llama 2. To get an idea, have a look at this HF space: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results

Also, using a locally run 13b model is relatively trivial at this point, even with modest hardware if you use a quantized version. I am looking your code base to see where a good entry point for something like llama.cpp or textgen-webui could be harnessed.

jawerty · 2023-08-21T20:58:33Z

Great lmk how that goes I’d love to make it so you can use any model you want. I used the llama 2 chat just to show how good this can work with a limited non fine tuned instruction model and it’s actually not bad. Any ideas for some 13b or less models that could write react really well? I’ll just change that as the default.

On Mon, Aug 21, 2023 at 4:54 PM Michael Martinez ***@***.***> wrote: Cool project!!! There are a lot of models out there that will probably perform way better than vanilla llama 2. To get an idea, have a look at this HF space: https://huggingface.co/spaces/mike-ravkine/can-ai-code-results Also, using a locally run 13b model is relatively trivial at this point, even with modest hardware if you use a quantized version. I am looking your code base to see where a good entry point for something like llama.cpp or textgen-webui could be harnessed. — Reply to this email directly, view it on GitHub <#1>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPIGZ5JKFINUNBVQWP7ZHTXWPDJFANCNFSM6AAAAAA3Y5XHQU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Sent from Gmail Mobile -Jared

metantonio · 2023-09-07T15:13:02Z

Change this block of code:

from transformers import AutoTokenizer
import transformers
import torch

model = "meta-llama/Llama-2-13b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

For this:

from transformers import AutoTokenizer, pipeline, logging
import transformers
import torch
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
import argparse


# model = "meta-llama/Llama-2-13b-chat-hf" # this works #needs about 25-30 GB VRAM (GPU)
model = "TheBloke/CodeLlama-7B-Instruct-GPTQ" # try this if you wanna experiment (for GPU under 15GB, using quantized model ;)
model_basename = "model"
use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model, use_fast=True)
tokenizer2= AutoGPTQForCausalLM.from_quantized(model,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

pipeline = pipeline(
  "text-generation",
  model=tokenizer2,
  tokenizer=tokenizer,
  max_new_tokens=4096,
  temperature=0.7,
  top_p=0.95,
  repetition_penalty=1.15
)

With this you can use quantized models, CodeLlama, etc...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use llama.cpp or TextGen-webui? #1

Use llama.cpp or TextGen-webui? #1

MichaelMartinez commented Aug 21, 2023

jawerty commented Aug 21, 2023 via email

metantonio commented Sep 7, 2023

Use llama.cpp or TextGen-webui? #1

Use llama.cpp or TextGen-webui? #1

Comments

MichaelMartinez commented Aug 21, 2023

jawerty commented Aug 21, 2023 via email

metantonio commented Sep 7, 2023