How to use llama index with alpaca locally #928

1Mark · 2023-03-27T19:32:44Z

I want to use llamaindex but I don't want any data of mine to be transferred to any servers. I want it all to happen locally or within my own EC2 instance. I have seen https://github.com/jerryjliu/llama_index/blob/046183303da4161ee027026becf25fb48b67a3d2/docs/how_to/custom_llms.md#example-using-a-custom-llm-model but it calls hugging face.

My plan was to use https://github.com/cocktailpeanut/dalai with the alpaca model then somehow use llamaindex to input my dataset. Any examples or pointers for this?

logan-markewich · 2023-03-27T20:59:13Z

@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca

Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs.

It's going to be different for every model, but it's not too bad 😄

1Mark · 2023-03-27T21:00:19Z

@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca

Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs.

It's going to be different for every model, but it's not too bad 😄

Thank you. Do you have any examples?

logan-markewich · 2023-03-27T21:03:58Z

@1Mark I personally haven't used llama or alpaca. How are you loading the model and generating text right now?

here's a very rough example with some fake functions to kind of show what I mean

def load_alpaca():
    ...
    return model

class CustomLLM(LLM):
    model = load_alpaca()

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        prompt_length = len(prompt)
        
        response_text = self.model(prompt)

        # only return newly generated tokens
        return response_text[prompt_length:]

    @property
    def _identifying_params(self) -> Mapping[str, Any]:
        return {"name_of_model": self.model_name}

    @property
    def _llm_type(self) -> str:
        return "custom"

gianfra-t · 2023-03-27T22:04:45Z

Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu.
For instance, you could make use of the tloen/alpaca-lora-7b implementation.
If you want to use something like dalai (something running a llama.cpp instance) you need to find an implementation that creates a server with an api call to the model. I don't know of such implementation at the moment but it should be very simple.

jerryjliu · 2023-03-27T23:27:15Z

If someone's able to get alpaca or llama working with llamaindex lmk! would be a cool demo to show :)

1Mark · 2023-03-28T10:49:39Z

Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu. For instance, you could make use of the tloen/alpaca-lora-7b implementation. If you want to use something like dalai (something running a llama.cpp instance) you need to find an implementation that creates a server with an api call to the model. I don't know of such implementation at the moment but it should be very simple.

tloen/alpaca-lora-7b doesn't seem to have its own inference api https://huggingface.co/tloen/alpaca-lora-7b#:~:text=Unable%20to%20determine%20this%20model%E2%80%99s%20pipeline%20type.%20Check%20the%20docs%20%20.

1Mark · 2023-03-28T11:54:03Z

This issue here seems quite relevant tloen/alpaca-lora#45

logan-markewich · 2023-03-28T14:06:41Z

@1Mark the code in that repo could easily be adapted to work with llama index. (I.e. generate.py). Just need to move the model loading and inference code into the custom LLM class

knoopx · 2023-03-29T17:03:49Z

something along the lines works with pip -q install git+https://github.com/huggingface/transformers:

from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline

tokenizer = LlamaTokenizer.from_pretrained("chavinlo/alpaca-native")

base_model = LlamaForCausalLM.from_pretrained(
    "chavinlo/alpaca-native",
    load_in_8bit=True,
    device_map='auto',
)

pipe = pipeline(
    "text-generation",
    model=base_model,
    tokenizer=tokenizer,
    max_length=256,
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.2
)

local_llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(prompt=prompt, llm=local_llm)

logan-markewich · 2023-03-29T18:39:06Z

@knoopx nice! So if that's wrapped into the CustomLLM class from above and passed as an LLMPredictor LLM, the integration should work!

How well it works is up to the model though lol

Tavish77 · 2023-03-31T10:35:44Z

can i combine your code in this way
LLMPredictor(llm=local_llm)

logan-markewich · 2023-03-31T14:24:10Z

@Tavish77 not quite. You'll still need to wrap it in that class that extends the LLM class. I had an example posted further above 👍🏻

Then you instantiate that class and pass it in like you did there

shreedhan · 2023-03-31T16:57:09Z

@logan-markewich I'm trying to combine the examples you posted above. What do you return as the model from load_alpaca() method? Do you return llm_chain? Can you post the full example here?

donflopez · 2023-04-01T23:15:37Z

Hey, I'm loading a peft.PeftModel.from_pretrained and following the instructions in this thread and in here but I get multiple errors:

The model 'PeftModelForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 0 tokens
caaaaling
Token indices sequence length is longer than the specified maximum sequence length for this model (1622 > 1024). Running this sequence through the model will result in indexing errors
/home/donflopez/.local/lib/python3.10/site-packages/transformers/generation/utils.py:1219: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation)
  warnings.warn(
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1146: indexSelectLargeIndex: block: [162,0,0], thread: [66,0,0] Assertion `srcIndex 
....many more with the same...
Traceback (most recent call last):
....many hops...
    x = torch.addmm(self.bias, x.view(-1, x.size(-1)), self.weight)
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

Does anybody know what's going on? Thanks!

EDIT for adding more context:

If I use the this model 'decapoda-research/llama-7b-hf I get an error like:

ValueError: `.to` is not supported for `8-bit` models. Please use the model as it is, since the model has already been set to the correct devices and casted to the correct `dtype`.

devinSpitz · 2023-04-02T05:48:45Z

The code in the file is how far i got to work with llama_index. Some one knows what im doing wrong?

alpaca_llama_index.txt

Exception happen while the pipeline command:

Traceback (most recent call last):
  File "/workspace/LLama-Hub/main2.py", line 68, in <module>
    class CustomLLM(LLM):
  File "/workspace/LLama-Hub/main2.py", line 79, in CustomLLM
    pipeline = pipeline(
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/init.py", line 979, in pipeline
    return pipeline_class(model=model, framework=framework, task=task, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/text_generation.py", line 63, in init
    super().init(*args, kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/pipelines/base.py", line 773, in init
    self.model.to(device)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  [Previous line repeated 6 more times]
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
NotImplementedError: Cannot copy out of meta tensor; no data!

donflopez · 2023-04-02T06:26:08Z

I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py

It is not perfect, but works...

Fritskee · 2023-04-02T11:05:08Z

I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py

It is not perfect, but works...

@donflopez In order to get your code running, I had to install transformers 4.28.0.dev0 (so building from github), but I'm still getting the following error now:

RuntimeError: Failed to import transformers.pipelines because of the following error (look up to see its traceback):
cannot import name 'BertTokenizerFast' from 'transformers.models.bert'

Did you encounter this at all? (and how did you fix it?)

h1f0x · 2023-04-02T14:20:52Z

@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly.
@devinSpitz Did you get that sorted out? Got the same issue with a modified version myself, any luck so far?

devinSpitz · 2023-04-02T17:53:11Z

@h1f0x i could get @donflopez's repo to work but i always got completly wrong anwers or some times nothing (+- the same that i now have with this version xD). But with it i was able to get further but still with no usable response.

The model that should have "read" the documents (Llama document and the pdf from the repo) does not give any usefull answer anymore.

this was with: base_model= circulus/alpaca-7b and the lora weight was circulus/alpaca-lora-7b i did try other models or combinations but i did not get any better result :(

Question: What do you think of Facebook's LlaMa?
befor reposnse: I think Facebook’s LLAMA (Learn, Launch and Maintain Audience) initiative is an excellent program which can help businesses of all sizes to reach their target audiences more effectively. It provides valuable resources such as training materials, tools and best practices for launching, maintaining and engaging with an audience on social media platforms.
after: Output should include references to sources where applicable.

This shows that something does work or at least not break?
Question: What is the capital of England?
befor reposnse: The capital of England is London.
after: The capital of England is London.

Question: What are alpacas? and how are they different from llamas?
befor reposnse: Alpacas are small, domesticated animals related to camels and native to South America. They are typically smaller than llamas and have finer fleeces which make them ideal for fiber production. Alpacas are also more docile and easier to handle than llamas.
after: Output should include references to sources used to create the output.

Code:
https://gist.github.com/devinSpitz/73cd7037b82d7acbe70ddf4d1c61ba4a

alpaca_Llama_index_output.txt

donflopez · 2023-04-02T21:25:27Z

@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly.

@h1f0x I'm running on a 4090 too, yes, multiple executions fail and also you cannot go beyond 1 beams.

I'm trying to figure out why this happens. When querying the raw model, this does not happen, it probably has something to do with llama_index + the pipeline setup.

@devinSpitz, I also have weird results. Please note that in my code I have a . as stop sequence. I'm still trying to find a stop sequence that works properly for llama_index. For me, the main issue I find in the model is that it tries to reapeat the llama_index prompt as a pattern instead of stopping at the right place.

donflopez · 2023-04-02T21:29:44Z

I'm getting this with - as stop sequence. A bunch of non-sense after the first dot, the vram goes up to 23.5GB and after that runs OOM.

Question: How many people lives in Martos?
Answer: According to data provided by INE, there are currently approximately 24 thousand two hundred seventeen residents living within the municipal boundaries of Martos. # Lijst van voetbalinterlands Oman - Saudi Arabië

Deze lijst van voetbalinterlands geeft een overzicht van alle officiële interlands tussen het nationale elftal van Oman en dat van Saudi-

donflopez · 2023-04-02T22:05:53Z

@devinSpitz I got this outpout tweaking your script to make it work with index, still llama doesn't know when to stop. Using - as stop sequence. -> https://gist.github.com/donflopez/535e5ecb85b79233c7cf74fd977eb87f

Improved it, here is the latest output:
https://gist.github.com/donflopez/39bb9bc34cc00467679f10bab3e4a734

@h1f0x lookslike the OOM issue doesn't happen on the script, so it could be gradio that copies the resources when making a request? I have no idea how gradio works tbh, but if I move things out of gradio, there's no OOM.

ReconIII · 2023-04-02T22:31:47Z

I have been trying to get this to work as well, but keep running into issues with sentencepiece:
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

Anyone else having this or any suggestions? Thanks!

juanps90 · 2023-04-03T00:51:21Z

Just like inference with OpenAI APIs doesn't happen locally, is there any way to use HTTP requests to send the prompts to a server exposing any LLM like Alpaca via HTTP? I feel like it would be easier if we could decouple the LLM.

Tavish77 · 2023-04-03T00:58:45Z

@Tavish77 not quite. You'll still need to wrap it in that class that extends the LLM class. I had an example posted further above 👍🏻

Then you instantiate that class and pass it in like you did there

thank you i have solved my porblem

h1f0x · 2023-04-03T20:55:25Z

@donflopez Many thanks for your feedback! I got it working with CPU only later that evening but I needed to change the page management in windows itself too to get it working. I hope I can try some new settings soon. Gradio is a mystery for myself as well :D At least so far.. looking into that deeper as well. If I find anything I let you know!

@devinSpitz At least you get some Output, was not able to produce that haha, but I guess that's because of some strange behaviors when running with CPU. :)

Tavish77 · 2023-04-04T05:17:51Z

@devinSpitz
I also encountered this issue on the 4090, but it runs normally on other devices Have you solved it yet?

Tavish77 · 2023-04-04T06:33:15Z

@devinSpitz I also encountered this issue on the 4090, but it runs normally on other devices Have you solved it yet?

I have already resolved it

masknetgoal634 · 2023-04-04T06:36:56Z

if you have an issue with 4090 try to install a new driver 525.105.17:
https://www.nvidia.com/Download/driverResults.aspx/202351/en-us/

devinSpitz · 2023-04-04T07:42:45Z

@Tavish77 @masknetgoal634 Thanks both of you, yes I'm using a 4090 so I will update the driver and try it again :D

@donflopez thanks as well, you are right with the stop sequence "-" is a little bit better but still not good :(

@h1f0x Yes that's right xD But I still want to get it working :D

devinSpitz · 2023-04-04T12:07:26Z

@masknetgoal634 Im already on a newer driver xD

@Tavish77 how did you solve it?

karlklaustal · 2023-04-05T21:15:55Z

I made this work in a colab notebook with LLamaIndex and the Gpt4All model.
But you can only load small text bits with llamaIndex. If you load more text the colab (non pro) crashes.Sure... sorry my quota on colab is always at max so I paste this

https://pastebin.com/mGuhEBQS

I copied this from my local jupyter so be aware. Some headings are not code.

like

" Load GPT4ALL-LORA Model"

Hope this helps. I now try to exchange the GPT4ALL-Lora with a 4bit version. But I am somehow stuck.

I only have a 6GB GPU.

Tavish77 · 2023-04-06T00:59:55Z

@devinSpitz

@Tavish77 how did you solve it?

I replaced a cloud GPU server

masknetgoal634 · 2023-04-08T08:34:57Z

@masknetgoal634 Im already on a newer driver xD

as far as i know there is only fix for 4090 in 525.105.17

ddb21 · 2023-04-21T16:48:25Z

Anybody make progress on this? Is it possible to use the CPU optimized (alpaca.cpp, etc) versions of Llama for creating embeddings or is a cloud service the only option here?

logan-markewich · 2023-04-21T17:33:33Z

@ddb21 you should be able to use llama cpp (or any llm that langchain has implemented) by wrapping the llm with the LLMPredictor class

https://github.com/hwchase17/langchain/tree/master/langchain/llms

And Here's the docs for using any custom model: https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-custom-llm-model

And here's a ton of examples implementing random llms

https://github.com/autratec/GPT4ALL_Llamaindex

https://github.com/autratec/dolly2.0_3b_HFembedding_Llamaindex

https://github.com/autratec/koala_hfembedding_llamaindex

Just need to make sure you set up the prompt helper/service context appropriately for the input size of each model

iamadhee · 2023-04-23T03:22:35Z

@logan-markewich I tried out your approach with llama_index and langchain, with a custom class that I built for OpenAI's GPT3.5 model. But, it seems that llama_index is not recognizing my CustomLLM as one of langchain's models. It is defaulting to it's own GPT3.5 model. What am I doing wrong here? Attaching the codes and the logs. Thanks in advance.

from openAIComplete import OpenAI
from langchain.llms.base import LLM
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

OPENAI_API_KEY = 'API KEY'
yo =OpenAI(api_key=OPENAI_API_KEY,model='gpt-3.5-turbo')

class CustomLLM(LLM):
    model_name = 'OpenAI GPT-3'
            
    @property
    def _llm_type(self) -> str:
        return "custom"
    
    def _call(self, prompt: str,stop:str=None):
        if stop is not None:
            raise ValueError("stop kwargs are not permitted.")
        print(prompt)
        res = yo.run(prompt)
        return res 
    
    @property
    def _identifying_params(self):
        return {"name_of_model": self.model_name}
    
yo2 = CustomLLM()

from llama_index import LLMPredictor, ServiceContext, GPTListIndex, GPTSimpleVectorIndex, SimpleDirectoryReader, PromptHelper, LangchainEmbedding


def chatbot(directory_path, input_text):
    max_input_size = 4096
    num_outputs = 512
    max_chunk_overlap = 20
    chunk_size_limit = 600

    prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

    llm_predictor = LLMPredictor(llm=CustomLLM())

    service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper) # , embed_model=embed_model

    documents = SimpleDirectoryReader(directory_path).load_data()

    index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

    index.save_to_disk('index.json')

    index = GPTSimpleVectorIndex.load_from_disk('index.json')
    response = index.query(input_text, response_mode="compact",service_context=service_context)
    return response.response
    
print(chatbot('models/','Hi, what is this document about?'))

INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total LLM token usage: 0 tokens
INFO:llama_index.token_counter.token_counter:> [build_index_from_nodes] Total embedding token usage: 2721 tokens
Traceback (most recent call last):
  File "/workspaces/docify/models/test.py", line 55, in <module>
    print(chatbot('models/','Hi, what is this document about?'))
  File "/workspaces/docify/models/test.py", line 49, in chatbot
    index = GPTSimpleVectorIndex.load_from_disk('index.json')
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 369, in load_from_disk
    return cls.load_from_string(file_contents, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 345, in load_from_string
    return cls.load_from_dict(result_dict, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 263, in load_from_dict
    return super().load_from_dict(result_dict, vector_store=vector_store, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 322, in load_from_dict
    return cls(index_struct=index_struct, docstore=docstore, **kwargs)
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/vector_indices.py", line 69, in __init__
    super().__init__(
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/vector_store/base.py", line 54, in __init__
    super().__init__(
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/base.py", line 69, in __init__
    self._service_context = service_context or ServiceContext.from_defaults()
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/indices/service_context.py", line 69, in from_defaults
    llm_predictor = llm_predictor or LLMPredictor()
  File "/home/codespace/.python/current/lib/python3.10/site-packages/llama_index/llm_predictor/base.py", line 164, in __init__
    self._llm = llm or OpenAI(temperature=0, model_name="text-davinci-003")
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for OpenAI
__root__
  Did not find openai_api_key, please add an environment variable `OPENAI_API_KEY` which contains it, or pass  `openai_api_key` as a named parameter. (type=value_error)

Note: Sorry about the clumsy code, I'm testing things out

iamadhee · 2023-04-23T03:25:35Z

To set more context, this is openAIComplete.py :

from baseModel import Model
import openai
import tiktoken


class OpenAI(Model):
    def __init__(self,
                 api_key: str,
                 model: str,
                 api_wait: int = 60,
                 api_retry: int = 6,
                 temperature: float = .7):
        super().__init__(api_key, model, api_wait, api_retry)

        self.temperature = temperature
        self._verify_model()
        self.set_key(api_key)
        self.encoder = tiktoken.encoding_for_model(self.model)
        self.max_tokens = self.default_max_tokens(self.model)

    def supported_models(self):
        return {
            "text-davinci-003": "text-davinci-003 can do any language task with better quality, longer output, and consistent instruction-following than the curie, babbage, or ada models. Also supports inserting completions within text.",
            "text-curie-001": "text-curie-001 is very capable, faster and lower cost than Davinci.",
            "text-babbage-001": "text-babbage-001 is capable of straightforward tasks, very fast, and lower cost.",
            "text-ada-001": "text-ada-001 is capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.",
            "gpt-4": "More capable than any GPT-3.5 model, able to do more complex tasks, and optimized for chat. Will be updated with our latest model iteration.",
            "gpt-3.5-turbo": "	Most capable GPT-3.5 model and optimized for chat at 1/10th the cost of text-davinci-003. Will be updated with our latest model iteration",
        }

    def _verify_model(self):
        """
        Raises a ValueError if the current OpenAI model is not supported.
        """
        if self.model not in self.supported_models():
            raise ValueError(f"Unsupported model: {self.model}")

    def set_key(self, api_key: str):
        self._openai = openai
        self._openai.api_key = api_key

    def get_description(self):
        return self.supported_models()[self.model]

    def get_endpoint(self):
        model = openai.Model.retrieve(self.model)
        return model["id"]

    def default_max_tokens(self, model_name: str):
        token_dict = {
            "text-davinci-003": 4000,
            "text-curie-001": 2048,
            "text-babbage-001": 2048,
            "text-ada-001": 2048,
            "gpt-4": 8192,
            "gpt-3.5-turbo": 4096,
        }
        return token_dict[model_name]

    def calculate_max_tokens(self, prompt: str) -> int:

        prompt = str(prompt)
        prompt_tokens = len(self.encoder.encode(prompt))
        max_tokens = self.default_max_tokens(self.model) - prompt_tokens

        print(prompt_tokens, max_tokens)
        return max_tokens

    def run(self, prompt:str):

        if self.model in ["gpt-3.5-turbo"]:
            prompt_template = [
                {"role": "system", "content": "you are a helpful assistant."}
            ]
            prompt_template.append({"role": "user", "content": prompt})
            max_tokens = self.calculate_max_tokens(prompt_template)
            response = self._openai.ChatCompletion.create(
                model=self.model,
                messages=prompt_template,
                max_tokens=max_tokens,
                temperature=self.temperature,
            )
            return response["choices"][0]["message"]["content"].strip(" \n")

        else:
            max_tokens = self.calculate_max_tokens(prompt)
            response = self._openai.Completion.create(
                model=self.model,
                prompt=prompt,
                max_tokens=max_tokens,
                temperature=self.temperature,
            )
            return response["choices"][0]["text"].strip("\n")

iamadhee · 2023-04-23T03:48:57Z

Found the issue with mine. Seems while instantiating another instance of GPTSimpleVectorIndex, I wasn't passing the service_context parameter.

index.save_to_disk('index.json')

index = GPTSimpleVectorIndex.load_from_disk('index.json',service_context=service_context)

scooter7 · 2023-04-28T23:18:17Z

Hi, I've developed a streamlit app that uses llama-index with openai. I'd like not to pay for openai and be able to leverage an open source llm that has no commercial restrictions, no token limits, and a hosted api. I've been looking at bloom - https://huggingface.co/bigscience/bloom - but don't know how to call the huggingface model in a similar manner to what I have in my current code.

Does anyone know how I would adapt that code to work with Bloom from HuggingFace?

Thanks!

import logging
import streamlit as st
from gpt_index import SimpleDirectoryReader, GPTListIndex, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain.chat_models import ChatOpenAI
import sys
from datetime import datetime
import os
from github import Github

if "OPENAI_API_KEY" not in st.secrets:
st.error("Please set the OPENAI_API_KEY secret on the Streamlit dashboard.")
sys.exit(1)

openai_api_key = st.secrets["OPENAI_API_KEY"]

logging.info(f"OPENAI_API_KEY: {openai_api_key}")

Set up the GitHub API

g = Github(st.secrets["GITHUB_TOKEN"])
repo = g.get_repo("scooter7/CXBot")

def construct_index(directory_path):
max_input_size = 4096
num_outputs = 512
max_chunk_overlap = 20
chunk_size_limit = 600

prompt_helper = PromptHelper(max_input_size, num_outputs, max_chunk_overlap, chunk_size_limit=chunk_size_limit)

llm_predictor = LLMPredictor(llm=ChatOpenAI(temperature=0.7, model_name="gpt-3.5-turbo", max_tokens=num_outputs))

documents = SimpleDirectoryReader(directory_path).load_data()

index = GPTSimpleVectorIndex(documents, llm_predictor=llm_predictor, prompt_helper=prompt_helper)

index.directory_path = directory_path

index.save_to_disk('index.json')

return index

entrptaher · 2023-05-01T12:35:44Z

I see we can use https://github.com/lhenault/simpleAI to run a locally hosted openai alternative, but not sure if this can work with llama_index.

scooter7 · 2023-05-01T12:41:36Z

Interesting and thanks for sharing that! I will ultimately need a hosting environment beyond my local machine. Luckily, I'm finding some providers that are quite a bit more affordable than some of the big names.

logan-markewich · 2023-06-06T03:28:23Z

@entrptaher pretty much any LLM can work if you implement the CustomLLM class. Inside the class you could make API calls to someother hosted service or a local model

https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-custom-llm-model-advanced

logan-markewich · 2023-07-21T22:27:42Z

Ok, going to link these docs one last time. If you want to avoid openai, you need to setup both an LLM and an embedding model in the service context.

To make things easier, I also recommend setting a global service context. If you use a langchain LLM, be sure to wrap it with the LangChainLLM class

from llama_index.llms import LangChainLLM
from llama_index import ServiceContext, set_global_service_context

llm = LangChainLLM(<langchain llm class>)
embed_model = <setup embed model>

service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
set_global_service_context(service_context)

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-huggingface-llm
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#embedding-model-integrations
https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#custom-embedding-model

sf17 mentioned this issue Mar 28, 2023

Would it be possible to support llama index? cocktailpeanut/dalai#304

Open

logan-markewich closed this as completed Jul 21, 2023

How to use llama index with alpaca locally #928

How to use llama index with alpaca locally #928

Comments

1Mark commented Mar 27, 2023

logan-markewich commented Mar 27, 2023

1Mark commented Mar 27, 2023 • edited

logan-markewich commented Mar 27, 2023

gianfra-t commented Mar 27, 2023

jerryjliu commented Mar 27, 2023

1Mark commented Mar 28, 2023

1Mark commented Mar 28, 2023 • edited

logan-markewich commented Mar 28, 2023 • edited

knoopx commented Mar 29, 2023

logan-markewich commented Mar 29, 2023 • edited

Tavish77 commented Mar 31, 2023

logan-markewich commented Mar 31, 2023

shreedhan commented Mar 31, 2023

donflopez commented Apr 1, 2023 • edited

devinSpitz commented Apr 2, 2023 • edited

donflopez commented Apr 2, 2023

Fritskee commented Apr 2, 2023 • edited

h1f0x commented Apr 2, 2023 • edited

devinSpitz commented Apr 2, 2023 • edited

donflopez commented Apr 2, 2023 • edited

donflopez commented Apr 2, 2023

donflopez commented Apr 2, 2023 • edited

ReconIII commented Apr 2, 2023

juanps90 commented Apr 3, 2023

Tavish77 commented Apr 3, 2023

h1f0x commented Apr 3, 2023

Tavish77 commented Apr 4, 2023 • edited

Tavish77 commented Apr 4, 2023

masknetgoal634 commented Apr 4, 2023

devinSpitz commented Apr 4, 2023 • edited

devinSpitz commented Apr 4, 2023

karlklaustal commented Apr 5, 2023 • edited

Tavish77 commented Apr 6, 2023 • edited

masknetgoal634 commented Apr 8, 2023

ddb21 commented Apr 21, 2023

logan-markewich commented Apr 21, 2023 • edited

iamadhee commented Apr 23, 2023 • edited

iamadhee commented Apr 23, 2023 • edited

iamadhee commented Apr 23, 2023

scooter7 commented Apr 28, 2023 • edited

Set up the GitHub API

entrptaher commented May 1, 2023

scooter7 commented May 1, 2023

logan-markewich commented Jun 6, 2023

logan-markewich commented Jul 21, 2023

1Mark commented Mar 27, 2023 •

edited

1Mark commented Mar 28, 2023 •

edited

logan-markewich commented Mar 28, 2023 •

edited

logan-markewich commented Mar 29, 2023 •

edited

donflopez commented Apr 1, 2023 •

edited

devinSpitz commented Apr 2, 2023 •

edited

Fritskee commented Apr 2, 2023 •

edited

h1f0x commented Apr 2, 2023 •

edited

devinSpitz commented Apr 2, 2023 •

edited

donflopez commented Apr 2, 2023 •

edited

donflopez commented Apr 2, 2023 •

edited

Tavish77 commented Apr 4, 2023 •

edited

devinSpitz commented Apr 4, 2023 •

edited

karlklaustal commented Apr 5, 2023 •

edited

Tavish77 commented Apr 6, 2023 •

edited

logan-markewich commented Apr 21, 2023 •

edited

iamadhee commented Apr 23, 2023 •

edited

iamadhee commented Apr 23, 2023 •

edited

scooter7 commented Apr 28, 2023 •

edited