New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to use llama index with alpaca locally #928
Comments
@1Mark you just need to replace the huggingface stuff with your code to load/run alpaca Basically, you need to code the model loading, putting text through the model, and returning the newly generated outputs. It's going to be different for every model, but it's not too bad 😄 |
Thank you. Do you have any examples? |
@1Mark I personally haven't used llama or alpaca. How are you loading the model and generating text right now? here's a very rough example with some fake functions to kind of show what I mean
|
Hi @1Mark. When you use something like in the link above, you download the model from huggingface but the inference (the call to the model) happens in your local machine. Your data does not go to huggingface. You could even try this by loading a very large model and you will probably run out of VRAM or RAM if in cpu. |
If someone's able to get alpaca or llama working with llamaindex lmk! would be a cool demo to show :) |
tloen/alpaca-lora-7b doesn't seem to have its own inference api https://huggingface.co/tloen/alpaca-lora-7b#:~:text=Unable%20to%20determine%20this%20model%E2%80%99s%20pipeline%20type.%20Check%20the%20docs%20%20. |
This issue here seems quite relevant tloen/alpaca-lora#45 |
@1Mark the code in that repo could easily be adapted to work with llama index. (I.e. |
something along the lines works with from transformers import LlamaTokenizer, LlamaForCausalLM, pipeline
from langchain.llms import HuggingFacePipeline
tokenizer = LlamaTokenizer.from_pretrained("chavinlo/alpaca-native")
base_model = LlamaForCausalLM.from_pretrained(
"chavinlo/alpaca-native",
load_in_8bit=True,
device_map='auto',
)
pipe = pipeline(
"text-generation",
model=base_model,
tokenizer=tokenizer,
max_length=256,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.2
)
local_llm = HuggingFacePipeline(pipeline=pipe)
llm_chain = LLMChain(prompt=prompt, llm=local_llm) |
@knoopx nice! So if that's wrapped into the CustomLLM class from above and passed as an LLMPredictor LLM, the integration should work! How well it works is up to the model though lol |
can i combine your code in this way |
@Tavish77 not quite. You'll still need to wrap it in that class that extends the Then you instantiate that class and pass it in like you did there |
@logan-markewich I'm trying to combine the examples you posted above. What do you return as the model from |
Hey, I'm loading a
Does anybody know what's going on? Thanks! EDIT for adding more context: If I use the this model
|
The code in the file is how far i got to work with llama_index. Some one knows what im doing wrong? Exception happen while the pipeline command:
|
I got it to work here -> https://github.com/donflopez/alpaca-lora-llama-index/blob/main/generate.py It is not perfect, but works... |
@donflopez In order to get your code running, I had to install
Did you encounter this at all? (and how did you fix it?) |
@donflopez On what hardware specs did you ran the model like this? My RTX 4090 comes to a limit sadly. |
@h1f0x i could get @donflopez's repo to work but i always got completly wrong anwers or some times nothing (+- the same that i now have with this version xD). But with it i was able to get further but still with no usable response. The model that should have "read" the documents (Llama document and the pdf from the repo) does not give any usefull answer anymore. this was with: base_model= circulus/alpaca-7b and the lora weight was circulus/alpaca-lora-7b i did try other models or combinations but i did not get any better result :( Question: What do you think of Facebook's LlaMa? This shows that something does work or at least not break? Question: What are alpacas? and how are they different from llamas? Code: |
@h1f0x I'm running on a 4090 too, yes, multiple executions fail and also you cannot go beyond 1 beams. I'm trying to figure out why this happens. When querying the raw model, this does not happen, it probably has something to do with llama_index + the pipeline setup. @devinSpitz, I also have weird results. Please note that in my code I have a |
I'm getting this with
|
@devinSpitz I got this outpout tweaking your script to make it work with index, still llama doesn't know when to stop. Using Improved it, here is the latest output: @h1f0x lookslike the OOM issue doesn't happen on the script, so it could be gradio that copies the resources when making a request? I have no idea how gradio works tbh, but if I move things out of gradio, there's no OOM. |
I have been trying to get this to work as well, but keep running into issues with sentencepiece: Anyone else having this or any suggestions? Thanks! |
Just like inference with OpenAI APIs doesn't happen locally, is there any way to use HTTP requests to send the prompts to a server exposing any LLM like Alpaca via HTTP? I feel like it would be easier if we could decouple the LLM. |
thank you i have solved my porblem |
@donflopez Many thanks for your feedback! I got it working with CPU only later that evening but I needed to change the page management in windows itself too to get it working. I hope I can try some new settings soon. Gradio is a mystery for myself as well :D At least so far.. looking into that deeper as well. If I find anything I let you know! @devinSpitz At least you get some Output, was not able to produce that haha, but I guess that's because of some strange behaviors when running with CPU. :) |
@devinSpitz |
I have already resolved it |
if you have an issue with 4090 try to install a new driver 525.105.17: |
@Tavish77 @masknetgoal634 Thanks both of you, yes I'm using a 4090 so I will update the driver and try it again :D @donflopez thanks as well, you are right with the stop sequence "-" is a little bit better but still not good :( @h1f0x Yes that's right xD But I still want to get it working :D |
@masknetgoal634 Im already on a newer driver xD @Tavish77 how did you solve it? |
I made this work in a colab notebook with LLamaIndex and the Gpt4All model. I copied this from my local jupyter so be aware. Some headings are not code. like " Load GPT4ALL-LORA Model" Hope this helps. I now try to exchange the GPT4ALL-Lora with a 4bit version. But I am somehow stuck. I only have a 6GB GPU. |
I replaced a cloud GPU server |
as far as i know there is only fix for 4090 in 525.105.17 |
Anybody make progress on this? Is it possible to use the CPU optimized (alpaca.cpp, etc) versions of Llama for creating embeddings or is a cloud service the only option here? |
@ddb21 you should be able to use llama cpp (or any llm that langchain has implemented) by wrapping the llm with the LLMPredictor class https://github.com/hwchase17/langchain/tree/master/langchain/llms And Here's the docs for using any custom model: https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html#example-using-a-custom-llm-model And here's a ton of examples implementing random llms https://github.com/autratec/GPT4ALL_Llamaindex https://github.com/autratec/dolly2.0_3b_HFembedding_Llamaindex https://github.com/autratec/koala_hfembedding_llamaindex Just need to make sure you set up the prompt helper/service context appropriately for the input size of each model |
@logan-markewich I tried out your approach with llama_index and langchain, with a custom class that I built for OpenAI's GPT3.5 model. But, it seems that llama_index is not recognizing my CustomLLM as one of langchain's models. It is defaulting to it's own GPT3.5 model. What am I doing wrong here? Attaching the codes and the logs. Thanks in advance.
Note: Sorry about the clumsy code, I'm testing things out |
To set more context, this is
|
Found the issue with mine. Seems while instantiating another instance of GPTSimpleVectorIndex, I wasn't passing the service_context parameter.
|
Hi, I've developed a streamlit app that uses llama-index with openai. I'd like not to pay for openai and be able to leverage an open source llm that has no commercial restrictions, no token limits, and a hosted api. I've been looking at bloom - https://huggingface.co/bigscience/bloom - but don't know how to call the huggingface model in a similar manner to what I have in my current code. Does anyone know how I would adapt that code to work with Bloom from HuggingFace? Thanks! import logging if "OPENAI_API_KEY" not in st.secrets: openai_api_key = st.secrets["OPENAI_API_KEY"] logging.info(f"OPENAI_API_KEY: {openai_api_key}") Set up the GitHub APIg = Github(st.secrets["GITHUB_TOKEN"]) def construct_index(directory_path):
|
I see we can use https://github.com/lhenault/simpleAI to run a locally hosted openai alternative, but not sure if this can work with llama_index. |
Interesting and thanks for sharing that! I will ultimately need a hosting environment beyond my local machine. Luckily, I'm finding some providers that are quite a bit more affordable than some of the big names. |
@entrptaher pretty much any LLM can work if you implement the CustomLLM class. Inside the class you could make API calls to someother hosted service or a local model |
Ok, going to link these docs one last time. If you want to avoid openai, you need to setup both an LLM and an embedding model in the service context. To make things easier, I also recommend setting a global service context. If you use a langchain LLM, be sure to wrap it with the LangChainLLM class from llama_index.llms import LangChainLLM
from llama_index import ServiceContext, set_global_service_context
llm = LangChainLLM(<langchain llm class>)
embed_model = <setup embed model>
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
set_global_service_context(service_context) https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-huggingface-llm https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/embeddings/usage_pattern.html#embedding-model-integrations |
I want to use llamaindex but I don't want any data of mine to be transferred to any servers. I want it all to happen locally or within my own EC2 instance. I have seen https://github.com/jerryjliu/llama_index/blob/046183303da4161ee027026becf25fb48b67a3d2/docs/how_to/custom_llms.md#example-using-a-custom-llm-model but it calls hugging face.
My plan was to use https://github.com/cocktailpeanut/dalai with the alpaca model then somehow use llamaindex to input my dataset. Any examples or pointers for this?
The text was updated successfully, but these errors were encountered: