Skip to content

Latest commit

 

History

History
75 lines (58 loc) · 4.29 KB

README_CPU.md

File metadata and controls

75 lines (58 loc) · 4.29 KB

CPU

Google Colab

A Google Colab version of a 7B LLaMa CPU model is at:

h2oGPT CPU

A local copy of that CPU Google Colab is h2oGPT_CPU.ipynb.


Local

CPU support is obtained after installing two optional requirements.txt files. This does not preclude GPU support, just adds CPU support:

  • Install base, langchain, and GPT4All, and python LLaMa dependencies:
git clone https://github.com/h2oai/h2ogpt.git
cd h2ogpt
for fil in requirements.txt reqs_optional/requirements_optional_langchain.txt reqs_optional/requirements_optional_gpt4all.txt reqs_optional/requirements_optional_langchain.gpllike.txt reqs_optional/requirements_optional_langchain.urls.txt ; do pip install -r $fil --extra-index https://download.pytorch.org/whl/cpu ; done
# Optional: support docx, pptx, ArXiv, etc.
sudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice
# Optional: for supporting unstructured package
python -m nltk.downloader all

See GPT4All for details on installation instructions if any issues encountered.

  • Change .env_gpt4all model name if desired.
model_path_llama=WizardLM-7B-uncensored.ggmlv3.q8_0.bin
model_path_gptj=ggml-gpt4all-j-v1.3-groovy.bin
model_name_gpt4all_llama=ggml-wizardLM-7B.q4_2.bin

For gptj and gpt4all_llama, you can choose a different model than our default choice by going to GPT4All Model explorer GPT4All-J compatible model. One does not need to download manually, the gp4all package will download at runtime and put it into .cache like Hugging Face would. However, gpjt model often gives no output, even outside h2oGPT.

So, for chatting, a better instruct fine-tuned LLaMa-based model for llama.cpp can be downloaded from TheBloke. For example, 13B WizardLM Quantized or 7B WizardLM Quantized. TheBloke has a variety of model types, quantization bit depths, and memory consumption. Choose what is best for your system's specs. However, be aware that LLaMa-based models are not commercially viable.

For 7B case, download WizardLM-7B-uncensored.ggmlv3.q8_0.bin into local path:

wget https://huggingface.co/TheBloke/WizardLM-7B-uncensored-GGML/resolve/main/WizardLM-7B-uncensored.ggmlv3.q8_0.bin

Then one sets model_path_llama in .env_gpt4all, which is currently the default.

  • Run generate.py

For LangChain support using documents in user_path folder, run h2oGPT like:

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None --langchain_mode='UserData' --user_path=user_path

See LangChain Readme for more details. For no langchain support (still uses LangChain package as model wrapper), run as:

python generate.py --base_model='llama' --prompt_type=wizard2 --score_model=None

When using llama.cpp based CPU models, for computers with low system RAM or slow CPUs, we recommend adding to .env_gpt4all:

use_mlock=False
n_ctx=1024

where use_mlock=True is default to avoid slowness and n_ctx=2048 is default for large context handling. For computers with plenty of system RAM, we recommend adding to .env_gpt4all:

n_batch=1024

for faster handling. On some systems this has no strong effect, but on others may increase speed quite a bit.

Also, for slow and low-memory systems, we recommend using a smaller embedding by using with generate.py:

python generate.py ... --hf_embedding_model=sentence-transformers/all-MiniLM-L6-v2

where ... means any other options one should add like --base_model etc. This simpler embedding is about half the size as default instruct-large and so uses less disk, CPU memory, and GPU memory if using GPUs.

See also Low Memory for more information about low-memory recommendations.