Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to Llama.cpp for Offline Chat #680

Merged
merged 10 commits into from Apr 2, 2024

Conversation

debanjum
Copy link
Collaborator

@debanjum debanjum commented Mar 20, 2024

Benefits

  • Support all GGUF format chat models
  • Support more GPUs like AMD, Nvidia, Mac, Vulcan (previously just Vulcan, Mac)
  • Support more capabilities like larger context window, schema enforcement, speculative decoding etc.

Changes

Major

  • 978ebfe Use llama.cpp for offline chat models
    • Support larger context window
    • Automatically apply appropriate chat template. So offline chat models not using llama2 format are now supported
    • shiny new default offline chat model
  • f8ba541 Enable extract queries actor to improve notes search with offline chat
  • aafb878 Update documentation to use llama.cpp for offline chat in Khoj

Minor

@debanjum debanjum force-pushed the migrate-to-llama-cpp-for-offline-chat branch 2 times, most recently from f8ba541 to 5f8e494 Compare March 20, 2024 23:05
@debanjum debanjum requested a review from sabaimran March 20, 2024 23:27
Copy link
Collaborator

@sabaimran sabaimran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exciting getting llama.cpp running & having extract_questions working with the offline models.

src/khoj/database/models/__init__.py Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
documentation/docs/get-started/setup.mdx Show resolved Hide resolved
src/khoj/processor/conversation/offline/utils.py Outdated Show resolved Hide resolved
documentation/docs/features/chat.md Outdated Show resolved Hide resolved
@debanjum debanjum force-pushed the migrate-to-llama-cpp-for-offline-chat branch from 5f8e494 to b68d88a Compare March 24, 2024 11:18
- Benefits of moving to llama-cpp-python from gpt4all:
  - Support for all GGUF format chat models
  - Support for AMD, Nvidia, Mac, Vulcan GPU machines (instead of just Vulcan, Mac)
  - Supports models with more capabilities like tools, schema
    enforcement, speculative ddecoding, image gen etc.
- Upgrade default chat model, prompt size, tokenizer for new supported
  chat models

- Load offline chat model when present on disk without requiring internet
  - Load model onto GPU if not disabled and device has GPU
  - Load model onto CPU if loading model onto GPU fails
  - Create helper function to check and load model from disk, when model
    glob is present on disk.

    `Llama.from_pretrained' needs internet to get repo info from
    HuggingFace. This isn't required, if the model is already downloaded

    Didn't find any existing HF or llama.cpp method that looked for model
    glob on disk without internet
- How to pip install khoj to run offline chat on GPU
  After migration to llama-cpp-python more GPU types are supported but
  require build step so mention how
- New default offline chat model
- Where to get supported chat models from on HuggingFace
Previously we were skipping the extract questions step for offline
chat as default offline chat model wasn't good enough to output proper
json given the time it took to extract questions.

The new default offline chat models gives json much more regularly and
with date filters, so the extract questions step becomes useful given
the impact on latency
@debanjum debanjum force-pushed the migrate-to-llama-cpp-for-offline-chat branch from b68d88a to 4912c0e Compare March 27, 2024 05:03
@debanjum debanjum merged commit 3c3e48b into master Apr 2, 2024
9 checks passed
@debanjum debanjum deleted the migrate-to-llama-cpp-for-offline-chat branch April 2, 2024 16:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants