Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to Llama.cpp for Offline Chat #680

Merged
merged 10 commits into from Apr 2, 2024
6 changes: 3 additions & 3 deletions documentation/docs/features/chat.md
Expand Up @@ -14,16 +14,16 @@ You can configure Khoj to chat with you about anything. When relevant, it'll use

### Setup (Self-Hosting)
#### Offline Chat
Offline chat stays completely private and works without internet using open-source models.
Offline chat stays completely private and can work without internet using open-source models.

> **System Requirements**:
> - Minimum 8 GB RAM. Recommend **16Gb VRAM**
> - Minimum **5 GB of Disk** available
> - A CPU supporting [AVX or AVX2 instructions](https://en.wikipedia.org/wiki/Advanced_Vector_Extensions) is required
> - A Mac M1+ or [Vulcan supported GPU](https://vulkan.gpuinfo.org/) should significantly speed up chat response times
> - An Nvidia, AMD GPU or a Mac M1+ machine would significantly speed up chat response times

1. Open your [Khoj offline settings](http://localhost:42110/server/admin/database/offlinechatprocessorconversationconfig/) and click *Enable* on the Offline Chat configuration.
2. Open your [Chat model options](http://localhost:42110/server/admin/database/chatmodeloptions/) and add a new option for the offline chat model you want to use. Make sure to use `Offline` as its type. We currently only support offline models that use the [Llama chat prompt](https://replicate.com/blog/how-to-prompt-llama#wrap-user-input-with-inst-inst-tags) format. We recommend using `mistral-7b-instruct-v0.1.Q4_0.gguf`.
2. Open your [Chat model options settings](http://localhost:42110/server/admin/database/chatmodeloptions/) and add any [GGUF chat model](https://huggingface.co/models?library=gguf) to use for offline chat. Make sure to use `Offline` as its type. For a balanced chat model that runs well on standard consumer hardware we recommend using [Hermes-2-Pro-Mistral-7B by NousResearch](https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF) by default.


:::tip[Note]
Expand Down
28 changes: 24 additions & 4 deletions documentation/docs/get-started/setup.mdx
Expand Up @@ -101,24 +101,44 @@ sudo -u postgres createdb khoj --password

##### Local Server Setup
- *Make sure [python](https://realpython.com/installing-python/) and [pip](https://pip.pypa.io/en/stable/installation/) are installed on your machine*
- Check [llama-cpp-python setup](https://python.langchain.com/docs/integrations/llms/llamacpp#installation) if you hit any llama-cpp issues with the installation

Run the following command in your terminal to install the Khoj backend.

```mdx-code-block
<Tabs groupId="operating-systems">
<TabItem value="macos" label="MacOS">
```shell
# ARM/M1+ Machines
MAKE_ARGS="-DLLAMA_METAL=on" python -m pip install khoj-assistant

# Intel Machines
debanjum marked this conversation as resolved.
Show resolved Hide resolved
python -m pip install khoj-assistant
```
</TabItem>
<TabItem value="win" label="Windows">
```shell
py -m pip install khoj-assistant
# 1. (Optional) To use NVIDIA (CUDA) GPU
$env:CMAKE_ARGS = "-DLLAMA_OPENBLAS=on"
# 1. (Optional) To use AMD (ROCm) GPU
CMAKE_ARGS="-DLLAMA_HIPBLAS=on"
# 1. (Optional) To use VULCAN GPU
CMAKE_ARGS="-DLLAMA_VULKAN=on"

# 2. Install Khoj
py -m pip install khoj-assistant
```
</TabItem>
<TabItem value="unix" label="Linux">
```shell
python -m pip install khoj-assistant
# CPU
python -m pip install khoj-assistant
# NVIDIA (CUDA) GPU
CMAKE_ARGS="DLLAMA_CUBLAS=on" FORCE_CMAKE=1 python -m pip install khoj-assistant
# AMD (ROCm) GPU
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" FORCE_CMAKE=1 python -m pip install khoj-assistant
# VULCAN GPU
CMAKE_ARGS="-DLLAMA_VULKAN=on" FORCE_CMAKE=1 python -m pip install khoj-assistant
```
</TabItem>
</Tabs>
Expand Down Expand Up @@ -179,13 +199,13 @@ If you're using a custom domain, you must use an SSL certificate. You can use [L
1. Go to http://localhost:42110/server/admin and login with your admin credentials.
1. Go to [OpenAI settings](http://localhost:42110/server/admin/database/openaiprocessorconversationconfig/) in the server admin settings to add an OpenAI processor conversation config. This is where you set your API key. Alternatively, you can go to the [offline chat settings](http://localhost:42110/server/admin/database/offlinechatprocessorconversationconfig/) and simply create a new setting with `Enabled` set to `True`.
2. Go to the ChatModelOptions if you want to add additional models for chat.
- Set the `chat-model` field to a supported chat model[^1] of your choice. For example, you can specify `gpt-4-turbo-preview` if you're using OpenAI or `mistral-7b-instruct-v0.1.Q4_0.gguf` if you're using offline chat.
- Set the `chat-model` field to a supported chat model[^1] of your choice. For example, you can specify `gpt-4-turbo-preview` if you're using OpenAI or `NousResearch/Hermes-2-Pro-Mistral-7B-GGUF` if you're using offline chat.
- Make sure to set the `model-type` field to `OpenAI` or `Offline` respectively.
- The `tokenizer` and `max-prompt-size` fields are optional. Set them only when using a non-standard model (i.e not mistral, gpt or llama2 model).
1. Select files and folders to index [using the desktop client](/get-started/setup#2-download-the-desktop-client). When you click 'Save', the files will be sent to your server for indexing.
- Select Notion workspaces and Github repositories to index using the web interface.

[^1]: Khoj, by default, can use [OpenAI GPT3.5+ chat models](https://platform.openai.com/docs/models/overview) or [GPT4All chat models that follow Llama2 Prompt Template](https://github.com/nomic-ai/gpt4all/blob/main/gpt4all-chat/metadata/models2.json). See [this section](/miscellaneous/advanced#use-openai-compatible-llm-api-server-self-hosting) to use non-standard chat models
[^1]: Khoj, by default, can use [OpenAI GPT3.5+ chat models](https://platform.openai.com/docs/models/overview) or [GGUF chat models](https://huggingface.co/models?library=gguf). See [this section](/miscellaneous/advanced#use-openai-compatible-llm-api-server-self-hosting) to use non-standard chat models

:::tip[Note]
Using Safari on Mac? You might not be able to login to the admin panel. Try using Chrome or Firefox instead.
Expand Down
2 changes: 1 addition & 1 deletion documentation/docs/miscellaneous/credits.md
Expand Up @@ -10,4 +10,4 @@ Many Open Source projects are used to power Khoj. Here's a few of them:
- Charles Cave for [OrgNode Parser](http://members.optusnet.com.au/~charles57/GTD/orgnode.html)
- [Org.js](https://mooz.github.io/org-js/) to render Org-mode results on the Web interface
- [Markdown-it](https://github.com/markdown-it/markdown-it) to render Markdown results on the Web interface
- [GPT4All](https://github.com/nomic-ai/gpt4all) to chat with local LLM
- [Llama.cpp](https://github.com/ggerganov/llama.cpp) to chat with local LLM
3 changes: 1 addition & 2 deletions pyproject.toml
Expand Up @@ -62,8 +62,7 @@ dependencies = [
"pymupdf >= 1.23.5",
"django == 4.2.10",
"authlib == 1.2.1",
"gpt4all == 2.1.0; platform_system == 'Linux' and platform_machine == 'x86_64'",
"gpt4all == 2.1.0; platform_system == 'Windows' or platform_system == 'Darwin'",
"llama-cpp-python == 0.2.56",
debanjum marked this conversation as resolved.
Show resolved Hide resolved
"itsdangerous == 2.1.2",
"httpx == 0.25.0",
"pgvector == 0.2.4",
Expand Down
6 changes: 3 additions & 3 deletions src/khoj/database/adapters/__init__.py
Expand Up @@ -43,7 +43,7 @@
from khoj.search_filter.file_filter import FileFilter
from khoj.search_filter.word_filter import WordFilter
from khoj.utils import state
from khoj.utils.config import GPT4AllProcessorModel
from khoj.utils.config import OfflineChatProcessorModel
from khoj.utils.helpers import generate_random_name, is_none_or_empty


Expand Down Expand Up @@ -705,8 +705,8 @@ def get_valid_conversation_config(user: KhojUser, conversation: Conversation):
conversation_config = ConversationAdapters.get_default_conversation_config()

if offline_chat_config and offline_chat_config.enabled and conversation_config.model_type == "offline":
if state.gpt4all_processor_config is None or state.gpt4all_processor_config.loaded_model is None:
state.gpt4all_processor_config = GPT4AllProcessorModel(conversation_config.chat_model)
if state.offline_chat_processor_config is None or state.offline_chat_processor_config.loaded_model is None:
state.offline_chat_processor_config = OfflineChatProcessorModel(conversation_config.chat_model)

return conversation_config

Expand Down
2 changes: 1 addition & 1 deletion src/khoj/database/models/__init__.py
Expand Up @@ -80,7 +80,7 @@ class ModelType(models.TextChoices):

max_prompt_size = models.IntegerField(default=None, null=True, blank=True)
tokenizer = models.CharField(max_length=200, default=None, null=True, blank=True)
chat_model = models.CharField(max_length=200, default="mistral-7b-instruct-v0.1.Q4_0.gguf")
chat_model = models.CharField(max_length=200, default="NousResearch/Hermes-2-Pro-Mistral-7B-GGUF")
model_type = models.CharField(max_length=200, choices=ModelType.choices, default=ModelType.OFFLINE)


Expand Down
71 changes: 71 additions & 0 deletions src/khoj/migrations/migrate_offline_chat_default_model_2.py
@@ -0,0 +1,71 @@
"""
Current format of khoj.yml
---
app:
...
content-type:
...
processor:
conversation:
offline-chat:
enable-offline-chat: false
chat-model: mistral-7b-instruct-v0.1.Q4_0.gguf
...
search-type:
...

New format of khoj.yml
---
app:
...
content-type:
...
processor:
conversation:
offline-chat:
enable-offline-chat: false
chat-model: NousResearch/Hermes-2-Pro-Mistral-7B-GGUF
...
search-type:
...
"""
import logging

from packaging import version

from khoj.utils.yaml import load_config_from_file, save_config_to_file

logger = logging.getLogger(__name__)


def migrate_offline_chat_default_model(args):
schema_version = "1.7.0"
raw_config = load_config_from_file(args.config_file)
previous_version = raw_config.get("version")

if "processor" not in raw_config:
return args
if raw_config["processor"] is None:
return args
if "conversation" not in raw_config["processor"]:
return args
if "offline-chat" not in raw_config["processor"]["conversation"]:
return args
if "chat-model" not in raw_config["processor"]["conversation"]["offline-chat"]:
return args

if previous_version is None or version.parse(previous_version) < version.parse(schema_version):
logger.info(
f"Upgrading config schema to {schema_version} from {previous_version} to change default (offline) chat model to mistral GGUF"
)
raw_config["version"] = schema_version

# Update offline chat model to use Nous Research's Hermes-2-Pro GGUF in path format suitable for llama-cpp
offline_chat_model = raw_config["processor"]["conversation"]["offline-chat"]["chat-model"]
if offline_chat_model == "mistral-7b-instruct-v0.1.Q4_0.gguf":
raw_config["processor"]["conversation"]["offline-chat"][
"chat-model"
] = "NousResearch/Hermes-2-Pro-Mistral-7B-GGUF"
debanjum marked this conversation as resolved.
Show resolved Hide resolved

save_config_to_file(raw_config, args.config_file)
return args