Offline chat: Quality and Reliability Improvements #393

sabaimran · 2023-08-01T03:58:41Z

Incoming

Major

Fix Prompt Size Exceeded Issue

Fix issues related to prompt size, Closes Prompt context window exceeded: The prompt is <n > 2048> tokens and the context window is 2048 #386. Use the correct tokenizer to calculate whether the input needs to be truncated or not.

Improve Llama 2 Model Download

Use the correct download link for LlamaV2 -- should have been using the small model, but was using the medium
Add better downloading logic to retry download if it failed, Closes Failure to download model llama-2 (Invalid model file) #379

Fix Segmentation Fault due to Race

Add a lock around generating chat responses from the offline model to avoid segmentation faults. Closes Segmentation Fault in Offline Chat Functionality #367.
Add a loading symbol to the web chat UI when the model is thinking. Closes Improve loading message in chat window #392

Improve Chat Response Latency

Improve performance of offline chat by increasing batch size (via n_batch) to automatically engage more cores/GPU, using smaller model and fixing prompt vs response token generation numbers. Closes Explore performance improvements for offline chat model #363

Fix Fake Dialogue Continuation

Fix formatting of user query with offline chat, this was contributing to Llama 2 generates fake dialogue continuations #398
Stop Llama 2 from Creating Fake Dialogue Continuations. Closes Llama 2 generates fake dialogue continuations #398

Minor

Improve default message for Chat window on web when it's not configured. Include hint to use offline chat.
Add null check in perform_chat_checks method
Add offline chat director unit tests

Performance Analysis (Time to First Token)

	v0.10.0	this branch
Query 1	52s	28s
Query 2	33s	42s
Query 3	67s	38s

Previously the system message was getting dropped when the context size with chat history would be more than the max prompt size supported by the cat model Now only the previous chat messages are dropped or the current message is truncated but the system message is kept to provide guidance to the chat model

- Fix download url -- was mapping to q3_K_M, but fixed to use q4_K_S - Use a proper Llama Tokenizer for counting tokens for truncation with Llama - Add additional null checks when running

…found

…e (and implicitly engage GPU)

…etting bombarded and stealing a bunch of compute resources - This also solves #367

debanjum

PR looks great with all the bug fixes and quality improvements based on user feedback! 🚀

tests/test_conversation_utils.py

tests/conftest.py

src/khoj/processor/conversation/gpt4all/utils.py

src/khoj/processor/conversation/gpt4all/chat_model.py

src/khoj/processor/conversation/gpt4all/utils.py

src/khoj/migrations/migrate_offline_model.py

src/khoj/processor/conversation/gpt4all/utils.py

tests/test_gpt4all_chat_actors.py

- Use same batch_size in extract question actor as the chat actor - Log final location the chat model is to be stored in, instead of it's temp filename while it is being downloaded

It would previously some times start generating fake dialogue with it's internal prompt patterns of <s>[INST] in responses. This is a jarring experience. Stop generation response when hit <s> Resolves #398

Create regression text to ensure it does not throw the prompt size exceeded context window error

…aming

- Only make them update config when it's run conditions are satisfies - Use static schema version to simplify reasoning about run conditions

This should ease readability, indicates which version this migration script will update the schema to once applied

Not just the chat response streaming

debanjum and others added 7 commits July 31, 2023 17:21

Fix format of user query during general conversation with Llama 2

ded606c

Update chat hint message at first run

ca19509

Misc. quality improvements for Llama V2

2d6c3cd

- Fix download url -- was mapping to q3_K_M, but fixed to use q4_K_S - Use a proper Llama Tokenizer for counting tokens for truncation with Llama - Add additional null checks when running

Resolve merge conflicts: let Khoj fail if the model tokenizer is not …

209975e

…found

Add new director tests for the offline chat model with llama v2

8dd5756

Make the fake message longer

95c7b07

sabaimran added fix Fix something that isn't working as expected upgrade New feature or request labels Aug 1, 2023

sabaimran requested a review from debanjum August 1, 2023 03:58

sabaimran added 4 commits July 31, 2023 21:07

Add better error handling for download processes incase of failure

2335f11

Fix unit tests and truncation logic

e55e9a7

Use n_batch parameter to increase resource consumption on host machin…

8054bdc

…e (and implicitly engage GPU)

Add a loading symbol to web chat. Closes #392

c14cbe9

sabaimran mentioned this pull request Aug 1, 2023

Segmentation Fault in Offline Chat Functionality #367

Closed

sabaimran added 2 commits August 1, 2023 00:21

Disable the input bar when chat response is in flight

6c30740

add a lock around chat operations to prevent the offline model from g…

1c52a69

…etting bombarded and stealing a bunch of compute resources - This also solves #367

debanjum approved these changes Aug 1, 2023

View reviewed changes

sabaimran added 8 commits August 1, 2023 08:52

Switch spinner snake case -> camel case

f7e03f6

Update comments and add explanations

90efc2e

Add migration script for getting the new offline model

3a5d93d

Add additional check for chat_messages length in UT

48363ec

Add a logline when the offline model migration script runs

778df6b

Add log line for time to first response

b11b00a

Update some of the extract question prompts for llamav2

f409e16

Update chat actor unit tests for greater accuracy and benchmarking

d8fa967

debanjum reviewed Aug 1, 2023

View reviewed changes

src/khoj/migrations/migrate_offline_model.py Outdated Show resolved Hide resolved

src/khoj/processor/conversation/gpt4all/utils.py Outdated Show resolved Hide resolved

sabaimran mentioned this pull request Aug 2, 2023

The context length of Llama2 is 2k or 4k ? #394

Closed

sabaimran commented Aug 2, 2023

View reviewed changes

tests/test_gpt4all_chat_actors.py Show resolved Hide resolved

Fix offline model migration script to run for version < 0.10.1

aa68463

- Use same batch_size in extract question actor as the chat actor - Log final location the chat model is to be stored in, instead of it's temp filename while it is being downloaded

debanjum added 4 commits August 1, 2023 20:52

Make Llama 2 stop generating response on hitting specified stop words

6e4050f

It would previously some times start generating fake dialogue with it's internal prompt patterns of <s>[INST] in responses. This is a jarring experience. Stop generation response when hit <s> Resolves #398

Fix context, response size for Llama 2 to stay within max token limits

c2b7a14

Create regression text to ensure it does not throw the prompt size exceeded context window error

Update local Chat Actor and Director tests expected to fail

95acb15

Remove old chat setup timer. It is mislabelled, irrelevant since stre…

185a1fb

…aming

debanjum force-pushed the improve-llama-2-perf-and-quality-and-fixes branch from 8a94dba to 185a1fb Compare August 2, 2023 03:52

debanjum added 3 commits August 1, 2023 21:28

Simplify migration scripts management. Make them use static version

b993754

- Only make them update config when it's run conditions are satisfies - Use static schema version to simplify reasoning about run conditions

Extract new schema version for each migration script into a variable

1812473

This should ease readability, indicates which version this migration script will update the schema to once applied

Put offline model response generation behind the chat lock as well

44292af

Not just the chat response streaming

debanjum merged commit 16c6bfc into master Aug 2, 2023
4 checks passed

debanjum deleted the improve-llama-2-perf-and-quality-and-fixes branch August 2, 2023 05:11

debanjum mentioned this pull request Aug 2, 2023

Initialization fails after running out of disk space #369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline chat: Quality and Reliability Improvements #393

Offline chat: Quality and Reliability Improvements #393

sabaimran commented Aug 1, 2023 •

edited by debanjum

debanjum left a comment

Offline chat: Quality and Reliability Improvements #393

Offline chat: Quality and Reliability Improvements #393

Conversation

sabaimran commented Aug 1, 2023 • edited by debanjum

Incoming

Major

Fix Prompt Size Exceeded Issue

Improve Llama 2 Model Download

Fix Segmentation Fault due to Race

Improve Chat Response Latency

Fix Fake Dialogue Continuation

Minor

Performance Analysis (Time to First Token)

debanjum left a comment

Choose a reason for hiding this comment

sabaimran commented Aug 1, 2023 •

edited by debanjum