GitHub - lemon-little/llama-python-streamingllm

Overall Effect 整体效果

Video Demonstration 视频演示

chrome_hH9tj3BgAX.mp4

Chat Page 聊天页面

Settings Page 设置页面

The debugging interface will output the probability of sampling.
调试界面输出取样的概率

Huggingface Spaces

First, check if Huggingface is working properly
先检查 Huggingface 是否正常：status.huggingface.co
Limour/llama-python-streamingllm
Only one person can use it at a time. Before use, click the Reset button to restore the initial kv_cache. If there is no response after clicking Submit, it means someone is using it. Wait for a while and then Reset again.
仅支持同时一个人用，用之前点 Reset 按钮恢复初始的 kv_cache，按 Submit 没反应，说明有人在用，等一段时间后再 Reset
Using more than one window will cause a crash, and you need to go to settings to Restart this Space to recover.
多于一个窗口使用会崩溃，需要到设置里 Restart this Space 才能恢复
After duplicating, set it to private before use.
只能 Duplicate 后，设为私密来使用

kaggle

https://www.kaggle.com/code/reginliu/llama-python-streamingllm

colab

https://colab.research.google.com/drive/1EzTCj5EllQOR9CtR6hXI_Dwid8gdB47q?usp=sharing

Local Installation 本地安装

conda create -n llamaCpp libcublas cuda-toolkit git -c nvidia -c conda-forge
conda activate llamaCpp
conda install python=3.10 gradio -c conda-forge
# Then download the corresponding package from releases
# 然后去 release 下载相应的包 https://github.com/Limour-dev/llama-cpp-python-cuBLAS-wheels/releases
pip install --force-reinstall llama_cpp_python-0.2.39+cu122-cp310-cp310-win_amd64.whl
git clone --depth=1 https://github.com/Limour-dev/llama-python-streamingllm.git
cd llama-python-streamingllm
mkdir cache
mkdir models
cd models
D:\aria2\aria2c.exe --all-proxy='http://127.0.0.1:7890' -o 'causallm_14b.IQ3_XS.gguf' --max-download-limit=6M "https://huggingface.co/Limour/CausalLM-14B-GGUF/resolve/main/causallm_14b.IQ3_XS.gguf?download=true"
cd ..
python .\gradio_streamingllm.py

Core Idea 核心思想

Utilize the two APIs, kv_cache_seq_rm and kv_cache_seq_shift, from llama.cpp to perform token-level operations on kv_cache.
借助 llama.cpp 的 kv_cache_seq_rm 和 kv_cache_seq_shift 两个 api 实现对 kv_cache 的 token 级操作

Defined methods starting with venv to mark tokens in kv_cache, such as 51-100 for RAG injected content, 101-150 for user input, and 151-200 for narrative, etc.
定义了 venv 开头的方法，用于对 kv_cache 的 token 进行标记，比如 51-100是RAG注入的内容 101-150是用户的输入 151-200是旁白之类的

Based on this, dynamically remove tokens that are no longer used to save space in kv_cache.
在此基础之上就可以动态将不再用到的 token 进行 remove 以节省 kv_cache 的空间。

Lastly, when kv_cache is full, skip the specified content to be permanently retained, such as system, and then start removing unused tokens from the beginning of kv_cache.
最后就是当 kv_cache 满了之后，先跳过指定要永久保留的内容，比如system，然后从开头开始，从 kv_cache 中动态移除不再用到的 token

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
2024-02		2024-02
assets		assets
config		config
mods		mods
.gitignore		.gitignore
KMP_list.py		KMP_list.py
LICENSE		LICENSE
README.md		README.md
Turbo_Colormap.py		Turbo_Colormap.py
chat_template.py		chat_template.py
gradio_streamingllm.py		gradio_streamingllm.py
install.txt		install.txt
llama_cpp_python_streamingllm.py		llama_cpp_python_streamingllm.py
rp_config.json		rp_config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overall Effect 整体效果

Huggingface Spaces

kaggle

colab

Local Installation 本地安装

Core Idea 核心思想

About

Releases

Packages

Languages

License

lemon-little/llama-python-streamingllm

Folders and files

Latest commit

History

Repository files navigation

Overall Effect 整体效果

Huggingface Spaces

kaggle

colab

Local Installation 本地安装

Core Idea 核心思想

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages