-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please add support for neural-chat-7b-v3-1 #1284
Comments
This is a mistral model so it should be supported, correct? |
I was able to get around the error where text-generation-launcher is trying to read in the path like it is a PEFT LoRA model by commenting out the following lines from text-generation-inference/server/text_generation_server/cli.py Lines 153 to 163 in 3c02262
Currently stuck installing dropout-layer-norm package. Had to |
I was able to build dropout-layer-norm after creating a fresh environment. Steps I've taken: # mise en place
git clone https://github.com/huggingface/text-generation-inference
cd text-generation-inference
# Create a new environment
conda create -n text-generation-inference python=3.10
conda activate text-generation-inference
# build tgi server
BUILD_EXTENSION=True DISABLE_CUSTOM_KERNELS=True make install
# This fails for me the first time because nvcc is not installed
# So we install nvcc
conda install -c "nvidia/label/cuda-12.1.1" cuda-toolkit
# Install it again
BUILD_EXTENSION=True DISABLE_CUSTOM_KERNELS=True make install
# that should work now that nvcc is installed
# build flash attention v2
cd server
make install-flash-attention-v2
# This will install ninja so set MAX_JOBS
export MAX_JOBS=2
# now build dropout-layer-norm
cd flash-attention-v2/csrc/layer_norm
python -m pip install .
# okay figured it would be a good time to try it out
cd ../../../..
# we should be back in text-generation-inference root directory now
target/release/text-generation-launcher --model-id /home/thomas/src/neural-chat-7b-v3-1 --port=8080 --quantize bitsandbytes-nf4 and I'm now seeing
so it appears I cannot get away with neglecting text-generation-inference/server/Makefile-vllm Lines 1 to 2 in 3c02262
Because this installed every
|
Should I |
So I was able to install vllm from a new commit that is mentioned in #1285 and the only problem I'm seeing from pip is
but I figured let's try it out anyway so I go back to root and run
and now I'm getting
so it looks like I need to
doing that and now I'm seeing
which is a new error and a sign of the forward march of progress! |
seems to be a vllm interop issue, which makes sense as I'm working on main lol |
I changed paged_attention.py to be sure to pass
so I
and I still see
which looks an awful lot like a problem with flash-attention's If I try to run
I get an OOM error
so I guess I need to go unit test varlen_fwd in the flash attention repo. |
Okay so I just ran
and after installing the correct versions of torchvision and timm along with building the fused_dense_lib module, I'm getting only a single error
which doesn't give me much help in tracking down why I'm getting illegal memory access errors from varlen_fwd when tgi warms up. How big of a batch is tgi using? Could this be a simple OOM that's getting obfuscated somehow? |
Thinking this is my problem as I am not using an A100 text-generation-inference/router/client/src/client.rs Lines 98 to 102 in 3c02262
I set
I've fiddled around trying to get rid of the warm up step. Even took out the self.generate_token(batch) call in warmup in flash_causal_lm.py as pulling it out of the rust code entirely left me without an initialized cache manager. Started getting OOMs from initializing the cache manager. Putting it back in and I get the above errors. So I guess I'm working on a low resource warmup right now. The thing is, I can use the model through Python and the flash attention unit tests are all passing. After some extensive testing with the model in Python, I find I can only process sequences of around 2000 tokens with the 8 Gb mobile GPU. Any more and I OOM. If I set |
I encountered this exact error while running it from Docker. |
if you run it locally you can comment out the offending code here |
hi, i run the below cmd and get an error text-generation-launcher --model-id ~/models/beluga --port=8080 --quantize bitsandbytes RuntimeError: unable to mmap 9976578928 bytes from file </home/models/beluga/model-00001-of-00002.safetensors>: Cannot allocate memory (12) |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
Any update on this, i get same error on Intel/neural-chat-7b-v3-3 model |
Model description
I'm using neural-chat-7b-v3-1 locally on my laptop and it would sure be sweet if I could serve it through tgi.
I can currently use it with python using the pattern
but when I try to pass the path of the repo I cloned through to tgi I get
So I'm seeing an error that appears to be related to #1283 in addition to tgi complaining there's no
adapter_config.json
which is odd because the repo has the full model and is not a peft adapter. But I mean it's doesn't even look like it can see the local repo so I don't know.Open source status
Provide useful links for the implementation
https://huggingface.co/Intel/neural-chat-7b-v3-1
The text was updated successfully, but these errors were encountered: