Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OPT model #165

Closed
lcw99 opened this issue Apr 10, 2023 · 17 comments
Closed

OPT model #165

lcw99 opened this issue Apr 10, 2023 · 17 comments

Comments

@lcw99
Copy link

lcw99 commented Apr 10, 2023

I've try to run with OPT-13B. Model is loaded succesfully but on inference time, following error occured.

2023-04-10T07:46:27.646766Z ERROR batch{batch_size=1}:prefill:prefill{id=1 size=1}:prefill{id=1 size=1}: text_generation_client: router/client/src/lib.rs:29: Server error: forward() got an unexpected keyword argument 'position_ids'
2023-04-10T07:46:27.649028Z ERROR HTTP request{otel.name=POST / http.client_ip= http.flavor=1.1 http.host=127.0.0.1:8080 http.method=POST http.route=/ http.scheme=HTTP http.target=/ http.user_agent=python-requests/2.28.1 otel.kind=server trace_id=15ce15d2e6483d4df26053f09b713b6b http.status_code=200 otel.status_code="OK"}:compat_generate{default_return_full_text=Extension(false) req=Json(CompatGenerateRequest { inputs: "A와 B가 진지한 대화 중이다. \n두사람의 대화를 자연스럽게 연결하시오.\nB: hi\nA:", parameters: GenerateParameters { best_of: None, temperature: Some(0.5), repetition_penalty: Some(1.1), top_k: None, top_p: Some(0.9), typical_p: Some(0.95), do_sample: false, max_new_tokens: 1024, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, seed: None }, stream: true })}:generate_stream{req=Json(GenerateRequest { inputs: "A와 B가 진지한 대화 중이다. \n두사람의 대화를 자연스럽게 연결하시오.\nB: hi\nA:", parameters: GenerateParameters { best_of: None, temperature: Some(0.5), repetition_penalty: Some(1.1), top_k: None, top_p: Some(0.9), typical_p: Some(0.95), do_sample: false, max_new_tokens: 1024, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, seed: None } })}:async_stream:generate_stream{request=GenerateRequest { inputs: "A와 B가 진지한 대화 중이다. \n두사람의 대화를 자연스럽게 연결하시오.\nB: hi\nA:", parameters: GenerateParameters { best_of: None, temperature: Some(0.5), repetition_penalty: Some(1.1), top_k: None, top_p: Some(0.9), typical_p: Some(0.95), do_sample: false, max_new_tokens: 1024, return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: true, seed: None } }}:infer{batch_size=1}:send_error: text_generation_router::infer: router/src/infer.rs:384: Request failed during generation: Server error: forward() got an unexpected keyword argument 'position_ids'
2023-04-10T07:46:27.666632Z ERROR shard-manager: text_generation_launcher: "Method Prefill encountered an error.
Traceback (most recent call last):
  File \"/home/chang/anaconda3/envs/hf-tgi/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/interceptor.py\", line 20, in intercept
    return await response
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/server.py\", line 46, in Prefill
    generations, next_batch = self.model.generate_token(batch)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/models/causal_lm.py\", line 341, in generate_token
    logits, past = self.forward(
  File \"/home/chang/AI/llm/tests/text-generation-inference/server/text_generation_server/models/causal_lm.py\", line 325, in forward
    outputs = self.model.forward(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'position_ids'
" rank=0

@OlivierDehaene
Copy link
Member

Solved in the latest release (v0.5.0).

@lcw99
Copy link
Author

lcw99 commented Apr 12, 2023

@OlivierDehaene thanks for your hard working. I have tested OPT-13B again. It works fine on inferece for a few tries, but cause OOM on GPU as follows. I loaded with --quantize option, it occupy 15G of memory on start, but after a few interence call cause OOM. I have 4090 and GPU memory is 24G.

2023-04-12T07:31:32.293255Z ERROR shard-manager: text_generation_launcher: "Method Decode encountered an error.
Traceback (most recent call last):
  File \"/home/chang/anaconda3/envs/hf-tgi/bin/text-generation-server\", line 8, in <module>
    sys.exit(app())
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1130, in __call__
    return self.main(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/core.py\", line 778, in main
    return _main(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/core.py\", line 216, in _main
    rv = self.invoke(ctx)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/click/core.py\", line 760, in invoke
    return __callback(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File \"/home/chang/AI/llm/text-generation-inference/server/text_generation_server/cli.py\", line 55, in serve
    server.serve(model_id, revision, sharded, quantize, uds_path)
  File \"/home/chang/AI/llm/text-generation-inference/server/text_generation_server/server.py\", line 135, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize))
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/runners.py\", line 44, in run
    return loop.run_until_complete(main)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete
    self.run_forever()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever
    self._run_once()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once
    handle._run()
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/asyncio/events.py\", line 80, in _run
    self._context.run(self._callback, *self._args)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/grpc_interceptor/server.py\", line 153, in invoke_intercept_method
    return await self.intercept(
> File \"/home/chang/AI/llm/text-generation-inference/server/text_generation_server/interceptor.py\", line 20, in intercept
    return await response
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 82, in _unary_interceptor
    raise error
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py\", line 73, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File \"/home/chang/AI/llm/text-generation-inference/server/text_generation_server/server.py\", line 70, in Decode
    generations, next_batch = self.model.generate_token(batch)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/contextlib.py\", line 79, in inner
    return func(*args, **kwds)
  File \"/home/chang/AI/llm/text-generation-inference/server/text_generation_server/models/causal_lm.py\", line 355, in generate_token
    logits, past = self.forward(
  File \"/home/chang/AI/llm/text-generation-inference/server/text_generation_server/models/opt.py\", line 40, in forward
    outputs = self.model.forward(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/models/opt/modeling_opt.py\", line 970, in forward
    outputs = self.model.decoder(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/models/opt/modeling_opt.py\", line 725, in forward
    layer_outputs = decoder_layer(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/models/opt/modeling_opt.py\", line 344, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/torch/nn/modules/module.py\", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/accelerate/hooks.py\", line 156, in new_forward
    output = old_forward(*args, **kwargs)
  File \"/home/chang/anaconda3/envs/hf-tgi/lib/python3.9/site-packages/transformers-4.28.0.dev0-py3.9.egg/transformers/models/opt/modeling_opt.py\", line 199, in forward
    key_states = torch.cat([past_key_value[0], key_states], dim=2)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 23.68 GiB total capacity; 21.19 GiB already allocated; 51.12 MiB free; 21.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
" rank=0
2023-04-12T07:31:32.293384Z ERROR batch{batch_size=6}:decode:decode{size=6}:decode{size=6}: text_generation_client: router/client/src/lib.rs:29: Server error: CUDA out of memory. Tried to allocate 54.00 MiB (GPU 0; 23.68 GiB total capacity; 21.19 GiB already allocated; 51.12 MiB free; 21.36 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@OlivierDehaene
Copy link
Member

Thanks!
I see that you OOMed with a batch size == 6.
What type of load do you send to the endpoint? (sequence length, max_new_tokens, number of requests / sec)
You need to tune the launcher max_input_length, max_total_tokens and max_batch_size for your setup to make sure you don't OOM.

You can use the benchmarking tool https://github.com/huggingface/text-generation-inference/tree/main/benchmark to help you.

@lcw99
Copy link
Author

lcw99 commented Apr 12, 2023

@OlivierDehaene my launch paramerters are as follows.

text-generation-launcher --num-shard 1 --quantize --port 8080 --model-id "/home/chang/AI/llm/Models/OPT-13B-instruct-native" --max-input-length 1500 --max-total-tokens 2048
I have tried --max-batch-size 5, in this case http timeout on client side and server fault with OOM after a few minutes.
BTW, I use streaming.

generate error = HTTPConnectionPool(host='127.0.0.1', port=8080): Read timed out.
Traceback (most recent call last):
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/site-packages/urllib3/response.py", line 758, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/home/chang/anaconda3/envs/hf38/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)
socket.timeout: timed out

@MichaMucha
Copy link

Thank you very much for your work!
I was also interested to know under what conditions torch.cuda.empty_cache() gets called - I've seen code around clearing cache, but I don't understand how and when that gets triggered inside the RPC mechaninsm.
Having the OOM experience after enough prompts have been sent, wondering what am I doing wrong..

Tried GPT-J 6B on a 24GB card. Exactly like @lcw99 said

it occupies 15G of memory on start, but after a few interence call cause OOM.

Would love your advice

@OlivierDehaene
Copy link
Member

Are you also using quantization?

@lcw99
Copy link
Author

lcw99 commented Apr 12, 2023

@OlivierDehaene Yes I am using quantization.

@lcw99
Copy link
Author

lcw99 commented Apr 13, 2023

@OlivierDehaene I've tested several cases. OOM is occurred only when I use streaming. In case of streaming, even if there's no inference call, GPU is running(GPU-util 50%) for long time after inference call is finished and sometime OOM occurred without any client calls.

@OlivierDehaene
Copy link
Member

@lcw99, @MichaMucha,

Can you both:

  1. tell me the model of GPU you are running on
  2. give me the launcher command you are using
  3. provide me a sample script with a client that triggers the error

@lcw99
Copy link
Author

lcw99 commented Apr 13, 2023

I have tested on two configuration both have same issue.

  1. two rtx 4090 24G
  2. text-generation-launcher --num-shard 2 --quantize --port 8080 --model-id "./Models/GPT-NeoX-20B-instruct-native/checkpoint-100" --max-input-length 1500 --max-total-tokens 2048
  3. this is part of my code. call generate_stream and loop response until no generation or stop condition. This routine is called for every user input.(telegram chat bot)

client = Client(hf_tgi_api_base)
response = client.generate_stream(contents, max_new_tokens=1024, **kwargs)

# Stream Answer
temp_gen_text_concat = ""
temp_gen_text_concat_start_pos = 0
no_gen_count = 0
stopped = False
for event in response:
    gen_text = event.token.text
    # if len(gen_text) > 0:
    #     print(f"finish_reason = {event['choices'][0]['finish_reason']}, {gen_text}, {ord(gen_text[0])}")
    time.sleep(speed)
    if len(gen_text) == 0 and not stopped:
        no_gen_count += 1
        print(f"no gen text={no_gen_count}")
        if no_gen_count > 5:
            reply_text(context, message, gen_text_to_reply, gen_text_concat, sent_message, True)
            break
        continue
    no_gen_count = 0
    prev_len = len(gen_text_concat)
    gen_text_concat += gen_text
    if not stopped:
        gen_text_concat, stopped = search_stop_word(gen_text_concat)
    gen_text = gen_text_concat[prev_len:]
    if len(gen_text) > 0:
        temp_gen_text_concat += gen_text
        if len(temp_gen_text_concat) < generation_chunk:
            continue
        gen_text_to_reply += temp_gen_text_concat
        temp_gen_text_concat_start_pos += len(temp_gen_text_concat)
        print(f"[{temp_gen_text_concat}]={temp_gen_text_concat_start_pos}")
        temp_gen_text_concat = ""
        gen_text_to_reply, sent_message = reply_text(context, message, gen_text_to_reply, gen_text_concat, sent_message)
    if 'stop_generation' in context.user_data:
        print('stop_generation detected...')
        context.user_data.pop('stop_generation', None)
        stopped = True
    if stopped:
        print(f"{len(gen_text_concat)=}, {temp_gen_text_concat_start_pos=}")
        stop_pos = len(gen_text_concat) - temp_gen_text_concat_start_pos + 1
        if stop_pos < 0:
            stop_pos = len(gen_text_concat)
        temp_gen_text_concat = temp_gen_text_concat[:stop_pos]
        gen_text_to_reply += temp_gen_text_concat
        reply_text(context, message, gen_text_to_reply, gen_text_concat, sent_message, True)
        break

@OlivierDehaene
Copy link
Member

Can you run the following commands:

make install-benchmark
text-generation-launcher --num-shard 2 --quantize --port 8080 --model-id "./Models/GPT-NeoX-20B-instruct-native/checkpoint-100" --max-input-length 1500 --max-total-tokens 2048

and then

text-generation-benchmark --tokenizer-name EleutherAI/gpt-neox-20b --batch-size 32 --sequence-length 1500 --decode-length 548

If the benchmarking command fails it means that your setup cannot handle the maximum load you might be sending to it. As I stated above:

What type of load do you send to the endpoint? (sequence length, max_new_tokens, number of requests / sec)
You need to tune the launcher max_input_length, max_total_tokens and max_batch_size for your setup to make sure you don't OOM.

It is enterily possible that you don't OOM when the load on the system is low because the batches stay small and once usage grows you then OOM.

Also, since the sequence lengths are dynamic, one batch of size N with small sequences might go through but another batch of the same size N with longer sequences might fail.

That's why you need to make sure that your max_input_length max_total_tokens max_batch_size combination works in the worst case scenario ahead of time.

@lcw99
Copy link
Author

lcw99 commented Apr 13, 2023

@OlivierDehaene I've tried your command settings, and I get OOM immediatly. It works on max batch size 8. I guess on streaming case, each retrieval of event generate a batch on server side. Which means, generate_stream need more GPU memory on server side compare to non-stream-generate normaly. Is it right?

@OlivierDehaene
Copy link
Member

No they both have the same memory requirements as generate uses generate-stream in the backend.
I opened an issue to make this whole process clearer in the readme/documentation.
Feel free to re-open this issue if you still have problems.

@MichaMucha
Copy link

MichaMucha commented Apr 21, 2023

The advice around benchmarking and in particular setting the batch size helped me. Thanks

I still see the server continue to print tokens (into the log) beyond max_tokens in streaming mode, even though the client for loop is done - i believe @lcw99 mentioned experiencing something similar

Thanks againn for your help Olivier

@lcw99
Copy link
Author

lcw99 commented Apr 21, 2023

I can understand @OlivierDehaene 's explanation and I understand the benchmarks. However, in a real situation, even though there is more than 15G of GPU memory left, streaming calls are made sequentially one by one, triggering GPU OOM is somewhat difficult to understand.

@OlivierDehaene
Copy link
Member

However, in a real situation [...] streaming calls are made sequentially one by one

What do you mean?

@lcw99
Copy link
Author

lcw99 commented Apr 21, 2023

However, in a real situation [...] streaming calls are made sequentially one by one

What do you mean?

I have tested on following condition.

  1. start hf text generation server.(Tested on one rtx 4090 GPU 24G. right after loading model, vram is 15G left.)
  2. run generate_stream client
  3. loop until streaming stoped by max length or stop code.
  4. wait 3 seconds and repeat 2.

I've run just one instance of client, I get OOM sometimes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants