Support for mosaicml/mpt-30b-instruct model #491

maziyarpanahi · 2023-06-23T16:24:42Z

Feature request

I was wondering if there will be a support for the newly released mpt-30b-instruct

Motivation

It's not possible to use mosaicml/mpt-30b-instruct model:

ValueError: sharded is not supported for AutoModel

Your contribution

I am not sure how you can add support for new LLM models. (if there is a step by step guide as where to start would be great and I can contribute)

The text was updated successfully, but these errors were encountered:

mantrakp04 · 2023-06-23T21:37:16Z

did you try --trust-remote-code while running the docker

tim-a-davis · 2023-06-23T22:36:14Z

it's very slow. This model is not supported for sharding at the moment in text-generation-inference.

did you try --trust-remote-code while running the docker

mantrakp04 · 2023-06-23T22:46:29Z

Then try implementing a rudimentary implementation of it, you can use rust or js as router and Python for inference, copy the custom kernels from the repo, modify them as suitable, mpt already has an implementation for flash attention in its "remote code file" use that and batch_encode_plus while tokenizer and batch_decode, implement batching on router server and volla u have your own server ready for inference

tim-a-davis · 2023-06-23T22:53:44Z

Then try implementing a rudimentary implementation of it, you can use rust or js as router and Python for inference, copy the custom kernels from the repo, modify them as suitable, mpt already has an implementation for flash attention in its "remote code file" use that and batch_encode_plus while tokenizer and batch_decode, implement batching on router server and volla u have your own server ready for inference

Maybe you could write me one as an example?

mantrakp04 · 2023-06-23T22:54:35Z

Am working on one right now, if you would like to help out (discord: mantrakp)

SinanAkkoyun · 2023-06-23T23:58:13Z

I am also very interested in this
I know the router side but how do you actually "on the fly" batch compute multiple requests at once with transformers?

SinanAkkoyun · 2023-06-23T23:58:39Z

(And can we expect an optimized tgi implementation soon?)

Narsil · 2023-06-26T09:33:13Z

Take example to other models we have done in server/text-generation-server/models/custom_modeling/*.py maybe ?

There's also some files in server/text-generation-server/models/*.py. Those are declaring the model as being flash enabled (the batching happens differently when a model supports flash).

If you succeed PRs are welcome !

louis030195 · 2023-06-27T21:12:48Z

is mpt even supported #290 ?

Narsil · 2023-06-28T10:53:52Z

It's supported on the "best effort basis".

I started some work to actually support it, but it means rewriting flash attention (the cuda version) with added bias, which may take some time.

mantrakp04 · 2023-06-28T23:47:03Z

Sad news: I didn't succeed, the mpt model is a bit different, i tried loading it but it didn't work as expected and keeps mixing up tokens. I am looking forward for your implementation Narsil, sorry for the wait.

A pre thanks (an advance thank) to narsil :D

ankit201 · 2023-06-29T06:13:03Z

It's supported on the "best effort basis".

I started some work to actually support it, but it means rewriting flash attention (the cuda version) with added bias, which may take some time.

Can you guide on how you started writing the flash attention part and what are your thoughts on implementing dynamic batching for this as it only supports 1 concurrent request for now on AutoModel.
A little guidance would be really great, maybe we can collaborate and try this out.

Narsil · 2023-06-29T11:38:30Z

on implementing dynamic batching for this as it only supports 1 concurrent request for now on AutoModel.

This won't require work once we have flash attention.

ankit201 · 2023-06-29T13:52:51Z

on implementing dynamic batching for this as it only supports 1 concurrent request for now on AutoModel.

This won't require work once we have flash attention.

Please correct me if I'm wrong but do we need to implement this since mpt-30 models already has flashattention usage prebuilt in its config?
mpt-30b-chat

import torch
import transformers

name = 'mosaicml/mpt-30b-chat'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.attn_config['attn_impl'] = 'triton'  # change this to use triton-based FlashAttention
config.init_device = 'cuda:0' # For fast initialization directly on GPU!

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.bfloat16, # Load model weights in bfloat16
  trust_remote_code=True
)

Narsil · 2023-06-30T07:09:46Z

Because it doesn't implement the flash attention we want.

This is Triton's flash attention, which doesn't support "unpadded" batching, which is the one necessary to work nicely on TGI (removing batching, removes a LOT of issues and unnecessary memory and speeds up inference much more than flash by itself).

Flash attention actually doesn't play that big of a role in speeding things up at inference, since most of the time is spent in decode where it doesn't really help. But the no padding thing is extremely important.

Narsil · 2023-07-01T10:33:08Z

Here is the non flash version (as a temporary measure since modifying the kernel is taking more time than I anticipated: #514

This should enable sharding at least.

ankit201 · 2023-07-01T12:48:19Z

Here is the non flash version (as a temporary measure since modifying the kernel is taking more time than I anticipated: #514

This should enable sharding at least.

Many thanks for this. Looking forward to the flash class too.
Cheers!

# What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes #361 Fixes #491 Fixes #290

ConProgramming · 2023-07-03T17:46:17Z

Because it doesn't implement the flash attention we want.

This is Triton's flash attention, which doesn't support "unpadded" batching, which is the one necessary to work nicely on TGI (removing batching, removes a LOT of issues and unnecessary memory and speeds up inference much more than flash by itself).

Flash attention actually doesn't play that big of a role in speeding things up at inference, since most of the time is spent in decode where it doesn't really help. But the no padding thing is extremely important.

Triton is the only flash attention implementation that supports ALiBi, if I understand this correctly.

So for TGI, if we want to use MPT with ALiBi, does that leave us with just the native pytorch implementation?

OlivierDehaene · 2023-07-04T07:41:33Z

We will fork and add it ourselves to the flash attention cuda kernels.

# What does this PR do? This adds a non flash version of MPT. Flash is harder because we need to create a bias ready cuda kernel of flash attention. Fixes huggingface/text-generation-inference#361 Fixes huggingface/text-generation-inference#491 Fixes huggingface/text-generation-inference#290

Narsil mentioned this issue Jul 1, 2023

Non flash MPT. #514

Merged

5 tasks

OlivierDehaene closed this as completed in #514 Jul 3, 2023

pseudotensor mentioned this issue Jul 4, 2023

Add support for mosaicml/mpt-30b and mosaicml/mpt-30b-instruct h2oai/h2ogpt#319

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for mosaicml/mpt-30b-instruct model #491

Support for mosaicml/mpt-30b-instruct model #491

maziyarpanahi commented Jun 23, 2023

mantrakp04 commented Jun 23, 2023

tim-a-davis commented Jun 23, 2023

mantrakp04 commented Jun 23, 2023

tim-a-davis commented Jun 23, 2023

mantrakp04 commented Jun 23, 2023 •

edited

SinanAkkoyun commented Jun 23, 2023

SinanAkkoyun commented Jun 23, 2023

Narsil commented Jun 26, 2023 •

edited

louis030195 commented Jun 27, 2023

Narsil commented Jun 28, 2023

mantrakp04 commented Jun 28, 2023

ankit201 commented Jun 29, 2023

Narsil commented Jun 29, 2023

ankit201 commented Jun 29, 2023

Narsil commented Jun 30, 2023

Narsil commented Jul 1, 2023 •

edited

ankit201 commented Jul 1, 2023

ConProgramming commented Jul 3, 2023

OlivierDehaene commented Jul 4, 2023 •

edited

Support for mosaicml/mpt-30b-instruct model #491

Support for mosaicml/mpt-30b-instruct model #491

Comments

maziyarpanahi commented Jun 23, 2023

Feature request

Motivation

Your contribution

mantrakp04 commented Jun 23, 2023

tim-a-davis commented Jun 23, 2023

mantrakp04 commented Jun 23, 2023

tim-a-davis commented Jun 23, 2023

mantrakp04 commented Jun 23, 2023 • edited

SinanAkkoyun commented Jun 23, 2023

SinanAkkoyun commented Jun 23, 2023

Narsil commented Jun 26, 2023 • edited

louis030195 commented Jun 27, 2023

Narsil commented Jun 28, 2023

mantrakp04 commented Jun 28, 2023

ankit201 commented Jun 29, 2023

Narsil commented Jun 29, 2023

ankit201 commented Jun 29, 2023

Narsil commented Jun 30, 2023

Narsil commented Jul 1, 2023 • edited

ankit201 commented Jul 1, 2023

ConProgramming commented Jul 3, 2023

OlivierDehaene commented Jul 4, 2023 • edited

mantrakp04 commented Jun 23, 2023 •

edited

Narsil commented Jun 26, 2023 •

edited

Narsil commented Jul 1, 2023 •

edited

OlivierDehaene commented Jul 4, 2023 •

edited