Impl simple mamba model #1480

drbh · 2024-01-25T02:11:12Z

This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass.

This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations.

Helpful resources

Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)
https://github.com/johnma2006/mamba-minimal
https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs
huggingface/transformers#28094

Notes: this dev work is currently targeting state-spaces/mamba-130m, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768

Update / Current State

Integration tests have been added and basic functionality such as model loading is supported.

cd integration-tests
pytest -vv models/test_fused_kernel_mamba.py

fetching models tested during dev

text-generation-server download-weights state-spaces/mamba-130m
text-generation-server download-weights state-spaces/mamba-1.4b
text-generation-server download-weights state-spaces/mamba-2.8b

The server can be run

cd server
 MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b

router

cargo run

make a request

curl -s localhost:3000/generate \
    -X POST \
    -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \
    -H 'Content-Type: application/json' | jq

response

{
  "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data."
}

RonanKMcGovern · 2024-01-30T15:44:26Z

thanks, this will be a great addition as we see more mamba architectures

server/text_generation_server/models/__init__.py

server/text_generation_server/models/custom_modeling/mamba_modeling.py

server/text_generation_server/models/mamba.py

…-generation-inference into impl-simple-mamba-model

Narsil

LGTM

This draft PR is a work in progress implementation of the mamba model. This PR currently loads weights, and produces correct logits after a single pass. This PR still needs to correctly integrate this model so it produces tokens as expected, and apply optimization to avoid all copies during runtime/unnecessary operations. [Mamba: Linear-Time Sequence Modeling with Selective State Spaces (Albert Gu and Tri Dao)](https://arxiv.org/abs/2312.00752) https://github.com/johnma2006/mamba-minimal https://github.com/huggingface/candle/blob/main/candle-examples/examples/mamba-minimal/model.rs huggingface/transformers#28094 Notes: this dev work is currently targeting `state-spaces/mamba-130m`, so if you want to test please use that model. Additionally when starting the router the prefill needs to be limited: `cargo run -- --max-batch-prefill-tokens 768 --max-input-length 768` Integration tests have been added and basic functionality such as model loading is supported. ```bash cd integration-tests pytest -vv models/test_fused_kernel_mamba.py ``` - [x] add tests - [x] load model - [x] make simple request - [ ] resolve warmup issue - [ ] resolve output issues fetching models tested during dev ```bash text-generation-server download-weights state-spaces/mamba-130m text-generation-server download-weights state-spaces/mamba-1.4b text-generation-server download-weights state-spaces/mamba-2.8b ``` The server can be run ```bash cd server MASTER_ADDR=127.0.0.1 MASTER_PORT=5555 python text_generation_server/cli.py serve state-spaces/mamba-2.8b ``` router ```bash cargo run ``` make a request ```bash curl -s localhost:3000/generate \ -X POST \ -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":20}}' \ -H 'Content-Type: application/json' | jq ``` response ```json { "generated_text": "\n\nDeep learning is a machine learning technique that uses a deep neural network to learn from data." } ``` --------- Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

drbh added 5 commits January 23, 2024 01:37

feat: initial weight load

bcf1457

feat: mvp single inference and explore integration

35939a2

feat: prefer custom model and produce correct output

1c32d53

feat: build custom selective-scan kernels

c2681b2

feat: use fused kernel in forward pass

966f3ba

drbh mentioned this pull request Jan 29, 2024

Support for Mamba (SSM) #1507

Closed

feat: add optimization and first pass of integration test

5b6f925

drbh commented Jan 30, 2024

View reviewed changes

server/text_generation_server/models/__init__.py Show resolved Hide resolved

drbh commented Jan 30, 2024

View reviewed changes

server/text_generation_server/models/custom_modeling/mamba_modeling.py Outdated Show resolved Hide resolved

drbh and others added 7 commits February 1, 2024 05:00

fix: start to add caching of previous states

2d67462

feat: use cache when decoding

3a42765

fix: revise non batching tests

0f124cb

feat: avoid triton selective_state_update

a4f1916

fix: improve step to use batch

63bc4c5

feat: support batching

3caa9b9

Fix mamba load.

8319e85

drbh commented Feb 6, 2024

View reviewed changes

server/text_generation_server/models/mamba.py Outdated Show resolved Hide resolved

drbh added 8 commits February 6, 2024 20:38

feat: prefer triton ops and batch conv

5e10218

fix: rename tests and snapshots

36a4853

feat: update docker for mamba

50ca04b

fix: update selective state Makefile

5b30a42

Merge branch 'main' into impl-simple-mamba-model

9146ba0

fix: adjust typos and docker build

deed8e8

Merge branch 'impl-simple-mamba-model' of github.com:huggingface/text…

48624fe

…-generation-inference into impl-simple-mamba-model

fix: add missing accepted_ids to batch_top_tokens

2c6ef7c

drbh marked this pull request as ready for review February 7, 2024 04:35

feat: conditionally include mamba

b99f784

Narsil approved these changes Feb 8, 2024

View reviewed changes

Narsil merged commit bd405e0 into main Feb 8, 2024
7 checks passed

Narsil deleted the impl-simple-mamba-model branch February 8, 2024 09:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impl simple mamba model #1480

Impl simple mamba model #1480

drbh commented Jan 25, 2024 •

edited

RonanKMcGovern commented Jan 30, 2024

Narsil left a comment

Impl simple mamba model #1480

Impl simple mamba model #1480

Conversation

drbh commented Jan 25, 2024 • edited

Helpful resources

Update / Current State

RonanKMcGovern commented Jan 30, 2024

Narsil left a comment

Choose a reason for hiding this comment

drbh commented Jan 25, 2024 •

edited