# MPT-7B-chat

---

🚨 **Note: this must be run on a GPU. If you run this on a CPU, even a very fast one, it can take many hours to answer a single question!**

---

I have reworked the code reviewed in this post
https://hackernoon.com/how-to-run-mpt-7b-on-aws-sagemaker-mosaicmls-chatgpt-competitor

In [1]:
%%time
!pip install -qU transformers accelerate einops langchain xformers

CPU times: user 59.6 ms, sys: 22 ms, total: 81.6 ms
Wall time: 5.25 s


In [2]:
import time

t_start_time = time.time()

In [3]:
%%time
from torch import cuda, bfloat16
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig
import transformers

CPU times: user 1.74 s, sys: 164 ms, total: 1.9 s
Wall time: 1.92 s


In [4]:
%%time

device = f'cuda:{cuda.current_device()}' if cuda.is_available() else 'cpu'

# name = 'mosaicml/mpt-30b-chat'
name = 'mosaicml/mpt-7b-chat'

tokenizer = AutoTokenizer.from_pretrained(name, trust_remote_code=True)

CPU times: user 82.7 ms, sys: 54.1 ms, total: 137 ms
Wall time: 205 ms


In [5]:
# %%time


# config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
# config.init_device = 'meta' # For fast initialization directly on GPU!

In [6]:
%%time
config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:0' # For fast initialization directly on GPU!

CPU times: user 12.5 ms, sys: 0 ns, total: 12.5 ms
Wall time: 80.9 ms


In [7]:
%%time
model = AutoModelForCausalLM.from_pretrained(name,
                                             trust_remote_code=True,
                                             config=config,
                                             torch_dtype=bfloat16)

print(f"device={device}")
print('model loaded')

Instantiating an MPTForCausalLM model from /home/ec2-user/.cache/huggingface/modules/transformers_modules/mosaicml/mpt-7b-chat/64e5c9c9fb53a8e89690c2dee75a5add37f7113e/modeling_mpt.py
You are using config.init_device='cuda:0', but you can also use config.init_device="meta" with Composer + FSDP for fast initialization.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

device=cuda:0
model loaded
CPU times: user 3.32 s, sys: 9.36 s, total: 12.7 s
Wall time: 12.8 s


In [8]:
%%time
# Use the GPU
model.to(device)

CPU times: user 2.04 ms, sys: 3.37 ms, total: 5.41 ms
Wall time: 3.53 ms


MPTForCausalLM(
  (transformer): MPTModel(
    (wte): SharedEmbedding(50432, 4096)
    (emb_drop): Dropout(p=0, inplace=False)
    (blocks): ModuleList(
      (0-31): 32 x MPTBlock(
        (norm_1): LPLayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attn): MultiheadAttention(
          (Wqkv): Linear(in_features=4096, out_features=12288, bias=False)
          (out_proj): Linear(in_features=4096, out_features=4096, bias=False)
        )
        (norm_2): LPLayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (ffn): MPTMLP(
          (up_proj): Linear(in_features=4096, out_features=16384, bias=False)
          (act): GELU(approximate='none')
          (down_proj): Linear(in_features=16384, out_features=4096, bias=False)
        )
        (resid_attn_dropout): Dropout(p=0, inplace=False)
        (resid_ffn_dropout): Dropout(p=0, inplace=False)
      )
    )
    (norm_f): LPLayerNorm((4096,), eps=1e-05, elementwise_affine=True)
  )
)

In [9]:
%%time
import time
from IPython.display import Markdown
import torch
from transformers import StoppingCriteria, StoppingCriteriaList

# mtp-7b is trained to add "<|endoftext|>" at the end of generations
stop_token_ids = [tokenizer.eos_token_id]

# define custom stopping criteria object.
# Source: https://github.com/pinecone-io/examples/blob/master/generation/llm-field-guide/mpt-7b/mpt-7b-huggingface-langchain.ipynb
class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor,scores: torch.FloatTensor,
                 **kwargs) -> bool:
        for stop_id in stop_token_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

stopping_criteria = StoppingCriteriaList([StopOnTokens()])

def ask_question(question, max_length=100):
    start_time = time.time()

    # Encode the question
    input_ids = tokenizer.encode(question, return_tensors='pt')

    # Use the GPU
    input_ids = input_ids.to(device)

    # Generate a response
    output = model.generate(
        input_ids,
        max_new_tokens=max_length,
        temperature=0.9,
        stopping_criteria=stopping_criteria
    )

    # Decode the response
    response = tokenizer.decode(output[:, input_ids.shape[-1]:][0],
                                skip_special_tokens=True)

    end_time = time.time()
    duration = end_time - start_time

    display(Markdown(response))

    print("Function duration:", duration, "seconds")

CPU times: user 72 µs, sys: 119 µs, total: 191 µs
Wall time: 198 µs


In [10]:
# Ask a question
# ask_question("What is the capital of France?")
ask_question("Explain to me the difference between nuclear fission and fusion.", 200)
# ask_question("write python code that converts a csv into a pdf", 400)




Nuclear fission is a process in which the nucleus of an atom is split into two smaller nuclei, releasing a large amount of energy in the process. This process is used in nuclear power plants to generate electricity.
Nuclear fusion is a process in which two or more nuclei combine to form a single, heavier nucleus, releasing energy in the process. This process is the same process that occurs in the sun and other stars, and scientists hope that it can be harnessed to provide a virtually unlimited source of clean energy.
In summary, nuclear fission is the process of splitting atoms to release energy, while nuclear fusion is the process of combining atoms to release energy.

Function duration: 29.659615993499756 seconds


In [11]:
t_end_time = time.time()
t_duration = t_end_time - t_start_time
print("Total duration:", t_duration, "seconds")

Total duration: 44.686596393585205 seconds
