This notebook downloads "mistralai/Mixtral-8x7B" and then uses it for inference.

This model is huge, so this notebook was written on a machine with 8x A100's (a total of 320G of GPU memory).

In [1]:
import torch
torch.cuda.is_available()

True

In [2]:
from transformers import pipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig
import time

import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

model_id = "mistralai/Mixtral-8x7B"

t1= time.perf_counter()
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

t2= time.perf_counter()
print(f"Loading tokenizer and model: took {t2-t1} seconds to execute.")
# Create a pipeline
pipe = pipeline('text-generation', model=model, tokenizer=tokenizer)

t3= time.perf_counter()
print(f"Creating pipeline: took {t3-t2} seconds to execute.")


Loading checkpoint shards:   0%|          | 0/19 [00:00<?, ?it/s]

Loading tokenizer and model: took 68.24930608200066 seconds to execute.
Creating pipeline: took 0.0003688449996843701 seconds to execute.


In [3]:
# Generate code for an input string
print("\n=========Please type in your prompt=========================\n")
user_content = input("\nPrompt: ") # User question
user_content.strip()
#prompt=f"[INST] {user_content} [/INST]"
prompt=user_content
print(f"Running inference on your input.\n")
t1= time.perf_counter()
output = pipe(prompt, max_length=256)[0]['generated_text']
t2= time.perf_counter()
print(f"Inferencing using the model: took {t2-t1} seconds to execute.")
print(output)






Prompt:  The accelerate python library can be used to load very large AI models across multiple GPUs. The way it works is 


Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Running inference on your input.

Inferencing using the model: took 26.911936888999662 seconds to execute.
The accelerate python library can be used to load very large AI models across multiple GPUs. The way it works is 1) it loads the model into memory, 2) it splits the model into chunks, 3) it loads the chunks into the GPUs, 4) it runs the model on the GPUs, 5) it unloads the chunks from the GPUs, 6) it unloads the model from memory.

The problem is that the model is loaded into memory before it is split into chunks. This means that the model must fit into memory before it can be loaded onto the GPUs. This is a problem because the model is too large to fit into memory.

The solution is to use the accelerate python library to load the model into memory, split the model into chunks, and load the chunks into the GPUs. This will allow the model to be loaded onto the GPUs without having to fit into memory.

## How to use the accelerate python library to load very large AI models across mu