<a href="https://colab.research.google.com/github/piesauce/llm-playbooks/blob/loading_llms%2Fateng/Loading_LLMs_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An Introduction to Loading Large Language Models

In this notebook, we will explore how to load LLMs in different sizes, which will be helpful in different compute setups. Specifically, we will take a look at **Llama2**.

*This notebook is written as a companion and and extension of Suhas Pai's Book [Designing Large Language Model Applications](https://www.oreilly.com/library/view/designing-large-language/9781098150495/).*

The complete repo for this project can be viewed [here](https://github.com/piesauce/llm-playbooks).

**System Requirements**:
This notebook was initially tested using the Google Colab free tier, but the engineers ran into issues when loading larger models and running inference.

The current notebook assumes a basic subscription of [Google Colab Pro](https://colab.research.google.com/signup) as of September 2023.  

Authored by [Amber Teng](https://www.linkedin.com/in/angelavteng) in collaboration with [Yenson Lau](https://www.linkedin.com/in/yensonlau/) and advised by [Suhas Pai](https://www.linkedin.com/in/piesauce/).

In [None]:
# import locale
# locale.getpreferredencoding = lambda: "UTF-8"
# locale.getpreferredencoding()
# !cd
# !pwd
# workaround for error: https://github.com/googlecolab/colabtools/issues/3409

## Resources

- Python 3 Google Compute Engine backend (GPU)
- System RAM: 6.7 / 83.5 GB
- GPU RAM: 16.8 / 40.0 GB
- Disk Space: 80.0 / 166.8 GB
- Hosted Runtime Type: A100

Note that to be more efficient and mindful of GPU RAM constraints and Disk Space limitations, we cleared the GPU memory after every model run, and we also cleared the Disk Space.

## Note about Llama2-70B
Currently, the Llama2 70 Billion Parameter Model doesn't work with Google Colab Pro. Our team is currently exploring using AWS for loading large language models as an alternative to Google Colab Pro. Stay tuned for our later posts! :)

# Install Libraries

Note that we're installing `bitsandbytes` mnanually. We're using this library for [bit quantization](https://huggingface.co/docs/optimum/concept_guides/quantization).

The original paper can be viewed [here](https://arxiv.org/abs/2305.14314), and the HuggingFace blog post can be viewed [here](https://huggingface.co/blog/4bit-transformers-bitsandbytes).

In [1]:
!pip install huggingface_hub
!pip install transformers
!pip install auto_gptq

!pip install text-generation
!pip install langchain transformers

!pip install pipeline

%pip install git+https://github.com/bigscience-workshop/petals

!pip install sentencepiece

!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

Collecting huggingface_hub
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/295.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: huggingface_hub
Successfully installed huggingface_hub-0.17.3
Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m59.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m1

# HuggingFace Login

For this notebook, we will be utilizing the HuggingFace library. To login with your own HuggingFace token, please see [this](https://huggingface.co/docs/hub/security-tokens) guide.

In [4]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# Loading Llama2 Chat GPTQ Using HuggingFace

To get approval to use Llama2, it usually takes a day or two after a user's initial request. To get started while waiting, I tried out a HuggingFace community version of Llama2's 7 Billion Parameter Chat model, built by [TheBlokeAI](https://github.com/TheBlokeAI) and supported by a16z. The model card can be viewed [here](https://huggingface.co/TheBloke/Llama-2-7b-Chat-GPTQ).

In [6]:
from transformers import AutoTokenizer, pipeline, logging
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

In [18]:
model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "model"

use_triton = False

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)



In [20]:
prompt = "Outline the steps to load a large language model using Python"
system_message = """You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information."""
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
# print(tokenizer.decode(output[0]))

# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
logging.set_verbosity(logging.CRITICAL)

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=1000,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(pipe(prompt_template)[0]['generated_text'])



*** Generate:
*** Pipeline:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant.
Always answer as helpfully as possible, while being safe.
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct.
If you don't know the answer to a question, please don't share false information.
<</SYS>>

Outline the steps to load a large language model using Python [/INST]  I'm glad you're interested in loading a large language model using Python! However, I must point out that it's important to use ethical and responsible AI practices when working with large language models. Here are some steps you can follow to load a large language model safely and responsibly:
1. Choose a reputable source for your language model: There are many op

**Question Answering:**
Trying out a prompt above, about loading LLMs using Python.

**Question:**
Outline the steps to load a large language model using Python

**Answer:**
 I'm glad you're interested in loading a large language model using Python! However, I must point out that it's important to use ethical and responsible AI practices when working with large language models. Here are some steps you can follow to load a large language model safely and responsibly:
1. Choose a reputable source for your language model: There are many open-source language models available online, such as BERT, RoBERTa, and XLNet. Make sure to choose a model from a reputable source to avoid any potential intellectual property issues.
2. Check the model's size and compatibility: Before attempting to load a large language model, make sure it's compatible with your hardware and software setup. You can check the model's size by looking at its documentation or checking the repository where it was hosted.
3. Use a suitable programming environment: To work with large language models, you'll need a programming environment that can handle the necessary computations. Python is a popular choice due to its extensive libraries and frameworks for machine learning and natural language processing (NLP).
4. Load the model using a relevant library: There are several Python libraries available for loading and manipulating large language models, including transformers, Hugging Face's Transformers, and spaCy. Each library has its own strengths and weaknesses, so choose one that best fits your needs.
5. Preprocess the data: Depending on the type of task you want to perform with the language model, you may need to preprocess the input data before feeding it into the model. This could involve tokenization, padding, or other forms of data preparation.
6. Fine-tune the model (optional): If you want to adapt the language model to a specific task or domain, you can fine-tune it using a small amount of task-specific training data. This process involves adjusting the model's weights to better fit the new dataset.
7. Test and evaluate the model: Once you've loaded and fine-tuned the language model, you can test and evaluate its performance on a validation set to see how well it performs.
By following these steps, you can safely and responsibly load a large language model using Python. Remember to always use ethical and responsible AI practices when working with these powerful tools!


# Loading Llama2


In [21]:
from transformers import AutoTokenizer, pipeline
import transformers
import torch

## 2.7 Billion Parameter Chat Model (HF)

[Model Card](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)

[Llama2](https://ai.meta.com/llama/) is an open source large language model developed by Meta.

**Llama2 Chat Models are optimized for dialogue use cases.**

From the [model](https://ai.meta.com/llama/#inside-the-model) website: "Llama 2 pretrained models are trained on 2 trillion tokens, and have double the context length than Llama 1. Its fine-tuned models have been trained on over 1 million human annotations."

Note that Llama2 is a Gated Model, so please do register for a license on the [Meta website](https://ai.meta.com/resources/models-and-libraries/llama-downloads/) before using it.

*`max_new_tokens` has been set to 1000 for each of these models to explore results for the same question / different slight variations of the same question.

In [22]:
model = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
sequences = pipeline(
    'Outline the steps to load a large language model using Python.\n',
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    # max_length=200,
    max_new_tokens = 1000
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")


Result: Outline the steps to load a large language model using Python.

Introduction:
Large language models have gained significant attention in recent years due to their impressive performance in various natural language processing (NLP) tasks. These models require significant computational resources and memory, making it challenging to load them onto a local machine. In this article, we will outline the steps to load a large language model using Python.
Step 1: Choose a Language Model
The first step is to choose a large language model that you want to load. There are several popular models available, including BERT, RoBERTa, and XLNet. You can choose one of these models or experiment with different variations.
Step 2: Download the Model
Once you have chosen a language model, you need to download it. You can download the pre-trained model from the model's official website or from a reputable data repository. Make sure to download the correct version of the model, which may vary depend

**Question Answering:**
Trying out a prompt above, about loading LLMs using Python.

**Question:**
Outline the steps to load a large language model using Python

**Answer:**

Introduction:
Large language models have gained significant attention in recent years due to their impressive performance in various natural language processing (NLP) tasks. These models require significant computational resources and memory, making it challenging to load them onto a local machine. In this article, we will outline the steps to load a large language model using Python.

Step 1: Choose a Language Model
The first step is to choose a large language model that you want to load. There are several popular models available, including BERT, RoBERTa, and XLNet. You can choose one of these models or experiment with different variations.

Step 2: Download the Model
Once you have chosen a language model, you need to download it. You can download the pre-trained model from the model's official website or from a reputable data repository. Make sure to download the correct version of the model, which may vary depending on the task you are working on.

Step 3: Prepare the Model
After downloading the model, you need to prepare it for use. This involves loading the model into memory and formatting it according to the required format. You can use the `torch` library in Python to load the model and perform other NLP tasks.

Step 4: Load the Model
To load the model into memory, you can use the `torch.load()` function. This function takes the path to the model file and loads it into memory. Here is an example:
```
import torch
model = torch.load("path/to/model.pth")
```

Step 5: Fine-tune the Model
After loading the model, you may want to fine-tune it for your specific task. This involves adjusting the model's weights to improve its performance on your task. You can use the `torch.optim` module to optimize the model's weights. Here is an example:
```
import torch
model = torch.load("path/to/model.pth")
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
```

Step 6: Use the Model
Once the model is loaded and fine-tuned, you can use it for your NLP task. You can use the `model()` function to make predictions or perform other NLP tasks. Here is an example:
```
import torch
model = torch.load("path/to/model.pth")
input_text = "This is a sample input text."
output = model(input_text)

```

Conclusion:
In this article, we outlined the steps to load a large language model using Python. These steps involve choosing a language model, downloading the model, preparing the model, loading the model into memory, fine-tuning the model, and using the model for your NLP task. By following these steps, you can leverage the power of large language models for your NLP tasks.

## 2.7 Billion Parameter Model (HF)

[Model Card](https://huggingface.co/meta-llama/Llama-2-7b-hf)

Note that for this model loading, we will use `bitsandbytes` for sharding. For more information on model quantization, see the HuuggingFace documentation [here](https://huggingface.co/docs/transformers/main_classes/quantization).

In [25]:
from torch import cuda, bfloat16
import transformers
from transformers import AutoTokenizer, pipeline, AutoModelForCausalLM, BitsAndBytesConfig

In [26]:
model_id = "meta-llama/Llama-2-7b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [31]:
text = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.


Question: What are the steps to load a large language model using Python? \n

Answer: """

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model_4bit.generate(**inputs, max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.


Question: What are the steps to load a large language model using Python? 
 

Answer: 
You can use the following steps to load a large language model using Python:

1.Download the model: You can download the model from the internet and save it on your computer.

2.Import the necessary libraries: You will need to import the necessary libraries for loading the model, such as numpy, pandas, and tensorflow.

3.Define the model: You will need to define the model by specifying the vocabula

**Question Answering:**
Trying out a prompt above, about loading LLMs using Python.

**Question:**
Outline the steps to load a large language model using Python

**Answer:**

You can use the following steps to load a large language model using Python:

1.Download the model: You can download the model from the internet and save it on your computer.

2.Import the necessary libraries: You will need to import the necessary libraries for loading the model, such as numpy, pandas, and tensorflow.

3.Define the model: You will need to define the model by specifying the vocabulary and the number of dimensions.

4.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

5.Train the model: You will need to train the model by providing the input text and the desired output text.

6.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

7.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

8.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

9.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

10.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

11.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

12.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

13.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

14.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

15.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

16.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

17.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

18.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

19.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

20.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

21.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

22.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

23.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

24.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired output.

25.Save the model: You will need to save the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

26.Load the model: You will need to load the model by specifying the vocabulary, the number of dimensions, and the file path where the model is saved.

27.Evaluate the model: You will need to evaluate the model by providing the input text and the desired output text and comparing the output to the desired


<i>**Note:** The output here is interesting because it seems to be looping through the load, evaluate, and save model steps. This is to be discussed and explored further in a blog post.</i>

## 13 Billion Parameter Model (HF)

[Model Card](https://huggingface.co/meta-llama/Llama-2-13b-hf)




In [32]:
model_id = "meta-llama/Llama-2-13b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

In [34]:
text = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

Q: What are the steps to load a large language model using Python? Please explain the answer to me in terms that a 5th grader can understand. Thank you. \n

A: """

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model_4bit.generate(**inputs, max_new_tokens=1000)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))


You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

Q: What are the steps to load a large language model using Python? Please explain the answer to me in terms that a 5th grader can understand. Thank you. 
 

A: 

  1. Download the language model. 
  2. Import the language model. 
  3. Load the language model. 
  4. Use the language model. 

 



**Question Answering:**
Trying out a prompt above, about loading LLMs using Python.

**Question:**
Outline the steps to load a large language model using Python

**Answer:**

A:

  1. Download the language model.
  2. Import the language model.
  3. Load the language model.
  4. Use the language model.


## 70 Billion Parameter Model (HF) [In Progress]

[Model Card](https://huggingface.co/meta-llama/Llama-2-70b-hf)

In [None]:
from petals import AutoDistributedModelForCausalLM

In [None]:
model_name = "meta-llama/Llama-2-70b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name, add_bos_token=False)
# tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False, add_bos_token=False)
model = AutoDistributedModelForCausalLM.from_pretrained(model_name)
model = model.cuda()

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]



Downloading (…)lve/main/config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Sep 03 23:35:12.791 [[1m[34mINFO[0m] Make sure you follow the LLaMA's terms of use: https://bit.ly/llama2-license for LLaMA 2, https://bit.ly/llama-license for LLaMA 1
Sep 03 23:35:12.792 [[1m[34mINFO[0m] Using DHT prefix: Llama-2-70b-hf


Downloading (…)fetensors.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]



Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
inputs = tokenizer('A cat in French is "', return_tensors="pt")["input_ids"].cuda()
outputs = model.generate(inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0]))

Sep 03 23:36:42.091 [[1m[38;5;208mWARN[0m] [[1mpetals.client.inference_session.step:327[0m] Caught exception when running inference via None (retry in 0 sec): MissingBlocksError("No servers holding blocks [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79] are online. You can check the public swarm's state at https://health.petals.dev If there are not enough servers, please connect your GPU: https://github.com/bigscience-workshop/petals#connect-your-gpu-and-increase-petals-capacity ")
Sep 03 23:36:45.525 [[1m[38;5;208mWARN[0m] [[1mpetals.client.inference_session.step:327[0m] Caught exception when running inference via None (retry in 1 sec): MissingBlocksError("No servers holding blocks [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 5

KeyboardInterrupt: ignored

In [None]:
model_id = "meta-llama/Llama-2-70b-hf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/66.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.85G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.97G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.80G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/9.50G [00:00<?, ?B/s]

Downloading (…)of-00015.safetensors:   0%|          | 0.00/524M [00:00<?, ?B/s]

Sep 04 02:35:26.526 [[1m[34mINFO[0m] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).


ValueError: ignored

In [None]:
model_4bit = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", max_memory={0: "38GIB", "cpu": "38GIB"})

Loading checkpoint shards:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
text = """
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.

Question:
How do you load a large language model like Llama2 using Hugging Face transfomers?"""

device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model_4bit.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

RuntimeError: ignored

In [None]:
inputs

{'input_ids': tensor([[    1, 29871,    13,  3492,   526,   263,  8444, 29892,  3390,  1319,
           322, 15993, 20255, 29889, 29849,  1234,   408,  1371,  3730,   408,
          1950, 29892,  1550,  1641,  9109, 29889, 29871,  3575,  6089,   881,
           451,  3160,   738, 10311,  1319, 29892,   443,   621,   936, 29892,
         11021,   391, 29892,  7916,   391, 29892,   304, 27375, 29892, 18215,
         29892,   470, 27302,  2793, 29889,  3529,  9801,   393,   596, 20890,
           526,  5374,   635,   443,  5365,  1463,   322,  6374,   297,  5469,
         29889,    13,    13,  3644,   263,  1139,   947,   451,  1207,   738,
          4060, 29892,   470,   338,   451,  2114,  1474, 16165,   261,   296,
         29892,  5649,  2020,  2012,   310, 22862,  1554,   451,  1959, 29889,
           960,   366,  1016, 29915, 29873,  1073,   278,  1234,   304,   263,
          1139, 29892,  3113,  1016, 29915, 29873,  6232,  2089,  2472, 29889,
            13,    13, 16492, 29901, 2

# Loading StarCoder [In Progress]

I also tried loading similar code-focused models such as [StarCoder](https://huggingface.co/blog/starcoder). The model card can be viewed [here](https://huggingface.co/bigcode/starcoder).

Note that this is also a Gated Model, so please apply for access before running this code.

From the model card: "The StarCoder models are 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2), with opt-out requests excluded. The model uses Multi Query Attention, a context window of 8192 tokens, and was trained using the Fill-in-the-Middle objective on 1 trillion tokens."

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

In [4]:
checkpoint = "bigcode/starcoder"

model = AutoModelForCausalLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

Loading checkpoint shards:   0%|          | 0/7 [00:00<?, ?it/s]

In [7]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0)
print( pipe("def hello():") )

OutOfMemoryError: ignored

In [5]:
pipe = pipeline("text-generation", model=model, max_new_tokens=1000,
                tokenizer=tokenizer, device=0)
print( pipe("Write code to load an LLM.") )

OutOfMemoryError: ignored