<a href="https://colab.research.google.com/github/itsual/AI_Experiments/blob/main/Mistral_7B_LLM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Runing Mistral-7b AI on a Single GPU with Google Colab**

Below code sets up a custom CSS style for the IPython display to ensure that the content wraps and doesn't overflow.

In [1]:
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))

get_ipython().events.register('pre_run_cell', set_css)


Below code installs necessary libraries and packages from their respective repositories.

In [2]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 MB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m55.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m73.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproje

This code defines quantization parameters to optimize the model's memory and computation efficiency.
It also loads a specific model from HuggingFace's model hub with the defined quantization settings.


In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Load the Model with quantization

In [4]:
model_id = "mistralai/Mistral-7B-Instruct-v0.1"
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

Downloading (…)lve/main/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/5.06G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [5]:
# This cell prints the model's architecture to show the changes made by the quantization.
print(model)

MistralForCausalLM(
  (model): MistralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MistralDecoderLayer(
        (self_attn): MistralAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MistralRotaryEmbedding()
        )
        (mlp): MistralMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): MistralRMSNorm()
        (post_attention_layernorm): MistralRMSNorm()
      )

In [6]:
model.hf_device_map

{'': 0}

Test the model / Inferencing

In [10]:
device = "cuda:0"

messages = [
    {"role": "user", "content": "What is death spiral in context of boiler water treatment?"}
]


encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)


generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] What is death spiral in context of boiler water treatment? [/INST] A death spiral may occur in the context of boiler water treatment when the pH of the water becomes too acidic, too basic, or too high in dissolved solids. This can have a negative impact on the performance and efficiency of the boiler, as well as on the health and safety of the operation.

When water becomes acidic due to the presence of acidic chemicals such as hydrochloric acid, the corrosion rate of the boiler system increases. As a result, the system can undergo a death spiral of corrosion and deterioration, which can ultimately lead to costly repairs or replacement.

Similarly, when water becomes too basic due to the presence of chemicals such as sodium hydroxide or lye, the system can become susceptible to scaling or mineral buildup, which can affect the efficiency of the boiler and increase the risk of equipment failure.

High dissolved solids in the water can also contribute to a death spiral in boile

In [11]:
messages = [
    {"role": "user", "content": "write a python script to print triangle pattern"}
    ]

encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)

generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] write a python script to print triangle pattern [/INST] Here is a simple Python script to print a triangle pattern using a loop:
```python
n = 5

for i in range(n):
    # Print spaces for the horizontal line of the triangle
    for j in range(n - i - 1):
        print(" ", end="")
    
    # Print the stars for the top down triangles
    for j in range(2 * i + 1):
        print("*", end="")
        
    # Move to the next line
    print()
```
You can run this script and adjust the 'n' variable value to print triangles with different number of rows.</s>


In [12]:
PROMPT= """ ### Instruction: Act as a data scientist.
### Question:
Explain what is generative AI. Assume that I am a teenager

### Answer:
"""

encodeds = tokenizer(PROMPT, return_tensors="pt", add_special_tokens=True)
model_inputs = encodeds.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True, pad_token_id=tokenizer.eos_token_id)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])



<s>  ### Instruction: Act as a data science expert.
### Question:
Explain what is generative AI. Assume that I am a teenager

### Answer:
Generative AI is a type of artificial intelligence technology where an AI system can create brand new content, such as images or videos, that are similar to a dataset of content that it was trained on. This is different than other types of AI, called "discriminative AI," where the AI system can classify something into categories or identify patterns in the existing data, but can't create new, unique content. With generative AI, the AI system can "generate" new content that looks a lot like the content in the dataset it was trained on, and it can do this without being explicitly programmed to make a certain image or video.

Here is an example to make it easier to understand: Imagine you train an AI system with a large dataset of pictures of cats. The AI system can learn to recognize different types of cats, like a Siamese or a Persian cat, by looking 