<a href="https://colab.research.google.com/github/lamcnguyen89/CAP_6411_Assignments/blob/main/Assignment_04/Assignment_04_Mistral_and_Mixtral.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Assignment 4: Mistral vs Mixtral Pipeline**



*   Got to https://people.eecs.berkeley.edu/~hendrycks/data.tar
*   Colab should have both mixtral and mistral 7b pipelines
*   Compare both in terms of speed and accuracy
*   Bonus: 100 points for applying SMoE to other models
*   Due 03Sep2024





**Mistral-7b** \
Mistral-7B is one of the biggest and most advanced Large Language Models out there, trained on a massive dataset of text and code.

Link: https://huggingface.co/mistralai/Mistral-7B-v0.1/discussions/13


**Mixtral 8x7b** \
Mixtral 8x7b is a high-quality sparse mixture of experts (SMoE) model with open weights created by Mistral AI. It outperforms Llama 2 70B on most benchmarks and batches or beats GPT3.5 on most standard benchmarks.

Link: https://github.com/dvmazur/mixtral-offloading/blob/master/notebooks/demo.ipynb


# **Load Mistral-7B Model and Test**

In [4]:
# Download Mistral-7B Model
!pip -q install git+https://github.com/huggingface/transformers # need to install from github
!pip -q install bitsandbytes accelerate xformers einops
!pip -q install langchain
!pip install datasets
!pip install --upgrade huggingface_hub

# You need to go onto Google Colab and create a variable in Secrets. This variable will contain a token for the huggingface account that we will get our pretrained models from
from google.colab import userdata
secret = userdata.get('HF_TOKEN')
!huggingface-cli login --token {secret}

clear_output()



In [2]:
!nvidia-smi

Tue Sep  3 02:07:11 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA A100-SXM4-40GB          Off | 00000000:00:04.0 Off |                    0 |
| N/A   32C    P0              45W / 400W |      2MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                    

In [3]:
import torch
import transformers
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments

torch.set_default_device('cuda')



In [4]:
# Import Pretrained model and load into memory
# Note I required a GPU with 40gb of memory for this project.
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
                                             torch_dtype="auto")

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1",
                                          torch_dtype="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
# How to Prompt Mistral AI Models: https://community.aws/content/2dFNOnLVQRhyrOrMsloofnW0ckZ/how-to-prompt-mistral-ai-models-and-why


text = "<s>[INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] Do you have mayonnaise recipes? [/INST]"

encodeds = tokenizer(text, return_tensors="pt", add_special_tokens=False)

device = 'cuda'
model_inputs = encodeds.to(device)
model.to(device)

generated_ids = model.generate(**model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
Starting from v4.46, the `logits` model output will have the same type as the model (except at train time, where it will always be FP32)


<s>[INST] What is your favourite condiment? [/INST]Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!</s> [INST] Do you have mayonnaise recipes? [/INST] Oh, absolutely! Here's a classic homemade mayonnaise recipe that's sure to impress:

Ingredients:

* 2 large egg yolks
* 1 tablespoon dijon mustard
* 2 tablespoons white wine vinegar
* 1 clove garlic, minced
* 1/2 teaspoon salt
* 1/2 cup olive oil
* 1/4 cup vegetable oil
* Fresh herbs or spices, to taste (optional)

Instructions:

1. In a small bowl, whisk together the egg yolks, dijon mustard, white wine vinegar, minced garlic, and salt until well combined.
2. Slowly stream in the olive oil and vegetable oil, whisking continuously to create a smooth, creamy sauce.
3. Taste and adjust seasoning as desired.
4. Store any leftover mayonnaise in an airtight container in the fridge for up to 3 days.

And there you have it! A simple, yet de

# **Load and Process Dataset**


In [6]:
# Source: https://github.com/mddunlap924/PyTorch-LLM/blob/main/notebooks/training.ipynb
!ls
validation_dataset = load_dataset("Assignment_04_Dataset")
print(validation_dataset)

Assignment_04_Dataset  sample_data
DatasetDict({
    validation: Dataset({
        features: ['Question', 'Answer_01', 'Answer_02', 'Answer_03', 'Answer_04', 'Answer_05'],
        num_rows: 1516
    })
})


# **Load Mixtral Pipeline**

In [1]:
!pip install aqlm[gpu]==1.0.1
!pip install git+https://github.com/huggingface/accelerate.git@main
!pip install git+https://github.com/BlackSamorez/transformers.git@aqlm

Collecting aqlm==1.0.1 (from aqlm[gpu]==1.0.1)
  Downloading aqlm-1.0.1-py3-none-any.whl.metadata (1.6 kB)
Collecting transformers==4.37.0 (from aqlm==1.0.1->aqlm[gpu]==1.0.1)
  Downloading transformers-4.37.0-py3-none-any.whl.metadata (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.4/129.4 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting triton>=2.1 (from aqlm[gpu]==1.0.1)
  Downloading triton-3.0.0-1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.3 kB)
Collecting ninja (from aqlm[gpu]==1.0.1)
  Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5_x86_64.whl.metadata (5.3 kB)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1)
  Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading aqlm-1.0.1-py3-none-any.whl (10 kB)
Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
[2K   [90m━━━━━━━━━━

In [2]:
!pip install transformers==4.37.0

# Source: https://medium.com/gitconnected/the-2-bit-quantization-is-insane-see-how-to-run-mixtral-8x7b-on-free-tier-colab-2803e39b9b9d


Collecting transformers==4.37.0
  Using cached transformers-4.37.0-py3-none-any.whl.metadata (129 kB)
Using cached transformers-4.37.0-py3-none-any.whl (8.4 MB)
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.38.0.dev0
    Uninstalling transformers-4.38.0.dev0:
      Successfully uninstalled transformers-4.38.0.dev0
Successfully installed transformers-4.37.0


In [5]:
from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
    "BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x16-hf-test-dispatch",
    torch_dtype="auto", device_map="cuda", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Mixtral-8x7B-v0.1")

KeyboardInterrupt: 