<a href="https://colab.research.google.com/github/kaiu85/llm-workshop/blob/main/Transformers/Mistral_Pipe_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# State of the Art Models at your Fingertips

## Getting state-of-the art open source models
Again, we will use the wonderful services provided by [HuggingFace](https://huggingface.co/) to locally download and run state-of-the-art open source models and modify them by appropriately structuring their inputs and outputs.

## Model background
This notebook will use a small version (7 billion trainable parameters) of the Open-Source Mistral model family of models, which was trained by the [Mistral.ai](https://mistral.ai/) team. While their models are free and open-source, the company makes money by serving them on their own platform as a payed service. You can find some information about this model in the company's corresponding [blog post](https://mistral.ai/news/announcing-mistral-7b/)

You can also choose a different version of this model or other models (Llama-3B, trained by Meta or a Phi model trained by Microsoft) later on. Please take some time to have a proper look at the corresponding Huggingface model cards, which also provides some basic information on potential biases, risks and harms.

For further information on this model family, feel free to also have a look at the corresponding blog post.

## Creating a web-based app
To simply create a small chatbot-window, which interfaces with the models you are downloading and running in this notebook, we will use [Gradio](https://www.gradio.app/). For demos, this service is also free. But to for rolling your apps on a larger scale, you'd have to pay them some money. We will be using a very simple **Block** layout, you find the relevant documentation [here](https://www.gradio.app/docs/gradio/blocks). But don't forget, you can also ask an LLM to read it and help you out (c.f., next section).

# Reminder: Using a large-language model as a coding resource

Alternatively, you can go with the flow and try to ask one of the many available large language models to help you. E.g., by copying some code into the model's prompt and asking it to find errors and/or improve your code. Here you could also experiment with different ways of **prompting**, i.e., asking or instructing your model. Usually, by asking the model to first think through a problem sequentially before providing the final answer, you can dramatically improve the performance in more complex reasoning tasks (similar to asking a human to first think through a problem carefully, before trying to provide a definite answer). One very impressive model in this regard is the one by [Perplexity AI](https://www.perplexity.ai/).

**Step 1: Setting Up the Environment**

In [1]:
!pip install -q -U transformers bitsandbytes accelerate gradio

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m36.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m318.1/318.1 kB[0m [31m23.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m142.5/142.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.8/8.8 MB[0m [31m37.6 M

In [2]:
import torch
import os

from transformers import BitsAndBytesConfig, AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline


In [3]:
os.environ["HF_TOKEN"]='' # Get a token at huggingface.co and put it here

**Step 2: Initializing the Language Model**

In [4]:
MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"
# MODEL_NAME ="mistralai/Mistral-7B-Instruct-v0.2"
# MODEL_NAME ="meta-llama/Meta-Llama-3-8B"
# MODEL_NAME ="microsoft/Phi-3-mini-4k-instruct"
# MODEL_NAME ="microsoft/phi-1_5"

# Quantization is a technique used to reduce the memory and computation requirements
# of deep learning models, typically by using fewer bits, 4 bits
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Initialization of a tokenizer for the language model,
# necessary to preprocess text data for input
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
tokenizer.pad_token = tokenizer.eos_token

# Initialization of the pre-trained language model
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
    quantization_config=quantization_config
)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

**Step 3: Configuring Generation Settings**

In [5]:
# Configuration of some generation-related settings
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
generation_config.max_new_tokens = 1024 # maximum number of new tokens that can be generated by the model
generation_config.temperature = 0.7 # randomness of the generated tex
generation_config.top_p = 0 # diversity of the generated text
generation_config.do_sample = True # sampling during the generation process
generation_config.pad_token_id = tokenizer.pad_token_id
# generation_config.repetition_penalty = 1.15 # the degree to which the model should avoid repeating tokens in the generated text

**Step 4: Creating the Pipeline**

In [6]:
# A pipeline is an object that works as an API for calling the model
# The pipeline is made of (1) the tokenizer instance, the model instance, and
# some post-procesing settings. Here, it's configured to return full-text outputs
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    #return_full_text=True,
    return_full_text=False,
    generation_config=generation_config,
)

In [7]:
import gradio as gr
import random
import time

with gr.Blocks() as demo:
    gr.Markdown("Start typing below and then hit **Enter** to start a conversation with a friendly pirate.")
    chatbot = gr.Chatbot()
    msg = gr.Textbox(label = "Type your message here and press 'Enter'.")
    clear = gr.ClearButton([msg, chatbot])

    def respond(message, chat_history):

        # Some context to prime the language model, which is
        # prefaced to the first user input
        context_prompt = """Imagine you are a friendly but stubborn pirate, who is asked random things by a fifth grader.
        """

        if len(chat_history) > 0:
          messages = list()
          for i, pair in enumerate(chat_history):

            # Preface the context prompt to the first user input!
            if i == 0:
              user_content = context_prompt + pair[0]
            else:
              user_content = pair[0]

            messages.append({"role": "user", "content": user_content})
            messages.append({"role": "assistant", "content": pair[1]})
          messages.append({"role": "user", "content": message})
        else:
          content = context_prompt + message

          messages = [{"role": "user", "content": content}]

        bot_message = pipe(messages, max_new_tokens=256)[0]['generated_text']
        chat_history.append((message, bot_message))
        time.sleep(2)

        return "", chat_history

    msg.submit(respond, [msg, chatbot], [msg, chatbot])

    # Create first output only based on the initialization prompt
    #demo.load(respond, [msg, chatbot], [msg, chatbot])

demo.launch(share = True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://da45e9559bae3cfcef.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# Make your app more easily accessible

The https://XXXXX12312345.gradio.live links can be quite hard to type, so you can just use a free web service to generate QR-codes (e.g., [https://qr.io/](https://qr.io/)), which you can add to posters or presentations, or you can do it directly with your [Chrome browser](https://support.google.com/chrome/answer/10051760?hl=en&co=GENIE.Platform%3DDesktop#zippy=%2Cshare-pages-with-a-qr-code).