<a href="https://colab.research.google.com/github/mike-maclaverty/financial-advisor-chatbot/blob/main/Demo_ML_projects_using_Gradio.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Welcome to the accompanying notebook for my Medium article
# *How I Created Easy Machine Learning Demos for Non-Technical Colleagues and Users .*[lien](https://levelup.gitconnected.com/how-i-created-easy-gen-ai-demos-for-non-technical-colleagues-and-users-f522aec7f98a)
The code below has been largely inspired by the Gradio documentation, notably [How to Create a Chatbot with Gradio](https://www.gradio.app/guides/[texte du lien](https://)creating-a-chatbot-fast)

# 1. Install Gradio and other necessary packages

In [None]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"

In [None]:
!pip install -q gradio
!pip install openai
!pip install tiktoken

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.7/16.7 MB[0m [31m56.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.1/92.1 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m305.2/305.2 kB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.7/138.7 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m394.2/394.2 kB[0m [31m28.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m99.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.7/60.7 kB[0m [31m7.8 M

# 2. Exemple 1 - Using Open AI as the backend LLM model and Gradio Chat Interface to create the UI

In [None]:
OPENAI_API_KEY="sk-XXXX" # Replace with your key


 The below function prepares a conversation history and sends it to the GPT-3.5 Turbo model to receive a response, allowing the user to interact with the model in a chat-like manner and obtain the model's response to a user message.

In [None]:
import openai
import gradio as gr

openai.api_key = OPENAI_API_KEY

def get_completion(message, history):
    history_openai_format = []
    for human, assistant in history:
        history_openai_format.append({"role": "user", "content": human })
        history_openai_format.append({"role": "assistant", "content":assistant})
    history_openai_format.append({"role": "user", "content": message})

    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages= history_openai_format,
        temperature=0,
    )
    return response.choices[0].message["content"]

In [None]:
gr.close_all()
gr.ChatInterface(get_completion).queue().launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://4c1b0ef008d6195181.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




Let's enhance the user experience further by incorporating streaming outputs, eliminating the need for the user to wait for the LLM to complete the answer

In [None]:
import openai
import gradio as gr

openai.api_key = OPENAI_API_KEY

def get_completion_with_streaming(message, history):
    history_openai_format = []
    for human, assistant in history:
        history_openai_format.append({"role": "user", "content": human })
        history_openai_format.append({"role": "assistant", "content":assistant})
    history_openai_format.append({"role": "user", "content": message})

    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages= history_openai_format,
        temperature=1.0,
        stream=True
    )

    partial_message = ""
    for chunk in response:
        if len(chunk['choices'][0]['delta']) != 0:
            partial_message = partial_message + chunk['choices'][0]['delta']['content']
            yield partial_message


In [None]:
gr.close_all()
gr.ChatInterface(get_completion_with_streaming).queue().launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://a50f8d71564b89009a.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




# Exemple 2 - Chat with our fine-tuned Mistral 7b or any other open-source LLMs

## Step 1. Load and Run the fine-tuned model with quantization
To run the LLM locally, one can use the Transformers library or the text-generation-inference.

In this examplem, we'll use the transformers library and run the model on the Colab instance:
* I recommend using a GPU runtime for this example. In the Colab menu bar, choose Runtime > Change Runtime Type and choose GPU under Hardware Accelerator
* You also need to have a Colab Pro account to have sufficient memory to load and run the Mistral 7b model.
* We'll load the model with quantization   

In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.9/270.9 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Building wheel for accelerate (pyproject.toml) ... [?25l[?25hdone


Set up the configuration for quantization

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

Load and run the finetuned Mistral-7b model!

In [None]:
import torch
from peft import PeftModel, PeftConfig
from transformers import AutoModelForCausalLM, AutoTokenizer

peft_model_id = "Ronal999/mistralai-7B-v01-based-finetuned-using-ludwig-notmerged"
config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path, quantization_config=bnb_config, device_map={"":0})
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

# Load the Lora model
model = PeftModel.from_pretrained(model, peft_model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


adapter_config.json:   0%|          | 0.00/492 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

adapter_model.bin:   0%|          | 0.00/13.7M [00:00<?, ?B/s]

## Step 2. Define a chat function which will be used by our Gradio Chatbot interface
- Adding memory to the chatbot: we'll pass all the chatting history to the model as the context
- There are several ways we can improve the user experience of the chatbot above. First, we can stream responses so the user doesn’t have to wait as long for a message to be generated. Second, we can have the user message appear immediately in the chat history, while the chatbot’s response is being generated. Here’s the code to achieve that:

In [None]:
import gradio as gr
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria, StoppingCriteriaList, TextIteratorStreamer
from threading import Thread

model = model.to('cuda:0')

class StopOnTokens(StoppingCriteria):
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        stop_ids = [29, 0]
        for stop_id in stop_ids:
            if input_ids[0][-1] == stop_id:
                return True
        return False

def predict(message, history):

    history_transformer_format = history + [[message, ""]]
    stop = StopOnTokens()

    messages = "".join(["".join(["\n<human>:"+item[0], "\n<bot>:"+item[1]])  #curr_system_message +
                for item in history_transformer_format])

    model_inputs = tokenizer([messages], return_tensors="pt").to("cuda")
    streamer = TextIteratorStreamer(tokenizer, timeout=10., skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        model_inputs,
        streamer=streamer,
        max_new_tokens=1024,
        do_sample=True,
        top_p=0.95,
        top_k=1000,
        temperature=1.0,
        num_beams=1,
        stopping_criteria=StoppingCriteriaList([stop])
        )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    partial_message  = ""
    for new_token in streamer:
        if new_token != '<':
            partial_message += new_token
            yield partial_message




NameError: name 'model' is not defined

## Step 3. Create Gradio Chatbot interface with advanced options

In [None]:
gr.close_all()
gr.ChatInterface(predict).queue().launch()

Alternatively, if you want to futher customize your chatbot, you can use

In [None]:
import gradio as gr

In [None]:
def predict(message, history):
  response = """
  Le document intitulé "Circulaire CSSF 12/552 telle que modifiée par les circulaires CSSF 13/563 à CSSF 22/807" est exhaustif, abordant des réglementations et lignes directrices complètes pour l'administration centrale, la gouvernance interne et la gestion des risques des établissements de crédit et des professionnels réalisant des opérations de prêt. Voici une synthèse résumée avec les points clés :

1. Administration Centrale et Gouvernance Interne : Les institutions doivent disposer d'une administration centrale solide, comprenant une structure organisationnelle claire avec des responsabilités bien définies, transparentes et cohérentes. Des processus efficaces de détection, de gestion et de rapport des risques sont essentiels. Des mécanismes de contrôle interne adéquats, des procédures administratives et comptables solides, et des politiques de rémunération favorisant une gestion saine des risques sont également requis.

2. Gestion des Risques : Des lignes directrices détaillées sont fournies pour la mesure et la gestion de différents types de risques, y compris le risque de concentration, le risque de crédit, et les risques liés aux entités de shadow banking. Cette section couvre également le risque d'encours garanti et le risque de taux d'intérêt, entre autres. Les institutions doivent avoir de solides principes internes pour contrôler les expositions à haut risque et les expositions non performantes.

3. Exigences Spécifiques : Cela inclut des directives sur la structure organisationnelle et les entités juridiques, la gestion des conflits d'intérêts, le processus d'approbation pour les nouveaux produits et les politiques d'externalisation.

4. Rapport Légal : Les institutions doivent se conformer aux exigences de rapport légal, assurant la transparence et la responsabilité.

Cette circulaire reflète une approche globale de la gestion des institutions financières, mettant l'accent sur une gouvernance robuste, la gestion des risques et la conformité aux normes légales et réglementaires.
  """
  return response

In [None]:
gr.close_all()
gr.ChatInterface(
    predict,
    chatbot=gr.Chatbot(height=1000),
    textbox=gr.Textbox(placeholder="Envoyer une question", container=False, scale=7),
    title="Outil d'assistance d'analyse des exigences réglementaires basé sur le modèle Mistral 7B",
    description="Poser des question sur la circulaire CSSF 12/552",
    theme="soft",
    examples=["Faire une synthèse de la circulaire"],
    cache_examples=True,
    retry_btn=None,
    undo_btn="Delete Previous",
    clear_btn="Clear",
).queue().launch()

Caching examples at: '/content/gradio_cached_examples/35'
Caching example 1/1
Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://96c3165772009510c3.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


