# Introduction
This notebook compares four different Small Language Models (Mistral-7B, Zephyr_7B, Orca-7B, and Phi-2, which is a 3B model) in a question-answering task. The ground truth answers are obtained by passing the same questions to Gemini-pro, a trusted Large Language Model. At the end of the notebook, we have graphs displaying the performance of each SLM against six commonly used metrics: Exact match, F1-Score, BLEU score, ROUGE score, BERT-based similarity score, and Sentence-transformer-based similarity score. Note: All models are quantized to accommodate space constraints. Some key takeaways:

- **Exact match** is a very strict metric for complex and long sentences, hence not effective for our use-case.
- **F1-score** is an acceptable metric as it can show that the produced answer contains significant portions of the ground truth. However, it is not a comprehensive measure of overall correctness.
- **BLEU and ROUGE scores** are similar in their limitations. Since they compare n-grams and do not account for the semantics of the sentences, these scores do not provide the complete picture. Nonetheless, they quantify similarity at some levels.
- **Similarity scores** (BERT-based and Sentence-transformer-based) are the right metrics for this scenario, as they take into account the semantic meaning of the produced answers.
- All the language models perform similarly for the given task, and the competition is very close. The scores are not exceptionally high since the models are highly quantized. Zephyr performs the best on average, with Orca and Mistral following closely, and Phi-2 trailing.

## Libraries

In [1]:
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelera/te.git
!pip install -q -U einops
!pip install -q -U safetensors
!pip install -q -U torch
!pip install -q -U xformers
!pip install -q -U langchain
!pip install -q -U ctransformers[cuda]
!pip install sentence-transformers
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install pypdf

!pip install ninja
!pip install fastparquet
!pip install torch>=2.1.0
!pip install safetensors>=0.3.2
!pip install sentencepiece>=0.1.97
!pip install pygments
!pip install websockets
!pip install regex
!pip install chromadb
!pip install --upgrade --quiet  langchain-google-genai pillow
!pip install -U langchain-community
!pip install rouge
!pip install plotly

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.1/314.1 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m21.3/21.3 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for peft (pyproject.toml) ... [?25l[?25hdone
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mgit clone --[0m[32mfilter[0m[32m=[0m[32mblob[0m[32m:none --quiet [0m[4;32mhttps://github.com/huggingface/accelera/te.git[0m[32m [0m[32m/tmp/[0m[32mpip-req-build-rr51ed42[0m did not run successfully.
  [31m│[0m exit code: [1;36m128[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
[1;31merror[0m: [1msubprocess-exited-with-

## Imports

In [2]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import StructuredOutputParser, ResponseSchema
from langchain.llms import CTransformers
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import HuggingFaceBgeEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains.summarize import load_summarize_chain
import torch
from accelerate import Accelerator
from langchain.text_splitter import RecursiveCharacterTextSplitter
import time
import re
import json
import math
import random
from abc import ABC,abstractmethod
from langchain_core.runnables import RunnablePassthrough
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.document_loaders import TextLoader

from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig, pipeline, BitsAndBytesConfig , CodeGenTokenizer
from langchain.llms import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain
from transformers import AutoTokenizer , AutoModelForCausalLM

import locale
locale.getpreferredencoding = lambda: "UTF-8"

 Reference to use phi-2 from local directory:
https://colab.research.google.com/drive/14_mVXXdXmDiFshVArDQlWeP-3DKzbvNI?usp=sharing

## A simple illustration of an end-to-end pipeline using Phi-2

In [None]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("microsoft/phi-2", trust_remote_code=True)

prompt = """
Context: Artificial intelligence (AI) is technology that enables computers and digital devices to learn, read, write, talk, see, create, play, analyze, make recommendations and do other things humans do.
In addition, AI refers to the field of computer science focused on developing these technologies. Yet, at its simplest form, artificial intelligence is a field which combines computer science and robust datasets to enable problem-solving. It also encompasses sub-fields of machine learning and deep learning, which are frequently mentioned in conjunction with artificial intelligence. These disciplines are comprised of AI algorithms which seek to create expert systems to make predictions or classifications based on input data.
Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of OpenAI’s ChatGPT seems to mark a turning point. The last time generative AI loomed this large, the breakthroughs were in computer vision, but now the leap forward is in natural language processing (NLP). And it’s not just human language: Generative models can also learn the grammar of software code, molecules, natural images, and a variety of other data types. Some no-code interfaces enable people without coding skills to use visual interfaces and intuitive controls including drag-and-drop to create and modify applications quickly and efficiently while the actual code remains hidden in the background.

Instruct: You are a helpful and informative bot that answers questions using text from the reference passage included below. \
Be sure to respond in a complete sentence, being comprehensive, including all relevant background information. \
However, you are talking to a non-technical audience, so be sure to break down complicated concepts and \
strike a friendly and converstional tone. \
If the passage is irrelevant to the answer, you may ignore it.

Question: What are the applications of AI\n
Output:
"""

def split_text_into_sections(text):
  '''
  Splits text into sections based on the presence of certain keywords.
  '''
  # Keywords to look for, in order
  keywords = ['Context:', 'Instruct:', 'Question:', 'Output:']

  # Dictionary to hold the split sections
  sections = {}

  # Find the starting positions of each keyword
  start_positions = {keyword: text.find(keyword) for keyword in keywords}

  # Iterate over the keywords and their start positions
  for i, (keyword, start_pos) in enumerate(start_positions.items()):
      # If the keyword was found in the text
      if start_pos != -1:
          # Find the end position, which is either the start of the next keyword or the end of the text
          end_pos = None
          if i < len(keywords) - 1:  # If this is not the last keyword
              next_keyword = keywords[i + 1]
              next_keyword_pos = start_positions[next_keyword]
              if next_keyword_pos != -1:
                  end_pos = next_keyword_pos
          if end_pos is None:  # If this is the last keyword or no more keywords are found
              end_pos = len(text)

          # Extract and store the section, trimming the keyword itself and any leading/trailing whitespace
          section_text = text[start_pos + len(keyword):end_pos].strip()
          sections[keyword[:-1].lower()] = section_text

  return sections


tokenizer.pad_token = tokenizer.eos_token # set the padding tokens to end of sentence tokens
model_inputs = tokenizer(
    prompt, return_tensors="pt", padding=True
).to("cuda")
model = model.to("cuda")

# the model provides ids in response, which is then converted text
generated_ids = model.generate(**model_inputs, max_new_tokens=1000)

# converts ids to text
output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
sections = split_text_into_sections(output)
sections['output']

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


'AI has a wide range of applications across various industries. In healthcare, AI is used for medical diagnosis, drug discovery, and personalized treatment plans. In finance, AI is used for fraud detection, risk assessment, and algorithmic trading. In transportation, AI is used for autonomous vehicles, traffic management, and logistics optimization. In education, AI is used for personalized learning, intelligent tutoring systems, and automated grading. In entertainment, AI is used for virtual assistants, recommendation systems, and content creation. In manufacturing, AI is used for predictive maintenance, quality control, and supply chain optimization. In agriculture, AI is used for crop monitoring, yield prediction, and precision farming. In retail, AI is used for customer segmentation, demand forecasting, and inventory management. In energy, AI is used for smart grid management, renewable energy optimization, and energy efficiency. In security, AI is used for facial recognition, intr

In [None]:
sections.keys()

dict_keys(['context', 'instruct', 'question', 'output'])

In [None]:
# formatted outputs based on sections
sections = split_text_into_sections(output)

# Print the sections
for key, value in sections.items():
    print(f"{key.capitalize()}:\n{value}\n")

Context:
Artificial intelligence (AI) is technology that enables computers and digital devices to learn, read, write, talk, see, create, play, analyze, make recommendations and do other things humans do.
In addition, AI refers to the field of computer science focused on developing these technologies. Yet, at its simplest form, artificial intelligence is a field which combines computer science and robust datasets to enable problem-solving. It also encompasses sub-fields of machine learning and deep learning, which are frequently mentioned in conjunction with artificial intelligence. These disciplines are comprised of AI algorithms which seek to create expert systems to make predictions or classifications based on input data.
Artificial intelligence has gone through many cycles of hype, but even to skeptics, the release of OpenAI’s ChatGPT seems to mark a turning point. The last time generative AI loomed this large, the breakthroughs were in computer vision, but now the leap forward is i

In [None]:
model

PhiForCausalLM(
  (model): PhiModel(
    (embed_tokens): Embedding(51200, 2560)
    (embed_dropout): Dropout(p=0.0, inplace=False)
    (layers): ModuleList(
      (0-31): 32 x PhiDecoderLayer(
        (self_attn): PhiSdpaAttention(
          (q_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (k_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (v_proj): Linear(in_features=2560, out_features=2560, bias=True)
          (dense): Linear(in_features=2560, out_features=2560, bias=True)
          (rotary_emb): PhiRotaryEmbedding()
        )
        (mlp): PhiMLP(
          (activation_fn): NewGELUActivation()
          (fc1): Linear(in_features=2560, out_features=10240, bias=True)
          (fc2): Linear(in_features=10240, out_features=2560, bias=True)
        )
        (input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (final_layernorm): LayerNorm((256

## LLMs initialization

In [3]:
models = {}

### Mistral

In [4]:
!huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF mistral-7b-instruct-v0.1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'mistral-7b-instruct-v0.1.Q4_K_M.gguf' to '.huggingface/download/mistral-7b-instruct-v0.1.Q4_K_M.gguf.14466f9d658bf4a79f96c3f3f22759707c291cac4e62fea625e80c7d32169991.incomplete'
mistral-7b-instruct-v0.1.Q4_K_M.gguf: 100% 4.37G/4.37G [00:52<00:00, 83.7MB/s]
Download complete. Moving file to mistral-7b-instruct-v0.1.Q4_K_M.gguf
mistral-7b-instruct-v0.1.Q4_K_M.gguf


In [5]:
accelerator = Accelerator()

#5000, 16000
config = {'max_new_tokens': 5000, 'repetition_penalty': 1.1, 'context_length': 10000, 'temperature':0, 'gpu_layers': 50}
llm = CTransformers(model = "./mistral-7b-instruct-v0.1.Q4_K_M.gguf", model_type = "mistral", gpu_layers=50, config=config, mlock=True)

llm_mistral, config = accelerator.prepare(llm, config)

print("LLM Initialized...")

LLM Initialized...


In [6]:
if "mistral" not in models:
  models['mistral'] = llm_mistral

### Zephyr

In [None]:
!huggingface-cli download TheBloke/zephyr-7B-beta-GGUF zephyr-7b-beta.Q5_K_M.gguf --local-dir . --local-dir-use-symlinks False

Downloading 'zephyr-7b-beta.Q5_K_M.gguf' to '.huggingface/download/zephyr-7b-beta.Q5_K_M.gguf.37894e5b171bd7228f4af7bd5bb0758dd29d6f07fb8e4742e387720f66bac434.incomplete'
zephyr-7b-beta.Q5_K_M.gguf: 100% 5.13G/5.13G [03:44<00:00, 22.9MB/s]
Download complete. Moving file to zephyr-7b-beta.Q5_K_M.gguf
zephyr-7b-beta.Q5_K_M.gguf


In [None]:
accelerator = Accelerator()

config = {'max_new_tokens': 50000, 'repetition_penalty': 1.1, 'context_length': 16000, 'temperature':0, 'gpu_layers': 50}
llm = CTransformers(model = "./zephyr-7b-beta.Q5_K_M.gguf", model_type = "mistral", gpu_layers=50, config=config, mlock=True)

llm_zephyr, config = accelerator.prepare(llm, config)

print("LLM Initialized...")

LLM Initialized...


In [None]:
if "zephyr" not in models:
  models['zephyr'] = llm_zephyr

### Orca-2

In [None]:
!huggingface-cli download TheBloke/Orca-2-7B-GGUF orca-2-7b.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False


Downloading 'orca-2-7b.Q4_K_M.gguf' to '.huggingface/download/orca-2-7b.Q4_K_M.gguf.eab8e78882fe5fc9d8336ecd3e12b280b26e89d05ea97eaf1d75e5d2a1f618e4.incomplete'
orca-2-7b.Q4_K_M.gguf: 100% 4.08G/4.08G [01:00<00:00, 67.8MB/s]
Download complete. Moving file to orca-2-7b.Q4_K_M.gguf
orca-2-7b.Q4_K_M.gguf


In [None]:
accelerator = Accelerator()

config = {'max_new_tokens': 50000, 'repetition_penalty': 1.1, 'context_length': 16000, 'temperature':0, 'gpu_layers': 50}
llm = CTransformers(model = "./orca-2-7b.Q4_K_M.gguf", model_type = "orca", gpu_layers=50, config=config, mlock=True)

llm_orca, config = accelerator.prepare(llm, config)

print("LLM Initialized...")

LLM Initialized...


In [None]:
if "orca" not in models:
  models['orca'] = llm_orca

### Phi-2

In [None]:
base_model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    torch_dtype=torch.float32,
    device_map='auto',
    # quantization_config=quantization_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/735 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/35.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/564M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2", trust_remote_code=True)

tokenizer_config.json:   0%|          | 0.00/7.34k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/1.08k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [None]:
pipe = pipeline(
    "text-generation",
    model=base_model,
    tokenizer=tokenizer,
    max_length=2048,
    temperature=0,
    top_p=0.95,
    repetition_penalty=1.2
)
pipe.model.config.pad_token_id = pipe.model.config.eos_token_id
llm_phi = HuggingFacePipeline(pipeline=pipe)


  warn_deprecated(


In [None]:
if "phi" not in models:
  models['phi'] = llm_phi

### Google gemini

In [None]:
import os

In [None]:
if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = "AIzaSyAlhhB-ukZg5iv3KXsk6YEY-Hlt2Dto5Ps"
llm_google = ChatGoogleGenerativeAI(model="gemini-pro")

In [None]:
if "gemini" not in models:
  models['gemini'] = llm_google

## Question Generation using Gemini-pro

In [None]:
import re
import os
import ast
import json
import math
import time
import random
import logging
from typing import List, Dict, Optional, Any, Tuple
from langchain_core.runnables import RunnableLambda

# logging.basicConfig(level=logging.INFO)

class QuestionGenerator():
    """
    Class for generating question and answer pairs from a given text.
    """

    def __init__(self, chunk_size: int, chunk_overlap: int, llm: Any, number_of_questions: int):
        self.llm = llm
        self.number_of_questions = number_of_questions
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap

    def text_splitter(self, data: str) -> List[str]:
        """
        Splits the input text into chunks based on specified chunk size and overlap.
        """
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap, add_start_index=True
        )
        all_splits = text_splitter.split_text(data)
        return all_splits

    def file_processing(self, loader: Any) -> List[str]:
        """
        Processes the input file using the provided loader.
        """
        data = loader.load()
        text = ''.join(page.page_content for page in data)
        return self.text_splitter(data=text)

    def parse_json_like(self, text: str):
        """
        Parses JSON-like strings from the input text.
        """
        pattern = re.compile(r'response\s*=\s*({.*?})', re.DOTALL)

        match = pattern.search(text)
        if match:
            dictionary_str = match.group(1)
            response_dict = ast.literal_eval(dictionary_str)
            # print(response_dict)
            return response_dict

        else:
            print("No match found")
            return None

    def instantiate_gemini(self):
        """
        Creates a LLM chain using the generated template and format instructions.
        """

        if "GOOGLE_API_KEY" not in os.environ:
            os.environ["GOOGLE_API_KEY"] = "AIzaSyAlhhB-ukZg5iv3KXsk6YEY-Hlt2Dto5Ps"
        llm = ChatGoogleGenerativeAI(model="gemini-pro")
        return llm

    def get_prompt(self, chunk):
      prompt = f"""Read the following text: {chunk} After analyzing the text, generate a thought-provoking question related to the content.
      Then, answer the question you formulated.
      Generate just one question and answer that is factual in realm of the given context.
      The answer should be short and concise.
      Structure your response as a Python dictionary with keys: question, answer."""
      return prompt


    def generate_qa(self, file_path: Optional[str] = None, text: Optional[str] = None) -> List[Dict[str, str]]:
        """
        Generates quiz questions based on the provided file or text data.
        """
        if file_path and text:
            print("Input either a file or text data. Not both.")
            return []

        if file_path:
            if file_path.endswith(".txt"):
                loader = TextLoader(file_path)
            elif file_path.endswith(".pdf"):
                loader = PyPDFLoader(file_path)
            else:
                print("Unsupported file type.")
                return []

            chunks = self.file_processing(loader)
        elif text:
            chunks = self.text_splitter(text)
        else:
            print("Please provide either a file path or text data.")
            return []

        print(f"Length of chunks: {len(chunks)}")
        if self.llm == "gemini":
          model = self.instantiate_gemini()
          print("LLM Chain created")

        start = time.time()
        results = []
        random.shuffle(chunks)
        counter = 0

        for chunk in chunks:
            prompt = self.get_prompt(chunk)
            result = model.invoke(prompt).content
            if counter == self.number_of_questions:
              end = time.time()
              with open(f'q&a.json', 'w', encoding='utf-8') as f:
                  json.dump(results, f, indent=2)
              return results
              break

            if not result.strip():
              continue
            # print(f"result before parsing: {result}")
            result = self.parse_json_like(result)
            # print(f"Result: {result}")
            results.append(result)
            counter += 1

        return results

In [None]:
qa = QuestionGenerator(1000, 100, 'gemini', 30)
qa_res = qa.generate_qa(file_path="./stories.pdf")

Length of chunks: 257
LLM Chain created
Result: {'question': 'What is the significance of beauty in the text?', 'answer': 'Beauty is subjective and can be perceived in different ways. It is not limited to physical appearance but can also be found in inner qualities and actions.'}
Result: {'question': 'Why were the citizens of the village unhappy despite their wishes being fulfilled?', 'answer': "They were jealous of each other's possessions and there were no gardens for the children to play in."}
Result: {'question': "What was the difference between the two frogs' responses to the adversity?", 'answer': 'One frog gave up while the other persisted despite the pain and exhaustion.'}
Result: {'question': 'What does the story suggest about the nature of true beauty?', 'answer': 'The story suggests that true beauty lies not in physical perfection but in the love and compassion that flows from one heart to another.'}
Result: {'question': 'What was the purpose of the merchant loading the donk

## File pre-processing

In [7]:
from pypdf import PdfReader

def load_pdf(file_path):
    """
    Reads the text content from a PDF file and returns it as a single string.

    Parameters:
    - file_path (str): The file path to the PDF file.

    Returns:
    - str: The concatenated text content of all pages in the PDF.
    """
    # Logic to read pdf
    reader = PdfReader(file_path)

    # Loop over each page and store it in a variable
    text = ""
    for page in reader.pages:
        text += page.extract_text()

    return text

# replace the path with your file path
pdf_text = load_pdf(file_path="/content/stories.pdf")

In [8]:
def doc_splitter(docs):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, chunk_overlap=100, add_start_index=True
    )
    docs = text_splitter.split_documents(docs)
    return docs

In [9]:
def text_splitter(data):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800, chunk_overlap=50, add_start_index=True
    )
    all_splits = text_splitter.split_text(data)
    return all_splits
chunkedtext = text_splitter(pdf_text)

## Persistent DB

In [10]:
import chromadb
from typing import List
def create_chroma_db(documents:List, path:str, name:str):
    """
    Creates a Chroma database using the provided documents, path, and collection name.

    Parameters:
    - documents: An iterable of documents to be added to the Chroma database.
    - path (str): The path where the Chroma database will be stored.
    - name (str): The name of the collection within the Chroma database.

    Returns:
    - Tuple[chromadb.Collection, str]: A tuple containing the created Chroma Collection and its name.
    """
    chroma_client = chromadb.PersistentClient(path=path)
    # db = chroma_client.create_collection(name=name, embedding_function=some_func())
    db = chroma_client.create_collection(name=name)

    for i, d in enumerate(documents):
        db.add(documents=d, ids=str(i))

    return db, name

db, name = create_chroma_db(documents=chunkedtext,
                          path="/content/db", #replace with your path
                          name="database")

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:19<00:00, 4.26MiB/s]


In [11]:
def load_chroma_collection(path, name):
    """
    Loads an existing Chroma collection from the specified path with the given name.

    Parameters:
    - path (str): The path where the Chroma database is stored.
    - name (str): The name of the collection within the Chroma database.

    Returns:
    - chromadb.Collection: The loaded Chroma Collection.
    """
    chroma_client = chromadb.PersistentClient(path=path)
    db = chroma_client.get_collection(name=name)

    return db

db=load_chroma_collection(path="/content/db", name="database")

In [12]:
def get_relevant_passage(query, db, n_results):
  passage = db.query(query_texts=[query], n_results=n_results)['documents'][0]
  return passage

## Prompts

In [21]:
def make_rag_prompt(query, model, relevant_passage):
  context = relevant_passage.replace("'", "").replace('"', "").replace("\n", " ")
  # print(f"context: {escaped}")
  match model:
    case 'mistral':
      prompt = ("""<s><INST>
You are an AI assistant tasked with answering questions based only on the provided context.
Here is an excerpt from a document and a question. Your job is to give a detailed answer using only the information from the context. Do not use any external knowledge or make up information.
Think clearly before answering the question.
Structure your response as a Python dictionary with keys: answer.

Context:
{context}

Question:
{query}
[/INST]
  </s>
""").format(query=query, context=context)

    case 'zephyr':
      prompt = ("""
    Instruction:
You are an AI assistant tasked with answering questions based only on the provided context.
Here is an excerpt from a document and a question. Your job is to give a detailed answer using only the information from the context. Do not use any external knowledge or make up information.
Think clearly before answering the question.
The answer is usually present in the given context. However, if the question is not remotely related to the context, then respond with "The answer cannot be found in the provided context".
Structure your response as a Python dictionary with keys: answer.
=======
{context}
=======
Question: {query}
Output:\n
  """).format(query=query, context=context)

    case 'phi':
      prompt = ("""
    Instruction:
You are an AI assistant tasked with answering questions based only on the provided context.
Here is an excerpt from a document and a question. Your job is to give a detailed answer using only the information from the context. Do not use any external knowledge or make up information.
Think clearly before answering the question.
The answer is usually present in the given context. However, if the question is not remotely related to the context, then respond with "The answer cannot be found in the provided context".
Structure your response as a Python dictionary with keys: answer.
=========================================
Context: {context}
=========================================
Question: {query}

Output:\n
  """).format(query=query, context=context)


    case 'orca':
      system_message = """You are an AI assistant tasked with answering questions based only on the provided context.
Here is an excerpt from a document and a question. Your job is to give a detailed answer using only the information from the context. Do not use any external knowledge or make up information.
Think clearly before answering the question.
The answer is usually present in the given context. However, if the question is not remotely related to the context, then respond with "The answer cannot be found in the provided context".
Structure your response as a Python dictionary with keys: answer."""

      user_message = """=======
      {context}
      =======
      Question: {query}""".format(query=query, context=context)
      prompt = f"<|im_start|>system\n{system_message}<|im_end|>\n<|im_start|>user\n{user_message}<|im_end|>\n<|im_start|>assistant"

  return prompt

## Generate Answer

In [14]:
def get_answer(model, prompt):
  answer = model.invoke(prompt)
  return answer

def generate_answer(models, model, db, query):
    #retrieve top 3 relevant text chunks
    relevant_text = get_relevant_passage(query, db, n_results=3)
    prompt = make_rag_prompt(query, model,
                             relevant_passage="".join(relevant_text)) # joining the relevant chunks to create a single passage
    answer = get_answer(models[model], prompt)

    return answer

## Run

In [3]:
import warnings
import re
warnings.filterwarnings('ignore')

def read_reference_questions(path):
  with open(path, 'r', encoding='utf-8') as f:
    data = json.load(f)

  queries = [item['question'] for item in data if item is not None]
  gemini_answers = []
  for item in data:
    if item is None:
      continue
    if type(item['answer']) != str:
      gemini_answers.append(str(item['answer']))
    else:
      gemini_answers.append(item['answer'])
  # gemini_answers = [item['answer'] for item in data if item is not None]

  return queries, gemini_answers


In [22]:
def run(models, model, queries):

  slm_answers = []
  start = time.time()
  for query in queries:
    print(f"query: {query}")
    slm_answer = generate_answer(models, model, db, query=query)
    if model == "zephyr":
      slm_answer = slm_answer.split(":")[-1]
      slm_answer = re.sub(r'[^a-zA-Z0-9\s]', '', slm_answer)
      slm_answer = ' '.join(slm_answer.strip().split())
    elif model == "orca":
      marker = "|im_start|>user"
      slm_answer = slm_answer['answer']
      m = slm_answer.find(marker)
      if m != -1:
        slm_answer = slm_answer[m+len(marker):]
    print(f"slm answer: {slm_answer}")
    slm_answers.append(slm_answer)
  end = time.time()
  answers_dict_list = [{'answer': answer} for answer in slm_answers]


  with open(f'{model}.json', 'w', encoding='utf-8') as f:
      json.dump(answers_dict_list, f, indent=2)
  print("running time: ", end-start)
  return answers_dict_list

In [4]:
queries, gemini_answers = read_reference_questions("q&a.json")

In [8]:
gemini_answers

['Beauty is subjective and can be perceived in different ways. It is not limited to physical appearance but can also be found in inner qualities and actions.',
 "They were jealous of each other's possessions and there were no gardens for the children to play in.",
 'One frog gave up while the other persisted despite the pain and exhaustion.',
 'The story suggests that true beauty lies not in physical perfection but in the love and compassion that flows from one heart to another.',
 'To make the load heavier when the donkey fell into the river, thus exposing its trick.',
 'Planting honesty leads to reaping trust.',
 'He asked about the cleansing power of the five daily prayers.',
 'The merchant learned that to be with the ones you love, you must be ready to give up everything, even life itself.',
 'The text suggests that the stranger may be an angel sent by God to deliver a message to the man.',
 'He believed that the 2,000 known elements at the time were not suitable for making a good 

In [23]:
db=load_chroma_collection(path="/content/db", #replace with path of your persistent directory
                            name="database") #replace with the collection name

slm_answers = run(models, "mistral", queries)

query: What is the significance of beauty in the text?
slm answer: {
"answer": "The significance of beauty in the text is that it is a concept that is often overlooked or judged based on external standards. The author emphasizes that true beauty lies within a person's character and can be seen when one looks beyond physical appearance. The text also highlights the importance of remembering the beautiful things that Allah has created and praising the beholder. Additionally, the author's mother had a unique ability to see the beauty in every person she met, regardless of their physical appearance, and this is portrayed as a positive trait."
}
query: Why were the citizens of the village unhappy despite their wishes being fulfilled?
slm answer: {
"answer": "The citizens of the village were unhappy despite their wishes being fulfilled because they were jealous of each other's possessions. The person who had a palace but no gold and the person who had gold but no palace were not happy. Addit

## Retrieve data from json

In [35]:
def get_answers(path, model):
  with open(path, 'r', encoding='utf-8') as f:
    data = json.load(f)
  return [item['answer'] for item in data]

In [22]:
mistral_answers = get_answers("filtered_mistral.json", "mistral")
zephyr_answers = get_answers("zephyr.json", "zephyr")
phi_answers = get_answers("filtered_phi.json", "phi")
orca_answers = get_answers("orca.json", "orca")

In [23]:
slms_answers = {"mistral": mistral_answers,
                "zephyr": zephyr_answers,
                "phi": phi_answers,
                "orca": orca_answers}

# Evaluations

In [19]:
evaluations = {
    "exact_match": {},
    "f1_score": {},
    "bleu": {},
    "rouge": {},
    "bert_cosine": {},
    "sentence_transformer_cosine": {}
}

## Semantic Similarity

### Bert and cosine sim

In [32]:
import warnings
from transformers import BertTokenizer, BertModel
import torch
from scipy.spatial.distance import cosine
import numpy as np

# Suppress specific warnings
warnings.filterwarnings("ignore", message="A parameter name that contains `beta` will be renamed internally to `bias`.")
warnings.filterwarnings("ignore", message="A parameter name that contains `gamma` will be renamed internally to `weight`.")

# Function to encode text to get embeddings
def get_bert_embeddings(model, tokenizer, text):
    # Encode text
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    # Get hidden states
    with torch.no_grad():
        outputs = model(**inputs)
    # Only use the embeddings of the [CLS] token (at position 0)
    embeddings = outputs.last_hidden_state[:, 0, :].squeeze()  # Ensure it's 1-D
    return embeddings

# Function to normalize embeddings
def normalize_embeddings(embeddings):
    norms = torch.norm(embeddings, p=2, dim=-1, keepdim=True)
    normalized_embeddings = embeddings / norms
    return normalized_embeddings

def init_bert():
    # Load pre-trained model and tokenizer
    model_name = "bert-base-uncased"
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)
    return model, tokenizer

def bert_cosine_sim(slm_answers, gemini_answers, model, tokenizer):

  # Load pre-trained model and tokenizer

  sim = []

  for i in range(len(gemini_answers)):

      if gemini_answers[i] is None or slm_answers[i] is None:
          sim.append(0)
          # print(f"{i}: Cosine similarity: {0}")
          continue

      # Get embeddings
      embeddings1 = get_bert_embeddings(model, tokenizer, slm_answers[i])
      embeddings2 = get_bert_embeddings(model, tokenizer, gemini_answers[i])

      # Normalize embeddings
      embeddings1 = normalize_embeddings(embeddings1)
      embeddings2 = normalize_embeddings(embeddings2)

      # Convert to NumPy arrays
      embeddings1 = embeddings1.numpy().flatten()
      embeddings2 = embeddings2.numpy().flatten()

      # Compute cosine similarity
      similarity = 1 - cosine(embeddings1, embeddings2)
      sim.append(similarity)

      # print(f"{i}: Cosine similarity: {similarity}")

  # Output the results
  sim = [round(score, 6) for score in sim]
  return sim


In [27]:
model, tokenizer = init_models()



In [30]:
for slm, answers in slms_answers.items():
  sim_score = bert_cosine_sim(answers, gemini_answers, model, tokenizer)
  evaluations["bert_cosine"][slm] = sim_score

### SentenceTransformer and cosine sim

In [38]:
from sentence_transformers import SentenceTransformer, util
import torch

def normalize_embeddings(embeddings):
    norms = torch.norm(embeddings, p=2, dim=1, keepdim=True)
    normalized_embeddings = embeddings / norms
    return normalized_embeddings

# Load the model
def init_sentence_transformer():
  model = SentenceTransformer('all-MiniLM-L6-v2')
  return model

# Compute embeddings
def sentence_transformers_cosine_sim(slm_answers, gemini_answers, model):
  res = []
  for i in range(len(gemini_answers)):
    # Encode and convert to PyTorch tensors
    original_result = torch.tensor(model.encode(gemini_answers[i], convert_to_tensor=False)).unsqueeze(0)
    rag_result = torch.tensor(model.encode(slm_answers[i], convert_to_tensor=False)).unsqueeze(0)

    # Normalize the embeddings
    original_result_normalized = normalize_embeddings(original_result)
    rag_result_normalized = normalize_embeddings(rag_result)

    # Compute similarity
    similarity = util.pytorch_cos_sim(original_result_normalized, rag_result_normalized)

    # Append the similarity score
    res.append(similarity.item())

  # Round the similarity scores for precision
  res = [round(score, 6) for score in res]
  return res


In [34]:
sentence_transformer = init_sentence_transformer()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [39]:
for slm, answers in slms_answers.items():
  sim_score = sentence_transformers_cosine_sim(answers, gemini_answers, sentence_transformer)
  evaluations["sentence_transformer_cosine"][slm] = sim_score

## Accuracy

### Exact matches (EM)

In [49]:
from collections import Counter
def exact_match(ground_truths, predictions):
  assertions = []
  for i in range(len(ground_truths)):
    prediction = predictions[i]
    ground_truth = ground_truths[i]
    assertions.append(int(prediction.strip().lower() == ground_truth.strip().lower()))
  return assertions

In [50]:
for slm, answers in slms_answers.items():
  em_score = exact_match(gemini_answers, answers)
  evaluations["exact_match"][slm] = em_score

### F1 score

In [51]:
def f1_score(ground_truths, predictions):
  f1_scores = []
  for i in range(len(ground_truths)):
    pred_tokens = predictions[i].split()
    gt_tokens = ground_truths[i].split()
    common = Counter(pred_tokens) & Counter(gt_tokens)
    num_same = sum(common.values())

    if num_same == 0:
        f1_scores.append(0)
        continue

    precision = num_same / len(pred_tokens)
    recall = num_same / len(gt_tokens)
    f1 = (2 * precision * recall) / (precision + recall)
    f1_scores.append(round(f1, 3))
  return f1_scores


In [53]:
for slm, answers in slms_answers.items():
  f1 = f1_score(gemini_answers, answers)
  evaluations["f1_score"][slm] = f1

## For Readability and Fluency

### BLEU score

In [48]:
from nltk.translate.bleu_score import sentence_bleu
def compute_bleu(ground_truths, predictions):
    blue_scores = []
    for i in range(len(ground_truths)):
      blue_scores.append(sentence_bleu([ground_truths[i].split()], predictions[i].split()))
    return blue_scores

In [54]:
evaluations.keys()

dict_keys(['exact_match', 'f1_score', 'bleu', 'rouge', 'bert_cosine', 'sentence_transformer_cosine'])

In [55]:
for slm, answers in slms_answers.items():
  bleu_score = compute_bleu(gemini_answers, answers)
  evaluations["bleu"][slm] = bleu_score

### ROUGE score

In [57]:
from rouge import Rouge

def compute_rouge(ground_truths, predictions):
    rouge_scores = []
    for i in range(len(ground_truths)):
      rouge = Rouge()
      rouge_scores.append(rouge.get_scores(predictions[i], ground_truths[i], avg=True))
    return rouge_scores


In [58]:
for slm, answers in slms_answers.items():
  rouge_score = compute_rouge(gemini_answers, answers)
  evaluations["rouge"][slm] = rouge_score

In [60]:
with open("evaluations.json", "w") as f:
  json.dump(evaluations, f, indent=2)

## Observations


- **BLEU, ROUGE-1, ROUGE-2, ROUGE-L, Exact match, and F1 scores are expectedly low**: Since the answers generated by the models are not just simple sentences, quantifying their performance only by overlaps, precision, and recall will not be sufficient.
  - **Exact match**: Nearly 0 for every sentence for every model.
  - **F1-score**: Zephyr performs the best in this category, with a median score of around 0.2.
  - **BLEU score**: The median score is 0 for all the models. However, according to the overall distribution, Zephyr performs the best.
  - **ROUGE score**: No surprises here as well, since Zephyr takes the win for ROUGE-1, ROUGE-2, and ROUGE-L.
  - **BERT similarity**: Phi edges past Zephyr to become the best performing in this category with a median value of 0.85.
  - **Sentence-Transformer similarity**: Zephyr is the best with a median score of 0.6.
- **It is important to note that all the models are very close in scores**: Some models outperform others in certain evaluations.
- **The overall scores are low because of heavy quantization**: The models are 4-bit quantized to fit space constraints. If we were to test the full models, the performance increase would be significant.
- **This notebook is to experiment and draw comparisons between the models**: Even though the scores are low, relatively we can understand the standings in models' performances.


In [None]:
import json
with open("evaluations.json", "r") as f:
  evaluations = json.load(f)

## Tabular Representation

In [None]:
def round_values(d):
    if isinstance(d, dict):
        return {k: round_values(v) for k, v in d.items()}
    elif isinstance(d, list):
        return [round_values(v) for v in d]
    elif isinstance(d, (int, float)):
        return round(d, 3)
    return d

rounded_data = round_values(evaluations)
print(rounded_data['bert_cosine'])

In [26]:
import pandas as pd
data = rounded_data
exact_match_df = pd.DataFrame(data['exact_match'])
f1_score_df = pd.DataFrame(data['f1_score'])
bleu_df = pd.DataFrame(data['bleu'])
bert_similarity_df = pd.DataFrame(data['bert_cosine'])
sentence_transformer_similarity_df = pd.DataFrame(data['sentence_transformer_cosine'])

# Print the DataFrames
print("BERT similarity Score DataFrame:")
print(bert_similarity_df)

print("\n Sentence-Transformer similarity Score DataFrame:")
print(sentence_transformer_similarity_df)

print("\nExact Match DataFrame:")
print(exact_match_df)

print("\nF1 Score DataFrame:")
print(f1_score_df)

print("\nBLEU DataFrame:")
print(bleu_df)

rouge_data = []

# Process ROUGE data
for model, scores in data['rouge'].items():
    for idx, score in enumerate(scores):
        rouge_data.append({
            'model': model,
            'rouge-1_r': score['rouge-1']['r'],
            'rouge-1_p': score['rouge-1']['p'],
            'rouge-1_f': score['rouge-1']['f'],
            'rouge-2_r': score['rouge-2']['r'],
            'rouge-2_p': score['rouge-2']['p'],
            'rouge-2_f': score['rouge-2']['f'],
            'rouge-l_r': score['rouge-l']['r'],
            'rouge-l_p': score['rouge-l']['p'],
            'rouge-l_f': score['rouge-l']['f']
        })

# Convert to DataFrame
rouge_df = pd.DataFrame(rouge_data)

# Display the DataFrame
print("\nROUGE DataFrame:")
print(rouge_df)

BERT similarity Score DataFrame:
    mistral  zephyr    phi   orca
0     0.770   0.804  0.823  0.877
1     0.630   0.764  0.836  0.795
2     0.667   0.896  0.924  0.769
3     0.877   0.941  0.870  0.774
4     0.599   0.863  0.895  0.712
5     0.773   0.859  0.945  0.642
6     0.611   0.832  0.677  0.820
7     0.756   0.798  0.949  0.954
8     0.717   0.934  0.849  0.781
9     0.711   0.832  0.887  0.801
10    0.630   0.657  0.738  0.767
11    0.829   0.863  0.855  0.843
12    0.577   0.668  0.660  0.547
13    0.719   0.805  0.729  0.820
14    0.596   0.762  0.753  0.679
15    0.662   0.730  0.782  0.758
16    0.764   0.849  0.869  0.710
17    0.674   0.918  0.899  0.601
18    0.699   0.796  0.840  0.776
19    0.760   0.832  0.669  0.757
20    0.639   0.743  0.844  0.836
21    0.644   0.707  0.756  0.680
22    0.661   0.662  0.950  0.706
23    0.662   0.949  0.821  0.809
24    0.795   0.903  0.837  0.879
25    0.630   0.837  1.000  0.761
26    0.654   0.853  0.899  0.742
27    0.707   0

## Plot graphs based on metrics

This section plots 3 kinds of graphn namely, line, bar and scatter plot, for different purposes. The line plot helps understand the general trend of the models' performance for all the sentences. However, it is a little busy. Scatter plot depicts clearly for each sentence, the models' performance. Lastly, and what I consider the most important one is the box plot where it shows the models overall performance in a single box. It also portrays the median and distribution of the scores of each model.

In [41]:
import plotly.graph_objects as go

def create_line_graph(metric, data):
    fig = go.Figure()

    # Add traces for each model
    for model in data[metric].keys():
        fig.add_trace(go.Scatter(x=[i for i in range(1, 31)], y=data[metric][model], mode='lines+markers', name=model))

    # Update layout
    fig.update_layout(
        title=f"Comparison of {metric.replace('_', ' ').title()} Scores for Different SLMs (Line Graph)",
        xaxis_title="Sentence",
        yaxis_title="Score",
        legend_title="Small Language Models",
        hovermode="x unified",
        xaxis = dict(
        tickmode = 'linear',
        tick0 = 1,
        dtick = 1
    )
    )

    return fig

import plotly.express as px

def create_box_plot(metric, data):
    fig = go.Figure()

    # Add traces for each model
    for model in data[metric].keys():
        fig.add_trace(go.Box(y=data[metric][model], name=model))

    # Update layout
    fig.update_layout(
        title=f"Comparison of {metric.replace('_', ' ').title()} Scores for Different SLMs (Box Plot)",
        xaxis_title="Model",
        yaxis_title="Score",
        legend_title="Small Language Models",
        hovermode="x unified",
  )

    return fig

def create_scatter_plot(metric, data):
    fig = go.Figure()

    # Add traces for each model
    for model in data[metric].keys():
        fig.add_trace(go.Scatter(x=[i for i in range(1, 31)], y=data[metric][model], mode='markers', name=model))

    # Update layout
    fig.update_layout(
        title=f"Comparison of {metric.replace('_', ' ').title()} Scores for Different SLMs (Scatter Plot)",
        xaxis_title="Sentence",
        yaxis_title="Score",
        hovermode="x unified",
        legend_title="Small Language Models",
  )

    return fig

def create_violin_plot(metric, data):
    fig = go.Figure()

    # Add traces for each model
    for model in data[metric].keys():
        fig.add_trace(go.Violin(y=data[metric][model], name=model, box_visible=True, meanline_visible=True))

    # Update layout
    fig.update_layout(
        title=f"Comparison of {metric.replace('_', ' ').title()} Scores for Different SLMs (Violin Plot)",
        xaxis_title="Model",
        yaxis_title="Score",
        legend_title="Small Language Models",
        hovermode="x unified"
    )

    return fig


In [42]:
data = evaluations
metrics = data.keys()

for metric in metrics:
    if metric == "rouge":
        continue
    fig = create_line_graph(metric, data)
    fig.show()

    fig = create_box_plot(metric, data)
    fig.show()

    fig = create_scatter_plot(metric, data)
    fig.show()


## Plot for ROUGE scores

In [39]:
import plotly.graph_objects as go
import plotly.express as px

# Helper function to extract scores
def extract_rouge_scores(rouge_scores, metric):
    scores = {'Model': [], 'Sentence': [], 'Recall': [], 'Precision': [], 'F1': []}
    for model, sentences in rouge_scores.items():
        for i, sentence in enumerate(sentences):
            scores['Model'].append(model)
            scores['Sentence'].append(i + 1)
            scores['Recall'].append(sentence[metric]['r'])
            scores['Precision'].append(sentence[metric]['p'])
            scores['F1'].append(sentence[metric]['f'])
    return scores

# Prepare data for plotting
rouge_1_scores = extract_rouge_scores(data['rouge'], 'rouge-1')
rouge_2_scores = extract_rouge_scores(data['rouge'], 'rouge-2')
rouge_l_scores = extract_rouge_scores(data['rouge'], 'rouge-l')

# Function to create line plot for a given metric
def create_line_plot(data, score_type, title):
    fig = go.Figure()
    for model in set(data['Model']):
        model_data = {key: [val for i, val in enumerate(data[key]) if data['Model'][i] == model] for key in data}
        fig.add_trace(go.Scatter(x=model_data['Sentence'], y=model_data[score_type], mode='lines+markers', name=model))

    fig.update_layout(
        title=title,
        xaxis_title='Sentence',
        yaxis_title=score_type,
        hovermode='x unified'
    )
    return fig

# Function to create scatter plot for a given metric
def create_scatter_plot(data, score_type, title):
    fig = px.scatter(data, x='Sentence', y=score_type, color='Model', title=title)
    fig.update_layout(
        xaxis_title='Sentence',
        yaxis_title=score_type,
        hovermode='x unified'
    )
    return fig

# Function to create box plot for a given metric
def create_box_plot(data, score_type, title):
    fig = px.box(data, x='Model', y=score_type, points="all", title=title)
    fig.update_layout(
        xaxis_title='Model',
        yaxis_title=score_type,
        hovermode='x unified'
    )
    return fig

# Create and show line plots for each score type
fig_rouge_1_recall = create_line_plot(rouge_1_scores, 'Recall', 'ROUGE-1 Recall Scores')
fig_rouge_1_recall.show()

fig_rouge_1_precision = create_line_plot(rouge_1_scores, 'Precision', 'ROUGE-1 Precision Scores')
fig_rouge_1_precision.show()

fig_rouge_1_f1 = create_line_plot(rouge_1_scores, 'F1', 'ROUGE-1 F1 Scores')
fig_rouge_1_f1.show()

fig_rouge_2_recall = create_line_plot(rouge_2_scores, 'Recall', 'ROUGE-2 Recall Scores')
fig_rouge_2_recall.show()

fig_rouge_2_precision = create_line_plot(rouge_2_scores, 'Precision', 'ROUGE-2 Precision Scores')
fig_rouge_2_precision.show()

fig_rouge_2_f1 = create_line_plot(rouge_2_scores, 'F1', 'ROUGE-2 F1 Scores')
fig_rouge_2_f1.show()

fig_rouge_l_recall = create_line_plot(rouge_l_scores, 'Recall', 'ROUGE-L Recall Scores')
fig_rouge_l_recall.show()

fig_rouge_l_precision = create_line_plot(rouge_l_scores, 'Precision', 'ROUGE-L Precision Scores')
fig_rouge_l_precision.show()

fig_rouge_l_f1 = create_line_plot(rouge_l_scores, 'F1', 'ROUGE-L F1 Scores')
fig_rouge_l_f1.show()

# Create and show scatter plots for each score type
fig_rouge_1_recall_scatter = create_scatter_plot(rouge_1_scores, 'Recall', 'ROUGE-1 Recall Scores Scatter')
fig_rouge_1_recall_scatter.show()

fig_rouge_1_precision_scatter = create_scatter_plot(rouge_1_scores, 'Precision', 'ROUGE-1 Precision Scores Scatter')
fig_rouge_1_precision_scatter.show()

fig_rouge_1_f1_scatter = create_scatter_plot(rouge_1_scores, 'F1', 'ROUGE-1 F1 Scores Scatter')
fig_rouge_1_f1_scatter.show()

fig_rouge_2_recall_scatter = create_scatter_plot(rouge_2_scores, 'Recall', 'ROUGE-2 Recall Scores Scatter')
fig_rouge_2_recall_scatter.show()

fig_rouge_2_precision_scatter = create_scatter_plot(rouge_2_scores, 'Precision', 'ROUGE-2 Precision Scores Scatter')
fig_rouge_2_precision_scatter.show()

fig_rouge_2_f1_scatter = create_scatter_plot(rouge_2_scores, 'F1', 'ROUGE-2 F1 Scores Scatter')
fig_rouge_2_f1_scatter.show()

fig_rouge_l_recall_scatter = create_scatter_plot(rouge_l_scores, 'Recall', 'ROUGE-L Recall Scores Scatter')
fig_rouge_l_recall_scatter.show()

fig_rouge_l_precision_scatter = create_scatter_plot(rouge_l_scores, 'Precision', 'ROUGE-L Precision Scores Scatter')
fig_rouge_l_precision_scatter.show()

fig_rouge_l_f1_scatter = create_scatter_plot(rouge_l_scores, 'F1', 'ROUGE-L F1 Scores Scatter')
fig_rouge_l_f1_scatter.show()

# Create and show box plots for each score type
fig_rouge_1_recall_box = create_box_plot(rouge_1_scores, 'Recall', 'ROUGE-1 Recall Scores Box')
fig_rouge_1_recall_box.show()

fig_rouge_1_precision_box = create_box_plot(rouge_1_scores, 'Precision', 'ROUGE-1 Precision Scores Box')
fig_rouge_1_precision_box.show()

fig_rouge_1_f1_box = create_box_plot(rouge_1_scores, 'F1', 'ROUGE-1 F1 Scores Box')
fig_rouge_1_f1_box.show()

fig_rouge_2_recall_box = create_box_plot(rouge_2_scores, 'Recall', 'ROUGE-2 Recall Scores Box')
fig_rouge_2_recall_box.show()

fig_rouge_2_precision_box = create_box_plot(rouge_2_scores, 'Precision', 'ROUGE-2 Precision Scores Box')
fig_rouge_2_precision_box.show()

fig_rouge_2_f1_box = create_box_plot(rouge_2_scores, 'F1', 'ROUGE-2 F1 Scores Box')
fig_rouge_2_f1_box.show()

fig_rouge_l_recall_box = create_box_plot(rouge_l_scores, 'Recall', 'ROUGE-L Recall Scores Box')
fig_rouge_l_recall_box.show()

fig_rouge_l_precision_box = create_box_plot(rouge_l_scores, 'Precision', 'ROUGE-L Precision Scores Box')
fig_rouge_l_precision_box.show()

fig_rouge_l_f1_box = create_box_plot(rouge_l_scores, 'F1', 'ROUGE-L F1 Scores Box')
fig_rouge_l_f1_box.show()
