# **Questions**

In [1]:
Questions = \
["What is the bias-variance trade-off?",
"Which factors dominate the model's error at which stages?",
"What is the bias-variance trade-off?",
"Illustrate how the specific model achieves the bias-variance trade-off with examples.",
"What is a confusion matrix?",
"What kind of problems are the confusion matrix used for?",
"What can be derived from the confusion matrix?",
"What is the correlation and covariance in statistics?",
"What is the plot of large negative covariance be like?",
"What is the plot of nearly zero covariance be like?",
"What is the plot of large positive covariance be like?",
"What is the plot of positive correlation be like?",
"What is the plot of zero correlation be like?",
"What is the plot of negative correlation be like?",
"What is the p-value?"
"Based on what kind of the behavior of p-value, we accept and reject the null hypothesis?",
"How to avoid the overfitting and underfitting?",
"What is the selection bias?",
"What is the best ideal ROC curve be like?",
"What is the formula of Softmax function?",
"How much of the time does data pre-processing take?" ,
"What classification algorithms does classifier include? List concrete algorithm(s) under each classification algorithm class.",
"Which algorithm can solve the case with both numerical and categorical data being involved?",
"What is the separating hyperlane?",
"What's the equation of the separating hyperlane for SVM?",
"Give a concrete example of the decision tree.",
"What is the containment relationship between the artificial inteligence, machine learning and deep learning?"]

# **Package installation**

In [2]:
!pip install pdf2image
!mkdir -p static/
!pip install pymupdf

!pip install pydantic==1.10.9
!pip install openai
!pip install llama-index
!pip install llama-index-readers-file

!apt-get install -y poppler-utils

!pip install llama-index-vector-stores-qdrant
!pip install -U qdrant_client fastembed
!pip install llama-index-vector-stores-chroma
!pip install llama-index-vector-stores-opensearch
!pip install llama-index-multi-modal-llms-azure-openai

Collecting pydantic==1.10.9
  Using cached pydantic-1.10.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (147 kB)
Using cached pydantic-1.10.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.1 MB)
Installing collected packages: pydantic
  Attempting uninstall: pydantic
    Found existing installation: pydantic 2.8.2
    Uninstalling pydantic-2.8.2:
      Successfully uninstalled pydantic-2.8.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 1.4.14 requires pydantic>=2.7.0, but you have pydantic 1.10.9 which is incompatible.
llama-index-core 0.11.5 requires pydantic<3.0.0,>=2.7.0, but you have pydantic 1.10.9 which is incompatible.[0m[31m
[0mSuccessfully installed pydantic-1.10.9
Collecting pydantic<3.0.0,>=2.7.0 (from llama-index-core<0.12.0,>=0.11.5->llama-index)
  Downloading pydantic-2.9.0-py3-none-a

In [4]:
import os
from os import getenv

OPENAI_API_KEY =  "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [5]:
from llama_index.core import VectorStoreIndex
from pathlib import Path
from llama_index.readers.file import PyMuPDFReader
from llama_index.embeddings.openai import OpenAIEmbedding

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import (DocxReader,
                                      HWPReader,
                                      PDFReader,
                                      EpubReader,
                                      FlatReader,
                                      HTMLTagReader,
                                      ImageCaptionReader,
                                      ImageReader,
                                      ImageVisionLLMReader,
                                      IPYNBReader,
                                      MarkdownReader,
                                      MboxReader,
                                      PptxReader,
                                      PandasCSVReader,
                                      VideoAudioReader,
                                      UnstructuredReader,
                                      PyMuPDFReader,
                                      ImageTabularChartReader,
                                      XMLReader,
                                      PagedCSVReader,
                                      CSVReader,
                                      RTFReader)

from llama_index.core.node_parser import (SentenceSplitter,
                                          SemanticSplitterNodeParser,
                                          SemanticDoubleMergingSplitterNodeParser,
                                          LanguageConfig)

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.multi_modal_llms.azure_openai import AzureOpenAIMultiModal

import qdrant_client

In [6]:
import chromadb
from chromadb.config import Settings

from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.vector_stores.opensearch import (OpensearchVectorStore, OpensearchVectorClient)
from llama_index.vector_stores.qdrant import QdrantVectorStore

# **Models**

In [7]:
from llama_index.core import Settings
from llama_index.llms.openai import OpenAI

# llm = OpenAI(temperature=0, model="gpt-4o")
Settings.llm = OpenAI(temperature=0, model="gpt-4o")

Settings.embed_model = OpenAIEmbedding(model = "text-embedding-3-small", dimensions = 512)
embed_model = OpenAIEmbedding()

# **Maps**

In [8]:
# Reader
reader_map = {'docx_reader': DocxReader,
              'HWP_reader':  HWPReader,
              'PDF_reader':  PDFReader,
              'Epub_reader': EpubReader,
              'Flat_reader': FlatReader,
              'HTMLTag_reader': HTMLTagReader,
              'ImageCaption_reader': ImageCaptionReader,
              'Image_reader': ImageReader,
              'ImageVision_reader': ImageVisionLLMReader,
              'IPYNB_reader': IPYNBReader,
              'Markdown_reader': MarkdownReader,
              'Mbox_reader': MboxReader,
              'Pptx_reader': PptxReader,
              'PandasCSV_reader': PandasCSVReader,
              'VideoAudio_reader': VideoAudioReader,
              'Unstructured_reader': UnstructuredReader,
              'PyMuPDF_reader': PyMuPDFReader,
              'ImageTabularChart_reader': ImageTabularChartReader,
              'XML_reader': XMLReader,
              'PagedCSV_reader': PagedCSVReader,
              'CSV_reader': CSVReader,
              'RTF_reader': RTFReader}

# Parser
node_parser_map = {'SentenceSplitter': SentenceSplitter,
                   'SemanticSplitter': SemanticSplitterNodeParser,
                   'SemanticDoubleMergingSplitter': SemanticDoubleMergingSplitterNodeParser}

# Multi-modal LLM
LLM_models = {'OpenAIMultiModal': OpenAIMultiModal(model="gpt-4-turbo-2024-04-09", api_key = OPENAI_API_KEY, max_new_tokens=4096),
              'AzureOpenAIMultiModal': AzureOpenAIMultiModal(engine         = "gpt-4-vision-preview",
                                                             api_version    = "2023-12-01-preview",
                                                             model          = "gpt-4-vision-preview",
                                                             max_new_tokens = 300)}

# vector_stores
supported_vector_stores   = ['Chroma', 'Opensearch', 'Qdrant']

vector_stores = {'Chroma': ChromaVectorStore,
                 'Opensearch': OpensearchVectorStore,
                 'Qdrant': QdrantVectorStore}

# **Configuration**

In [63]:
class Configs:

    # Reader
    selected_reader_type = 'PyMuPDF_reader'

    # Parser
    selected_node_parser = 'SemanticSplitter'

    Node_parsers_parameters = {'SentenceSplitter': {'chunk_size': 1024, 'chunk_overlap': 20},
                               'SemanticSplitter': {'buffer_size': 1, 'breakpoint_percentile_threshold': 95, 'embed_model': embed_model},
                               'SemanticDoubleMergingSplitter': {'language_config': LanguageConfig(language = "english", spacy_model = "en_core_web_md"),
                                                                 'initial_threshold': 0.4,
                                                                 'appending_threshold': 0.5,
                                                                 'merging_threshold': 0.5,
                                                                 'max_chunk_size': 5000}
                               }

    selected_splitter_parameters = Node_parsers_parameters[selected_node_parser]

    # MultiModal LLM model
    selected_LLM_model = 'OpenAIMultiModal'

    # Vector store
    selected_vector_store = 'Chroma'

    vector_store_parameters = {'Chroma': {'chroma_collection': chromadb.EphemeralClient(settings = Settings(allow_reset = True)).create_collection("demo_3")},

                               'Qdrant': {'collection_name': "demo",
                                          'client': qdrant_client.QdrantClient(host = "localhost", port = 6333),
                                          'aclient': qdrant_client.AsyncQdrantClient(location=":memory:"),
                                          'prefer_grpc': True}
                               }

    selected_vector_store_parameters = vector_store_parameters[selected_vector_store]

configs = Configs()

# **Load Data**

## Load image data

In [10]:
import os
from pdf2image import convert_from_path

def pdf2images(pdf_file):

    '''Convert each PDF page into a PNG image'''

    # The saved path = original PDF file name (without the extension)
    output_directory_path, _        = os.path.splitext(pdf_file)
    doc_pages_output_directory_path = output_directory_path + "/doc_pages"

    if not os.path.exists(doc_pages_output_directory_path):
        os.makedirs(doc_pages_output_directory_path)

    # Convert PDF to images
    images = convert_from_path(pdf_file)

    # Save images as PNG files
    for page_number, image in enumerate(images):
        image.save(f"{doc_pages_output_directory_path}/page_{page_number + 1}.png")

    return doc_pages_output_directory_path

## text ---(through Reader)---> documents ---(through Splitter)---> text chunks ---> nodes

In [11]:
from llama_index.core.schema import TextNode

In [64]:
class DataLoader:

    def __init__(self, file_path, configs, if_image=True):

      self.if_image    = if_image
      self.configs     = configs
      self.text_chunks = []
      self.doc_idxs    = []
      self.nodes       = []

      # For image understanding and reasoning
      if if_image:
          self.doc_pages_output_directory_path = pdf2images(file_path)
          self.LLM_model                       = LLM_models[configs.selected_LLM_model]

      # For text loading
      loader         = reader_map[configs.selected_reader_type]()
      self.documents = loader.load(file_path = file_path)
      self.splitter  = node_parser_map[configs.selected_node_parser](**configs.selected_splitter_parameters)

    def fill_chunks(self, docs):

      for doc_idx, doc in enumerate(docs):
          cur_text_chunks = self.splitter.split_text(doc.text)
          self.text_chunks.extend(cur_text_chunks)
          self.doc_idxs.extend([doc_idx] * len(cur_text_chunks))

    def fill_nodes_for_text(self):
      # Fill text chunks
      #self.fill_chunks(self.documents)

      nodes = self.splitter.get_nodes_from_documents(self.documents)
      self.nodes.extend(nodes)

      #for idx, text_chunk in enumerate(self.text_chunks):
      #    node          = TextNode(text = text_chunk)
      #    src_doc       = self.documents[self.doc_idxs[idx]]
      #    node.metadata = src_doc.metadata
      #    self.nodes.append(node)

    def fill_nodes_for_image_description(self):
        # Fill image description chunks
        image_docs = []
        for img_file in Path(self.doc_pages_output_directory_path).glob("*.png"):

            image_documents = SimpleDirectoryReader(input_files = [img_file]).load_data()
            response        = self.LLM_model.complete(prompt          = "Please briefly describe the information in the image",
                                                      image_documents = image_documents,)
            self.nodes.append(TextNode(text=str(response)))

    def fill_nodes(self):
        # Fill nodes
        self.fill_nodes_for_text()
        if self.if_image:
            self.fill_nodes_for_image_description()

In [65]:
data_loader = DataLoader("/content/DataScience Interview Questions.pdf", configs)

In [66]:
data_loader.fill_nodes()

In [67]:
len(data_loader.nodes)

176

# **Vector Store**

In [42]:
class Vector_store:

    def __init__(self, selected_vector_store, nodes):

        if selected_vector_store not in supported_vector_stores:
            raise Exception(f"Sorry, {selected_vector_store} is not in supported vector stores")

        # Create vector store
        selected_vector_store = vector_stores[selected_vector_store](**configs.selected_vector_store_parameters)

        # Storage Context is the storage container of Vector Store, which is used to store text, index, vector and other data.
        storage_context       = StorageContext.from_defaults(vector_store = selected_vector_store)

        # Create index through connecting the storage context to the selected vector store
        self.index = VectorStoreIndex(nodes, storage_context = storage_context)

        self.chat_engine = self.index.as_chat_engine()

    def chat(self, question):

        response = self.chat_engine.chat(question)


In [43]:
vector_store = Vector_store(configs.selected_vector_store, data_loader.nodes)

# **Ranker**

In [46]:
!pip install torch sentence-transformers
from llama_index.core.postprocessor import SentenceTransformerRerank

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1


In [47]:
reranker = SentenceTransformerRerank(model="BAAI/bge-reranker-large", top_n=2)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/801 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

# **RAG Fusion Retriever**

In [48]:
from llama_index.core.retrievers import QueryFusionRetriever

In [49]:
fusion_retriever = QueryFusionRetriever([vector_store.index.as_retriever()],
                                         similarity_top_k = 5, # 检索召回 top k 结果
                                         num_queries = 3,  # 生成 query 数
                                         use_async = True)

# **Query Engine**

In [50]:
from llama_index.core.query_engine import RetrieverQueryEngine

In [51]:
query_engine = RetrieverQueryEngine.from_args(fusion_retriever,
                                              node_postprocessors = [reranker])

# **Chat Engine**

In [55]:
import nest_asyncio
nest_asyncio.apply() # 只在Jupyter笔记环境中需要此操作，否则会报错

In [52]:
from llama_index.core.chat_engine import CondenseQuestionChatEngine

In [53]:
chat_engine = CondenseQuestionChatEngine.from_defaults(query_engine=query_engine, # condense_question_prompt=... # 可以自定义 chat message prompt 模板
                                                       )

In [56]:
for i, question in enumerate(Questions):

    response = chat_engine.chat(question)

    print("{}:\n Query: {}\n Answer: {}\n".format(i, question, response))

0:
 Query: What is the bias-variance trade-off?
 Answer: The bias-variance trade-off refers to the balance that needs to be achieved in machine learning models between the error introduced by oversimplification (bias) and the error introduced by sensitivity to fluctuations in the training data (variance). The goal is to find a model that has low bias and low variance to ensure good prediction performance. Increasing model complexity typically reduces bias but increases variance, leading to a trade-off where minimizing one type of error may increase the other, ultimately impacting the total error of the model.

1:
 Query: Which factors dominate the model's error at which stages?
 Answer: Bias dominates a model's error when the model is too simple or has high bias, leading to underfitting. On the other hand, variance dominates a model's error when the model is too complex or has high variance, causing overfitting. The trade-off between bias and variance aims to find the optimal balance w