# allganize-RAG-Evaluation data + multimodal hybrid search
## Methodology
1. Read PDF files with `Reader`
    * Try `DoclingPDFReader` with `PDF2ImageReader` as fallback
2. Chunk `Document` into single-node `Document`
3. Embed chunk `Document` instances
    * dense: `Visualized_BGE`
    * sparse
4. Insert into `QdrantSingleHybridVectorStore` vector store
5. Test retrieval with queries

## Setting
* parser:
    * IBM [Docling](https://github.com/DS4SD/docling) v2.22.0
    * docling-v2 pdf parser backend
* dense embedding model: `baai/bge-visualized` (bge-m3 weight)
    * https://huggingface.co/BAAI/bge-visualized
* data: real-life pdf files from `allganize-RAG-Evaluation-Dataset-KO`
    * https://huggingface.co/datasets/allganize/RAG-Evaluation-Dataset-KO
    * use 10 'finance' domain PDF files

In [1]:
import json
from pathlib import Path
import time
from typing import Any, Dict, List, Optional

import jsonlines
import pandas as pd
from tqdm import tqdm

from config import settings

In [2]:
import sys
import os

parent_dir = os.path.dirname(os.getcwd())
core_src_dir = os.path.join(parent_dir, "src/psiking")
sys.path.append(core_src_dir)

In [3]:
## Import Core Schemas
from core.base.schema import Document, TextNode, ImageNode, TableNode

# 1. Read Data
* 10 pdf files
* try conversion with docling -> use pdf2image as fallback

In [4]:
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    PictureDescriptionApiOptions
)
from core.reader.pdf.docling_reader import DoclingPDFReader

format_options = PdfPipelineOptions()
format_options.images_scale = 1.5
format_options.generate_page_images = True
format_options.generate_picture_images = True

format_options.do_ocr = False
format_options.do_table_structure = True

# Image description
print("VLM MODEL:", settings.vlm_model)

# Use VLM for image description (ImageNode.text)
image_description_options = PictureDescriptionApiOptions(
    url=f"{settings.vlm_base_url}/v1/chat/completions",
    params=dict(
        model=settings.vlm_model,
        seed=42,
        max_completion_tokens=512,
        temperature=0.9
    ),
    prompt="이미지에 대해 3줄 정도로 자세히 설명해 주세요. 이미지에 정보가 없다면 설명 텍스트를 작성하지 않습니다",
    timeout=90,
    bitmap_area_threshold=0.05 # 5% of page area
)
format_options.do_picture_description = True
format_options.picture_description_options = image_description_options

docling_reader = DoclingPDFReader()

VLM MODEL: Qwen2-VL-72B-Instruct-GPTQ-Int4


In [5]:
from core.reader import PDF2ImageReader

# testing on macOS, provide poppler path manually
poppler_path = "/opt/homebrew/Cellar/poppler/25.01.0/bin"
pdf2img_reader = PDF2ImageReader(poppler_path=poppler_path)

In [6]:
# PDF File directory
pdf_dir = os.path.join(settings.data_dir, "allganize-RAG-Evaluation-Dataset-KO/finance")
pdf_fnames =[x for x in os.listdir(pdf_dir) if x.endswith(".pdf")]
print("num files:", len(pdf_fnames))
pdf_fnames[:10]

num files: 10


['★2019 제1회 증시콘서트 자료집_최종★.pdf',
 '240409(보도자료) 금융위 핀테크 투자 생태계 활성화 나선다.pdf',
 '2024년 3월_3. 향후 통화신용정책 방향.pdf',
 '133178946057443204_WP22-05.pdf',
 '240130(보도자료) 지방은행의 시중은행 전환시 인가방식 및 절차.pdf',
 '130292099630937500_KIFVIP2013-10.pdf',
 '2024년 3월_2. 통화신용정책 운영.pdf',
 '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf',
 '240320(보도자료) 금융권의 상생금융 추진현황.pdf',
 '한-호주 퇴직연금 포럼_책자(최종).pdf']

In [7]:

# Convert pages to image
documents = []
docling_failed_fnames = []
pdf2img_failed_fnames = []
for doc_i, fname in tqdm(enumerate(pdf_fnames)):
    file_path = os.path.join(pdf_dir, fname)
    print(fname)
    extra_info = {
        "source_id": f"allganize-RAG-Evaluation-Dataset-KO/finance/{doc_i}", # arbitrary id
        "domain": "finance",
        "source_file": fname
    }
    try:
        document = docling_reader.run(
            file_path,
            extra_info=extra_info
        )
        documents.append(document)
        continue
    except Exception as e:
        print("[DOCLING READER] failed {} - {}".format(fname, str(e)))
        docling_failed_fnames.append(fname)
    
    try:
        document = pdf2img_reader.run(
            file_path,
            extra_info=extra_info
        )
        documents.append(document)
    except Exception as e:
        print("[PDF2IMG READER] failed {} - {}".format(fname, str(e)))
        pdf2img_failed_fnames.append(fname)
    
for node in document.nodes[:3]:
    print(type(node))

0it [00:00, ?it/s]

★2019 제1회 증시콘서트 자료집_최종★.pdf


1it [00:27, 27.40s/it]

240409(보도자료) 금융위 핀테크 투자 생태계 활성화 나선다.pdf


2it [00:29, 12.76s/it]

2024년 3월_3. 향후 통화신용정책 방향.pdf


3it [00:45, 13.92s/it]

133178946057443204_WP22-05.pdf


Encountered an error during conversion of document 02616dbc4dc47f992b7008e68e4f1d4cb49ccece229e7fad02a38a3470346a63:
Traceback (most recent call last):

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 163, in _build_document
    for p in pipeline_pages:  # Must exhaust!

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 127, in _apply_on_pages
    yield from page_batch

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/models/page_assemble_model.py", line 60, in __call__
    for page in page_batch:

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/models/table_structure_model.py", line 175, in __call__
    yield from page_batch

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/models/layout_model.py", line 146, in __call__
    for page in page_batch:

  File "/opt/miniconda3/envs/docling/lib/python3

[DOCLING READER] failed 133178946057443204_WP22-05.pdf - Invalid code point


4it [00:55, 12.53s/it]

240130(보도자료) 지방은행의 시중은행 전환시 인가방식 및 절차.pdf


5it [00:58,  8.98s/it]

130292099630937500_KIFVIP2013-10.pdf


6it [01:15, 11.92s/it]

2024년 3월_2. 통화신용정책 운영.pdf


7it [02:38, 35.01s/it]

[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf


8it [02:43, 25.56s/it]

240320(보도자료) 금융권의 상생금융 추진현황.pdf


9it [02:46, 18.52s/it]Encountered an error during conversion of document ce014774ce984417127bff298a0e883db7ad2652e7cb66d49bbbb2423cc4176c:
Traceback (most recent call last):

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 163, in _build_document
    for p in pipeline_pages:  # Must exhaust!

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/pipeline/base_pipeline.py", line 127, in _apply_on_pages
    yield from page_batch

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/models/page_assemble_model.py", line 60, in __call__
    for page in page_batch:

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/models/table_structure_model.py", line 175, in __call__
    yield from page_batch

  File "/opt/miniconda3/envs/docling/lib/python3.10/site-packages/docling/models/layout_model.py", line 146, in __call__
    for page in page_batch:

  File "/opt/miniconda3/en

한-호주 퇴직연금 포럼_책자(최종).pdf
[DOCLING READER] failed 한-호주 퇴직연금 포럼_책자(최종).pdf - Invalid code point


10it [02:52, 17.29s/it]

<class 'core.base.schema.ImageNode'>
<class 'core.base.schema.ImageNode'>
<class 'core.base.schema.ImageNode'>





In [8]:
document.metadata

{'source_id': 'allganize-RAG-Evaluation-Dataset-KO/finance/9',
 'domain': 'finance',
 'source_file': '한-호주 퇴직연금 포럼_책자(최종).pdf'}

In [9]:
# image = document.nodes[0].image

# # Crop to half
# width, height = image.size
# left_half = image.crop((0, 0, width, height//2))
# left_half

# 2. Process Document into Chunks
1. merge text nodes with `TextNodeMerger`
2. split texts into chunks with `LangchainRecursiveCharacterTextSplitter`
3. filter chunks with min length strings

In [10]:
from core.processor.document.text_merger import TextNodeMerger

# Split Documents page-level
merger = TextNodeMerger()

merged_documents = []
for document in documents:
    merged_document = merger.run(document)
    merged_documents.append(merged_document)

In [11]:
# merged_documents[0]
merged_documents[0].nodes[0]

TextNode(id_='2b241214-d411-41e2-8410-c0e88a62ab49', metadata={'page_no': 1}, text_type=<TextType.PLAIN: 'plain'>, label=<TextLabel.PLAIN: 'plain'>, resource=MediaResource(data=None, text='증권사 리서치센터장, 자산운용사 대표와 함께하는 제1회 증시 콘서트\n2019 하반기 증시 대전망\n|\xa0일\xa0시\xa0| 2019.\xa07.\xa02\xa0(화)\xa014:30\n|\xa0장\xa0소\xa0| 금융투자협회\xa03층\xa0불스홀', path=None, url=None, mimetype=None))

In [12]:
[x.id_ for x in merged_documents]

['6fac7f25-bc33-4842-bbdf-fc1ec7a91274',
 'a782bc1c-011d-402b-9ed2-4f2439c6e4e2',
 '0295a537-4856-4564-81a4-d81e272b0398',
 'a602289c-abf7-49d9-b49a-747b65dbeea9',
 '640081d0-57ad-4b74-b632-769176c21062',
 'a738059e-cea4-4fc2-b669-a38458fdb168',
 '6ec2cb86-a114-4f1b-8c62-c00f32534c75',
 '62af0f2e-c08b-45db-a017-09bd88f68060',
 'dbe3a4a8-5669-41de-aaf9-e9031b174a22',
 '32c09f9c-d78e-4d39-a8b7-ea646aaf4952']

In [13]:
# 3. Run Splitter
from core.splitter.text.langchain_text_splitters import LangchainRecursiveCharacterTextSplitter

splitter = LangchainRecursiveCharacterTextSplitter(
    chunk_size = 1024,
    chunk_overlap = 128
)

min_text_length = 30
chunks = []
for document in merged_documents:
    document_chunks = []
    source_id = document.id_
    for i, node in enumerate(document.nodes):
        # Run Splitter
        if isinstance(node, TextNode):
            try:
                split_nodes = splitter.run(node)
            except Exception as e:
                print(i, node)
                print(str(e))
                raise e
        else:
            split_nodes = [node]
        
        # Create New Document
        for split_node in split_nodes:
            ## Filter TextNodes with short lengths
            if isinstance(split_node, TextNode) and len(split_node.text.strip())<min_text_length:
                continue
            
            # Each Document contains single node
            chunk = Document(
                nodes=[split_node],
                
                metadata={
                    "source_id": source_id,
                    "domain": document.metadata["domain"],
                    "source_file": document.metadata['source_file'],
                }
            )
            document_chunks.append(chunk)
    chunks.extend(document_chunks)
print(len(chunks))

1010


In [14]:
chunk_ids =[x.id_ for x in chunks]
print(len(chunk_ids), len(set(chunk_ids)))

1010 1010


# 3. Embed Using VisualizedBGE + BM42

In [15]:
# Initialize Text Formatter
from core.formatter.document.simple import SimpleTextOnlyFormatter

# use default templates
text_formatter = SimpleTextOnlyFormatter()

## 3-1. Dense Embedding VisualizedBGE
* 

In [16]:
## Load Model
import torch
from visual_bge.modeling import Visualized_BGE

# Load Colpali engine
bge_m3_model_dir = os.path.join(
    settings.model_weight_dir, "bge-m3"
)
visualized_model_dir = os.path.join(
    settings.model_weight_dir, "baai-bge-visualized/Visualized_m3.pth"
)

dense_embedding_model = Visualized_BGE(
    model_name_bge = bge_m3_model_dir,
    model_weight= visualized_model_dir
)
dense_embedding_model.eval()
print("Loaded Dense Embedding Model")
dense_embedding_model.dtype



Loaded Dense Embedding Model


torch.float32

In [17]:
from core.embedder.flagembedding import (
    VisualizedBGEInput, 
    LocalVisualizedBGEEmbedder
)
dense_embedder = LocalVisualizedBGEEmbedder(
    model=dense_embedding_model
)

In [18]:
def prepare_visualized_bge_input(text_formatter, chunk: Document):
    # Single 
    formatted_text = text_formatter.run([chunk])[0]
    
    node = chunk.nodes[0]
    if isinstance(node, TextNode):
        return VisualizedBGEInput(text=formatted_text)
    elif isinstance(node, ImageNode) or isinstance(node, TableNode):
        return VisualizedBGEInput(
            text=formatted_text,
            image=node.image
        )
    else:
        raise ValueError("Unknown node type error {}".format(type(node)))
    
visualized_bge_inputs = [prepare_visualized_bge_input(text_formatter, x) for x in chunks]

In [19]:
visualized_bge_inputs[0]

VisualizedBGEInput(text='증권사 리서치센터장, 자산운용사 대표와 함께하는 제1회 증시 콘서트\n2019 하반기 증시 대전망\n|\xa0일\xa0시\xa0| 2019.\xa07.\xa02\xa0(화)\xa014:30\n|\xa0장\xa0소\xa0| 금융투자협회\xa03층\xa0불스홀', image=None)

In [20]:
dense_embeddings = dense_embedder.run(visualized_bge_inputs, batch_size = 4, disable_tqdm=False)

100%|██████████| 114/114 [03:12<00:00,  1.69s/it]
100%|██████████| 140/140 [03:48<00:00,  1.63s/it]


In [21]:
# (num_chunks, seq_len, embedding_dim)
print(len(dense_embeddings))
print(len(dense_embeddings[0]))

1010
1024


## 3-2. Sparse Embedding (BM42)
* Embed using BM42 Sparse embedder model
    * https://huggingface.co/Qdrant/all_miniLM_L6_v2_with_attentions

### Loading model from pre-downloaded directory
* Load model using 'specific model path'
    * specific_model_path (Optional[str], optional): The specific path to the onnx model dir if it should be imported from somewhere else
    * download_model method skips download phase (available > v0.5.1 )
        * https://github.com/qdrant/fastembed/blob/a931f143ef3543234bc9d8d0c305496c67199972/fastembed/common/model_management.py#L367
    * build from source with commit `a931f143ef3543234bc9d8d0c305496c67199972`
* cache_dir: cache_dir (str, optional): The path to the cache directory.
    Can be set using the `FASTEMBED_CACHE_PATH` env variable.
    Defaults to `fastembed_cache` in the system's temp directory.
```
cd poetry
poetry build
pip install --force-reinstall fastembed-0.5.1-py3-none-any.whl
```

In [22]:
os.environ["FASTEMBED_CACHE_PATH"] = str(os.path.join(os.getcwd(), "fastembed"))
print(os.environ["FASTEMBED_CACHE_PATH"])
sparse_model_dir = os.path.join(settings.model_weight_dir, "fastembed/sparse/all_miniLM_L6_v2_with_attentions")
os.listdir(sparse_model_dir)

/Users/id4thomas/github/psi-king/examples/fastembed


['tokenizer_config.json',
 'special_tokens_map.json',
 'config.json',
 'tokenizer.json',
 'README.md',
 'vocab.txt',
 'model.onnx',
 '.gitattributes',
 '.git',
 'stopwords.txt']

In [23]:
# Load fastembed model
from fastembed import SparseTextEmbedding

# test specific_model_path function
downloaded_dir = SparseTextEmbedding.download_model(
    model={},
    cache_dir=os.environ["FASTEMBED_CACHE_PATH"],
    specific_model_path=sparse_model_dir,
)
print(downloaded_dir)

sparse_model = SparseTextEmbedding(
    model_name="Qdrant/bm42-all-minilm-l6-v2-attentions",
    specific_model_path=sparse_model_dir,
    cuda=False,
    lazy_load=False
)

test_embeddings = list(sparse_model.embed(["hi"]))
print(test_embeddings)
test_embeddings[0].values.tolist(), test_embeddings[0].indices.tolist()

/Users/id4thomas/models/fastembed/sparse/all_miniLM_L6_v2_with_attentions
[SparseEmbedding(values=array([0.30918342]), indices=array([948991206]))]


([0.3091834199811786], [948991206])

In [24]:
# Load Embedder
from core.embedder.fastembed.local_sparse import LocalFastEmbedSparseEmbedder

sparse_embedder = LocalFastEmbedSparseEmbedder(
    model=sparse_model
)

In [25]:
def prepare_sparse_input(text_formatter, chunk: Document):
    # Single 
    formatted_text = text_formatter.run([chunk])[0]
    return formatted_text

sparse_inputs = [prepare_sparse_input(text_formatter, x) for x in chunks]
sparse_embedding_values, sparse_embedding_indices = sparse_embedder.run(
    sparse_inputs, batch_size=256
)

In [26]:
print(sparse_embedding_values[0][:5])
print(sparse_embedding_indices[0][:5])

[0.27762097595534047, 0.2596218248069528, 0.29100913226138186, 0.2296326768039164, 0.11464637476009029]
[186075762, 777355938, 1724316797, 214838547, 1558169044]


In [27]:
# [x.id_ for x in chunks]

# Make DocumentStore

In [28]:
from core.storage.docstore import InMemoryDocumentStore

docstore = InMemoryDocumentStore()
docstore.add(chunks)
print(docstore.count())

1010


# 4. Insert into VectorStore
* intialize qdrant in-memory

In [29]:
from qdrant_client import QdrantClient
from core.storage.vectorstore.qdrant import QdrantSingleHybridVectorStore


# initialize client
client = QdrantClient(":memory:")
collection_name = "allganize-finance"

vector_store = QdrantSingleHybridVectorStore(
    collection_name=collection_name,
    client=client
)

In [30]:
## Create Collection
from qdrant_client.http import models

# bge-m3 1024 dim
dense_embedding_dim=1024
dense_vectors_config = models.VectorParams(
    size=dense_embedding_dim,
    distance=models.Distance.COSINE,
    on_disk=True,
)

# Sparse BM42 Embedding
sparse_vectors_config = models.SparseVectorParams(
    modifier=models.Modifier.IDF, ## uses indices from bm42 embedder
)

# Create VectorStore
vector_store.create_collection(
    dense_vector_config=dense_vectors_config,
    sparse_vector_config=sparse_vectors_config,
    on_disk_payload=True,
)

# Create Index
vector_store.create_index(
    field_name="text",
    field_schema=models.TextIndexParams(
        type="text",
        tokenizer=models.TokenizerType.MULTILINGUAL,
    ),
)

  self._client.create_payload_index(


In [31]:
vector_store.add(
    documents=chunks,
    texts=sparse_inputs,
    dense_embeddings=dense_embeddings,
    sparse_embedding_values=sparse_embedding_values,
    sparse_embedding_indices=sparse_embedding_indices,
    metadata_keys=["source_file", "source_id", "title"]
)

In [32]:
# check collection
collection_info = vector_store._client.get_collection(
    collection_name=vector_store.collection_name
)
print(collection_info.model_dump_json(indent=4))

{
    "status": "green",
    "optimizer_status": "ok",
    "vectors_count": null,
    "indexed_vectors_count": 0,
    "points_count": 1010,
    "segments_count": 1,
    "config": {
        "params": {
            "vectors": {
                "vector_dense": {
                    "size": 1024,
                    "distance": "Cosine",
                    "hnsw_config": null,
                    "quantization_config": null,
                    "on_disk": true,
                    "datatype": null,
                    "multivector_config": null
                }
            },
            "shard_number": null,
            "sharding_method": null,
            "replication_factor": null,
            "write_consistency_factor": null,
            "read_fan_out_factor": null,
            "on_disk_payload": null,
            "sparse_vectors": {
                "vector_sparse": {
                    "index": null,
                    "modifier": "idf"
                }
            }
        },
 

In [33]:
chunks[0].id_

'c0e9699c-a4b2-4498-9886-fffd5fcb7c4f'

In [34]:
points = vector_store._client.retrieve(
    collection_name=vector_store.collection_name,
    ids=[chunks[0].id_],
    with_vectors=True
)

In [35]:
print(points[0].id)
print(points[0].payload)

# Dense Vector
print(len(points[0].vector['vector_dense']))

# Sparse Vector
print(len(points[0].vector['vector_sparse'].indices))
print(len(points[0].vector['vector_sparse'].values))

c0e9699c-a4b2-4498-9886-fffd5fcb7c4f
{'source_id': '6fac7f25-bc33-4842-bbdf-fc1ec7a91274', 'source_file': '★2019 제1회 증시콘서트 자료집_최종★.pdf'}
1024
20
20


# 5. Test Retrieval with Query

In [36]:
# Embed Query
query = "시중은행, 지방은행, 인터넷은행의 인가 요건 및 절차에 차이가 있는데 그 차이점은 무엇인가요?"

query_dense_embedding = dense_embedder.run(
    [VisualizedBGEInput(text=query)],
    batch_size = 4,
    disable_tqdm=False
)

query_sparse_embedding_values, query_sparse_embedding_indices = sparse_embedder.run(
    [query], batch_size = 1
)

print(len(query_dense_embedding[0]))
print(len(query_sparse_embedding_values[0]), query_sparse_embedding_values[0])
print(len(query_sparse_embedding_indices[0]), query_sparse_embedding_indices[0])

100%|██████████| 1/1 [00:00<00:00,  6.59it/s]

1024
8 [0.31059375711328135, 0.31304877079908167, 0.19882314306607887, 0.1964898348134671, 0.32203981694197703, 0.3009219191747245, 0.10172041730715874, 0.33753322982893214]
8 [1024444394, 1285937098, 693871510, 376689346, 332251539, 1798584096, 1061271926, 1903036828]





In [58]:
# Hybrid Query
results = vector_store.query(
    mode="hybrid",
    dense_embedding=query_dense_embedding[0],
    sparse_embedding_values=query_sparse_embedding_values[0],
    sparse_embedding_indices=query_sparse_embedding_indices[0],
    limit=10
)
print(len(results.points))

10


In [59]:
print(results.points[0])
results.points[0].payload

id='dcd44cdf-858d-481d-af0c-36f56b5aa29d' version=0 score=0.8333333333333333 payload={'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060', 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'} vector=None shard_key=None order_value=None


{'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060',
 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}

In [56]:
for point in results.points[:5]:
    point_id = point.id
    point_chunk = docstore.get([point_id])[0]
    print("{} - score {:.3f}".format(point_id, point.score))
    print(point_chunk.metadata)
    print(type(point_chunk.nodes[0]))
    print(repr(point_chunk.nodes[0].text[:100]))
    print('-'*30)

dcd44cdf-858d-481d-af0c-36f56b5aa29d - score 0.833
{'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060', 'domain': 'finance', 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}
<class 'core.base.schema.TextNode'>
'- *  (은행법  §8 ➀ )  은행업을 경영하려는 자는 금융위원회의 인가를 받아야 한다.\n\t- ㅇ 시중은행전국영업뿐만 아니라 지방은행 및 인터넷은행도 모두 ( ) 동일한 조항제'
------------------------------
730ab9ca-6ba4-4041-9ae5-189cdee17b79 - score 0.611
{'source_id': '640081d0-57ad-4b74-b632-769176c21062', 'domain': 'finance', 'source_file': '240130(보도자료) 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}
<class 'core.base.schema.TextNode'>
'금융위원회\n보도자료\n보도시점\n20 2 4 . 1 . 3 1 . ( 수  금\n)\n융위  회 의   후\n(별도공지)\n배포\n2024.1.30.(화) 10:00\n지방은행의 시중은행 전환시'
------------------------------
81338709-1091-404c-8940-469c22aea391 - score 0.417
{'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060', 'domain': 'finance', 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}

In [None]:
# Dense-Only Query
results = vector_store.query(
    mode="dense",
    dense_embedding=query_dense_embedding[0],
    limit=100
)
print(len(results.points))

100


In [64]:
for point in results.points[:5]:
    point_id = point.id
    point_chunk = docstore.get([point_id])[0]
    print("{} - score {:.3f}".format(point_id, point.score))
    print(point_chunk.metadata)
    print(type(point_chunk.nodes[0]))
    print(repr(point_chunk.nodes[0].text[:100]))
    print('-'*30)

730ab9ca-6ba4-4041-9ae5-189cdee17b79 - score 0.785
{'source_id': '640081d0-57ad-4b74-b632-769176c21062', 'domain': 'finance', 'source_file': '240130(보도자료) 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}
<class 'core.base.schema.TextNode'>
'금융위원회\n보도자료\n보도시점\n20 2 4 . 1 . 3 1 . ( 수  금\n)\n융위  회 의   후\n(별도공지)\n배포\n2024.1.30.(화) 10:00\n지방은행의 시중은행 전환시'
------------------------------
dcd44cdf-858d-481d-af0c-36f56b5aa29d - score 0.784
{'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060', 'domain': 'finance', 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}
<class 'core.base.schema.TextNode'>
'- *  (은행법  §8 ➀ )  은행업을 경영하려는 자는 금융위원회의 인가를 받아야 한다.\n\t- ㅇ 시중은행전국영업뿐만 아니라 지방은행 및 인터넷은행도 모두 ( ) 동일한 조항제'
------------------------------
ee332e75-0250-442f-b71e-4624a130bc11 - score 0.731
{'source_id': '640081d0-57ad-4b74-b632-769176c21062', 'domain': 'finance', 'source_file': '240130(보도자료) 지방은행의 시중은행 전환시 인가방식 및 절

In [None]:
# Sparse-Only Query
results = vector_store.query(
    mode="sparse",
    sparse_embedding_values=query_sparse_embedding_values[0],
    sparse_embedding_indices=query_sparse_embedding_indices[0],
    limit=100
)
print(len(results.points))

In [71]:
for point in results.points[:5]:
    point_id = point.id
    point_chunk = docstore.get([point_id])[0]
    print("{} - score {:.3f}".format(point_id, point.score))
    print(point_chunk.metadata)
    print(type(point_chunk.nodes[0]))
    print(repr(point_chunk.nodes[0].text[:100]))
    print('-'*30)

dcd44cdf-858d-481d-af0c-36f56b5aa29d - score 0.647
{'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060', 'domain': 'finance', 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}
<class 'core.base.schema.TextNode'>
'- *  (은행법  §8 ➀ )  은행업을 경영하려는 자는 금융위원회의 인가를 받아야 한다.\n\t- ㅇ 시중은행전국영업뿐만 아니라 지방은행 및 인터넷은행도 모두 ( ) 동일한 조항제'
------------------------------
4319d2e6-c485-48fb-9746-cbf5184e8597 - score 0.489
{'source_id': '6ec2cb86-a114-4f1b-8c62-c00f32534c75', 'domain': 'finance', 'source_file': '2024년 3월_2. 통화신용정책 운영.pdf'}
<class 'core.base.schema.TextNode'>
'③ 2023년 미 SVB 사태와의 차이점\n2023년 미 SVB 사태와 미 CRE발 리스크의 가장 큰 차이점은 전자의 경우 중소 지역은행들에 대한 예 금인출 사태(bank-run)로'
------------------------------
81338709-1091-404c-8940-469c22aea391 - score 0.475
{'source_id': '62af0f2e-c08b-45db-a017-09bd88f68060', 'domain': 'finance', 'source_file': '[별첨] 지방은행의 시중은행 전환시 인가방식 및 절차.pdf'}
<class 'core.base.schema.TextNode'>
'나. 