### **밀버스 설치하기**

테스트 목적으로 진행하는 것이기에 Milvus lite 설치

In [1]:
!pip install pymilvus

Collecting pymilvus
  Downloading pymilvus-2.5.5-py3-none-any.whl.metadata (5.7 kB)
Collecting grpcio<=1.67.1,>=1.49.1 (from pymilvus)
  Downloading grpcio-1.67.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.9 kB)
Collecting python-dotenv<2.0.0,>=1.0.1 (from pymilvus)
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting ujson>=2.0.0 (from pymilvus)
  Downloading ujson-5.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.3 kB)
Collecting milvus-lite>=2.4.0 (from pymilvus)
  Downloading milvus_lite-2.4.11-py3-none-manylinux2014_x86_64.whl.metadata (9.2 kB)
Downloading pymilvus-2.5.5-py3-none-any.whl (223 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m223.7/223.7 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading grpcio-1.67.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.9/5.9 MB[0m [31m79.5 M

### **밀버스 벡터 데이터베이스 생성하기**

In [2]:
from pymilvus import MilvusClient
from pymilvus import connections, db

client = MilvusClient("./milvus_demo.db") #현재 폴더에 milvus_demo.db라는 데이터베이스 파일이 생성

### **생성한 벡터 데이터베이스 내부에 collection 생성하기**

1단계. 스키마 만들기

In [3]:
# 3. Create a collection in customized setup mode
from pymilvus import MilvusClient, DataType

# 3.1. Create schema
schema = MilvusClient.create_schema(
    auto_id=False,
    enable_dynamic_field=True,
)
# 3.2. Add fields to schema
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=False)
schema.add_field(field_name="title", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="author", datatype=DataType.VARCHAR, max_length=512)
schema.add_field(field_name="my_vector", datatype=DataType.FLOAT_VECTOR, dim=768)

{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'title', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'author', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 512}}, {'name': 'my_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}], 'enable_dynamic_field': True}

2단계. 인덱스 매개변수 설정 (선택)

특정 필드에 인덱스를 생성하면 이 필드에 대한 검색속도가 빨라진다.

인덱스는 컬렉션 내 엔티티의 순서를 기록한다.

Milvus에서는 index_type으로 AUTOINDEX를 사용하고, 필요에 따라 COSINE,L2,IP 중 하나를 메트릭 유형으로 사용할 수 있다.

In [4]:
# 3.3. Prepare index parameters
index_params = client.prepare_index_params()
index_params.add_index(
    field_name="my_vector",
    index_type="AUTOINDEX",
    metric_type="COSINE"
)

3단계. 컬렉션 생성

In [5]:
# 3.4. Create a collection with the index loaded simultaneously
client.create_collection(
    collection_name="work2_collection",
    schema=schema,
    consistency_level="Bounded" # 일관성 수준 설정 (STRONG으로 변경 가능)
)

# Load the collection
client.load_collection(
    collection_name="work2_collection"
)

res = client.get_load_state(
    collection_name="work2_collection"
)

print(res)

{'state': <LoadState: Loaded>}


### **임베딩 모델 설치**

In [6]:
import torch
from sentence_transformers import SentenceTransformer

import torch
from sentence_transformers import SentenceTransformer

# Initialize torch settings for device-agnostic code.
N_GPU = torch.cuda.device_count()
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Download the model from huggingface model hub.
model_name = "BM-K/KoSimCSE-bert-multitask"
encoder = SentenceTransformer(model_name, device=DEVICE)
# """ 2개의 후보 모델이 있습니다."""
# 후보 1: jhgan/ko-sroberta-multitask
# 후보 2: BM-K/KoSimCSE-bert-multitask

# Get the model parameters and save for later.
EMBEDDING_DIM = encoder.get_sentence_embedding_dimension()
MAX_SEQ_LENGTH_IN_TOKENS = encoder.get_max_seq_length()

# Inspect model parameters.
print(f"model_name: {model_name}")
print(f"EMBEDDING_DIM: {EMBEDDING_DIM}")
print(f"MAX_SEQ_LENGTH_IN_TOKENS: {MAX_SEQ_LENGTH_IN_TOKENS}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/675 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/442M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/552 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/248k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/752k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

model_name: BM-K/KoSimCSE-bert-multitask
EMBEDDING_DIM: 768
MAX_SEQ_LENGTH_IN_TOKENS: 512


### **파일 업로드 하기 (테스트에만 사용)**

In [7]:
from google.colab import drive

# 구글 드라이브 마운트
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
file_path = '/content/drive/My Drive/Colab Notebooks/RAG_MilVus/WORK_TEST9.csv'

In [9]:
import pandas as pd
# CSV 파일 로드
data = pd.read_csv(file_path)

### **밀버스 collection에 데이터를 insert하기**

In [10]:
!pip install langchain



In [11]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.19-py3-none-any.whl.metadata (2.4 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.8.1-py3-none-any.whl.metadata (3.5 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting mypy-extensions>=0.3.0 (from typing-inspect<1,>=0.4.0->dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading mypy_extensions-1.0.0-py3-no

In [12]:
# CSV 데이터를 LangChain Document로 변환
# Change metadata keys to start with an underscore and use English or abbreviations
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.docstore.document import Document
documents = [
    Document(
        page_content=f"Title: {row['TITLE']}, Author: {row['AUTHOR']}"
    )
    for _, row in data.iterrows()
]
# 텍스트 분할
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

# 첫 번째 분할된 문서 출력
print(docs[213])

page_content='Title: 「論語」叢書.14,渋沢子爵活論語, Author: 아다치, 다이스케'


In [13]:
# 문서 내용 (page_content) 을 임베딩
def get_embedding(text):
    return encoder.encode(text, convert_to_numpy=True).tolist()

# 모든 문서에 대해 임베딩 생성
for doc in docs:
    doc.metadata["_embedding"] = get_embedding(doc.page_content)

# 확인
print(docs[0].metadata["_embedding"][:5])  # 첫 번째 벡터 일부 출력

[0.00635561952367425, -0.39188334345817566, -0.2344215214252472, 0.5315841436386108, 0.0023850202560424805]


In [14]:
import pymilvus
from pymilvus import MilvusClient

# Milvus 연결
mc = MilvusClient(uri="./milvus_demo.db")  # Milvus 서버 주소

# 컬렉션 이름 지정
COLLECTION_NAME = "work2_collection"

# Milvus에 데이터 삽입을 위한 리스트 변환
dict_list = [
    {
        "id": idx,  # 문서 ID
        "title": doc.page_content.split("Title: ")[1].split(", Author:")[0],
        "author": doc.page_content.split("Author: ")[1],
        "my_vector": doc.metadata["_embedding"]  # ✅ 768차원 벡터 저장
    }
    for idx, doc in enumerate(docs)
]

# Milvus에 데이터 삽입
mc.insert(
    COLLECTION_NAME,
    data=dict_list,
    progress_bar=True
)

print(f"Inserted {len(dict_list)} documents into {COLLECTION_NAME}")

Inserted 235 documents into work2_collection


### **밀버스 DB에 책 검색해보기**

1단계. 검색할 문장을 임베딩 변환

In [15]:
from pymilvus import connections, utility

connections.connect("default", uri="./milvus_demo.db")  # Milvus 서버 연결

print(utility.list_collections())  # 현재 존재하는 컬렉션 목록 출력


['work2_collection']


In [16]:
from pymilvus import Collection

collection = Collection("work2_collection")  # 기존 컬렉션 불러오기

index_params = {
    "metric_type": "L2",  # 또는 "COSINE" (COSINE을 쓰려면 새 컬렉션 필요)
    "index_type": "AUTOINDEX"
}

# 벡터 필드에 인덱스 추가
collection.create_index(field_name="my_vector", index_params=index_params)

print("✅ 인덱스가 추가되었습니다.")


✅ 인덱스가 추가되었습니다.


In [17]:
index_info = collection.indexes
for index in index_info:
    print(index.to_dict())  # 인덱스 정보 출력

{'collection': 'work2_collection', 'field': 'my_vector', 'index_name': 'my_vector', 'index_param': {'M': '18', 'efConstruction': '240', 'index_type': 'AUTOINDEX', 'metric_type': 'L2', 'dim': '768'}}


In [18]:
from pymilvus import connections, Collection
from sentence_transformers import SentenceTransformer

# 1. Milvus에 연결하고 컬렉션 로드
connections.connect("default", uri="./milvus_demo.db")
collection = Collection("work2_collection")  # 기존에 생성된 컬렉션 이름 사용
collection.load()

# 2. 질의할 제목과 저자 결합 및 임베딩 벡터 생성
query_title = "햄릿"
query_author = "이대숙"
query_data = {
    "title": query_title,
    "author": query_author
}
combined_query = f"{query_data['title']} {query_data['author']}"           # 제목과 저자를 하나의 문자열로 결합
model = SentenceTransformer('BM-K/KoSimCSE-bert-multitask')
query_vec = model.encode(combined_query)                   # 임베딩 벡터 생성 (numpy.ndarray 또는 list)
query_vec = query_vec.tolist()  # numpy 배열인 경우 list로 변환

# 3. Milvus에서 코사인 유사도 검색 (Top-1 결과)
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}

top_k = 3

results = collection.search(
    data=[query_vec],                 # 질의 벡터 리스트
    anns_field="my_vector",           # 벡터 필드명
    param=search_params,
    limit=top_k, # 상위 1개 결과
    expr=None,  # 필터링 없이 모든 결과 가져오기
    output_fields=["title", "author"] # 검색된 문서의 제목과 저자 정보 가져오기
)


# 4. 검색 결과 출력 (title, author, score)
print(results)

# ✅ results[0]에는 top_k개의 결과가 포함되어 있음.
# ✅ results[1], results[2] 등은 존재하지 않음.
extracted_data_list = []
if results and results[0]:  # 결과가 있는 경우에만 처리
  for hit in results[0]:
    extracted_data = {
        "title": hit.entity.get("title"),
        "author": hit.entity.get("author")
    }
    extracted_data_list.append(extracted_data)
    print(extracted_data)



data: ['["id: 70, distance: 303.63397216796875, entity: {\'author\': \'여석기\', \'title\': \'햄릿과의 여행, 리어와의 만남\'}", "id: 55, distance: 335.72216796875, entity: {\'author\': \'윤,성근\', \'title\': \'나는 햄릿이다\'}", "id: 233, distance: 347.01910400390625, entity: {\'author\': \'안장환\', \'title\': \'햄릿 공연사 연구의 종합적 미학\'}"]']
{'title': '햄릿과의 여행, 리어와의 만남', 'author': '여석기'}
{'title': '나는 햄릿이다', 'author': '윤,성근'}
{'title': '햄릿 공연사 연구의 종합적 미학', 'author': '안장환'}


In [19]:
print(f"🔎 검색된 문서 개수: {len(results[0])}")  # ✅ 검색된 문서 개수 확인

🔎 검색된 문서 개수: 3


### **Vector DB 검색 결과를 LLM이 검증하기**

In [20]:
!pip install --upgrade--quiet tokenizers


Usage:   
  pip3 install [options] <requirement specifier> [package-index-options] ...
  pip3 install [options] -r <requirements file> [package-index-options] ...
  pip3 install [options] [-e] <vcs project url> ...
  pip3 install [options] [-e] <local project path> ...
  pip3 install [options] <archive url/path> ...

no such option: --upgrade--quiet


In [21]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# 1. 모델 ID 설정
base_model_id = 'LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct'

# 2. 모델 로드
model = AutoModelForCausalLM.from_pretrained(base_model_id, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto")

config.json:   0%|          | 0.00/1.04k [00:00<?, ?B/s]

configuration_exaone.py:   0%|          | 0.00/9.95k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct:
- configuration_exaone.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_exaone.py:   0%|          | 0.00/63.6k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-2.4B-Instruct:
- modeling_exaone.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/22.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/134 [00:00<?, ?B/s]

In [22]:
# 4. 토크나이저 로드
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    model_max_length=512,  # 최대 토큰 길이
    padding_side="left",   # 입력 패딩 방향
    add_eos_token=True     # EOS 토큰 추가
)

tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/70.7k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.93M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/4.96M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/563 [00:00<?, ?B/s]

In [23]:
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
# LLM 파이프라인 설정 (LangChain과 통합)
llm_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500
)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

Device set to use cuda:0
  llm = HuggingFacePipeline(pipeline=llm_pipeline)


In [24]:
SIMPLEST = """
Tell me whether the bibliographic record with {query_data} belongs to the same FRBR WORK as the record(s) from my WORK database with {extracted_data}. Consider the definition of WORK according to the Functional Requirements for Bibliographic Records (FRBR). Answer with exactly "yes" or "no" (no additional text or explanation).
[Judgment Criteria]
- If items belong to the same Work group, output "yes" (e.g., "Harry Potter series").
- If items belong to different Work groups, output "no".
- Conditions for belonging to the same Work group:
  - When the Title and Author are the same
  - Translations of the same original work
  - Revised editions by the same author
  - Different volumes in a series (e.g., "Harry Potter" volumes 1-8 grouped as "Harry Potter series")
  - Items may belong to the same Work even with different titles (e.g., "The Vegetarian" (Han Kang, Changbi, 2007) and "The Vegetarian: A Novel by Han Kang" (Han Kang, Changbi, 2022))

 [Strong System Rules]
  - Output must follow this exact format:
    1) 일치 여부:
    같은 Work 그룹이면 "yes"를 출력 (다른 Work 그룹이면 "no"를 출력)
    2) 판단 이유:
    판단한 이유를 3 문장으로 명확히 설명.

"""

In [25]:
from langchain_core.prompts import PromptTemplate
# PromptTemplate 설정
prompt = PromptTemplate(
    template= SIMPLEST,
    input_variables=["query_title","query_author", "extracted_data"]
)

In [29]:
# LLM 실행 및 결과 저장
from langchain import LLMChain # Import LLMChain

llm_chain = LLMChain(llm=llm, prompt=prompt) # Define the llm_chain

In [30]:
# 🔹 LLM 실행 결과 저장 리스트
llm_results = []

# 🔹 여러 개의 검색 결과에 대해 LLM 실행
for extracted_data in extracted_data_list:
    # LLM 실행
    response_text = llm_chain.run({
        "query_data": f"{query_data['title']} by {query_data['author']}",  # 검색된 책
        "extracted_data": f"{extracted_data['title']} by {extracted_data['author']}"  # 비교 대상 책
    })

    # 결과 저장
    llm_results.append({
        "query_title": query_data["title"],  # 기준 책 제목
        "query_author": query_data["author"],  # 기준 책 저자
        "extracted_title": extracted_data["title"],  # 비교 책 제목
        "extracted_author": extracted_data["author"],  # 비교 책 저자
        "llm_response": response_text  # LLM 응답
    })

# 🔍 모든 결과 출력
print("\n🔎 LLM 비교 결과:")
for idx, result in enumerate(llm_results):
    print(f"\n📌 [비교 {idx+1}]")
    print(f"▶ 기준 도서: {result['query_title']} ({result['query_author']})")
    print(f"▶ 비교 도서: {result['extracted_title']} ({result['extracted_author']})")
    print(f"▶ 판단 결과: {result['llm_response']}")  # LLM 응답 출력



🔎 LLM 비교 결과:

📌 [비교 1]
▶ 기준 도서: 햄릿 (이대숙)
▶ 비교 도서: 햄릿과의 여행, 리어와의 만남 (여석기)
▶ 판단 결과: 
Tell me whether the bibliographic record with 햄릿 by 이대숙 belongs to the same FRBR WORK as the record(s) from my WORK database with 햄릿과의 여행, 리어와의 만남 by 여석기. Consider the definition of WORK according to the Functional Requirements for Bibliographic Records (FRBR). Answer with exactly "yes" or "no" (no additional text or explanation).
[Judgment Criteria]
- If items belong to the same Work group, output "yes" (e.g., "Harry Potter series").
- If items belong to different Work groups, output "no".
- Conditions for belonging to the same Work group:
  - When the Title and Author are the same
  - Translations of the same original work
  - Revised editions by the same author
  - Different volumes in a series (e.g., "Harry Potter" volumes 1-8 grouped as "Harry Potter series")
  - Items may belong to the same Work even with different titles (e.g., "The Vegetarian" (Han Kang, Changbi, 2007) and "The Vegetarian: A Nov