
# RAG with Chroma

1. 문서 읽기
2. 문서를 청크로 나눈다.
3. 문서를 임베딩한다. -> 벡터 데이터베이스에 저장
4. 질문이 있을때 벡터데이터 베이스에서 유사도 검색
5. 유사도 검색으로 가져온 문서를 LLM에 질문과 같이 전달

## 1. 문서 읽기

In [59]:
%pip install -qU docx2txt langchain-community langchain-text-splitters unstructured python-docx

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [60]:
from langchain_community.document_loaders import Docx2txtLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

# 문서 읽기
loader = Docx2txtLoader("./data/2023년도 맞춤형복지제도 업무처리 세부지침.docx")

# chunking
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
document_list = loader.load_and_split(text_splitter)


In [61]:
 document_list


[Document(metadata={'source': './data/2023년도 맞춤형복지제도 업무처리 세부지침.docx'}, page_content='[붙임 \n\n\n\n\n\n\n\n[붙임 \n\n[붙임 \n\n\n\n\n\n\n\n\n\n2023년도 전라북도교육감 소속 공무원 \n\n맞춤형복지제도 업무처리 세부지침\n\n\n\n2023. 2.\n\n\n\n     행정국 재무과\n\nⅠ. 맞춤형 복지제도 개요·······················································1\n\nⅡ. 맞춤형복지 항목 설계·······················································2\n\nⅢ. 복지점수 배정 ·······························································6\n\n차  례\n\nⅣ. 복지점수 집행 및 정산···················································15\n\nⅤ. 행정사항·······································································16\n\n【붙임 1】 맞춤형복지비 적용 배제 또는 제한자 현황···························18\n\n【붙임 2】 2023년 온누리상품권 의무구매 비율 의견수렴 결과············19\n\n【붙임 3】 가족 복지점수 추가 배정 신청서···········································20\n\n【붙임 4】 출산축하 및 태아·산모검진 지원 복지점수 신청 서식················21\n\n【붙임 5】 난임지원 복지점수 신청 서식················································22\n\n\n\n2023년도 전라북도교육감 소속 공무원\n\n맞춤형복지제도 업무처리 세부지침\n\n전년 대비 주요 변동 내용\n\n

## embedding
### sentence-transformers

In [42]:
%pip install sentence-transformers langchain-huggingface

Collecting sentence-transformers
  Downloading sentence_transformers-5.1.2-py3-none-any.whl.metadata (16 kB)
Collecting transformers<5.0.0,>=4.41.0 (from sentence-transformers)
  Downloading transformers-4.57.3-py3-none-any.whl.metadata (43 kB)
Collecting tqdm (from sentence-transformers)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting torch>=1.11.0 (from sentence-transformers)
  Downloading torch-2.9.1-cp314-cp314-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting scikit-learn (from sentence-transformers)
  Downloading scikit_learn-1.7.2-cp314-cp314-macosx_12_0_arm64.whl.metadata (11 kB)
Collecting scipy (from sentence-transformers)
  Downloading scipy-1.16.3-cp314-cp314-macosx_14_0_arm64.whl.metadata (62 kB)
Collecting huggingface-hub>=0.20.0 (from sentence-transformers)
  Downloading huggingface_hub-1.1.6-py3-none-any.whl.metadata (13 kB)
Collecting Pillow (from sentence-transformers)
  Downloading pillow-12.0.0-cp314-cp314-macosx_11_0_arm64.whl

In [62]:
from langchain_huggingface import HuggingFaceEmbeddings

# embedding
embeddings = HuggingFaceEmbeddings(model_name='jhgan/ko-sroberta-multitask')

## Chroma DB 생성후 파일로 저장

In [64]:

from langchain_chroma import Chroma

database = Chroma.from_documents(
    document_list, 
    embeddings,
    collection_name="gwp2003",
    persist_directory="./chroma_db",
    )


## Chrmoma DB 로드

In [89]:
from langchain_chroma import Chroma

database = Chroma(
    collection_name="gwp2003",
    persist_directory="./chroma_db",
    embedding_function=embeddings,
)

In [57]:
query = '보험 필수 항목은 무엇인가요?'

retd_docs = database.similarity_search(query)

NameError: name 'database' is not defined

In [90]:
from dotenv import load_dotenv
load_dotenv()

from langchain_google_genai import GoogleGenerativeAI

llm = GoogleGenerativeAI(model="gemini-2.5-flash")



In [91]:
# %pip install langchain_core

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_template("rlm/rag-prompt")

In [92]:
prompt

ChatPromptTemplate(input_variables=[], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], input_types={}, partial_variables={}, template='rlm/rag-prompt'), additional_kwargs={})])

In [96]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

qa_chain = create_retrieval_chain(
    llm, 
    retriever=database.as_retriever(),
    chain_type_kwargs={
        "prompt": prompt
    }
)

ModuleNotFoundError: No module named 'langchain.chains'