# Create VectorDB
- 로컬에서 작업하는 경우 VectorDB 생성에 시간이 오래 걸립니다.
- 해당 파일은 Google Colab GPU로 VectorDB를 만드는 작업을 하는 ipynb 파일입니다.

## Library Install
- 2025.10.24 기준 아래 라이브러리 Colab에서 정상 작동

In [1]:
!pip install langchain langchain-chroma langchain-huggingface langchain-openai langchain-text-splitters langchain-community chroma openai sentence-transformers

Collecting langchain-chroma
  Downloading langchain_chroma-1.0.0-py3-none-any.whl.metadata (1.9 kB)
Collecting langchain-huggingface
  Downloading langchain_huggingface-1.0.0-py3-none-any.whl.metadata (2.1 kB)
Collecting langchain-openai
  Downloading langchain_openai-1.0.1-py3-none-any.whl.metadata (1.8 kB)
Collecting langchain-community
  Downloading langchain_community-0.4-py3-none-any.whl.metadata (3.0 kB)
Collecting chroma
  Downloading Chroma-0.2.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting chromadb<2.0.0,>=1.0.20 (from langchain-chroma)
  Downloading chromadb-1.2.1-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of langchain-chroma to determine which version is compatible with other requirements. This could take a while.
Collecting langchain-chroma
  Downloading langchain_chroma-0.2.6-py3-none-any.whl.metadata (1.1 kB)
INFO: pip is looking at multiple versions of langchain-h

## Google Drive 연동
- 개인 Google Drive에 Pandas, Scikit-Learn 공식 documents 파일을 업로드를 한 상태입니다.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Langchain: Data Load

In [3]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader

In [4]:
## _sources 폴더 안에 pandas 폴더, scikit-learn 폴더 각각 구분하였음

doc_path = "/content/drive/MyDrive/공부/모두연/RAGthon/_sources/"

In [5]:
loader = DirectoryLoader(
    doc_path,
    glob="**/*.rst.txt",
    loader_cls=TextLoader,
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=True,
    use_multithreading=True,
)

In [6]:
## Pandas + Scikit-Learn 공식 documents(rst.txt 파일)의 수: 3365

print("문서 로드 시작")

documents = loader.load()

print(f"로드 완료. 총 {len(documents)}개의 .rst.txt 파일 로드")

Pandas 공식 문서 로드 시작


100%|██████████| 3365/3365 [12:27<00:00,  4.50it/s]

로드 완료. 총 3365개의 .rst.txt 파일 로드





## Langchain: Create Chunks

In [7]:
## separators 파라미터 default로 설정

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=150,
)

In [8]:
print("문서 Chunking 시작")

splits = text_splitter.split_documents(documents)

print(f"분할 완료. 총 {len(documents)}개의 문서가 {len(splits)}개의 덩어리로 분할됨")

문서 Chunking 시작
분할 완료. 총 3365개의 문서가 19803개의 덩어리로 분할됨


## Langchain: Create VectorDB
- HuggingFace에 공개된 Embedding 모델을 사용합니다.
- VectorDB는 개인 Google Drive에 생성됩니다.

In [10]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

embeddings = HuggingFaceEmbeddings(model_name='intfloat/multilingual-e5-large-instruct')
print("임베딩 모델 설정 완료")
print("=================")

persist_directory = '/content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn'
vectorstore = Chroma.from_documents(
    documents=splits,           # 분할된 텍스트 조각들
    embedding=embeddings,       # 사용할 임베딩 모델
    persist_directory=persist_directory # 저장할 디스크 경로
)

print(f"'{persist_directory}'에 벡터 DB 저장 완료.")

## GPU T4 기준 22분 소요

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/128 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_xlm-roberta_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/690 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

임베딩 모델 설정 완료
'/content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn'에 벡터 DB 저장 완료.


## Langchain: Retriever 테스트

In [11]:
embeddings = HuggingFaceEmbeddings(model_name='intfloat/multilingual-e5-large-instruct')
persist_directory = '/content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn'

vectorstore = Chroma(persist_directory=persist_directory,
                     embedding_function=embeddings,
                     )

retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

queries = [
    "Pandas: loc and iloc difference",
    "Pandas의 loc와 iloc의 차이",

    "pandas and scikit-learn difference",
    "Pandas와 Scikit-Learn은 근본적으로 어떤 역할 차이가 있어?",
]

for query in queries:
    print(f"\n--- [테스트 쿼리]: {query} ---")
    retrieved_docs = retriever.invoke(query)

    if not retrieved_docs:
        print(">>> [결과]: 관련 문서를 찾지 못했습니다! (DB 구축 실패)")
    else:
        print(f">>> [결과]: {len(retrieved_docs)}개의 문서를 찾았습니다.")
        for i, doc in enumerate(retrieved_docs):
            print(f"--- Doc {i+1} (Source: {doc.metadata.get('source')}) ---")
            print(doc.page_content[:200] + "...")


--- [테스트 쿼리]: Pandas: loc and iloc difference ---
>>> [결과]: 5개의 문서를 찾았습니다.
--- Doc 1 (Source: /content/drive/MyDrive/공부/모두연/RAGthon/_sources/pandas/reference/api/pandas.MultiIndex.get_locs.rst.txt) ---
pandas.MultiIndex.get\_locs

.. currentmodule:: pandas

.. automethod:: MultiIndex.get_locs...
--- Doc 2 (Source: /content/drive/MyDrive/공부/모두연/RAGthon/_sources/pandas/reference/api/pandas.MultiIndex.get_loc.rst.txt) ---
pandas.MultiIndex.get\_loc

.. currentmodule:: pandas

.. automethod:: MultiIndex.get_loc...
--- Doc 3 (Source: /content/drive/MyDrive/공부/모두연/RAGthon/_sources/pandas/reference/api/pandas.Index.slice_locs.rst.txt) ---
pandas.Index.slice\_locs

.. currentmodule:: pandas

.. automethod:: Index.slice_locs...
--- Doc 4 (Source: /content/drive/MyDrive/공부/모두연/RAGthon/_sources/pandas/reference/api/pandas.DataFrame.loc.rst.txt) ---
pandas.DataFrame.loc

.. currentmodule:: pandas

.. autoproperty:: DataFrame.loc...
--- Doc 5 (Source: /content/drive/MyD

## Zip & Download

In [12]:
!zip -r /content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn.zip /content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn

  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/ (stored 0%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/chroma.sqlite3 (deflated 53%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/169b02a3-b889-45ec-9df0-95eb7f2b5762/ (stored 0%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/169b02a3-b889-45ec-9df0-95eb7f2b5762/header.bin (deflated 55%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/169b02a3-b889-45ec-9df0-95eb7f2b5762/data_level0.bin (deflated 9%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/169b02a3-b889-45ec-9df0-95eb7f2b5762/length.bin (deflated 41%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/169b02a3-b889-45ec-9df0-95eb7f2b5762/link_lists.bin (deflated 80%)
  adding: content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn/169b02a3-b889-45ec-9df0-95eb7f2b5762/index_meta

In [13]:
from google.colab import files
files.download('/content/drive/MyDrive/공부/모두연/RAGthon/chroma_db_pd_sklearn.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>