

---



## **[미션]** : HuggingFace 임베딩 테스트하기

앞의 내용을 참고하여 새로운 pdf 파일을 준비한 후 Hugging Face Embedding 기능을 테스트해보세요.

### **1.라이브러리 설치**

In [None]:
# 설치
%pip install -q "sentence-transformers>=3.0.0" "langchain-community" "chromadb==1.0.21" "requests==2.32.4"

In [None]:
!pip install langchain tiktoken openai pypdf  langchain_openai

### **2.라이브러리 불러오기**

In [4]:
import os, urllib.request
from langchain.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

### **3.임베딩할 문서 불러오기**

In [5]:
PDF_URL = "https://arxiv.org/pdf/1801.04293.pdf"
PDF_NAME = "critical_success_factors_gamedev.pdf"
PERSIST_DIR = "./chroma_db_csf_gamedev"
TOP_K = 3
THRESHOLD = 0.55  # 신뢰도 경고 기준

In [6]:
print("[1/6] PDF 다운로드 중 ...")
urllib.request.urlretrieve(PDF_URL, filename=PDF_NAME)
print(f"  → saved: {os.path.abspath(PDF_NAME)}")

[1/6] PDF 다운로드 중 ...
  → saved: /content/critical_success_factors_gamedev.pdf


### **4.문서 살펴보기**

In [7]:
print("[2/6] PDF 로드 및 텍스트 분할 ...")
loader = PyPDFLoader(PDF_NAME)
pages = loader.load()  # 페이지 단위 로드

[2/6] PDF 로드 및 텍스트 분할 ...


In [8]:
# 논문은 한 페이지가 길어 검색 성능을 위해 청크(문자열 길이)로 다시 분할
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,   # 청크 길이
    chunk_overlap=120, # 겹침
    separators=["\n\n", "\n", ". ", " "]
)
docs = splitter.split_documents(pages)
print("  원본 페이지 수   :", len(pages))
print("  청크된 문서 수   :", len(docs))

  원본 페이지 수   : 51
  청크된 문서 수   : 163


### **5.임베딩하기(임베딩 후 벡터DB에 저장하기)**
1. **임베딩할 모델지정**
    - **intfloat/multilingual-e5-base**
        - Microsoft, 2022년 10월
        - 100개 이상의 언어를 지원하는 고성능 다국어 임베딩 모델로, 텍스트를 의미적으로 정확한 벡터로 변환하여 의미 검색(semantic search) 및 유사도 측정에 주로 사용
2. **임베딩 후 벡터DB에 저장하기(ChromaDB)**
- 💡 **주의** 임베딩할 문서의 양에 따라 시간이 오래 걸릴 수 있습니다.

In [9]:
print("[3/6] 임베딩 모델 로드 (intfloat/multilingual-e5-base, normalize=True) ...")
embedding = HuggingFaceEmbeddings(
    model_name="intfloat/multilingual-e5-base",
    encode_kwargs={"normalize_embeddings": True}
)

print("[4/6] Chroma 벡터DB 생성/저장 ...")
vectordb = Chroma.from_documents(
    documents=docs,
    embedding=embedding,
    persist_directory=PERSIST_DIR
)
vectordb.persist()
print(f"  → persisted dir: {os.path.abspath(PERSIST_DIR)}")

[3/6] 임베딩 모델 로드 (intfloat/multilingual-e5-base, normalize=True) ...


  embedding = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/694 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/418 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

[4/6] Chroma 벡터DB 생성/저장 ...
  → persisted dir: /content/chroma_db_csf_gamedev


  vectordb.persist()


### **6.질의 테스트**

In [10]:
# ---------- 6) 질의 테스트 ----------
#   "k": 3 --> 가장 관련성이 높은 상위 3개의 문서(또는 청크)만 가져오도록 지정
retriever = vectordb.as_retriever(search_kwargs={"k": 3})

# 예시 1
docs = retriever.get_relevant_documents("게임 개발 성공 요인은 무엇인가?")
print([d.page_content[:120] for d in docs])

# 예시 2
docs = retriever.get_relevant_documents("팀 커뮤니케이션이 개발 성과에 미치는 영향은?")
print([d.page_content[:120] for d in docs])

# 예시 3
docs = retriever.get_relevant_documents("테스트와 품질보증(QA)의 중요성은?")
print([d.page_content[:120] for d in docs])

# 예시 4
docs = retriever.get_relevant_documents("요구사항 변경 관리가 프로젝트 품질에 주는 영향은?")
print([d.page_content[:120] for d in docs])

# 예시 5
docs = retriever.get_relevant_documents("개발 도구와 파이프라인 관련 권장 사항은?")
print([d.page_content[:120] for d in docs])


  docs = retriever.get_relevant_documents("게임 개발 성공 요인은 무엇인가?")


['opment.  Nevertheless, the differences between \nsoftware engineering and games development are \nnot exclusive; it seems ', 'choice is to consider the developer perspective to produce good -quality software games by improving the \ngame developme', 'investigation and research into the individual  \n \ncomponents of a system. Researchers do not have \nresources to develop']
['Journal of Computer Science and Technology, 31(5):925-950, DOI: 10.007/s11390-016-1673-z, Springer, September 2016. \n- 8', 'Journal of Computer Science and Technology, 31(5):925-950, DOI: 10.007/s11390-016-1673-z, Springer, September 2016. \n- 2', 'The term “collaboration” can be defined as the \nlevel of shared understanding and coordination \namong teams and the main']
['Journal of Computer Science and Technology, 31(5):925-950, DOI: 10.007/s11390-016-1673-z, Springer, September 2016. \nand', 'product. Typically, in a particular game project, \nthe leader dedicates a specific amount of time for \nquality assu

In [11]:
for d in docs:
  print(d)
  print('-' * 50)

page_content='[9] Gredler, M.E. (2004). Games and 
simulations and their relationship to 
learning. Handbook of Research on 
Educational Communications and 
Technology, pp. 571–581. 
[10] Rieber, L.P. (2005). Multimedia learning 
in games, simulations, and micro-worlds. 
Cambridge Handbook of Multimedia 
Learning, Cambridge University Press, 
U.K., pp. 549–567. 
[11] Keith, C. (2010). Agile Game 
Development with Scrum. Boston: 
Addison-Wesley. 
[12] Pressman, R.S. (2001). Software 
Engineering: A Practitioner Approach, 5th 
ed., New York: Wiley. 
[13] Petrillo, F., Pimenta, M., Trindade, F. 
(2009). What went wrong? A survey of 
problems in game development. Computers 
in Entertainment, ACM Digital Library, 
Vol. 7, No. 1, pp. 13.1–13.22. 
[14] Ramadan, R., Widyani, Y. (2013). Game 
development life-cycle guidelines. 
Proceedings of 5
th International Conference 
on Advanced Computer Science and 
Information Systems (ICACIS), IEEE 
Computer Society, Jakarta, Indonesia,' metadata={'pag



---

