### File Loader

- text loader
- pdf loader
- unstructured loader 등 다양함
  -> unstructured loader의 경우 확장자에 상관없이 사용 가능
  - https://stackoverflow.com/questions/38916452/nltk-download-ssl-certificate-verify-failed


### Text splitters

- loader로 불러온 텍스트를 분할하기 위해 사용
- 필요한 부분만을 잘라내서 탐색하므로 더 효율적임


In [None]:
from langchain.document_loaders import UnstructuredFileLoader
from langchain.chat_models import ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

chat = ChatOpenAI(temperature=0.1)

#? chunk_size : 문서를 분할할 사이즈 설정
#? chun_overlap : 불완전한 문장으로 분할되는 것을 보완하기 위해 문단의 뒷부분을 가져와서 다음 섹션에 붙임
# splitter = RecursiveCharacterTextSplitter(
#   chunk_size=200,
#   chunk_overlap=50
# )
splitter = CharacterTextSplitter.from_tiktoken_encoder(
  separator="\n",
  chunk_size=600,
  chunk_overlap=100,
  # length_function=len
)

loader = UnstructuredFileLoader("./files/test.pdf")
docs = loader.load_and_split(text_splitter=splitter)


### Embedder

- 자룔를 속성에 따라 벡터로 표현


In [None]:
from langchain.embeddings import OpenAIEmbeddings
embedder = OpenAIEmbeddings()
vector = embedder.embed_documents([
  "Hi",
  "How",
  "are",
  "you",
  "You can embedd quiet long sentences"
])

for i in vector:
  print(len(i))


[-0.03629858192333018, -0.007224538187570188, -0.03371885554109727, -0.02866363267807191, -0.026865641732513695, 0.03460482274185763, -0.012318847263635718, -0.007752209747023993, 0.0019380524367559983, -0.002701873068082294, 0.02478101390138119, -0.002477124199887517, -0.005732726535614382, -0.002905449946508664, 0.006677323288765644, -0.003032482117949758, 0.03384914384922044, -0.0015032120884641703, 0.021093827586875228, -0.008996472123429598, -0.021719216308744023, 0.01038405247696104, 0.006244111590891486, 0.00708122021044435, -0.012312332661965037, 0.0008998100308185962, 0.005876044512740219, -0.009888952994538026, -0.0030731974470689016, -0.02457255037320985, 0.01074234811826759, -0.013810659381252829, -0.02442923286174532, -0.014110324538845866, 0.0024347802203507035, -0.018878911447619554, 0.0005618723451099323, -0.011270018746398786, 0.018110203351641003, -0.009967126351940971, 0.013028923944578141, -0.011328649230112302, -0.00913327596454606, -0.009654432922329186, -0.026539