## Build a semantic search engine

This guide focuses on retrieval of text data. We will cover the following concepts:

 - Documents and document loaders;
 - Text splitters;
 - Embeddings;
 - Vector stores and retrievers.


pip install langchain-community pypdf

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "../data/DeepSeek_V3.pdf"
loader = PyPDFLoader(file_path)
docs = loader.load()
print(docs)
print(docs[0].metadata)

[Document(metadata={'source': '../data/DeepSeek_V3.pdf', 'page': 0}, page_content='DeepSeek-V3 Technical Report\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total\nparameters with 37B activated for each token. To achieve efficient inference and cost-effective\ntraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-\ntures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers\nan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training\nobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and\nhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to\nfully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms\nother open-source models and achieves performance comparable to leading closed-source\nmodels. Desp

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)
print(all_splits)

[Document(metadata={'source': '../data/DeepSeek_V3.pdf', 'page': 0, 'start_index': 0}, page_content='DeepSeek-V3 Technical Report\nDeepSeek-AI\nresearch@deepseek.com\nAbstract\nWe present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total\nparameters with 37B activated for each token. To achieve efficient inference and cost-effective\ntraining, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architec-\ntures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers\nan auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training\nobjective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and\nhigh-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to\nfully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms\nother open-source models and achieves performance comparable to leading closed-so

In [20]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)
print(vector_1)

from langchain_core.vectorstores import InMemoryVectorStore
vector_store = InMemoryVectorStore(embeddings)
ids = vector_store.add_documents(documents=all_splits)

embeddingWord = embeddings.embed_query("how long did deepseek pretrain use?")
results = vector_store.similarity_search_by_vector(embeddingWord)
print(results)

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini")
prompt = results[0].page_content + "//" + "how long did deepseek pretrain use?"
ret = llm.invoke(prompt)
print(ret)


[0.010783931240439415, -0.019750265404582024, -0.012887763790786266, -0.028981368988752365, 0.0006113816634751856, 0.006848189979791641, -0.03105657733976841, 0.05461378023028374, -0.012558593414723873, 0.03549323230981827, 0.0189917404204607, -0.017059650272130966, 0.027092212811112404, -0.04138968884944916, -0.0039142738096416, 0.016615984961390495, 0.012351072393357754, -0.009445779025554657, 0.005950125399976969, -0.055157627910375595, -0.027950920164585114, -0.020465854555368423, -0.021739603951573372, 0.008637163788080215, 0.042334266006946564, 0.015056000091135502, 0.017503315582871437, 0.011241909116506577, -0.04994813725352287, 0.019778888672590256, 0.031571801751852036, 0.027077900245785713, -0.012916387990117073, -0.01020430400967598, -0.008830372244119644, -0.014132889918982983, 0.04762962833046913, 0.022498128935694695, -0.01883431151509285, 0.010211460292339325, 0.02183978632092476, 0.008265056647360325, -0.03065584786236286, -0.00948871485888958, 0.020322738215327263, -0

In [15]:
from langchain_core.runnables import chain
from typing import List
from langchain_core.documents import Document

@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)

rets = retriever.batch(
    [
        "How many auhtors of deepseek?",
        "who is the first author?",
    ],
)
print(rets)


[[Document(id='0c3d5473-7b96-4c72-afec-6506d1a5208d', metadata={'source': '../data/DeepSeek_V3.pdf', 'page': 34, 'start_index': 2336}, page_content='DeepSeek-R1 series of models. Comprehensive evaluations demonstrate that DeepSeek-V3 has\nemerged as the strongest open-source model currently available, and achieves performance com-\nparable to leading closed-source models like GPT-4o and Claude-3.5-Sonnet. Despite its strong\nperformance, it also maintains economical training costs. It requires only 2.788M H800 GPU\nhours for its full training, including pre-training, context length extension, and post-training.\nWhile acknowledging its strong performance and cost-effectiveness, we also recognize that\nDeepSeek-V3 has some limitations, especially on the deployment. Firstly, to ensure efficient\ninference, the recommended deployment unit for DeepSeek-V3 is relatively large, which might\npose a burden for small-sized teams. Secondly, although our deployment strategy for DeepSeek-\nV3 has 