# [Build a semantic search engine](https://python.langchain.com/docs/tutorials/retrievers/)

> *build a search engine over a PDF document. This will allow us to retrieve passages in the PDF that are similar to an input query.*


* [Document Loader](https://python.langchain.com/docs/concepts/document_loaders/)
* Text splitters
* [Embedding](https://python.langchain.com/docs/concepts/embedding_models/  )
* [Vector Store](https://python.langchain.com/docs/concepts/vectorstores/) abstractions & Retrievers

→  support data retrieval -- from (vector) databases and other sources-- for integration with LLM workflows



In [1]:
%pip install -qU langchain-community pypdf 

Note: you may need to restart the kernel to use updated packages.


In [2]:
import getpass
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass()

#### Documents and Document Loaders

[Document](https://python.langchain.com/api_reference/core/documents/langchain_core.documents.base.Document.html) abstraction → represent unit of text & associated metadata. 


Attributes:
* `page_content`: a string representing the content;
* `metadata`: a dict containing arbitrary metadata;
* `id`: (optional) a string identifier for the document.

> *Note that an individual Document object often represents a chunk of a larger document.*

In [2]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

## Loading documents
Load a PDF into a **sequence of Document objects**.

In [10]:
!wget https://github.com/langchain-ai/langchain/blob/master/docs/docs/example_data/nke-10k-2023.pdf

--2024-12-23 11:12:34--  https://github.com/langchain-ai/langchain/blob/master/docs/docs/example_data/nke-10k-2023.pdf
Resolving github.com (github.com)... 20.87.245.0
Connecting to github.com (github.com)|20.87.245.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘nke-10k-2023.pdf’

nke-10k-2023.pdf        [   <=>              ] 221.04K   530KB/s    in 0.4s    

2024-12-23 11:12:35 (530 KB/s) - ‘nke-10k-2023.pdf’ saved [226347]



In [11]:
from langchain_community.document_loaders import PyPDFLoader
import os

file_path = os.path.join(os.getcwd(), "nke-10k-2023.pdf")
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

107


> #### [ℹ️ How to load PDFs Guide](https://python.langchain.com/docs/how_to/document_loader_pdf/)

In [13]:
print(f"{docs[0].page_content[:200]}\n")
print(docs[0].metadata)

Table of Contents
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
FORM 10-K
(Mark One)
☑  ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934
F

{'source': '/Users/neelanpather/dev/test-langchain/nke-10k-2023.pdf', 'page': 0}


### [Text Splitters](https://python.langchain.com/docs/concepts/text_splitters/)

* For **information retrieval** & **downstream question-answering purposes**, a page too coarse a representation
* Meanings of relevant portions of the document are not "washed out" by surrounding text

> #### Simple Text Splitter 
* Partitions based on characters
* Split our documents into chunks of 1000 characters with 200 characters of overlap between chunks
* The overlap helps mitigate the possibility of separating a statement from important context related to it


> #### [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/how_to/recursive_text_splitter/)

* Recursively split document using common separators like new lines until each chunk is appropriate size
* **‼ Recommended text splitter for generic text use cases**
* `add_start_index=True` → character index where each split Document starts saved as meta data



In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
all_splits = text_splitter.split_documents(docs)

len(all_splits)

516

## [Embeddings](https://python.langchain.com/docs/integrations/text_embedding/)

* Vector search is a common way to store and search over unstructured data (such as unstructured text). 
* Store numeric vectors associated with text
* Embed query as vector 
* Use vector similarity metrics (such as cosine similarity) to identify related text

In [16]:
%pip install -qU langchain-openai

Note: you may need to restart the kernel to use updated packages.


In [17]:
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

In [18]:
vector_1 = embeddings.embed_query(all_splits[0].page_content)
vector_2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector_1) == len(vector_2)
print(f"Generated vectors of length {len(vector_1)}\n")
print(vector_1[:10])

Generated vectors of length 3072

[0.009358575567603111, -0.01612727716565132, 0.00027407926972955465, 0.006443404126912355, 0.02054382488131523, -0.039210930466651917, -0.007369252387434244, 0.04108764976263046, -0.008169986307621002, 0.060004983097314835]


Armed with a model for generating text embeddings, we can next store them in a special data structure that supports efficient similarity search.

## Vector stores

* Objects contain methods for adding text and Document objects to the store, and querying them using various similarity metrics.
* Often initialized with embedding models (determine how text data is translated to numeric vectors)

Variety of vectors store technologies, some hosted, some local

> #### In Memory Vector Store

In [19]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [20]:
index = vector_store.add_documents(documents=all_splits)

most vector store implementations will allow you to connect to an existing vector store-- e.g., by providing a client, index name, or other information

Querying 

[Vector Store](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html) includes methods for querying:

Synchronously and asynchronously;
By string query and by vector;
With and without returning similarity scores;
By similarity and [maximum marginal relevance](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.base.VectorStore.html#langchain_core.vectorstores.base.VectorStore.max_marginal_relevance_search) (to balance similarity with query to diversity in retrieved results).
The methods will generally include a list of Document objects in their outputs.


Embeddings typically represent text as a **"dense" vector** such that texts with similar meanings are geometrically close. This lets us retrieve relevant information just by passing in a question, without knowledge of any specific key-terms used in the document.

In [21]:
results = vector_store.similarity_search(
    "How many distribution centers does Nike have in the US?"
)

print(results[0])

page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our
wholesale, NIKE Direct and merchandising strategies in the region, among other functions.
In the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are
leased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics
providers. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,
some of which are leased and operated by third-party logistics providers. The most significant distribution facilities outside the United States are located in Laakdal,' metadata={'source': '/Users/neelanpather/dev/test-langchain/n

> #### Async query

In [22]:
results = await vector_store.asimilarity_search("When was Nike incorporated?")

print(results[0])

page_content='Table of Contents
PART I
ITEM 1. BUSINESS
GENERAL
NIKE, Inc. was incorporated in 1967 under the laws of the State of Oregon. As used in this Annual Report on Form 10-K (this "Annual Report"), the terms "we," "us," "our,"
"NIKE" and the "Company" refer to NIKE, Inc. and its predecessors, subsidiaries and affiliates, collectively, unless the context indicates otherwise.
Our principal business activity is the design, development and worldwide marketing and selling of athletic footwear, apparel, equipment, accessories and services. NIKE is
the largest seller of athletic footwear and apparel in the world. We sell our products through NIKE Direct operations, which are comprised of both NIKE-owned retail stores
and sales through our digital platforms (also referred to as "NIKE Brand Digital"), to retail accounts and to a mix of independent distributors, licensees and sales' metadata={'source': '/Users/neelanpather/dev/test-langchain/nke-10k-2023.pdf', 'page': 3, 'start_index': 0

> #### Return scores

In [23]:
# Note that providers implement different scores; the score here
# is a distance metric that varies inversely with similarity.

results = vector_store.similarity_search_with_score("What was Nike's revenue in 2023?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.6881573672429626

page_content='Table of Contents
FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTSThe following tables present NIKE Brand revenues disaggregated by reportable operating segment, distribution channel and major product line:
FISCAL 2023 COMPARED TO FISCAL 2022
• NIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.
The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6,
2 and 1 percentage points to NIKE, Inc. Revenues, respectively.
• NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This
increase was primarily due to higher revenues in Men's, the Jordan Brand, Women's and Kids' which grew 17%, 35%,11% and 10%, respectively, on a wholesale
equivalent basis.' metad

## Retreivers

* VectorStore objects do not subclass [Runnable](https://python.langchain.com/api_reference/core/index.html#langchain-core-runnables) (A unit of work that can be invoked, batched, streamed, transformed and composed)

* LangChain Retrievers are Runnables and can be easily created without sub-classing `runnable` if retrieval method is selected. 

* Although we can construct retrievers from vector stores, **retrievers can interface with non-vector store sources of data, as well (such as external APIs)**

In [26]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain

# @chain decorator is used to create a chain of runnables
@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)


retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='ad8f98fa-181b-4768-ab52-7eea10a1e572', metadata={'source': '/Users/neelanpather/dev/test-langchain/nke-10k-2023.pdf', 'page': 26, 'start_index': 804}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are\nleased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics\nproviders. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,\nsome of which are leased and operated by third-party logisti

Notice `@chain` decorator make function a Runnable. Sets the name of the Runnable to the name of the function. Any runnables called by the function will be traced as dependencies.


Vectorstores implement an `as_retriever` method that will generate a Retriever, specifically a `VectorStoreRetriever`. 

        * includes specific `search_type` and `search_kwargs` attributes that identify what methods of the underlying vector store to call, and how to parameterize them

In [27]:
retriever = vector_store.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(
    [
        "How many distribution centers does Nike have in the US?",
        "When was Nike incorporated?",
    ],
)

[[Document(id='ad8f98fa-181b-4768-ab52-7eea10a1e572', metadata={'source': '/Users/neelanpather/dev/test-langchain/nke-10k-2023.pdf', 'page': 26, 'start_index': 804}, page_content='operations. We also lease an office complex in Shanghai, China, our headquarters for our Greater China geography, occupied by employees focused on implementing our\nwholesale, NIKE Direct and merchandising strategies in the region, among other functions.\nIn the United States, NIKE has eight significant distribution centers. Five are located in or near Memphis, Tennessee, two of which are owned and three of which are\nleased. Two other distribution centers, one located in Indianapolis, Indiana and one located in Dayton, Tennessee, are leased and operated by third-party logistics\nproviders. One distribution center for Converse is located in Ontario, California, which is leased. NIKE has a number of distribution facilities outside the United States,\nsome of which are leased and operated by third-party logisti