## Documents and Document Loaders

In [1]:
from langchain_core.documents import Document

documents = [
    Document(
        page_content = "Dogs are great companions, known for their loyalty and friendliness.",
        metadata = {"source": "mammal-pets-doc"}
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

### Loading documents

In [2]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "Global_status_report_on_road_safety_2023.pdf"
loader = PyPDFLoader(file_path)

docs = loader.load()

print(len(docs))

96


In [3]:
print(f"{docs[3].page_content[:200]}\n")
print(docs[3].metadata)

Global status report on road safety 2023
ISBN 978-92-4-008651-7 (electronic version) 
ISBN 978-92-4-008652-4 (print version)
© World Health Organization 2023
Some rights reserved. This work is availab

{'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.4 (Windows)', 'creationdate': '2024-05-03T11:38:40+07:00', 'moddate': '2024-07-04T14:36:20+02:00', 'trapped': '/False', 'source': 'Global_status_report_on_road_safety_2023.pdf', 'total_pages': 96, 'page': 3, 'page_label': 'd'}


### Splitting

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)

all_splits = text_splitter.split_documents(docs)

len(all_splits)

283

## Embeddings

In [5]:
from langchain_ollama import OllamaEmbeddings

embeddings = OllamaEmbeddings(model="nomic-embed-text")

In [6]:
vector1 = embeddings.embed_query(all_splits[0].page_content)
vector2 = embeddings.embed_query(all_splits[1].page_content)

assert len(vector1) == len(vector2)
print(f"Generated vectors of length: {len(vector1)}\n")
print(vector1[:10])

Generated vectors of length: 768

[-0.018381232, 0.005806942, -0.17286521, -0.028823044, 0.07203584, 0.01872876, 0.026769688, 0.03038094, 0.042006318, -0.034976497]


## Vectore Store

In [7]:
from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)

In [8]:
ids = vector_store.add_documents(documents=all_splits)

### Usage

In [9]:
results = vector_store.similarity_search(
    "Which regions are the most affected by traffic accidents?"
)

print(results[0])

page_content='nean 
Region
South-
East Asia 
Region Western 
Paciﬁc 
Region
European 
Region
7Section 1. The global burden of road traffic deaths
Fatality counts and rates, by 
region and country-income 
level
The vast majority of road traffic deaths, 92%, 
occur in upper-middle, lower-middle, and 
low-income countries combined. Seventy-
nine percent of road traffic deaths occur in 
lower-middle-income countries and upper-
middle-income countries combined (44% and 
35% respectively), with low-income countries 
accounting for 13%, and high-income 
countries accounting for the remaining 8%. 
Relative to the size of countries’ motor 
vehicle fleets and road networks, there is a 
disproportionately high number of fatalities in 
low- and middle-income countries compared 
to high-income countries. For example, high-
income countries have 16% of the world’s 
population, 28% of the world’s vehicle fleet, 
88% of all paved inter-urban roads, and 8% of 
fatalities; by contrast, low-income countr

In [14]:
results = await vector_store.asimilarity_search(
    "what are the primary causes of accidents?"
)

print(results[0])

page_content='include deaths that occur at the scene, or 
within a limited time period from the date of 
the crash. 
Producing a global morbidity figure for 
road traffic crashes is challenging, because 
around a third of countries report no measure 
for nonfatal cases, while the other two 
thirds report using a variety of operational 
definitions. Only 114 countries report having 
a specific definition for injuries that result 
from a road traffic crash. More than half of 
these countries (57%) use either the need for 
hospitalization as the operational definition 
(or hospitalization plus another condition) 
or require three or more days of leave from 
work. The next most common definition 
used by more than 10% of countries relates 
to standardized injury definitions such as the 
Maximum Abbreviated Injury Scale (MAIS) 
(39), the Revised Trauma Score (RTS) (40), or 
the Mechanism/Glasgow Coma Score (Age/
Pressure (MGAP) (41). The remaining countries 
report using a variety of defini

In [15]:
results = vector_store.similarity_search_with_score("Who wrote this report?")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 0.6684215718992772

page_content='from the WHO Division of Data, Analytics and Delivery 
for Impact. Overall supervision was provided by Nhan 
Tran and Etienne Krug.
Advisory Board members actively participated in the study 
design and the review of the report: Shrikant Bangdiwala, 
McMaster University, Canada; Said Dahdah, Global Road 
Safety Facility at the World Bank, Washington (DC), 
United States of America (USA); Jennifer Ellis, Bloomberg 
Philanthropies, New York, United States of America 
(USA); Eduard Fernandez, CITA – the International Motor 
Vehicle Inspection Committee, Brussels, Belgium; Floor 
Lieshout, Youth for Road Safety (YOURS), Culemborg, the 
Netherlands; John Milton, Road Safety Committee World 
Road Association (Permanent International Association of 
Road Congresses -PIARC), Paris, France; Fernanda Rivera, 
Road Safety and Sustainable Urban Mobility, Mexico City’s 
Mobility Ministry; David Studdert, Stanford University, 
California, United States of Amer

In [16]:
embedding = embeddings.embed_query("who are responsible for road accidents")

results = vector_store.similarity_search_by_vector(embedding)
print(results[0])

page_content='Section 4.  
Measures to 
strengthen 
road safety 
governance
171 countries have a dedicated road safety agency.
117 countries report having a national road safety 
strategy, while just 16 of these strategies are 
fully funded.
More than half of countries use general taxes on 
vehicle purchases, insurance, fuel and alcoholic 
beverages among others, to finance road 
safety activities.
About half of all countries use dedicated taxes 
and fines from traffic violations to finance road 
safety activities.
Differences between reported counts of road traffic 
deaths and WHO estimated counts exist in 120 
countries. In some cases the WHO estimates are 10 
times higher, and in one case, 49 times higher, than 
self-reported figures.
Only 114 countries report having a specific definition 
for injuries that result from a road traffic crash.' metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.4 (Windows)', 'creationdate': '2024-05-03T11:38:40+07:00', 'modda

## Retrievers

In [17]:
from typing import List

from langchain_core.documents import Document
from langchain_core.runnables import chain

@chain
def retriever(query: str) -> List[Document]:
    return vector_store.similarity_search(query, k=1)

retriever.batch(
    [
        "How many regions are involved in this report?",
        "When was this report made?"
    ]
)

[[Document(id='de4d0902-00d6-4fd9-8454-07683874f46c', metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.4 (Windows)', 'creationdate': '2024-05-03T11:38:40+07:00', 'moddate': '2024-07-04T14:36:20+02:00', 'trapped': '/False', 'source': 'Global_status_report_on_road_safety_2023.pdf', 'total_pages': 96, 'page': 7, 'page_label': 'vi', 'start_index': 2421}, page_content='the List of contributors) and approximately 800 or more \ncollaborators. Country1 data delivery involved the use of \na data entry platform and securing consensus among \nparticipants. \nCountry-level participation was facilitated through the \nWHO designated Regional Data Focal Points who were \nresponsible for training and supervision of country \ncollaborators, and who also reviewed the report: Eunice \nChomi and Idrissa Talla (African Region), Alessandra \nSenisse (Region of the Americas), Rania Saad (South-\nEast Asia Region and Eastern Mediterranean Region), \nand Roseanne Vandermeer (Wester

In [18]:
retriever = vector_store.as_retriever(
    search_type='similarity',
    search_kwargs={"k":1}
)

retriever.batch(
    [
        "How many regions are involved in this report?",
        "When was this report made?"
    ]
)

[[Document(id='de4d0902-00d6-4fd9-8454-07683874f46c', metadata={'producer': 'Adobe PDF Library 17.0', 'creator': 'Adobe InDesign 19.4 (Windows)', 'creationdate': '2024-05-03T11:38:40+07:00', 'moddate': '2024-07-04T14:36:20+02:00', 'trapped': '/False', 'source': 'Global_status_report_on_road_safety_2023.pdf', 'total_pages': 96, 'page': 7, 'page_label': 'vi', 'start_index': 2421}, page_content='the List of contributors) and approximately 800 or more \ncollaborators. Country1 data delivery involved the use of \na data entry platform and securing consensus among \nparticipants. \nCountry-level participation was facilitated through the \nWHO designated Regional Data Focal Points who were \nresponsible for training and supervision of country \ncollaborators, and who also reviewed the report: Eunice \nChomi and Idrissa Talla (African Region), Alessandra \nSenisse (Region of the Americas), Rania Saad (South-\nEast Asia Region and Eastern Mediterranean Region), \nand Roseanne Vandermeer (Wester