# Aoororachain - Load data for semantic search

In [2]:
require "aoororachain"

true

## Configuration

In [24]:
Aoororachain.logger = Logger.new($stdout)
Aoororachain.log_level = Aoororachain::LEVEL_DEBUG

chroma_host = "http://localhost:8000"
collection_name = "ley-fintech"

"ley-fintech"

# Load data

With data loaders you can pass a parser class to help you clean, format or extract metadata from loaded documents.

In [1]:
# PDFDocParser: Parse PDF extracted data.
# Forces text to be UTF-8 and cleans empty spaces
class PDFDocParser
  def self.parse(text)
    metadata = {}
    
    text = text.encode("UTF-8", invalid: :replace, undef: :replace, replace: "")
    text.gsub!(/\s+/, " ").strip!
    
    [text, metadata]
  end
end

:parse

With a _DirectoryLoader_ a glob is set to load documents. The _loader_ is set for PDF files, and the _PDFDocParser_ cleans the loaded documents. **NOTE: Update path to your files.***

In [5]:
directory_loader = Aoororachain::Loaders::DirectoryLoader.new(path: "./files", glob: "**/*.pdf", loader: Aoororachain::Loaders::PDFLoader, parser: PDFDocParser)
files = directory_loader.load
files.size

D, [2023-06-27T15:19:32.738787 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:32.790326 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:32.836711 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:32.886452 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:32.934852 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:32.981511 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:33.031347 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:33.080203 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:33.124492 #6809] DEBUG -- : message

D, [2023-06-27T15:19:38.891187 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:38.980687 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.082565 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.157284 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.273334 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.383840 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.458649 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.548850 #6809] DEBUG -- : message=Extracted metadata using parser Class additional_metadata={}
D, [2023-06-27T15:19:39.621418 #6809] DEBUG -- : message

2

## Split files

Using _RecursiveTextSplitter_ loaded documents are splitted into smaller chunks of text.

In [6]:
text_splitter = Aoororachain::RecursiveTextSplitter.new(size: 512, overlap: 0)

texts = []
files.each do |file|
  texts.concat(text_splitter.split_documents(file))
end

texts.size

577

## Create embeddings

Embeddings are generated for texts, both are stored in Chroma DB. 

In [5]:
model = Aoororachain::Embeddings::LocalPythonEmbedding::MODEL_INSTRUCTOR_L
device = "mps" # cuda or cpu

"mps"

In [6]:
embedder = Aoororachain::Embeddings::LocalPythonEmbedding.new(model:, device:)
vector_database = Aoororachain::VectorStores::Chroma.new(embedder: embedder, options: {host: chroma_host})

vector_database.from_documents(texts, index: collection_name)

I, [2023-07-11T11:23:01.916847 #2227]  INFO -- : message=Using data={:model=>"hkunlp/instructor-large", :device=>"mps"}
I, [2023-07-11T11:23:01.917040 #2227]  INFO -- : message=This embedding calls Python code using system call.


#<Aoororachain::VectorStores::Chroma:0x0000000106e85358 @embedder=#<Aoororachain::Embeddings::LocalPythonEmbedding:0x0000000106e87f18 @model="hkunlp/instructor-large", @device="mps">>

## Query documents

With documents and embeddings in Chroma, you can query the documents using semantic search

In [25]:
vector_database.from_index(collection_name)

retriever = Aoororachain::VectorStores::Retriever.new(vector_database)

D, [2023-07-11T11:27:52.417419 #2227] DEBUG -- : message=Sending a request method=post uri=http://localhost:8000/api/v1/collections params={:name=>"ley-fintech", :metadata=>{:embedder=>"Aoororachain::Embeddings::LocalPythonEmbedding : hkunlp/instructor-large : mps"}, :get_or_create=>true}
I, [2023-07-11T11:27:52.426084 #2227]  INFO -- : message=Successful response code=200


#<Aoororachain::VectorStores::Retriever:0x0000000106447918 @vector_store=#<Aoororachain::VectorStores::Chroma:0x0000000106e85358 @embedder=#<Aoororachain::Embeddings::LocalPythonEmbedding:0x0000000106e87f18 @model="hkunlp/instructor-large", @device="mps">, @store=#<Chroma::Resources::Collection:0x0000000106449088 @id="b3f2ff9f-c359-45e9-9f8f-d62c99c465fd", @name="ley-fintech", @metadata={"embedder"=>"Aoororachain::Embeddings::LocalPythonEmbedding : hkunlp/instructor-large : mps"}>>, @search_type=:similarity, @results=3>

In [26]:
documents = retriever.search("¿Qué es una institución de tecnología financiera?", results: 4)

I, [2023-07-11T11:28:15.315757 #2227]  INFO -- : message=First time usage might take long time due to models download.
D, [2023-07-11T11:28:20.013592 #2227] DEBUG -- : message=Text embedded
D, [2023-07-11T11:28:20.015502 #2227] DEBUG -- : message=Query embeddings ¿Qué es una institución de tecnología financiera? data={:embeddings=>[[-0.027840066701173782, -0.015734124928712845, 0.01961270347237587, 0.008476519025862217, 0.027325643226504326, 0.03806598111987114, 0.01359499804675579, 0.03649275377392769, 0.003697786247357726, 0.006635308265686035, 0.014541883952915668, 0.007061609998345375, -0.0008789448766037822, 0.04689152538776398, -0.03610367700457573, -0.0057502505369484425, -0.023546848446130753, -0.03421090543270111, -0.06470227241516113, -0.002786332741379738, 0.020043276250362396, -0.010822630487382412, -0.0008251603576354682, 0.026602277532219887, -0.008874745108187199, 0.029100699350237846, 0.021924568340182304, 0.0008391602314077318, 0.024778123944997787, -0.0294620171189308

D, [2023-07-11T11:28:20.019291 #2227] DEBUG -- : message=Sending a request method=post uri=http://localhost:8000/api/v1/collections/b3f2ff9f-c359-45e9-9f8f-d62c99c465fd/query params={:query_embeddings=>[[-0.027840066701173782, -0.015734124928712845, 0.01961270347237587, 0.008476519025862217, 0.027325643226504326, 0.03806598111987114, 0.01359499804675579, 0.03649275377392769, 0.003697786247357726, 0.006635308265686035, 0.014541883952915668, 0.007061609998345375, -0.0008789448766037822, 0.04689152538776398, -0.03610367700457573, -0.0057502505369484425, -0.023546848446130753, -0.03421090543270111, -0.06470227241516113, -0.002786332741379738, 0.020043276250362396, -0.010822630487382412, -0.0008251603576354682, 0.026602277532219887, -0.008874745108187199, 0.029100699350237846, 0.021924568340182304, 0.0008391602314077318, 0.024778123944997787, -0.029462017118930817, 0.007106659933924675, -0.05771395191550255, -0.050318311899900436, -0.055999428033828735, -0.039833538234233856, 0.034173998981

I, [2023-07-11T11:28:20.040722 #2227]  INFO -- : message=Successful response code=200
D, [2023-07-11T11:28:20.041468 #2227] DEBUG -- : message=Building embeddings from {"ids"=>[["ecc98395-796c-4d23-99e9-afd060bbdc29", "2b04b516-1163-41af-bad4-b4df8288ff52", "cfa3a96c-0db9-4f2b-b7c8-65b51268cc73", "7a964dd1-4687-4b16-a2d8-7d6eaa80c447"]], "embeddings"=>nil, "documents"=>[["supervisadas por las Autoridades Financieras. Las expresiones “institución de tecnología financiera”, “ITF”, “institución de financiamiento colectivo”, “institución de fondos de pago electrónico” u otras que expresen ideas semejantes en cualquier idioma, referidas a dichos conceptos o a marcas y productos que correspondan a ellos, por las que pueda inferirse la realización de las actividades propias de las referidas entidades, no podrán ser usadas en el nombre, denominación, razón social o publicidad de", "tecnología financiera”, “ITF”, “institución de financiamiento colectivo”, “institución de fondos de pago electrón

[#<Chroma::Resources::Embedding:0x00000001064b5cd8 @id="ecc98395-796c-4d23-99e9-afd060bbdc29", @embedding=nil, @metadata={"source"=>"./files/LRITF_200521.pdf", "pages"=>71, "page"=>6}, @document="supervisadas por las Autoridades Financieras. Las expresiones “institución de tecnología financiera”, “ITF”, “institución de financiamiento colectivo”, “institución de fondos de pago electrónico” u otras que expresen ideas semejantes en cualquier idioma, referidas a dichos conceptos o a marcas y productos que correspondan a ellos, por las que pueda inferirse la realización de las actividades propias de las referidas entidades, no podrán ser usadas en el nombre, denominación, razón social o publicidad de", @distance=0.1925981193780899>, #<Chroma::Resources::Embedding:0x00000001064b5c38 @id="2b04b516-1163-41af-bad4-b4df8288ff52", @embedding=nil, @metadata={"source"=>"./files/LRITF_200521.pdf", "pages"=>71, "page"=>55}, @document="tecnología financiera”, “ITF”, “institución de financiamiento cole

In [27]:
documents.size

4

The result documents are the ones that seems to have the response to the question. This query returns the chunk of text, metadata and the cosine calculated distance of the text and the question.

In [29]:
documents.first

#<Chroma::Resources::Embedding:0x00000001064b5cd8 @id="ecc98395-796c-4d23-99e9-afd060bbdc29", @embedding=nil, @metadata={"source"=>"./files/LRITF_200521.pdf", "pages"=>71, "page"=>6}, @document="supervisadas por las Autoridades Financieras. Las expresiones “institución de tecnología financiera”, “ITF”, “institución de financiamiento colectivo”, “institución de fondos de pago electrónico” u otras que expresen ideas semejantes en cualquier idioma, referidas a dichos conceptos o a marcas y productos que correspondan a ellos, por las que pueda inferirse la realización de las actividades propias de las referidas entidades, no podrán ser usadas en el nombre, denominación, razón social o publicidad de", @distance=0.1925981193780899>

**These are the texts that you send as context to LLMs to create a response using Natural Processing Language.**