In [None]:
# https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-file
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader

# PDF Reader with `SimpleDirectoryReader`
parser = PDFReader()
file_extractor = {"AICompanionsReduceLoneliness.pdf": parser}
documents = SimpleDirectoryReader(
    "data", 
    file_extractor=file_extractor
).load_data()

In [None]:
len(documents)

[How to Choose the Right Embedding Model for Your LLM Application](https://www.mongodb.com/developer/products/atlas/choose-embedding-model-rag/)

There are many fields in [MTEB](https://huggingface.co/spaces/mteb/leaderboard) that we can mostly ignore. The fields that matter most for those of us using these models in the real world are:

* Score: the score we should focus on is "average" and "retrieval average". Both are highly correlated, so focusing on either works.
* Sequence length tells us how many tokens a model can consume and compress into a single embedding. Generally speaking, we wouldn't recommend stuffing more than a paragraph of heft into a single embedding - so models supporting up to 512 tokens are usually more than enough.
* Model size: the size of a model indicates how easy it will be to run. All models near the top of MTEB are reasonably sized. One of the largest is instructor-xl (requiring 4.96GB of memory), which we can easily run on consumer hardware.

* https://docs.llamaindex.ai/en/stable/examples/embeddings/huggingface/

Youtube Video - https://www.youtube.com/watch?v=TcRRfcbsApw

Uses [Semantic-router](https://pypi.org/project/semantic-router/)

Other Library

Semantic chunkers allow us to build more context aware chunks of information. We can use this for RAG, splitting video, audio, and much more.

In this example, we will stick with a simple RAG-focused example. We will learn about three different types of chunkers available to us; StatisticalChunker, ConsecutiveChunker, CumulativeChunker, and RegexChunker.


* https://github.com/aurelio-labs/semantic-chunkers
* https://github.com/aurelio-labs/semantic-chunkers/blob/main/docs/00-chunkers-intro.ipynb




In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# loads BAAI/bge-small-en
# embed_model = HuggingFaceEmbedding()

# loads BAAI/bge-small-en-v1.5
embed_model = HuggingFaceEmbedding(model_name="Alibaba-NLP/gte-base-en-v1.5", trust_remote_code=True )

In [None]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

In [None]:
nodes = splitter.get_nodes_from_documents(documents)

Inspecting the Chunks

In [None]:
len(nodes)

In [None]:
print(nodes[8].get_content())

In [None]:
from llama_index.core import VectorStoreIndex
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()

In [None]:
response = query_engine.query(
    "Tell me 10 different new and unique findings that will help to start a new business in combating social isolation using AI"
)

In [None]:
print(str(response))

In [None]:
response = query_engine.query(
    "What are the findings of study 3")
print(str(response))
