<h2 style='text-align: center;'>RAG: Indexing</h2>

<p align='center'>
  <img src='./indexing_and_retrieval.png' width=480 style='border-radius: 14px;' />
</p>

<p align='justify'>&nbsp;&nbsp;&nbsp;Neste notebook, pretendo focar na parte de Indexação desse mergulho em RAG.</p>

<br>

<h3>Setting Up</h3>

> Notas:
> - [Contador de tokens](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb) considerando [~4 char / token](https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them)
> - [Modelos de embedding](https://python.langchain.com/docs/integrations/text_embedding/openai)
> - [Similaridade dos Cossenos](https://platform.openai.com/docs/guides/embeddings/frequently-asked-questions) recomendado (1 indica que e identico) para embeddings da OpenAI

In [1]:
import sys
sys.path.append('D:/Deep Learning/Diver')

from config.constants import *
from config.config import *
from config.utils import *

os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_PROJECT'] = f'RAG'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
os.environ['LANGCHAIN_API_KEY'] = settings.LANGSMITH_API_KEY

  from .autonotebook import tqdm as notebook_tqdm


<h3>Tokens Counter</h3>

In [2]:
question = 'What kind of movies do I like?'
document = 'Naomi Lago loves sci-fi movies, as well as thrillers and comedy.'

def tokens_counter(text: str, encoding: str) -> int:
  '''Returns the number of tokens in a given text.'''
  
  encoding = tiktoken.get_encoding(encoding)
  
  return len(encoding.encode(text))

print(f'Question tokens: {tokens_counter(question, "cl100k_base")}')

Question tokens: 8


<h3><a href='https://python.langchain.com/docs/integrations/text_embedding/openai'>Embeddings and Similarity</a></h3>

In [3]:
def similarity_calculator(vector_a: Sequence, vector_b: Sequence) -> float:
  '''Returns the calculated similarity between two vectors.'''

  dot_product = np.dot(vector_a, vector_b)
  norm_a = np.linalg.norm(vector_a)
  norm_b = np.linalg.norm(vector_b)
  
  return dot_product / (norm_a * norm_b)

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

query_result = embeddings.embed_query(question)
document_result = embeddings.embed_query(document)

similarity = similarity_calculator(query_result, document_result)

print(f'Cosine similarity: {similarity}')

Cosine similarity: 0.48588228298678326


<h3><a href='https://python.langchain.com/docs/integrations/document_loaders/'>Document Loaders</a></h3>

In [4]:
loader = WebBaseLoader(
  web_paths=[
    'https://naomilago.com/',
    'https://naomilago.com/about.html'
    'https://naomilago.com/posts/so-what-is-data-science/',
    'https://naomilago.com/posts/what-is-natural-language-processing/',
    'https://naomilago.com/posts/getting-geographic-coordinates-with-python-and-google/',
    'https://naomilago.com/posts/perceptron-fundamentals-and-applications/',
    'https://naomilago.com/posts/information-retrieval-with-vector-search/',
    'https://naomilago.com/posts/image-classification-with-deep-learning-made-easy/',
    'https://naomilago.com/posts/vector-search-with-facebook-ai/',
    'https://naomilago.com/posts/recognizing-handwriting-digits/',
    'https://naomilago.com/posts/preprocessing-unstructured-data/',
    'https://github.com/naomilago/',
    'https://github.com/naomilago?tab=repositories'
  ], 
)

documents = loader.load()
documents

[Document(page_content='\n\n\n\n\n\nNaomi Lago - Data dives and beyond\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNaomi Lago\n\n\n\n\n\n\n\n\n\n\nAbout\n\n\n \n\nGithub\n\n\n \n\nLinkedIn\n\n\n \n\n\n\n\n \n\n\n\n \n\n\n\n\n\n\nData dives and beyond\n\n\nWelcome to my personal blog. Here I’ll share my learning notes and some great resources. Join me as we journey through this fascinating realm, uncovering valuable resources together.\n        \n\n\n\n\n\n\n\n\n\nCategoriesAll (9)COMPUTER VISION (1)DATA SCIENCE (5)DEEP LEARNING (3)GEOLOCATION (1)NLP (3)VECTOR SEARCH (2)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPreprocessing Unstructured Data\n\n\n\n\n\n\nDATA SCIENCE\n\n\nNLP\n\n\n\nReady for another dive? Today we’ll be exploring a vital component in building today’s powerful LLMs, playing a significant role in RAG systems.\n\n\n\n\n\nMay 1, 2024\n\n\nNaomi Lago\n\n\n\n\n\n\n\n\n\n\n\n\nRecognizing Handwriting Digits\n\n\n\n\n\n\nDEEP

<h3><a href='https://python.langchain.com/docs/modules/data_connection/document_transformers/recursive_text_splitter'>Splitter</a></h3>

> - Este divisor de texto é o recomendado para texto genérico. Ele é parametrizado por uma lista de caracteres. Tenta dividir com base neles na ordem até que os pedaços sejam pequenos o suficiente. A lista padrão é `["\n\n", "\n", " ", ""]`. Isso tem o efeito de tentar manter todos os parágrafos (e então as sentenças, e então as palavras) juntos o máximo possível, já que esses seriam genericamente os pedaços de texto mais fortemente relacionados semanticamente.

In [5]:
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
  chunk_size=300,
  chunk_overlap=50
)

splits = text_splitter.split_documents(documents)
splits

[Document(page_content='Naomi Lago - Data dives and beyond\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNaomi Lago\n\n\n\n\n\n\n\n\n\n\nAbout\n\n\n \n\nGithub\n\n\n \n\nLinkedIn\n\n\n \n\n\n\n\n \n\n\n\n \n\n\n\n\n\n\nData dives and beyond\n\n\nWelcome to my personal blog. Here I’ll share my learning notes and some great resources. Join me as we journey through this fascinating realm, uncovering valuable resources together.\n        \n\n\n\n\n\n\n\n\n\nCategoriesAll (9)COMPUTER VISION (1)DATA SCIENCE (5)DEEP LEARNING (3)GEOLOCATION (1)NLP (3)VECTOR SEARCH (2)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPreprocessing Unstructured Data\n\n\n\n\n\n\nDATA SCIENCE\n\n\nNLP\n\n\n\nReady for another dive? Today we’ll be exploring a vital component in building today’s powerful LLMs, playing a significant role in RAG systems.\n\n\n\n\n\nMay 1, 2024\n\n\nNaomi Lago\n\n\n\n\n\n\n\n\n\n\n\n\nRecognizing Handwriting Digits\n\n\n\n\n\n\nDEEP LEARNING\n\

<h3><a href='https://python.langchain.com/docs/integrations/vectorstores/'>Vector Store</a></h3>


In [6]:
vector_store = Chroma.from_documents(
  documents=splits,
  embedding=embeddings,
  persist_directory='../data/vector_store'
)

vector_store

<langchain_community.vectorstores.chroma.Chroma at 0x21aed7b3b20>