# CodingMindset Lab - Langchain - RAG 102

## Prerrequisitos

1. Instalamos las dependencias necesarias

In [None]:
#!pip install -qU langchain_experimental langchain_openai langchain_community langchain ragas faiss-cpu tiktoken

2. Seteamos la API Key de OpenAI, si no lo hemos hecho previamente

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

## Descarga de la base de conocimiento

> NOTA: Podemos utilizar, cualquier documento detexto o pdf que tengamos a nuestro alcance. En caso contrario podemos usar la web the Gutenberg.org para descargar cualquier libro disponible gratuitamente

- Descargamos un libro sobre el funcionamiento del cerebro en formato txt y le damos el nombre de "the_brain"

In [None]:
!wget https://gutenberg.org/cache/epub/14586/pg14586.txt -O the_brain.txt

In [None]:
with open("./the_brain.txt") as f:
  the_brain = f.read()

## Estrategias de Chunking

### Recursive Splitting

También conocido como "Naive chunking", suele ser la estrategia usada por defecto en LangChain en la mayoría de tutoriales o para el enfoque "Naive RAG"<br> Utiliza reglas sintácticas.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,
    chunk_overlap=0,
    length_function=len,
)

In [None]:
naive_chunks = text_splitter.split_text(the_brain)

In [None]:
for chunk in naive_chunks[40:55]:
  print(chunk + "\n")

We already used the term "external reality" which is not defined yet. This
fundamental term is considered as a source of information, which is not
localized in the structure of models of the brain. I want to emphasize that
the external reality is not a source of information, but is just considered so
by any brain.

Thus, one of the main hardware functions of the brain is to make models of the
external reality and to predict, by simulation on the model, the possible
evolution of the associated external reality.

We already defined the reality as all the information which is or could be
generated by a model. This means that we understand the external reality by
the reality, which is generated by a model, which is associated with the
external reality.

Example: For a given external reality, any person makes an associated model.
Any person has his/her own model associated to the same external reality. We
think and act based on our own reality and not based directly on the external
reality.

### Semantic Chunk

El chunking semántico implica dividir el texto en segmentos cohesivos y significativos basados en el contenido semántico. Este método asegura que cada fragmento represente un tema o idea coherente, lo cual mejora la relevancia y el contexto durante los procesos de recuperación y generación de información.

1. Inicializamos la clase, pasándole por parámetro el embedding model que vamos a usar

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

semantic_chunker = SemanticChunker(OpenAIEmbeddings(model="text-embedding-3-large"), breakpoint_threshold_type="percentile")


El segundo parámetro dispone de tres métodos comunes para realizar el chunking semántico, según la [documentación de LangChain](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker):

- **Método del Percentil (`percentile`)**: Se calculan todas las diferencias entre frases y cualquier diferencia mayor al percentil X se utiliza para dividir.
- **Método de la Desviación Estándar (`standard_deviation`)**: Divide en puntos donde las diferencias superen X desviaciones estándar.
- **Método Intercuartil (`interquartile`)**: Utiliza la distancia intercuartil para determinar los puntos de división.

#### Pasos Básicos:

1. **División del Documento:** Fragmenta el texto en frases basándose en signos de puntuación terminativos como `.`, `?`, y `!`.
2. **Indexación de Frases:** Asigna un índice a cada frase según su posición en el texto.
3. **Buffering:** Añade un `buffer_size` (int) de frases alrededor de cada sentencia clave para contextualizar.
4. **Cálculo de Distancias:** Mide las distancias semánticas entre los grupos de frases.
5. **Unión Basada en Similitud:** Fusiona los grupos de oraciones que sean similares utilizando los límites establecidos por los métodos de chunking.

> **Nota:** Este método es experimental y podría recibir mejoras y actualizaciones en el futuro próximo.


2. Creamos los chunks

In [None]:
semantic_chunks = semantic_chunker.create_documents([the_brain])

In [None]:
for semantic_chunk in semantic_chunks:
  if "MDT is associated with the basic" in semantic_chunk.page_content:
    print(semantic_chunk.page_content)
    print(len(semantic_chunk.page_content))

So, the only guarantee of a "correct" prediction is the confidence
in that structure of models. MDT is associated with the basic hardware functions of the brain. Once we
described the hardware structure, everything what the MDT predicts is based on
what the hardware is able to do. What MDT says about knowledge is not another
theory on knowledge but what the hardware is able to do. Any experiment is based on a model.
419


## RAG APP

Vamos a crear una RAG app, usando nuestra nueva estrategia de chunking con LangChain usando LCEL

### Retrieval

En esta ocasión usaremos **FAISS (Facebook AI Similarity Search)**. la cual es una biblioteca optimizada para realizar búsquedas rápidas en grandes bases de datos de vectores almacenados en memoria, ideal para tareas de recuperación de información y sistemas de recomendación debido a su alta velocidad y escalabilidad.

In [None]:
from langchain_community.vectorstores import FAISS

semantic_chunk_vectorstore = FAISS.from_documents(semantic_chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-large"))

Limitaremos (`semantic_chunk_vectorstore`) a k = 1 para demostrar el poder de la estrategia de chunking semántico, manteniendo un conteo de tokens similar entre el contexto recuperado semánticamente y el contexto recuperado de manera simple.

In [None]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever(search_kwargs={"k" : 1})

In [None]:
semantic_chunk_retriever.invoke("what is MDT?")

[Document(page_content="generated by the model). They have only descriptive definitions. Once the fundamental terms are introduced by description, all the other terms\nhave normal definitions, which are generated by the symbolic model, by logical\nand mathematical operations. Let's see the definitions of the terms used by the MDT theory. Model: this is a term used on large scale in science and technology. The MDT\ntheory accepts the definition used there. A model means some fundamental elements and some fundamental relations between\nthe elements. The elements could be of any type (physical objects, the representation of any\nobject in any form, including pictures of any type or images of any type or\nmathematical symbols of any type and so on). In fact, an element could be\nassociated with anything which can be considered as an entity. The elements\nhave some properties, which must be specified somehow. There are a number of\nrelations between the elements, which must also be specifie

### Augmented

- Descargamos el prompt desde ek langchain hub por comodidad, pero podriamos crear neustro propio prompt de la siguiente manera:
````python
from langchain_core.prompts import ChatPromptTemplate

rag_template = """\
Use the following context to answer the user's query. If you cannot answer, please respond with 'I don't know'.

User's Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(rag_template)

````

In [None]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

### Generation

- Utilizaremos `ChatOpenAI()` para mantener la simplicidad del ejemplo

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI()

### LCEL RAG Chain

In [None]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

semantic_rag_chain = (
    {"context" : semantic_chunk_retriever, "question" : RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
semantic_rag_chain.invoke("what is MDT?")

'MDT stands for Model-Driven Theory. This theory accepts the definition of a model as fundamental elements and relations between them, which could be of any type. The elements in MDT have properties that must be specified, and there are relations between the elements that must also be specified.'