In [1]:
import re
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.schema import TransformComponent
from llama_index.llms.openai import OpenAI
from llama_index.agent.openai import OpenAIAgent
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings
from llama_index.core.objects import ObjectIndex
from typing import Sequence
from llama_index.core.tools import BaseTool, FunctionTool
import os
import yaml
import json
import utils

In [2]:
print("Testing LLM...")
utils.test_llm(utils.get_api_key(name='openai'))

Testing LLM...
LLM test successful with model: gpt-4o-mini
Response: Hello! How can I assist you today?


True

In [3]:
print("\nTesting Embedding Model...")
utils.test_embeddings(utils.get_api_key(name='openai'))


Testing Embedding Model...
Embedding test successful!
Embedding shape: 3072


True

### Configuración de Llama Index

Modelos y embeddings

In [4]:
Settings.llm = OpenAI(api_key=utils.get_api_key(name='openai'),model="gpt-4o-mini", temperature=0.7, max_tokens=1024)
Settings.embed_model = OpenAIEmbedding(api_key=utils.get_api_key(name='openai'), model="text-embedding-3-large", embed_batch_size=10)

### Cargar documentos del RAG

In [5]:
book = SimpleDirectoryReader("../data/books").load_data()
len(book)
print(book[10])

Doc ID: a8b23985-3bcd-4eba-9954-07436a1de5c3
Text: What is Data Science? 3 •When you start with the question you
often discover that you need to collect new data or design an experi-
ment to confirm you are getting the right answer. •It is easy to
discover structure or networks in a data set. There will always be
correlations for a thousand reasons if you collect enough data.
Understanding wheth...


### Pipeline de ingesta

* Crea chunks de un máximo de 512 tokens con 40 tokens de solape entre ellos
* Se crean los embeddings asociados a los chunks extraídos
* El resultado es la construcción de nodos
* Se limpian los nodos
* IngestionPipeline tiene la restricción que trabaja solo con clases que heredan de TransformerComponent
* Referencia : https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations/#custom-transformations

In [6]:
class TextCleaner(TransformComponent):
    """
    A transformation component that cleans text by removing non-alphanumeric characters.

    This component takes in a list of nodes, removes any non-alphanumeric characters
    (except for spaces) from the text of each node, and returns the cleaned nodes.

    Args:
        nodes (list): A list of nodes to be cleaned.

    Returns:
        list: The list of cleaned nodes.
    """
    def __call__(self, nodes, **kwargs):
        for node in nodes:
            node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text)
        return nodes
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=40),
        TextCleaner(),
    ],
)
nodes = pipeline.run(documents=book)
len(nodes)

157

### Se crea el índice de embeddings en memoria

* Cómo construyen el índica es un mecanismo interno de la librería
* Lo importante es que luego este índice se usa para encontrar los chunks más relevantes a la consulta del usuario por similitud de coseno
* https://www.reddit.com/r/LocalLLaMA/comments/1bvo5l4/the_more_i_use_llamaindex_the_less_i_like_it/
* referencia: https://docs.llamaindex.ai/en/stable/understanding/loading/loading/

In [7]:
vector_index = VectorStoreIndex(
        nodes=nodes,
        show_progress=True,
    )

  from .autonotebook import tqdm as notebook_tqdm
Generating embeddings: 100%|██████████| 157/157 [00:16<00:00,  9.50it/s]


### Validación del RAG usando métodos que vienen en la librería

In [13]:
query_engine = vector_index.as_query_engine()
response = query_engine.query(
    "which are the Specific stages of a data science project according to the book?"
)
print(response)

The specific stages of a data science project are as follows:

1. Question
2. Exploratory data analysis
3. Formal modeling
4. Interpretation
5. Communication


### Creación del Chat

* Se usa como referencia esto: https://docs.llamaindex.ai/en/stable/examples/chat_engine/chat_engine_openai/

In [9]:
chat_engine = vector_index.as_chat_engine(chat_mode="openai", verbose=True)

In [10]:
response = chat_engine.chat("Hi")
print(response)

Added user message to memory: Hi
Hello! How can I assist you today?


In [14]:
response = chat_engine.chat(
    "which are the Specific stages of a data science project?"
)
print(response)

Added user message to memory: which are the Specific stages of a data science project?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"What are the specific stages of a data science project?"}
Got output: The specific stages of a data science project are:

1. Question
2. Exploratory data analysis
3. Formal modeling
4. Interpretation
5. Communication

The specific stages of a data science project are:

1. **Question**: Defining the problem or question to be addressed.
2. **Exploratory Data Analysis**: Analyzing the data to understand its structure, patterns, and insights.
3. **Formal Modeling**: Building and validating models using statistical and machine learning techniques.
4. **Interpretation**: Interpreting the results of the models and understanding their implications.
5. **Communication**: Presenting the findings in a clear and actionable manner to stakeholders.


In [15]:
response = chat_engine.chat("thanks, and now can you translate it to spanish?")
print(response)

Added user message to memory: thanks, and now can you translate it to spanish?
Of course! Here’s the translation:

Las etapas específicas de un proyecto de ciencia de datos son:

1. **Pregunta**: Definir el problema o la pregunta que se va a abordar.
2. **Análisis exploratorio de datos**: Analizar los datos para entender su estructura, patrones e insights.
3. **Modelado formal**: Construir y validar modelos utilizando técnicas estadísticas y de aprendizaje automático.
4. **Interpretación**: Interpretar los resultados de los modelos y entender sus implicaciones.
5. **Comunicación**: Presentar los hallazgos de manera clara y accionable a las partes interesadas.
