In [1]:
import re
import openai
import numpy as np
from llama_index.core import Document
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.schema import TransformComponent
from llama_index.llms.openai import OpenAI
from llama_index.agent.openai import OpenAIAgent
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, Settings,StorageContext
from llama_index.core.objects import ObjectIndex
from typing import Sequence
from llama_index.core.tools import BaseTool, FunctionTool
import pickle
import os
import yaml
import json
import utils

In [2]:
Settings.llm = OpenAI(api_key=utils.get_api_key(name='openai'),model="gpt-4o-mini", temperature=0.5, max_tokens=1024)
Settings.embed_model = OpenAIEmbedding(api_key=utils.get_api_key(name='openai'), model="text-embedding-3-large", embed_batch_size=10,dimensions=1024)
book = SimpleDirectoryReader("../data/books").load_data()
class TextCleaner(TransformComponent):
    def __call__(self, nodes, **kwargs):
        for node in nodes:
            node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text)
        return nodes
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=1000, chunk_overlap=120),
        TextCleaner(),
    ],
)
nodes = pipeline.run(documents=book)
vector_index = VectorStoreIndex(
        nodes=nodes,
        show_progress=True,
    )
# Guardar el índice vectorial usando un contexto de almacenamiento
vector_index.storage_context.persist('../data/models/')

  from .autonotebook import tqdm as notebook_tqdm
Generating embeddings: 100%|██████████| 153/153 [00:10<00:00, 14.40it/s]


### Creación del Chat

* Ahora validamos que si usamos el chat, efectivamente "tiene memoria"
* Se usa como referencia esto: https://docs.llamaindex.ai/en/stable/examples/chat_engine/chat_engine_openai/

In [3]:
chat_engine = vector_index.as_chat_engine(chat_mode="openai", verbose=True)
response = chat_engine.chat("which are the Specific stages of a data science project?")
print(response)

Added user message to memory: which are the Specific stages of a data science project?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"What are the specific stages of a data science project?"}
Got output: The specific stages of a data science project are:

1. Question
2. Exploratory data analysis
3. Formal modeling
4. Interpretation
5. Communication

The specific stages of a data science project are:

1. **Question**: Define the problem or question you want to address.
2. **Exploratory Data Analysis (EDA)**: Analyze and visualize the data to understand its structure, patterns, and relationships.
3. **Formal Modeling**: Develop models using statistical or machine learning techniques to make predictions or gain insights.
4. **Interpretation**: Analyze the results of the models and understand their implications in the context of the original question.
5. **Communication**: Present the findings and insights to stakeholders in a clear and actionable manner.

In [4]:
response = chat_engine.chat("thanks, and now can you translate it to spanish?")
print(response)

Added user message to memory: thanks, and now can you translate it to spanish?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"Translate the specific stages of a data science project to Spanish."}
Got output: Las cinco fases de un proyecto de ciencia de datos son:

1. Pregunta
2. Análisis exploratorio de datos
3. Modelado formal
4. Interpretación
5. Comunicación

Las etapas específicas de un proyecto de ciencia de datos son:

1. **Pregunta**
2. **Análisis exploratorio de datos**
3. **Modelado formal**
4. **Interpretación**
5. **Comunicación**


In [5]:
response = chat_engine.chat("give me a list of roles in a data science team?")
print(response)

Added user message to memory: give me a list of roles in a data science team?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"What are the roles in a data science team?"}
Got output: A data science team typically includes data scientists, data engineers, and a data science manager. Data scientists focus on solving individual projects or subproblems, while data engineers develop and maintain the infrastructure necessary for data science work. The data science manager is responsible for building the team, recruiting members, setting goals, and prioritizing projects. Additionally, team members often interact with external stakeholders to communicate about the data science infrastructure and ongoing projects.

A data science team typically includes the following roles:

1. **Data Scientist**: Focuses on solving individual projects or subproblems using statistical and machine learning techniques.
2. **Data Engineer**: Develops and maintains the infrastructu

In [6]:
response = chat_engine.chat("how can i be a good data scientist?")
print(response)

Added user message to memory: how can i be a good data scientist?
=== Calling Function ===
Calling function: query_engine_tool with args: {"input":"What are the tips to become a good data scientist?"}
Got output: To become a good data scientist, it is important to develop a range of skills and qualities. Here are some tips:

1. **Master Data Analysis**: Learn how to extract, clean, and analyze data sets effectively. Familiarize yourself with statistical inference and prediction techniques.

2. **Enhance Communication Skills**: Be able to communicate your findings clearly. This includes creating visualizations that effectively convey the results and the level of uncertainty involved.

3. **Learn Programming Languages**: Gain proficiency in languages commonly used in data science, such as R or Python, and understand SQL for database interactions.

4. **Understand Visualization Tools**: Familiarize yourself with visualization tools, including interactive options like D3.js, to present dat