### Usando la API de OpenAI

Comprobar que se setearon las variables de entorno ya sea para OpenAI o Gemini


```bash
export OPENAI_API_KEY=<YOUR_API_KEY_HERE>
# o
export GEMINI_API_KEY=<YOUR_API_KEY_HERE>
```

**Importar el módulo**

Importa el cliente oficial de la API de OpenAI.

In [1]:
from openai import OpenAI

**Creación de un cliente para usar los modelos de OpenAI**

Crea una instancia del cliente para interactuar con los modelos. Este cliente automáticamente detecta tu clave API desde la variable de entorno OPENAI_API_KEY.

In [2]:
client = OpenAI()

**Se hace una petición al modelo gpt-4o-mini para que genere una respuesta textual**

- Llama al endpoint de chat (chat.completions.create).
- Usa el modelo `gpt-4o-mini`, una versión optimizada de GPT-4o.
- La opción `store=True` indica que la conversación puede ser almacenada por OpenAI para análisis o mejoras (según su política de uso).
- Se envía un mensaje en el rol de "user" con una instrucción específica.

In [3]:
completion = client.chat.completions.create(
  model="gpt-4o-mini",
  store=True,
  messages=[
    {"role": "user", "content": "Escribe un cuento de una sola oración sobre un unicornio para dormir."}
  ]
)

**Impresión del texto generado por el modelo**

Imprime en consola la respuesta generada por el modelo, accediendo al primer resultado retornado (`choices[0]`) y extrayendo su contenido (`message.content`).

In [4]:
print(completion.choices[0].message.content);

En la calma de la noche estrellada, un suave unicornio con crines de arcoíris danzaba entre los sueños de los niños, llevando consigo la promesa de aventuras mágicas y un mundo lleno de esperanza y amor.


### Usando la API de Gemini

**Importar el módulo**
- Esta línea importa el módulo genai*, que forma parte del SDK de Google para interactuar con modelos de IA generativa como Gemini.
- Este SDK proporciona una forma sencilla de hacer peticiones a los modelos LLM de Google.

In [5]:
from google import genai

**Creación de un cliente para usar los modelos de Gemini**

Se crea un objeto client usando la clase Client, que necesita una clave API (api_key) para autenticarte con los servicios de Google AI.

In [6]:
client = genai.Client()

**Se hace una petición al modelo gemini-2.0-flash para que genere una respuesta textual**

- Le estás pasando como entrada el mensaje "Explica cómo funciona la IA en pocas palabras".
- El método generate_content() procesa esta entrada y devuelve una respuesta generada por el modelo.

In [7]:
response = client.models.generate_content(
    model="gemini-2.0-flash", 
    contents="Explica cómo funciona la IA en pocas palabras"
)

**Impresión del texto generado por el modelo**
- Esta línea imprime el texto de la respuesta generada en la consola. 
- El contenido proviene de response.text, que contiene la respuesta principal del modelo.

In [8]:
print(response.text)

La IA funciona enseñando a las computadoras a aprender patrones de datos. Luego, usan esos patrones para hacer predicciones, tomar decisiones o resolver problemas, sin necesidad de ser programadas explícitamente para cada tarea.



### Creando nuestar Base de Conocimiento con minsearch

In [9]:
import minsearch
import json

In [10]:
! mkdir doc

mkdir: cannot create directory ‘doc’: File exists


In [11]:
! wget -P doc/ https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/01-intro/documents.json

--2025-06-15 00:34:23--  https://raw.githubusercontent.com/DataTalksClub/llm-zoomcamp/refs/heads/main/01-intro/documents.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 658332 (643K) [text/plain]
Saving to: ‘doc/documents.json.2’


2025-06-15 00:34:23 (6.86 MB/s) - ‘doc/documents.json.2’ saved [658332/658332]



In [12]:
with open('doc/documents.json', 'rt') as f_in:
    docs_raw = json.load(f_in)

In [13]:
len(docs_raw)

3

In [14]:
for docs in docs_raw:
    print(docs.keys())

dict_keys(['course', 'documents'])
dict_keys(['course', 'documents'])
dict_keys(['course', 'documents'])


In [15]:
documents = []

for course_dict in docs_raw:
    for doc in course_dict['documents']:
        doc['course'] = course_dict['course']
        documents.append(doc)

In [16]:
documents[:3]

[{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
  'section': 'General course-related questions',
  'question': 'Course - When will the course start?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites',
  'section': 'General course-related questions',
  'question': 'Course - What are the prerequisites for this course?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines 

In [17]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

```sql
SELECT * WHERE course = 'data-engineering-zoomcamp';
```

In [18]:
index.fit(documents)

<minsearch.minsearch.Index at 0x7fdb6e54f4d0>

In [19]:
query = 'the course has already started, can I still enroll?'

In [20]:
filter_config = {'course': 'data-engineering-zoomcamp'}

boost_config = {'question': 3.0, 'section': 0.5}

n_result = 2

results = index.search(
    query=query,
    filter_dict=filter_config,
    boost_dict=boost_config,
    num_results=n_result
)

results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'}]

Si hacemos una consulta al LLM sobre el curso sin dar contexto a traves de nuestra base de conocimiento, el LLM da una respuesta no deseada y ambigua

In [21]:
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": query}]
)

print(response.choices[0].message.content);

Whether you can still enroll in a course that has already started depends on the institution or platform offering the course. Many educational institutions have specific policies regarding late enrollment. Here are some steps you can take:

1. **Check Enrollment Policies**: Visit the course or institution's website to find information on late enrollment.

2. **Contact the Instructor or Administration**: Reach out directly to the course instructor or the admissions office. They can provide the most accurate information regarding late enrollment options.

3. **Ask About Any Alternatives**: If late enrollment is not allowed, inquire if there are similar courses offered in the future or if there are resources available for catching up on missed content.

4. **Look for Online Options**: If the course is available online, some platforms allow you to enroll and start learning from recorded lectures or materials even after the live class has begun.

Being proactive in reaching out will give yo

**Utilizando la base de conocimiento para mejorar la respuesta del LLM**

Genero el prompt a pasar al LLM

In [22]:
prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

context = ""

for doc in results:
    context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

prompt = prompt_template.format(question=query, context=context).strip()

print(prompt)

You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: the course has already started, can I still enroll?

CONTEXT: 
section: General course-related questions
question: Course - Can I still join the course after the start date?
answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

section: General course-related questions
question: Course - Can I follow the course after it finishes?
answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.


Realizo la consulta al LLM

In [23]:
client = OpenAI()

response_rag = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print(response_rag.choices[0].message.content);

Yes, you can still enroll in the course after the start date. Even if you don't register, you're still eligible to submit the homework. Just keep in mind that there will be deadlines for turning in the final projects, so it's best not to leave everything for the last minute.


### Mejorando el codigo y creando el flujo del RAG

Metodo encargado de la busqueda en la base de conocimiento

In [24]:
def search(query):
    filter_config = {'course': 'data-engineering-zoomcamp'}
    boost_config = {'question': 3.0, 'section': 0.5}
    n_result = 5

    results = index.search(
        query=query,
        filter_dict=filter_config,
        boost_dict=boost_config,
        num_results=n_result
    )

    return results

Metodo encargado de generar el prompt (consulta del usaurio + info relevante de la base de conocimiento)

In [25]:
def build_prompt(query, search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT: 
{context}
""".strip()

    context = ""
    
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"
    
    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

Metodo encargado de solicitar al LLM una inferencia de acuerdo al prompt enviado

In [26]:
def open_ai_llm(prompt):
    client = OpenAI()
    
    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )
    
    return response.choices[0].message.content

Flujo basico para mostrar el funcionamiento de un marco RAG

In [27]:
def rag(query):
    search_results = search(query)
    prompt = build_prompt(query, search_results)
    answer = open_ai_llm(prompt)
    return answer

Resultado final, estos metodos con el flujo en python se puede convertir facilmente en un script de python a ser utilizados en pipeline o aplicaciones AI

In [28]:
query = 'how do I run kafka?'

result_example_1 = rag(query)

print(result_example_1)

To run Kafka, you can execute the following command in your project directory for a Java producer:

```bash
java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java
```

For Python, ensure that you create a virtual environment and run the necessary requirements. Here are the steps:

1. Create a virtual environment (run only once):
   ```bash
   python -m venv env
   source env/bin/activate  # for MacOS/Linux
   # or
   env\Scripts\activate  # for Windows
   ```
2. Install the required packages:
   ```bash
   pip install -r ../requirements.txt
   ```

3. Whenever you need to use the virtual environment, activate it with:
   ```bash
   source env/bin/activate  # for MacOS/Linux
   # or
   env\Scripts\activate  # for Windows
   ```

4. To deactivate the virtual environment, run:
   ```bash
   deactivate
   ``` 

Make sure Docker images are up and running if you're using Docker.


In [29]:
result_example_2 = rag('the course has already started, can I still enroll?')

print(result_example_2)

Yes, you can still enroll in the course even though it has already started. You are eligible to submit homework assignments, but keep in mind that there will be deadlines for turning in final projects, so be sure not to leave everything until the last minute.


### Creando nuestar Base de Conocimiento con Elasticsearch

Elasticsearch es un motor de búsqueda y analíticas distribuido open source desarrollado para aplicaciones de velocidad, escalabilidad y AI. Como plataforma de recuperación, almacena datos estructurados, no estructurados y vectoriales en tiempo real, al ofrecer búsquedas híbridas y vectoriales rápidas, impulsar la analítica de seguridad y observabilidad, y habilitar aplicaciones impulsadas por AI con alto rendimiento, precisión y relevancia.

Usamos Docker para levantar el servicio; ejecuta el siguiente comando en la terminal:

```bash
docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:9.0.0
```

**Importar el módulo**

Importa el modulo de Elasticsearch para poder interactuar con el servidor de Elasticsearch

In [30]:
from elasticsearch import Elasticsearch

**Conectarse al servidor de Elasticsearch**

Crea una instancia de cliente que te permite enviar peticiones al servidor de Elasticsearch.

In [31]:
es_client = Elasticsearch('http://localhost:9200')

Obtener información básica del clúster

In [32]:
info = es_client.info()
info

ObjectApiResponse({'name': '58a1aee08cde', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'WBO0PMdCRReVvBHQTptbLQ', 'version': {'number': '9.0.0', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '112859b85d50de2a7e63f73c8fc70b99eea24291', 'build_date': '2025-04-08T15:13:46.049795831Z', 'build_snapshot': False, 'lucene_version': '10.1.0', 'minimum_wire_compatibility_version': '8.18.0', 'minimum_index_compatibility_version': '8.0.0'}, 'tagline': 'You Know, for Search'})

In [33]:
info.keys()

dict_keys(['name', 'cluster_name', 'cluster_uuid', 'version', 'tagline'])

In [34]:
info['version']

{'number': '9.0.0',
 'build_flavor': 'default',
 'build_type': 'docker',
 'build_hash': '112859b85d50de2a7e63f73c8fc70b99eea24291',
 'build_date': '2025-04-08T15:13:46.049795831Z',
 'build_snapshot': False,
 'lucene_version': '10.1.0',
 'minimum_wire_compatibility_version': '8.18.0',
 'minimum_index_compatibility_version': '8.0.0'}

**Crear un índice en Elasticsearch**

*settings*: define la configuración interna del índice:
- `number_of_shards`: 1 → El índice se divide en 1 fragmento. Los shards distribuyen los datos.
- `number_of_replicas`: 0 → No se crean réplicas (copias redundantes). Esto puede ser útil en entornos de desarrollo para ahorrar recursos.

*mappings*: define la estructura de los documentos (los campos y sus tipos):
- `text` → tipo text, se analiza para búsquedas de texto completo (ideal para campos con oraciones).
- `section` → también text, puede representar una sección del curso.
- `question` → también text, probablemente sea la pregunta misma.
- `course` → tipo keyword, no se analiza, ideal para filtrado exacto o agregaciones (como "Matemáticas", "Historia").

In [35]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

Define el nombre del índice donde se guardarán los documentos.

In [36]:
index_name = "course-questions"

Usa el cliente de Elasticsearch (es_client) para crear el índice course-questions con la configuración y mappings definidos anteriormente.

In [50]:
# para eliminar el indice
# es_client.indices.delete(index=index_name)

In [51]:
es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [38]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [39]:
from tqdm.auto import tqdm

**Insertar (indexar) un documento en un índice en Elasticsearch**

Elasticsearch guarda ese documento en el índice course-questions. Le asignará automáticamente un _id (a menos que lo especifiques) y lo dejará listo para búsquedas

In [40]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [42]:
query = 'I just disovered the course. Can I still join it?'

**Definir una consulta compleja en Elasticsearch**

Esta consulta está diseñada para:
- Buscar documentos relacionados con una pregunta o frase (query).
- Priorizar las coincidencias en el campo question.
- Filtrar los resultados para que solo correspondan al curso data-engineering-zoomcamp.
- Retornar como máximo 5 resultados.

In [43]:
search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

Realiza una búsqueda en Elasticsearch usando el índice y la consulta definida previamente.

In [44]:
response = es_client.search(index=index_name, body=search_query)

Exploramos la respuesa obtenida de la busqueda

In [45]:
response['hits']['hits']

[{'_index': 'course-questions',
  '_id': 'DNeocZcBS89bJI-BNzA5',
  '_score': 73.286255,
  '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
   'section': 'General course-related questions',
   'question': 'Course - Can I still join the course after the start date?',
   'course': 'data-engineering-zoomcamp'}},
 {'_index': 'course-questions',
  '_id': 'WNd_cZcBS89bJI-BLSzy',
  '_score': 73.286255,
  '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
   'section': 'General course-related questions',
   'question': 'Course - Can I still join the course after the start date?',
   'course': 'data-engineering-zoomcamp'}},
 {'_index'

Podemos crear nuestro metodo para luego utilizarlo en el flujo basico de RAG

In [46]:
def elastic_search(query):
    index_name = "course-questions"
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }
    
    response = es_client.search(index=index_name, body=search_query)
    result_docs = [hit['_source'] for hit in response['hits']['hits']]
    
    return result_docs

In [47]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    answer = open_ai_llm(prompt)
    return answer

In [48]:
result_example_3 = rag('the course has already started, can I still enroll?')

print(result_example_3)

Yes, you can still enroll in the course after it has started. Even if you don't register, you are eligible to submit the homeworks. However, be mindful that there will be deadlines for turning in the final projects, so it's advisable not to leave everything until the last minute.
