## Introducción

En este notebook se construye un conjunto de datos de preguntas generadas automáticamente a partir de documentos de tipo FAQ de distintos cursos del LLM Zoomcamp de DataTalksClub.
Este dataset será útil para evaluar el rendimiento de modelos de lenguaje en tareas de recuperación de información, como el *retrieval-augmented generation (RAG)*.

El proceso abarca desde la carga y preprocesamiento de documentos hasta la generación de preguntas usando un modelo de OpenAI, finalizando con la exportación del dataset en formato CSV.
A continuación, se presentan las distintas etapas del flujo de trabajo:

### 1. Descarga y estructuración de los documentos

Se descargan los documentos tipo FAQ desde el repositorio del curso, y se estructuran en una lista única añadiendo el nombre del curso al que pertenece cada entrada.

In [2]:
import requests 

docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()

documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

### 2. Generación de identificadores únicos

Se genera un identificador único (`id`) para cada documento usando un hash MD5 a partir de la combinación del curso, la pregunta y los primeros caracteres de la respuesta. Esto permite manejar referencias únicas por documento.

In [4]:
import hashlib

def generate_document_id(doc):
    # combined = f"{doc['course']}-{doc['question']}"
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [5]:
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [6]:
documents[3]

{'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'section': 'General course-related questions',
 'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'course': 'data-engineering-zoomcamp',
 'id': '0bbf41ec'}

### 3. Verificación de duplicados

Se utiliza un diccionario agrupado por ID para verificar si hay documentos duplicados (mismo hash generado). Esto ayuda a garantizar que no haya redundancia en los datos procesados.

In [8]:
from collections import defaultdict

In [9]:
hashes = defaultdict(list)

for doc in documents:
    doc_id = doc['id']
    hashes[doc_id].append(doc)

In [10]:
len(hashes), len(documents)

(947, 948)

In [18]:
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))

593f7569 2


In [19]:
hashes['593f7569']

[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'}]

### 4. Exportación de documentos con ID

Los documentos enriquecidos con su identificador se exportan a un archivo JSON (`documents-with-ids.json`) para uso posterior.

In [20]:
import json

In [21]:
with open('data/documents-with-ids.json', 'wt') as f_out:
    json.dump(documents, f_out, indent=2)

In [22]:
!head data/documents-with-ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


### 5. Definición del prompt para generación de preguntas

Se crea un `prompt_template` con instrucciones para que el modelo de lenguaje genere 5 preguntas completas por documento, utilizando la menor cantidad posible de palabras directamente copiadas.

In [23]:
prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

### 6. Generación de preguntas con OpenAI

Se envían los documentos al modelo `gpt-4o-mini` de OpenAI, que devuelve un JSON con 5 preguntas por documento. Los resultados se guardan para su análisis.

In [27]:
import os
from dotenv import load_dotenv

load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

if openai_key is None:
    print("La variable OPENAI_API_KEY no está definida en el archivo .env")
else:
    print("La variable OPENAI_API_KEY está definida correctamente.")

La variable OPENAI_API_KEY está definida correctamente.


In [29]:
from openai import OpenAI
client = OpenAI()

In [30]:
def generate_questions(doc):
    prompt = prompt_template.format(**doc)

    response = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[{"role": "user", "content": prompt}]
    )

    json_response = response.choices[0].message.content
    return json_response

In [31]:
from tqdm.auto import tqdm

In [32]:
results = {}

In [33]:
for doc in tqdm(documents): 
    doc_id = doc['id']
    if doc_id in results:
        continue

    questions = generate_questions(doc)
    results[doc_id] = questions

  0%|          | 0/948 [00:00<?, ?it/s]

In [35]:
import pickle

with open('data/results.bin', 'wb') as f_out:
    pickle.dump(results, f_out)

### 7. Procesamiento y limpieza de resultados

Se parsean los resultados generados para asociarlos correctamente con su documento y curso de origen. Se estructura la información en una lista con tripletas: pregunta, curso y documento.


In [36]:
import pickle

with open('data/results.bin', 'rb') as f_in:
    results = pickle.load(f_in)

In [37]:
results['1f6520ca']

'["What prior knowledge or skills do I need before enrolling in the course?", "Where can I find the list of prerequisites for the course?", "Are there any specific tools or technologies I should be familiar with before taking this course?", "Is there a resource that outlines the necessary qualifications for this course?", "What are the essential requirements for participation in this course?"]'

In [38]:
parsed_resulst = {}

for doc_id, json_questions in results.items():
    parsed_resulst[doc_id] = json.loads(json_questions)

In [40]:
doc_index = {d['id']: d for d in documents}

In [41]:
final_results = []

for doc_id, questions in parsed_resulst.items():
    course = doc_index[doc_id]['course']
    for q in questions:
        final_results.append((q, course, doc_id))

In [42]:
final_results[:5]

[('What is the exact date and time for the commencement of the course?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('How can I stay updated about course announcements?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('What should I do before the course begins?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('Where can I find the link to register for the course?',
  'data-engineering-zoomcamp',
  'c02e79ef'),
 ('Is there a specific platform for live Office Hours?',
  'data-engineering-zoomcamp',
  'c02e79ef')]

### 8. Creación y exportación del dataset final

Se construye un DataFrame con las preguntas generadas y se exporta a un archivo CSV (`ground-truth-data.csv`), que sirve como base para futuros *benchmarks* de evaluación.

In [44]:
import pandas as pd

In [45]:
df = pd.DataFrame(final_results, columns=['question', 'course', 'document'])

In [46]:
df.to_csv('data/ground-truth-data.csv', index=False)

In [47]:
!head data/ground-truth-data.csv

question,course,document
What is the exact date and time for the commencement of the course?,data-engineering-zoomcamp,c02e79ef
How can I stay updated about course announcements?,data-engineering-zoomcamp,c02e79ef
What should I do before the course begins?,data-engineering-zoomcamp,c02e79ef
Where can I find the link to register for the course?,data-engineering-zoomcamp,c02e79ef
Is there a specific platform for live Office Hours?,data-engineering-zoomcamp,c02e79ef
What prior knowledge or skills do I need before enrolling in the course?,data-engineering-zoomcamp,1f6520ca
Where can I find the list of prerequisites for the course?,data-engineering-zoomcamp,1f6520ca
Are there any specific tools or technologies I should be familiar with before taking this course?,data-engineering-zoomcamp,1f6520ca
Is there a resource that outlines the necessary qualifications for this course?,data-engineering-zoomcamp,1f6520ca
