# Point of notebook: Generating a set of questions for each record in the FAQ, and formatting it properly

In [1]:
import requests

# URL of the JSON file containing the course documents information
docs_url = 'https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/documents.json?raw=1'

# Send a GET request to the URL to retrieve the JSON data
docs_response = requests.get(docs_url)

# Parse the JSON response into a Python dictionary
documents_raw = docs_response.json()

# Initialize an empty list to store the processed document information
documents = []

# Loop through each course in the raw documents data
for course in documents_raw:
    # Extract the course name for the current course
    course_name = course['course']

    # Loop through the documents associated with the current course
    for doc in course['documents']:
        # Add the course name to each document's dictionary
        doc['course'] = course_name
        
        # Append the updated document (with course name) to the documents list
        documents.append(doc)

# Now 'documents' contains a list of all documents with the associated course name added


In [None]:
# Generating an id for each record in the FAQ

In [2]:
n = len(documents)

for i in range(n):
    documents[i]['id'] = i

documents[2]

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp',
 'id': 2}

But this approach means we can't change the order of the records in the FAQ so we take a different approach...

The following function is used for generating a unique ID for each document based on its content, which can be used for easy identification and storage. Hashing ensures that even slight differences in the input (e.g., a different course name or text) result in completely different outputs. Truncating the hash is a balance between uniqueness and brevity. By taking the first 8 characters, you get a short identifier, but MD5 still provides a reasonable chance that the ID will remain unique for different documents.

In [3]:
import hashlib

def generate_document_id(doc):
    # Create a string combining 'course', 'question', and the first 10 characters of 'text'
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    
    # Create an MD5 hash object from the combined string
    hash_object = hashlib.md5(combined.encode())
    
    # Convert the hash object into its hexadecimal representation
    hash_hex = hash_object.hexdigest()
    
    # Take the first 8 characters of the hash as the document ID
    document_id = hash_hex[:8]
    
    # Return the generated document ID
    return document_id


In [5]:
for doc in documents:
    doc['id'] = generate_document_id(doc)

In [6]:
documents[3]

{'text': "You don't need it. You're accepted. You can also just start learning and submitting homework without registering. It is not checked against any registered list. Registration is just to gauge interest before the start date.",
 'section': 'General course-related questions',
 'question': 'Course - I have registered for the Data Engineering Bootcamp. When can I expect to receive the confirmation email?',
 'course': 'data-engineering-zoomcamp',
 'id': '0bbf41ec'}

Purpose of the following code:

- Grouping by Document ID: This code groups all documents by their id and checks for duplicates. Each id serves as a key in the hashes dictionary, and the associated list contains all documents with that id.
- Finding Duplicates: The second loop checks for IDs that appear more than once (duplicates).
- Retrieving Documents by ID: The final line retrieves all documents with a specific id ('593f7569' in this case).

In [7]:
from collections import defaultdict

hashes = defaultdict(list)

# Iterate through each document in 'documents'
for doc in documents:
    # Extract the 'id' from the document
    doc_id = doc['id']
    
    # Append the document to the list corresponding to its ID in 'hashes'
    hashes[doc_id].append(doc)

# Return the number of unique keys in 'hashes' and the total number of documents
len(hashes), len(documents)


(947, 948)

So actually we have one fewer hashes then documents, but that's ok.

In [8]:
# Check for duplicate documents based on their ID
for k, values in hashes.items():
    if len(values) > 1:
        print(k, len(values))  # Print the document ID and how many duplicates exist

# Retrieve all documents with the specific ID '593f7569', which we know is the duplicate
hashes['593f7569']

593f7569 2


[{'text': "They both do the same, it's just less typing from the script.\nAsked by Andrew Katoch, Added by Edidiong Esu",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'},
 {'text': "They both do the same, it's just less typing from the script.",
  'section': '6. Decision Trees and Ensemble Learning',
  'question': 'Does it matter if we let the Python file create the server or if we run gunicorn directly?',
  'course': 'machine-learning-zoomcamp',
  'id': '593f7569'}]

The following code writes a list of documents with unique IDs into a JSON file:

In [9]:
import json

# Open a file called 'documents-with-ids.json' in write text mode ('wt')
with open('documents-with-ids.json', 'wt') as f_out:
    # Write the 'documents' data as a JSON file with an indentation of 2 spaces
    json.dump(documents, f_out, indent=2)


In [10]:
!head documents-with-ids.json

[
  {
    "text": "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  \u201cOffice Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon\u2019t forget to register in DataTalks.Club's Slack and join the channel.",
    "section": "General course-related questions",
    "question": "Course - When will the course start?",
    "course": "data-engineering-zoomcamp",
    "id": "c02e79ef"
  },
  {
    "text": "GitHub - DataTalksClub data-engineering-zoomcamp#prerequisites",


# Prompt

The following prompt generates questions that a student might ask based on an FAQ record. It minimizes the use of specific words from the record and keeps the questions general and concise.

In [11]:

prompt_template = """
You emulate a student who's taking our course.
Formulate 5 questions this student might ask based on a FAQ record. The record
should contain the answer to the questions, and the questions should be complete and not too short.
If possible, use as fewer words as possible from the record. 

The record:

section: {section}
question: {question}
answer: {text}

Provide the output in parsable JSON without using code blocks:

["question1", "question2", ..., "question5"]
""".strip()

The following code generates questions based on a document by formatting a prompt and sending it to the OpenAI GPT-4o model for completion. It extracts and returns the generated response, which is a list of questions formatted as a JSON string.

In [12]:
from openai import OpenAI  

client = OpenAI() 

def generate_questions(doc):
    # Creates a prompt by formatting the prompt_template using the 'doc' dictionary
    prompt = prompt_template.format(**doc)

    # Sends a request to the OpenAI GPT-4 model with the generated prompt
    # The model 'gpt-4o' might be a typo, 'gpt-4' is more common
    response = client.chat.completions.create(
        model='gpt-4o',  # Specifies the model (might need to be 'gpt-4')
        messages=[{"role": "user", "content": prompt}]  # Defines the message for the model (as a user input)
    )

    # Extracts the response from the OpenAI API by accessing the first completion choice and the content of the message
    json_response = response.choices[0].message.content

    # Returns the generated response, which should be a JSON-formatted string containing the 5 questions
    return json_response



The following code iterates through a list of documents, generating and storing questions for each unique document based on its ID. If a document has already been processed, it is skipped to avoid duplication.

In [13]:
# from tqdm.auto import tqdm  # Import the tqdm progress bar for tracking the progress of a loop.

# results = {}  # Initialize an empty dictionary to store the results.

# # Loop through each document in the 'documents' list, displaying a progress bar using tqdm.
# for doc in tqdm(documents):  
#     doc_id = doc['id']  # Extract the 'id' from each document.
    
#     # If the document ID is already in the 'results' dictionary, skip to the next iteration.
#     if doc_id in results:
#         continue  # Skip any further processing for this document if it's already processed.

#     # Generate questions for the document using the 'generate_questions' function.
#     questions = generate_questions(doc)
    
#     # Store the generated questions in the 'results' dictionary, using the document ID as the key.
#     results[doc_id] = questions


  0%|          | 0/948 [00:00<?, ?it/s]

In [16]:
# import pickle

# # Save the 'results' dictionary to a pickle file.
# with open('results.pkl', 'wb') as f:
#     pickle.dump(results, f)

In [17]:
with open('results.pkl', 'rb') as f_in:
    results = pickle.load(f_in)

In [18]:
results['1f6520ca']

'[\n  "Where can I find the prerequisites for this class?",\n  "Is there a link to the course requirements?",\n  "Where are details about necessary prior knowledge for this course?",\n  "How can I check if I meet the course prerequisites?",\n  "Where should I look for the entry requirements of this course?"\n]'

### Parse JSON strings as python dictionary

In [19]:
import json 
parsed_results = {}  # Initialize an empty dictionary to store the parsed results.

# Loop through the 'results' dictionary, where each key is a 'doc_id' and each value is a JSON string.
for doc_id, json_questions in results.items():
    
    # Parse the JSON string 'json_questions' into a native Python object (e.g. dictionary).
    # json.loads() converts a JSON-formatted string into a corresponding Python object.
    parsed_results[doc_id] = json.loads(json_questions)

    # Now 'parsed_results[doc_id]' will store the Python object instead of a JSON string.


In [29]:
# parsed_results

In [25]:
# Inspect the results
## Get the first key-value pair from the parsed_results dictionary
first_key, first_value = next(iter(parsed_results.items()))

## Print the first key and the corresponding value (parsed JSON object)
print("First document ID:", first_key)
print("First parsed value:", first_value)


First document ID: c02e79ef
First parsed value: ['When does the course begin?', 'What time will the course start on January 15th?', "How can I subscribe to the course's Google Calendar?", 'Where should I register before the course starts?', 'How do I join the course Telegram channel?']


# Mapping Questions to Course and Document Information for DataFrame Organization

The following code processes a set of questions extracted from documents, associates them with their respective course and document ID, and organizes the data into a structured pandas DataFrame. This allows for easier analysis, manipulation, or export of the data for further use.

In [31]:
import pandas as pd

# Create a dictionary mapping document IDs to their corresponding documents
doc_index = {d['id']: d for d in documents}

# Initialize an empty list for final results
final_results = []

# Loop through parsed questions by document ID
for doc_id, questions in parsed_results.items():
    course = doc_index[doc_id]['course']  # Get course info for the document
    for q in questions:
        final_results.append((q, course, doc_id))  # Append question, course, and doc_id to the results

# Convert results list to a DataFrame
df = pd.DataFrame(final_results, columns=['question', 'course', 'document'])


In [34]:
df.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,What time will the course start on January 15th?,data-engineering-zoomcamp,c02e79ef
2,How can I subscribe to the course's Google Cal...,data-engineering-zoomcamp,c02e79ef
3,Where should I register before the course starts?,data-engineering-zoomcamp,c02e79ef
4,How do I join the course Telegram channel?,data-engineering-zoomcamp,c02e79ef


In [32]:
df.to_csv('ground-truth-data.csv', index=False)

In [33]:
!head ground-truth-data.csv

question,course,document
When does the course begin?,data-engineering-zoomcamp,c02e79ef
What time will the course start on January 15th?,data-engineering-zoomcamp,c02e79ef
How can I subscribe to the course's Google Calendar?,data-engineering-zoomcamp,c02e79ef
Where should I register before the course starts?,data-engineering-zoomcamp,c02e79ef
How do I join the course Telegram channel?,data-engineering-zoomcamp,c02e79ef
Where can I find the prerequisites for this class?,data-engineering-zoomcamp,1f6520ca
Is there a link to the course requirements?,data-engineering-zoomcamp,1f6520ca
Where are details about necessary prior knowledge for this course?,data-engineering-zoomcamp,1f6520ca
How can I check if I meet the course prerequisites?,data-engineering-zoomcamp,1f6520ca
