# Creating Exam Questions
In this Notebook, a pipeline is established to create a set of Exam Questions based on presentation slide decks.

To reproduce the code, you need an API Key for the LLM services from ScaDS.AI.

In [17]:
from openai import OpenAI
import os
from caching import get_zenodo_ids_from_yaml, get_zenodo_pdfs, download_pdf
import requests
import pdfplumber
from pdf2image import convert_from_bytes
import io
import base64

my_api_key = os.getenv("SCADS_KEY")
token_limit = 131000  # Token limit for the prompt

# OpenAI client
openai_client = OpenAI(base_url="https://llm.scads.ai/v1", api_key=my_api_key)

# Find model with "llama" in name
for model in openai_client.models.list().data:
    model_name = model.id
    if "Qwen/" in model_name:
        print(model_name)
        break


Qwen/Qwen2-VL-7B-Instruct


## Before starting, all Zenodo Record, i.e. Slide Decks, are extracted from the training material database in order to choose a presentation used for generating the questions

In [16]:
file_url = "https://raw.githubusercontent.com/NFDI4BIOIMAGE/training/main/resources/nfdi4bioimage.yml"
yaml_file = "nfdi4bioimage.yml" 
response = requests.get(file_url)

# Download the current Training Material yaml file from the Git Repository
with open(yaml_file, "wb") as file:
    file.write(response.content)
print(f"File downloaded successfully as {yaml_file}")

# Extract the Zenodo Record IDs
zenodo_ids = get_zenodo_ids_from_yaml(yaml_file)
print(f"Found {len(zenodo_ids)} Zenodo records: {zenodo_ids}")

File downloaded successfully as nfdi4bioimage.yml
Found 38 Zenodo records: ['10008464', '10008465', '10083555', '10654775', '10679054', '10687658', '10793699', '10815329', '10816895', '10886749', '10939519', '10942559', '10970869', '10972692', '10990107', '11031746', '11066250', '11107798', '11265038', '11396199', '11472148', '11474407', '11548617', '12623730', '3778431', '4317149', '4328911', '4330625', '4334697', '4461261', '4630788', '4748510', '4748534', '4778265', '8323588', '8329305', '8329306', '8414318']


In [11]:
from pdf_utilities import save_images, download_zenodo_pdf

# You can either choose a specific record from the training material, or just load the desired PDF in this repository and change the pdf_path to the corresponding filename
zenodo_record_id = "12623730"  # Change to the desired Record
pdf_number = 2  # Change to the desired PDF number

# Step 1: Download PDF
pdf_path = download_zenodo_pdf(zenodo_record_id, pdf_number, "downloaded_images")

# Step 2: Save Images
save_images("downloaded_images", pdf_path)

 Downloaded PDF saved at: zenodo_12623730_pdf2.pdf


## First, all images (that very previously downloaded) are encoded and sent to the VLM to create one question per Slide

In [12]:
from PIL import Image

def encode_image(image_path, max_size=(512, 512), quality=75, convert_to_jpeg=True):
    """
    Resize and compress an image before encoding it to Base64.
    
    Parameters:
    - image_path (str): Path to the image.
    - max_size (tuple): Maximum width and height (default: 512x512).
    - quality (int): JPEG quality (1-100), lower = smaller size.
    - convert_to_jpeg (bool): Convert PNG to JPEG to reduce size.

    Returns:
    - str: Base64 encoded string of the optimized image.
    """
    with Image.open(image_path) as img:
        # Convert PNG to JPEG (optional)
        if convert_to_jpeg and img.format == "PNG":
            img = img.convert("RGB")  # Remove alpha channel for JPEG

        # Resize image while maintaining aspect ratio
        img.thumbnail(max_size, Image.Resampling.LANCZOS)  # Use high-quality resizing

        # Save to a BytesIO buffer
        img_buffer = io.BytesIO()
        img_format = "JPEG" if convert_to_jpeg else img.format  # Save as JPEG if converting
        img.save(img_buffer, format=img_format, quality=quality, optimize=True)

        # Convert to Base64
        img_buffer.seek(0)
        return base64.b64encode(img_buffer.getvalue()).decode("utf-8")


In [13]:
def get_image_paths(folder):
    image_paths = sorted([
        os.path.join(folder, f) for f in os.listdir(folder) if f.endswith(".png")
    ])
    return image_paths

# Example usage
image_folder = "downloaded_images"
image_files = get_image_paths(image_folder)

In [14]:
def ask_vlm_one_by_one(image_paths, knowledge_level="intermediate"):    
    for model in openai_client.models.list().data:
        model_name = model.id
        if "Qwen/" in model_name:
            break
        
    system_prompt = f"You are an AI assistant that analyzes slide presentations and creates a set of Exam Questions from them. Formulate the questions depending on the knowledge level, to make it easier or more detailed. The level is {knowledge_level}."
    prompt = "Take a look at the Slide and suggest a Exam Question for College Students concerning the topic of the current Slide. Output ONLY the Question, no additional information or explanations. Also output EXACTELY one Question per Image."
    
    responses = [] 

    for img_path in image_paths:
        base64_image = encode_image(img_path)  # Convert image to Base64

        messages = [
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": [
                {"type": "text", "text": prompt},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}
            ]}
        ]

        # Send request
        response = openai_client.chat.completions.create(
            model=model_name,
            messages=messages
        )

        # Store response
        responses.append(response.choices[0].message.content)

    return responses 



In [18]:
responses = ask_vlm_one_by_one(image_files, "beginner")
for i, res in enumerate(responses):
    print(f"Response for Image {i+1}: {res}")

Response for Image 1: What are the key components of research data management as discussed in the presentation?
Response for Image 2: What are the key stages in the RDM Life Cycle and what are their respective responsibilities?
Response for Image 3: What is the importance of licensing in the RDM life cycle and how does it impact the next cycle or acquisition?
Response for Image 4: What are the key components of good RDM (Research Data Management) as outlined in the slide?
Response for Image 5: What are the key responsibilities of a domain specialist in the context of scientific questions related to the physical world?
Response for Image 6: What are the key components of a Data Management Plan (DMP)?
Response for Image 7: What are the key responsibilities and procedures that should be defined early in a Data Management Plan (DMP)?
Response for Image 8: What is the difference between archiving and backup?
Response for Image 9: What is the difference between a role and a job profile in th

## Second, all questions are then again passed to the VLM, asking to summarize them in order to extract questions covering the most important points.

In [19]:
def ask_for_summary(responses_per_slide):
    openai_client = OpenAI(base_url="https://llm.scads.ai/v1", api_key=my_api_key)

    # Find model with "llama" in name
    for model in openai_client.models.list().data:
        model_name = model.id
        if "Llama" in model_name:
            break
    
    number_of_slides = len(responses_per_slide)
    max_questions = number_of_slides / 3
    
    questions = ", ".join(responses_per_slide)
    
    prompt = f"""
    Here is a list of Exam Questions : {questions}, relating to one specific deck of presentation Slides. 
    Each Slide got converted into a question. Go through the list and extract up to {max_questions} reasonable Exam Questions for College Students to that specific topic.
    Output the Questions numerated and output nothing else than those questions (no extra explanation, etc.):
    1. Question 1
    2. Questions 2
    ...

    Try to stick as closely to the topics from the list as possible and extract the key points into new questions.
    """
    
    messages = [
            {"role": "user", "content": [
                {"type": "text", "text": prompt}
            ]}
        ]

    # Send request
    response = openai_client.chat.completions.create(model=model_name,messages=messages)

    return response.choices[0].message.content

In [20]:
summary = ask_for_summary(responses)
print(summary)

1. What are the key components of research data management?
2. What are the stages in the RDM Life Cycle and their respective responsibilities?
3. What is the importance of licensing in the RDM life cycle?
4. What are the key components of a good Research Data Management plan?
5. What is the role of a domain specialist in scientific research?
6. What are the key components of a Data Management Plan?
7. What is the difference between archiving and backup?
8. What is the role and job profile of a Data Steward?
9. What are the different types of content that can be shared and licensed?
10. What is the difference between Reproducibility and Replicability in scientific experiments?
11. What are the potential long-term impacts on research software sustainability?
12. What are the FAIR principles for sharing data?
13. What are the four FAIR principles for making data findable?
14. What does "resolution" in microscopy imaging describe?
15. What are the different types of metadata and their pur

## Third, questions can also be created with another knowledge level

In [21]:
# Test if changing the knowledge level works
responses = ask_vlm_one_by_one(image_files, "expert")
summary = ask_for_summary(responses)

In [22]:
print(summary)

1. What are the key components of research data management in the context of scalable data analytics and artificial intelligence?
2. What are the key stages involved in the RDM Life Cycle and their respective responsibilities?
3. What is the importance of licensing in the RDM life cycle and its impact on the next cycle or acquisition?
4. What are the key components of good Research Data Management as outlined in the slide?
5. What are the key responsibilities of a data analyst and how do they differ from those of a domain specialist and an IT specialist?
6. What are the key components of a Data Management Plan?
7. What is the difference between archiving and backup in the context of regularly copying files to a remote place?
8. What are the key responsibilities and skills required for a Data Scientist role?
9. What is the role and job profile of a Data Steward?
10. What are the different types of content that can be shared and licensed according to the slide?
11. What is the difference