# Question Filtering and SDG Classification

## Overview

This notebook describes the **filtering** and **classification** of the questions generated in the previous stage.  
The objective is to ensure question quality and assign each valid question to one or more **Sustainable Development Goals (SDGs)**.

## Filtering Process

Questions are evaluated according to **four metrics** defined in the referenced paper.  
Only those satisfying **all four criteria** are retained, while the others are discarded.  

Functions are provided to:
- **Automatically apply** the filtering process to all files within a folder.  

## SDG Classification

Filtered questions are **classified into SDGs** to support further analysis and visualization.  
This process can also be **executed in batch mode** to handle multiple files simultaneously.

## Configuration

At the beginning of the notebook, update the **path variables** to specify the input and output directories used throughout the workflow.



In [None]:

file_name = "Indonesia_Floods and volcanic activity in Indonesia-Week 20 2024-prompt-1"
country = "Indonesia"
question_path_one_file = f"./Results/Questions/Test questions different prompts/1-Questions generated/Dev set/questions-{file_name}.json"
questions_path = "./Results/Questions/Test questions different prompts/1-Questions generated/Dev set"
filtered_questions_path = "./Results/Questions/Test questions different prompts/3-Filtered questions/Dev set"
filtered_questions_path_one_file = f"./Results/Questions/Test questions different prompts/3-Filtered questions/Dev set/final_questions-{file_name}.json"

filtered_questions_path_SDGs = "./Results/Questions/Test questions different prompts/4-Filtred questions with SDGs/Dev set/"
filtered_questions_path_SDGs_one_file = f"./Results/Questions/Test questions different prompts/4-Filtred questions with SDGs/Dev set/final_questions-SDGs-{file_name}.json"


openaikey = ""


In [None]:
import re 
import json 
import openai
import numpy as np 
def call_openai(prompt):


    client = openai.OpenAI(api_key=openaikey)
    # Ask GPT-4 for a brief overview
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content



# Metric to remove the unrelevant questions


In [None]:
def evaluate_question(question, country):
  relevance_prompt = f"""
    You are an AI filter evaluating questions for a humanitarian situational report (SitRep) focused on {country}.
  Analyze this question: "{question}" and evaluate the following criteria:

  
  1.  **Not Specific to {country}:** Does the question mention another place or country?
  2.  **Too Political:**Does it focus heavily on political causes, express strong opinions, assign blame, propose political solutions or strategies, or exhibit bias instead of focusing on neutral humanitarian impact and response?
  3.  **Long term/Historical:** Does the question focus on cumulative past events or speculative future scenarios, rather than immediate, actionable issues? Discard if it lacks clear relevance to the current humanitarian context or if it talks about a long term project. If there is a speicific period mentioned in the question it's likely that you should discard it. 
  4.  **Too general/too specific** Is it overly broad, abstract, or too specific? Even if the question is not related to the given country, it may still have a good generality level. 
  Each score should be evaluated indipendently from the others. 
  
  Based strictly on these rules, respond with a JSON object in this format:
    {{
      "score": [0, 0, 0, 0],
      "reason": ["", "", "", ""]
    }}
    
  For the "score" array:
  - Provide a 0 if the question meets the corresponding discard criterion.
  - Provide a 1 if the question *does not* meet the corresponding discard criterion (meaning it's acceptable for that metric).

  For the "reason" array:
  - If a score for a criterion is 0, provide a brief explanation of why the question failed that specific criterion in the corresponding position in the "reason" array.
  - If a score for a criterion is 1, leave the corresponding string in the "reason" array empty.
    """
  #response = (call_google_gemini(relevance_prompt))
  #response = call_google_gemini(relevance_prompt)
  response= call_openai(relevance_prompt)
  response = re.sub(r"```(?:json)?", "", response).strip().strip("`")

  try:
      result = json.loads(response)
      # Validate the structure of the score array if needed
      if not isinstance(result.get("score"), list) or len(result["score"]) != 4:
          print(f"Warning: 'score' in JSON response is not a 4-element array: {result.get('score')}")
          # You might want to handle this as an error or try to default to something sensible
          # For now, let's just return the potentially malformed result, or force a default
          return {"score": [0, 0, 0, 0], "reason": "Invalid 'score' array structure from AI response."}

      return result
  except json.JSONDecodeError as e:
      print("JSON parse error:", e)
      # When parsing fails, return a default structure with all scores as 0
      return {"score": [0, 0, 0, 0], "reason": "Could not parse JSON structure from AI response."}


 
  

# Filtering process

## Filtering one file 

In [None]:


with open(question_path_one_file, 'r') as file:
    questions_data = json.load(file)

In [None]:
filtered_questions = {}

for key, item in questions_data.items():
        fquestions = []
        questions = item.get('picked_questions')
        for question in questions:
            question = question[2:]
            print(question)
            result = evaluate_question(question, country)
            scores = result['score']
            reasons = result['reason']
            print(scores)
            print(reasons)
            
            
            if np.sum(scores) == 4: fquestions.append(question)
            else: continue
            
        filtered_questions[key] = fquestions

In [None]:
filtered_questions

## Save file 

In [None]:

with open(filtered_questions_path_one_file, 'w') as f:
    json.dump(filtered_questions, f)

In [None]:
filtered_questions


## Filter questions in a folder

In [None]:

# import json
# import os
# import numpy as np
# from pathlib import Path 



# def process_files_in_folder(input_folder_path_str, output_folder_path_str):
#     """
#     Processes all JSON files in the input folder, filters questions,
#     and saves them to the output folder.
#     """
#     input_folder = Path(input_folder_path_str)
#     output_folder = Path(output_folder_path_str)
#     print(input_folder)
#     output_folder.mkdir(parents=True, exist_ok=True)

#     for input_file_path in input_folder.glob('*.json'):
#         file_name_with_extension = input_file_path.name
#         file_name_stem = input_file_path.stem

#         print(f"\n--- Processing file: {file_name_with_extension} ---")

#         try:
#             country = file_name_stem.split('_')[0]
#             if not country:
#                 print(f"Warning: Could not extract country from filename: {file_name_with_extension}. Skipping.")
#                 continue
#         except IndexError:
#             print(f"Warning: Could not extract country from filename (no '_'): {file_name_with_extension}. Skipping.")
#             continue

#         print(f"Extracted Country: {country}")

#         try:
#             with open(input_file_path, 'r', encoding='utf-8') as file:
#                 questions_data = json.load(file)
#         except json.JSONDecodeError:
#             print(f"Error: Could not decode JSON from {file_name_with_extension}. Skipping.")
#             continue
#         except Exception as e:
#             print(f"Error reading file {file_name_with_extension}: {e}. Skipping.")
#             continue

#         filtered_questions_for_file = {}
#         for key, item in questions_data.items():
#             fquestions = []
#             questions_list = item.get('picked_questions')
#             if not isinstance(questions_list, list):
#                 continue

#             for question_full_string in questions_list:
#                 if not isinstance(question_full_string, str):
#                     continue
                
#                 # --- MODIFIED LOGIC FOR QUESTION PROCESSING ---
#                 question_processed = "" 
#                 if not question_full_string: # Handle empty string case
#                     print(f"\nOriginal question string: '{question_full_string}'")
#                     print(f"Warning: Original question string is empty. Skipping this question.")
#                     continue 

#                 if question_full_string[0].isalpha():
#                     question_processed = question_full_string
#                 else:
#                     # If not a letter, remove the first two characters.
#                     # Python's slice [2:] gracefully handles strings shorter than 2 chars by returning an empty string.
#                     question_processed = question_full_string[2:]
#                 # --- END OF MODIFIED LOGIC ---

#                 print(f"\nOriginal question string: '{question_full_string}'")
#                 print(f"Processed question: '{question_processed}'")

#                 # Evaluate the processed question
#                 result = evaluate_question(question_processed, country)
#                 scores = result.get('score')
#                 reasons = result.get('reason', 'No reason provided.')

#                 if scores is None or not isinstance(scores, list) or len(scores) != 4:
#                     print(f"Warning: Invalid scores format for question '{question_processed}'. Scores: {scores}. Skipping.")
#                     print(f"Reason: {reasons}")
#                     continue
                
#                 print(f"Scores: {scores}")
#                 print(f"Reason: {reasons}")

#                 try:
#                     if np.sum(scores) == 4:
#                         fquestions.append(question_processed) # Storing the processed question
#                         print("Question PASSED filter.")
#                     else:
#                         print("Question FAILED filter.")
#                 except Exception as e:
#                     print(f"Error during score summation or appending for question '{question_processed}': {e}")
#                     continue
            
#             if fquestions:
#                 filtered_questions_for_file[key] = fquestions

#         output_file_name = f"final_questions-{file_name_stem}.json"
#         filtered_questions_path = output_folder / output_file_name

#         if not filtered_questions_for_file:
#             print(f"No questions passed the filter for {file_name_with_extension}. Output file will be empty or not created if it doesn't contain any keys.")
        
#         try:
#             with open(filtered_questions_path, 'w', encoding='utf-8') as f:
#                 json.dump(filtered_questions_for_file, f, indent=4)
#             print(f"Filtered questions saved to: {filtered_questions_path}")
#         except Exception as e:
#             print(f"Error writing output file {filtered_questions_path}: {e}")

In [None]:

# process_files_in_folder(questions_path, filtered_questions_path)

# SDGs classification

## Import filtered questions:

In [None]:

with open(filtered_questions_path_one_file, 'r') as f: 
    filtered_questions= json.load(f)

In [None]:
filtered_questions

In [None]:
sdg_descriptions = {
    "No Poverty": "Eradicate poverty in all its forms globally, with a focus on ensuring basic human needs such as food, shelter, and clean water, and increasing access to social protection and economic resources.",
    "Zero Hunger": "End hunger and malnutrition by promoting sustainable agriculture, food security, improved nutrition, and equitable access to sufficient, nutritious food year-round.",
    "Good Health and Well-being": "Ensure healthy lives and well-being for all by reducing maternal and child mortality, ending epidemics, improving healthcare systems, and ensuring universal access to health services.",
    "Quality Education": "Provide inclusive, equitable, and high-quality education for all and promote lifelong learning opportunities to ensure literacy, numeracy, and access to skills for sustainable development.",
    "Gender Equality": "Achieve gender equality by eliminating all forms of discrimination, violence, and harmful practices against women and girls, and empowering them through equal opportunities in leadership, education, and the workforce.",
    "Clean Water and Sanitation": "Ensure universal access to clean water and adequate sanitation by improving water quality, managing water resources, reducing pollution, and promoting sustainable practices.",
    "Affordable and Clean Energy": "Ensure universal access to affordable, reliable, and modern energy by increasing renewable energy production, improving energy efficiency, and promoting sustainable energy consumption.",
    "Decent Work and Economic Growth": "Promote sustained, inclusive, and sustainable economic growth by providing productive employment, protecting labor rights, ensuring decent work conditions, and promoting economic equality.",
    "Industry, Innovation, and Infrastructure": "Build resilient infrastructure, promote inclusive and sustainable industrialization, and foster innovation to drive economic growth, technological progress, and sustainable development.",
    "Reduced Inequality": "Reduce income inequality within and among countries by promoting social, economic, and political inclusion, as well as equal opportunities for marginalized and disadvantaged populations.",
    "Sustainable Cities and Communities": "Make cities and human settlements inclusive, safe, resilient, and sustainable by ensuring affordable housing, reducing urban pollution, improving infrastructure, and promoting sustainable urban planning.",
    "Responsible Consumption and Production": "Ensure sustainable consumption and production patterns by reducing waste, increasing recycling, promoting sustainable business practices, and encouraging responsible consumer behavior.",
    "Climate Action": "Take urgent action to combat climate change and its impacts by reducing greenhouse gas emissions, building resilience to climate-related hazards, and integrating climate policies into national development strategies.",
    "Life Below Water": "Conserve and sustainably use the oceans, seas, and marine resources by reducing marine pollution, protecting coastal ecosystems, and promoting sustainable fishing practices.",
    "Life on Land": "Protect, restore, and promote sustainable use of terrestrial ecosystems, manage forests sustainably, combat desertification, halt biodiversity loss, and protect natural habitats.",
    "Peace, Justice, and Strong Institutions": "Promote peaceful and inclusive societies by ensuring access to justice, reducing violence, building accountable institutions, and fostering good governance at all levels.",
    "Partnerships for the Goals": "Strengthen global partnerships for sustainable development by mobilizing resources, sharing knowledge and technology, and promoting international cooperation for achieving the SDGs."
}

In [None]:
from openai import OpenAI
def check_relevance_1sdg(question, sdg, description):
    
    
    
    # CoT Prompt
    prompt = (
        f"I need to analyze whether the following question is related to a specific Sustainable Development Goal (SDG).\n\n"
        f"Question: {question}\n\n"
        f"SDG: {sdg}\n"
        f"Description: {description}\n\n"
        "Your task is to determine if the question directly relates to this SDG. You can use your general knowledge, "
        "along with the provided description, to compare the question to the SDG.\n\n"
        "Please follow these steps:\n"
        "1. Identify key terms in the question.\n"
        "2. Consider how these terms relate to the SDG description.\n"
        "3. Based on your analysis, return only the following:\n"
        "   - Return '1' if the question is directly related to this SDG, or '0' if it is not.\n"
        #"   - The reasoning for your choice.\n"
        
        "Provide your response in this format: "
        " Score: \n "
        "Reason: "
    )
    

    # Send the prompt to the LLM
    client = OpenAI(api_key=" ")
    response = client.chat.completions.create(
        model="gpt-4o",  
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    
   
    
        
    return response.choices[0].message.content.strip()

In [None]:
def create_sdg_classification_prompt(question, sdgs_description_str):
    prompt = f"""You are an expert in Sustainable Development Goals (SDGs).

TASK OVERVIEW:
Your job is to determine which SDGs are directly relevant to a given question. This classification will be used to group similar questions

CLASSIFICATION CRITERIA:
- Score 1: The question directly addresses, mentions, or requires knowledge about this specific SDG's core themes, targets, or indicators
- Score 0: The question does not directly relate to this SDG, even if there might be indirect or tangential connections

EVALUATION PROCESS:
1. Read the question carefully
2. For each of the 17 SDGs, independently assess whether the question directly relates to that SDG's primary focus areas
3. Be precise - only mark as relevant (score 1) if there is a clear, direct connection
4. Avoid marking SDGs as relevant based on weak or indirect associations

QUESTION TO CLASSIFY:
{question}

SDG DESCRIPTIONS:
{sdgs_description_str}

OUTPUT FORMAT:
Return your response as a python array containing exactly 17 integers (0 or 1), where each position corresponds to SDGs 1-17 in order.

Example:
- If the question relates to SDG 1 (No Poverty) and SDG 8 (Decent Work), return: [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- If the question relates only to SDG 3 (Good Health), return: [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Your response must be only the python array, nothing else."""

    return prompt

In [None]:
from openai import OpenAI
import json

def check_relevance_all_sdgs(question, sdgs_dict):
    client = OpenAI(api_key="")  # Ensure your API key is configured
    
    
    sdgs_description_str = ""
    for sdg_name, sdg_description in sdgs_dict.items():
        sdgs_description_str += f"- **{sdg_name}**: {sdg_description}\n"
    
    prompt = create_sdg_classification_prompt(question, sdgs_description_str)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": prompt}
        ],
    )
    
    content = response.choices[0].message.content.strip()
    
    # Parse the Python array response
    # Remove any potential backticks or code block markers
    content = content.replace('```python', '').replace('```', '').strip()
    
    # Parse as Python literal (safer than eval)
    import ast
    parsed_results = ast.literal_eval(content)
    
    # Validate that we have exactly 17 scores
    if len(parsed_results) != 17:
        raise ValueError(f"Expected 17 SDG scores, got {len(parsed_results)}")
    
    # Validate that all scores are 0 or 1
    for i, score in enumerate(parsed_results):
        if score not in [0, 1]:
            raise ValueError(f"Invalid score {score} at position {i}. Scores must be 0 or 1.")
    
    return parsed_results

In [None]:
filtered_questions

In [None]:
sdg_descriptions.keys()

In [None]:
import re



for key, questions in filtered_questions.items():
    updated_questions = []
    for q in questions: 
        result_q = check_relevance_all_sdgs(q, sdg_descriptions)
        # Create the new dictionary structure for each question
        question_data = {
            'question': q,
            'sdg_scores': result_q
        }
        updated_questions.append(question_data)
    
    # Replace the old list of strings with the new list of dictionaries
    filtered_questions[key] = updated_questions
        
        
# #            

## Save file sdgs

In [None]:

with open(filtered_questions_path_SDGs_one_file, 'w') as f:
    json.dump(filtered_questions, f)

## SDGs classificiation for folder

In [None]:

import json
import os
import numpy as np
from pathlib import Path 

def classify_sdgs_in_folder(input_folder_path_str, output_folder_path_str):
    """
    Processes all JSON files in the input folder, classify the sdgs,
    and saves them to the output folder.
    """
    input_folder = Path(input_folder_path_str)
    output_folder = Path(output_folder_path_str)
    print(input_folder)
    output_folder.mkdir(parents=True, exist_ok=True)

    for input_file_path in input_folder.glob('*.json'):
        file_name_with_extension = input_file_path.name
        file_name_stem = input_file_path.stem

        print(f"\n--- Processing file: {file_name_with_extension} ---")

        try:
            country = file_name_stem.split('_')[0]
            if not country:
                print(f"Warning: Could not extract country from filename: {file_name_with_extension}. Skipping.")
                continue
        except IndexError:
            print(f"Warning: Could not extract country from filename (no '_'): {file_name_with_extension}. Skipping.")
            continue

        print(f"Extracted Country: {country}")

        try:
            with open(input_file_path, 'r', encoding='utf-8') as file:
                filtered_questions = json.load(file)
        except json.JSONDecodeError:
            print(f"Error: Could not decode JSON from {file_name_with_extension}. Skipping.")
            continue
        except Exception as e:
            print(f"Error reading file {file_name_with_extension}: {e}. Skipping.")
            continue
        file = filtered_questions.copy()
        for key, questions in filtered_questions.items():
            updated_questions = []
            for q in questions: 
                result_q = check_relevance_all_sdgs(q, sdg_descriptions)
                # Create the new dictionary structure for each question
                question_data = {
                    'question': q,
                    'sdg_scores': result_q
                }
                updated_questions.append(question_data)
            
            # Replace the old list of strings with the new list of dictionaries
            file[key] = updated_questions
        
        
        output_file_name = f"final_questions-SDGs-{file_name_stem}.json"
        filtered_questions_path = output_folder / output_file_name

        try:
            with open(filtered_questions_path, 'w', encoding='utf-8') as f:
                json.dump(file, f, indent=4)
            print(f"Filtered questions saved to: {filtered_questions_path}")
        except Exception as e:
            print(f"Error writing output file {filtered_questions_path}: {e}")
        
        

In [None]:


# classify_sdgs_in_folder(filtered_questions_path, filtered_questions_path_SDGs)