# Objective 5: Survey Question Generator (Notebook)

This notebook implements the generator design for Objective 5. It provides:\n
- a TF-IDF based retrieval pipeline to match requirements to questions,\n
- a greedy constraint-based selector to satisfy category/difficulty quotas,\n
- optional hooks for sentence-transformer embeddings for improved semantic matching.

Notes: The code cells are implementation-ready but not executed here. Follow the 'Next steps' cell later to run locally.

## Roadmap & Requirements (short)

The generator accepts a structured `requirements` object, for example:\n
````python\nrequirements = {\n  'question_count': 10,\n  'categories': ['service','satisfaction'],\n  'difficulty_range': [1,3],  # inclusive min/max\n  'lang': 'en'\n}\n````

Design goals: select unique questions, respect category and difficulty constraints, avoid near-duplicates, and favor high semantic-match to the requirement text.

In [None]:
# Imports and optional dependencies
import json
from typing import List, Dict, Any, Optional, Tuple
import math
from collections import Counter, defaultdict

# sklearn TF-IDF for baseline semantic retrieval
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Local database API in this repo
from survey_database import SurveyDatabase

# Optional: sentence-transformers (better semantic matching).
# We'll try to import but keep it optional; if not present, TF-IDF will be used.
try:
    from sentence_transformers import SentenceTransformer
    SENTENCE_TRANSFORMER_AVAILABLE = True
except Exception:
    SentenceTransformer = None
    SENTENCE_TRANSFORMER_AVAILABLE = False

# Helper: normalize text for vectorization (lightweight)
def prepare_text(s: Optional[str]) -> str:
    if s is None: return ''
    return str(s).strip().lower()

In [None]:
# Build a corpus from the database and create vector representations
def build_corpus_and_vectorizer(db: SurveyDatabase, lang: str = 'en') -> Tuple[List[Dict[str, Any]], TfidfVectorizer, Any]:
    """
    Returns: (questions_list, vectorizer, matrix_or_embeddings)
    - questions_list: list of question dicts (same order as vectors)
    - vectorizer: a trained TfidfVectorizer (or None if using embeddings)
    - matrix_or_embeddings: TF-IDF matrix or embeddings ndarray
    """
    questions = db.get_all_questions()
    # Filter by language if field exists (many items may not have lang recorded)
    if lang:
        filtered = [q for q in questions if q.get('tags') is None or lang in q.get('tags') or q.get('question_type') ]
    else:
        filtered = questions

    # Create a `text` field used for vectorization
    corpus = []
    for q in filtered:
        text = q.get('question_text') or q.get('text') or ''
        opts = q.get('options') or q.get('options_text') or ''
        # combine text and options to give weight to option-based questions
        combined = prepare_text(text) + ' ' + prepare_text(str(opts))
        corpus.append({'q': q, 'text': combined})

    texts = [c['text'] for c in corpus]

    # Use sentence-transformer embeddings if available for better semantic similarity
    if SENTENCE_TRANSFORMER_AVAILABLE:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        embeddings = model.encode(texts, show_progress_bar=False)
        return corpus, None, embeddings

    # Fallback to TF-IDF
    vectorizer = TfidfVectorizer(stop_words='english')
    matrix = vectorizer.fit_transform(texts)
    return corpus, vectorizer, matrix

In [None]:
# Rank questions by semantic similarity to the requirement text
def rank_questions(corpus, matrix_or_embeddings, vectorizer, requirement_text: str, top_k: int = 200):
    req = prepare_text(requirement_text)

    if SENTENCE_TRANSFORMER_AVAILABLE and vectorizer is None:
        # matrix_or_embeddings is embeddings ndarray
        model = SentenceTransformer('all-MiniLM-L6-v2')
        req_emb = model.encode([req], show_progress_bar=False)
        sims = cosine_similarity(req_emb, matrix_or_embeddings).flatten()
    else:
        # TF-IDF path
        req_vec = vectorizer.transform([req])
        sims = cosine_similarity(req_vec, matrix_or_embeddings).flatten()

    scored = []
    for idx, score in enumerate(sims):
        scored.append((idx, float(score)))
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_k]

# Greedy selector that enforces category & difficulty quotas
def select_with_constraints(corpus, scored_indices, question_count: int, categories: Optional[List[str]] = None, difficulty_range: Optional[List[float]] = None):
    selected = []
    used_texts = set()
    cat_counter = Counter()

    for idx, score in scored_indices:
        if len(selected) >= question_count:
            break
        q = corpus[idx]['q']
        q_text = (q.get('question_text') or q.get('text') or '').strip()

        # Skip duplicates by exact text
        if q_text in used_texts:
            continue

        # Category constraint (if provided)
        if categories:
            q_cat = q.get('category') or q.get('original_category') or 'Uncategorized'
            if q_cat not in categories:
                continue

        # Difficulty constraint (if provided)
        if difficulty_range:
            diff = q.get('difficulty') or q.get('difficulty_score') or 2
            if isinstance(diff, str):
                try:
                    diff = float(diff)
                except Exception:
                    diff = 2
            if diff < difficulty_range[0] or diff > difficulty_range[1]:
                continue

        # Passed filters, add to selected set
        selected.append({'question': q, 'score': score})
        used_texts.add(q_text)
        cat_counter[q.get('category') or 'Uncategorized'] += 1

    return selected

In [None]:
# High-level generation API
def generate_questionnaire(requirements: Dict[str, Any], db_path: str = 'convert_data.json') -> Dict[str, Any]:
    """
    requirements keys:\n
      - question_count (int)\n
      - categories (Optional[List[str]])\n
      - difficulty_range (Optional[List[min,max]])\n
      - lang (Optional[str])\n
      - requirement_text (Optional[str]) -- short natural language description\n
    Returns a dict with 'question_ids' and 'questions' (list of question dicts).
    """
    db = SurveyDatabase(db_path)
    lang = requirements.get('lang', 'en')
    question_count = int(requirements.get('question_count', 10))
    categories = requirements.get('categories')
    difficulty_range = requirements.get('difficulty_range')
    req_text = requirements.get('requirement_text', '')

    corpus, vectorizer, matrix_or_embeddings = build_corpus_and_vectorizer(db, lang=lang)

    # If no requirement text provided, prefer highest-usage or random sampling
    if not req_text:
        # simple fallback: sort by usage_count asc (less used first) or randomize
        candidates = [(i, 0.0) for i in range(len(corpus))]
    else:
        candidates = rank_questions(corpus, matrix_or_embeddings, vectorizer, req_text, top_k=1000)

    selected = select_with_constraints(corpus, candidates, question_count, categories, difficulty_range)

    # Build return structure
    question_ids = []
    questions_out = []
    for it in selected:
        q = it['question']
        question_ids.append(q.get('id') or q.get('question_id'))
        questions_out.append(q)

    return {'question_ids': question_ids, 'questions': questions_out}

# Example usage (not executed here)
# req = {'question_count': 10, 'categories': ['satisfaction','service'], 'difficulty_range':[1,3], 'lang':'en', 'requirement_text':'measure customer satisfaction after check-in experience'}
# result = generate_questionnaire(req, db_path='convert_data.json')
# print(result['question_ids'])

## Next steps (what you should run locally)

- Install required packages:\n
```bash\n
pip install scikit-learn sentence-transformers
```
- If you want to use the better semantic matching, install `sentence-transformers`. If not installed, TF-IDF will be used automatically.
- Open this notebook in Jupyter or run the `generate_questionnaire` function from a Python script. Example:\n
```python\n
from Q5_Codes_vLST import generate_questionnaire  # if exported as module or copy function to script\n
req = { 'question_count':10, 'categories':['satisfaction'], 'difficulty_range':[1,3], 'lang':'en', 'requirement_text':'post-checkin satisfaction' }\n
res = generate_questionnaire(req, db_path='convert_data.json')\n
print(len(res['questions']))\n
```
- Validate the output and add more selection constraints if you need (e.g., avoid overlapping keywords between selected questions).

If you want, I can next: implement a stricter deduplication step (embedding-based clustering) and add unit tests that validate category/difficulty quotas.