DOCUMENT TO QUESTION GENERATOR - NLP PARSING UTILITY
A Python-based command-line tool that reads document files, sanitizes the extracted text, and applies rule-based logic to automatically generate contextual study questions.
PROJECT OVERVIEW Developed as a technical showcase for my SIWES engineering portfolio, this application demonstrates backend data processing and text manipulation. Instead of relying on external AI APIs, it uses custom Regular Expressions (Regex) and trigger-word mapping to parse academic or technical text and programmatically generate relevant questions.
KEY FEATURES
Multi-Format File Parsing: Safely extracts text from both raw .txt files and complex .pdf documents using the pypdf library.
Data Sanitization: Cleans the extracted data by collapsing whitespace and stripping page numbers or noisy numeric lines to prevent processing errors.
Rule-Based NLP Logic: Scans sentences for specific linguistic triggers (e.g., "results in", "functions by") to intelligently generate categorized questions (What, Explain, How, Why).
Dual Data Export: Saves the generated questions into two formats simultaneously: a human-readable .txt file and a structured .json file for database or API integration.
TECHNICAL STACK
Language: Python 3.x
Core Concepts: File Input/Output, Regular Expressions (re), String Manipulation, Algorithm Design, JSON Serialization.
Dependencies: pypdf (for PDF extraction).
INSTALLATION AND USAGE
Install Dependencies Open your terminal and install the required PDF library: pip install pypdf
Run the Application python document_reader.py
How to Use:
Run the script and input the path to your document when prompted.
Alternatively, pass the file path directly in the terminal (e.g., python document_reader.py notes.pdf).
The engine will extract the text, process the sentences, and instantly generate a questions.txt and questions.json file in the same folder.
CODE ARCHITECTURE HIGHLIGHTS This application proves an understanding of clean software architecture. By isolating the text extraction, data cleaning, and question generation into strict, separate functions, the codebase adheres to the Single Responsibility Principle (SRP), making it highly modular and easy to maintain.
Status: SIWES Portfolio Project (2026)