GitHub - hosannaute/Question-Generator: A Python NLP utility that extracts text from PDFs and generates study questions

DOCUMENT TO QUESTION GENERATOR - NLP PARSING UTILITY

A Python-based command-line tool that reads document files, sanitizes the extracted text, and applies rule-based logic to automatically generate contextual study questions.

PROJECT OVERVIEW Developed as a technical showcase for my SIWES engineering portfolio, this application demonstrates backend data processing and text manipulation. Instead of relying on external AI APIs, it uses custom Regular Expressions (Regex) and trigger-word mapping to parse academic or technical text and programmatically generate relevant questions.

KEY FEATURES

Multi-Format File Parsing: Safely extracts text from both raw .txt files and complex .pdf documents using the pypdf library.

Data Sanitization: Cleans the extracted data by collapsing whitespace and stripping page numbers or noisy numeric lines to prevent processing errors.

Rule-Based NLP Logic: Scans sentences for specific linguistic triggers (e.g., "results in", "functions by") to intelligently generate categorized questions (What, Explain, How, Why).

Dual Data Export: Saves the generated questions into two formats simultaneously: a human-readable .txt file and a structured .json file for database or API integration.

TECHNICAL STACK

Language: Python 3.x

Core Concepts: File Input/Output, Regular Expressions (re), String Manipulation, Algorithm Design, JSON Serialization.

Dependencies: pypdf (for PDF extraction).

INSTALLATION AND USAGE

Install Dependencies Open your terminal and install the required PDF library: pip install pypdf

Run the Application python document_reader.py

How to Use:

Run the script and input the path to your document when prompted.

Alternatively, pass the file path directly in the terminal (e.g., python document_reader.py notes.pdf).

The engine will extract the text, process the sentences, and instantly generate a questions.txt and questions.json file in the same folder.

CODE ARCHITECTURE HIGHLIGHTS This application proves an understanding of clean software architecture. By isolating the text extraction, data cleaning, and question generation into strict, separate functions, the codebase adheres to the Single Responsibility Principle (SRP), making it highly modular and easy to maintain.

Status: SIWES Portfolio Project (2026)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
questions_generator.py		questions_generator.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages