An automated Python scraper to download educational worksheets from K5 Learning website. This tool efficiently scrapes and downloads PDF worksheets with parallel processing support for faster downloads.
- 🚀 Parallel Downloads: Download multiple PDFs concurrently (up to 20 concurrent downloads)
- 📁 Organized Structure: Automatically organizes worksheets by topic and grade
- 🔄 Smart Resume: Skips already downloaded files
- 🛡️ Robust Error Handling: Automatic retry logic with exponential backoff
- 🎯 Flexible Scraping: Supports both direct worksheet pages and topic-based pages
- 📊 Progress Tracking: Real-time download status and statistics
k5learning/
├── main.py # Main scraper script
├── split_folders.py # Utility to restructure downloaded folders
├── requirements.txt # Python dependencies
├── downloaded_worksheets/ # Downloaded PDFs (ignored by git)
└── README.md # This file
- Python 3.7+
- pip or uv (for package management)
- Clone the repository:
git clone <repository-url>
cd k5learning- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate- Install dependencies:
# Using uv (recommended)
uv pip install -r requirements.txt
# Or using pip
pip install -r requirements.txt- Configure URLs: Edit
main.pyand modify theROOT_URLSlist in themain()function:
ROOT_URLS = [
"https://www.k5learning.com/free-math-worksheets/first-grade-1",
"https://www.k5learning.com/reading-comprehension-worksheets/first-grade-1",
# Add more URLs as needed
]- Run the scraper:
python main.py- Find your downloads: All PDFs will be saved in the
downloaded_worksheets/directory.
If you need to reorganize folders with underscores into nested directory structures:
python split_folders.py /path/to/downloaded_worksheetsThis will convert folders like grade1_math_addition into grade1/math/addition/.
Modify the MAX_CONCURRENT_DOWNLOADS constant in main.py:
MAX_CONCURRENT_DOWNLOADS = 20 # Default valueUpdate the OUTPUT_DIR variable in the main() function:
OUTPUT_DIR = "downloaded_worksheets" # Default location- Math worksheets
- Reading comprehension
- Vocabulary
- Spelling
- Grammar
- Science worksheets
- Async I/O: Uses
aiohttpfor parallel HTTP requests - Beautiful Soup 4: HTML parsing and navigation
- Semaphore Pattern: Controls concurrent download limits
- Retry Logic: Automatic retry with configurable attempts
The scraper intelligently detects two types of pages:
- Direct Worksheet Pages: Pages with immediate PDF download links
- Topic-Based Pages: Pages with categorized topics containing sub-pages with worksheets
- Automatic retry (up to 3 attempts) for failed downloads
- Skip existing files to avoid re-downloading
- Detailed logging of success/failure statistics
- Graceful handling of network errors and timeouts
requests: HTTP library for synchronous requestsbeautifulsoup4: HTML parsingaiohttp: Async HTTP client/serveraiofiles: Async file I/O operations
- The scraper includes respectful delays between requests (
time.sleep(0.5)) - User-Agent header is set to avoid being blocked
- PDF files are excluded from git tracking (see
.gitignore)
This project is licensed under the MIT License - see the LICENSE file for details.
Note: This project is for educational purposes only. Please respect K5 Learning's terms of service and copyright policies when using this tool.
- Increase
MAX_CONCURRENT_DOWNLOADS(but be respectful to the server) - Check your internet connection
- The website might be rate-limiting requests
- Try reducing
MAX_CONCURRENT_DOWNLOADS - Check if the website structure has changed
- Verify the URL is correct
- The website structure may have changed
- Check console output for parsing errors
Feel free to submit issues or pull requests for improvements.
This tool is for personal educational use only. Always comply with the website's terms of service and robots.txt file. The author is not responsible for any misuse of this tool.