K5 Learning Worksheet Downloader

An automated Python scraper to download educational worksheets from K5 Learning website. This tool efficiently scrapes and downloads PDF worksheets with parallel processing support for faster downloads.

Features

🚀 Parallel Downloads: Download multiple PDFs concurrently (up to 20 concurrent downloads)
📁 Organized Structure: Automatically organizes worksheets by topic and grade
🔄 Smart Resume: Skips already downloaded files
🛡️ Robust Error Handling: Automatic retry logic with exponential backoff
🎯 Flexible Scraping: Supports both direct worksheet pages and topic-based pages
📊 Progress Tracking: Real-time download status and statistics

Project Structure

k5learning/
├── main.py                      # Main scraper script
├── split_folders.py             # Utility to restructure downloaded folders
├── requirements.txt             # Python dependencies
├── downloaded_worksheets/       # Downloaded PDFs (ignored by git)
└── README.md                    # This file

Prerequisites

Python 3.7+
pip or uv (for package management)

Installation

Clone the repository:

git clone <repository-url>
cd k5learning

Create and activate a virtual environment:

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install dependencies:

# Using uv (recommended)
uv pip install -r requirements.txt

# Or using pip
pip install -r requirements.txt

Usage

Basic Usage

Configure URLs: Edit main.py and modify the ROOT_URLS list in the main() function:

ROOT_URLS = [
    "https://www.k5learning.com/free-math-worksheets/first-grade-1",
    "https://www.k5learning.com/reading-comprehension-worksheets/first-grade-1",
    # Add more URLs as needed
]

Run the scraper:

python main.py

Find your downloads: All PDFs will be saved in the downloaded_worksheets/ directory.

Restructuring Downloaded Folders

If you need to reorganize folders with underscores into nested directory structures:

python split_folders.py /path/to/downloaded_worksheets

This will convert folders like grade1_math_addition into grade1/math/addition/.

Configuration

Adjust Concurrent Downloads

Modify the MAX_CONCURRENT_DOWNLOADS constant in main.py:

MAX_CONCURRENT_DOWNLOADS = 20  # Default value

Change Output Directory

Update the OUTPUT_DIR variable in the main() function:

OUTPUT_DIR = "downloaded_worksheets"  # Default location

Supported Worksheet Types

Math worksheets
Reading comprehension
Vocabulary
Spelling
Grammar
Science worksheets

Technical Details

Architecture

Async I/O: Uses aiohttp for parallel HTTP requests
Beautiful Soup 4: HTML parsing and navigation
Semaphore Pattern: Controls concurrent download limits
Retry Logic: Automatic retry with configurable attempts

Page Detection

The scraper intelligently detects two types of pages:

Direct Worksheet Pages: Pages with immediate PDF download links
Topic-Based Pages: Pages with categorized topics containing sub-pages with worksheets

Error Handling

Automatic retry (up to 3 attempts) for failed downloads
Skip existing files to avoid re-downloading
Detailed logging of success/failure statistics
Graceful handling of network errors and timeouts

Dependencies

requests: HTTP library for synchronous requests
beautifulsoup4: HTML parsing
aiohttp: Async HTTP client/server
aiofiles: Async file I/O operations

Notes

The scraper includes respectful delays between requests (time.sleep(0.5))
User-Agent header is set to avoid being blocked
PDF files are excluded from git tracking (see .gitignore)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: This project is for educational purposes only. Please respect K5 Learning's terms of service and copyright policies when using this tool.

Troubleshooting

Issue: Downloads are too slow

Increase MAX_CONCURRENT_DOWNLOADS (but be respectful to the server)
Check your internet connection

Issue: Many failed downloads

The website might be rate-limiting requests
Try reducing MAX_CONCURRENT_DOWNLOADS
Check if the website structure has changed

Issue: No worksheets found

Verify the URL is correct
The website structure may have changed
Check console output for parsing errors

Contributing

Feel free to submit issues or pull requests for improvements.

Disclaimer

This tool is for personal educational use only. Always comply with the website's terms of service and robots.txt file. The author is not responsible for any misuse of this tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

K5 Learning Worksheet Downloader

Features

Project Structure

Prerequisites

Installation

Usage

Basic Usage

Restructuring Downloaded Folders

Configuration

Adjust Concurrent Downloads

Change Output Directory

Supported Worksheet Types

Technical Details

Architecture

Page Detection

Error Handling

Dependencies

Notes

License

Troubleshooting

Issue: Downloads are too slow

Issue: Many failed downloads

Issue: No worksheets found

Contributing

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
split_folders.py		split_folders.py

License

gndps/k5learning_worksheet_scraper

Folders and files

Latest commit

History

Repository files navigation

K5 Learning Worksheet Downloader

Features

Project Structure

Prerequisites

Installation

Usage

Basic Usage

Restructuring Downloaded Folders

Configuration

Adjust Concurrent Downloads

Change Output Directory

Supported Worksheet Types

Technical Details

Architecture

Page Detection

Error Handling

Dependencies

Notes

License

Troubleshooting

Issue: Downloads are too slow

Issue: Many failed downloads

Issue: No worksheets found

Contributing

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages