Skip to content

gndps/k5learning_worksheet_scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

K5 Learning Worksheet Downloader

An automated Python scraper to download educational worksheets from K5 Learning website. This tool efficiently scrapes and downloads PDF worksheets with parallel processing support for faster downloads.

Features

  • 🚀 Parallel Downloads: Download multiple PDFs concurrently (up to 20 concurrent downloads)
  • 📁 Organized Structure: Automatically organizes worksheets by topic and grade
  • 🔄 Smart Resume: Skips already downloaded files
  • 🛡️ Robust Error Handling: Automatic retry logic with exponential backoff
  • 🎯 Flexible Scraping: Supports both direct worksheet pages and topic-based pages
  • 📊 Progress Tracking: Real-time download status and statistics

Project Structure

k5learning/
├── main.py                      # Main scraper script
├── split_folders.py             # Utility to restructure downloaded folders
├── requirements.txt             # Python dependencies
├── downloaded_worksheets/       # Downloaded PDFs (ignored by git)
└── README.md                    # This file

Prerequisites

  • Python 3.7+
  • pip or uv (for package management)

Installation

  1. Clone the repository:
git clone <repository-url>
cd k5learning
  1. Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
  1. Install dependencies:
# Using uv (recommended)
uv pip install -r requirements.txt

# Or using pip
pip install -r requirements.txt

Usage

Basic Usage

  1. Configure URLs: Edit main.py and modify the ROOT_URLS list in the main() function:
ROOT_URLS = [
    "https://www.k5learning.com/free-math-worksheets/first-grade-1",
    "https://www.k5learning.com/reading-comprehension-worksheets/first-grade-1",
    # Add more URLs as needed
]
  1. Run the scraper:
python main.py
  1. Find your downloads: All PDFs will be saved in the downloaded_worksheets/ directory.

Restructuring Downloaded Folders

If you need to reorganize folders with underscores into nested directory structures:

python split_folders.py /path/to/downloaded_worksheets

This will convert folders like grade1_math_addition into grade1/math/addition/.

Configuration

Adjust Concurrent Downloads

Modify the MAX_CONCURRENT_DOWNLOADS constant in main.py:

MAX_CONCURRENT_DOWNLOADS = 20  # Default value

Change Output Directory

Update the OUTPUT_DIR variable in the main() function:

OUTPUT_DIR = "downloaded_worksheets"  # Default location

Supported Worksheet Types

  • Math worksheets
  • Reading comprehension
  • Vocabulary
  • Spelling
  • Grammar
  • Science worksheets

Technical Details

Architecture

  • Async I/O: Uses aiohttp for parallel HTTP requests
  • Beautiful Soup 4: HTML parsing and navigation
  • Semaphore Pattern: Controls concurrent download limits
  • Retry Logic: Automatic retry with configurable attempts

Page Detection

The scraper intelligently detects two types of pages:

  1. Direct Worksheet Pages: Pages with immediate PDF download links
  2. Topic-Based Pages: Pages with categorized topics containing sub-pages with worksheets

Error Handling

  • Automatic retry (up to 3 attempts) for failed downloads
  • Skip existing files to avoid re-downloading
  • Detailed logging of success/failure statistics
  • Graceful handling of network errors and timeouts

Dependencies

  • requests: HTTP library for synchronous requests
  • beautifulsoup4: HTML parsing
  • aiohttp: Async HTTP client/server
  • aiofiles: Async file I/O operations

Notes

  • The scraper includes respectful delays between requests (time.sleep(0.5))
  • User-Agent header is set to avoid being blocked
  • PDF files are excluded from git tracking (see .gitignore)

License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: This project is for educational purposes only. Please respect K5 Learning's terms of service and copyright policies when using this tool.

Troubleshooting

Issue: Downloads are too slow

  • Increase MAX_CONCURRENT_DOWNLOADS (but be respectful to the server)
  • Check your internet connection

Issue: Many failed downloads

  • The website might be rate-limiting requests
  • Try reducing MAX_CONCURRENT_DOWNLOADS
  • Check if the website structure has changed

Issue: No worksheets found

  • Verify the URL is correct
  • The website structure may have changed
  • Check console output for parsing errors

Contributing

Feel free to submit issues or pull requests for improvements.

Disclaimer

This tool is for personal educational use only. Always comply with the website's terms of service and robots.txt file. The author is not responsible for any misuse of this tool.

About

Scrape worksheets from k5learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages