This the code from a private research project. The project's aim is to gather and analyze data from the German-language website Chefkoch.
Data is gathered from the German-language cooking website Chefkoch. Note: As of mid-October 2016 the chefkoch.de website's terms of use made no provision restricting scraping the website for information, so this is all perfectly legal.
Inspecting their robots.txt reveals that chefkoch wants to restrict crawling behavior for some sub-folders (awesome hint from Scrapehero). I don't have insight into their folder structure, so have to guess which parts they are restricting. Most things (e.g., photos, user-pictures etc.) are uninteresting. I'm guessing "produkte" is the products beings sold on the website (not the recipes being referred to as "products"). The only one that might be tricky is /user/
because I do want some info on the users (experience level etc.).
In an old discussion on API usage a chefkoch admin says that "grabbing information and putting it on other platforms" is disallowed. It's unclear from the wording whether each or both of the two activities are disallowed, so I'm assuming it's both (frankly because it suits me best and who cares if I gather data for research purposes anyway).
This awesome guide by Jason Austwick is a useful resource.
Background
Recipe pages on Chefkoch are organized into seven overlapping categories:
- Baking and sweets (Backen & Süßspeisen)
- Drinks (Getränke)
- Type of recipe (Menüart), e.g. starter or main course
- Regional cuisine (Regional)
- Seasonal recipes (Saisonal)
- Special recipes (Spezielles), e.g. baby food or camping
- Method of preparation (Zubereitungsarten), don't really understand this category
Categories comprise 12k-260k recipes. Links to the recipes are listed on batches of 30 on 'category-list-pages' (like this).
crawl_category_subpages.py
This script uses a list of the category syntax and a list of user-defined path preferences in config/
. It downloads the HTML code of each category-list-page and and stores it to a local txt file.
The user-agents are curtesy of [willdrevo][user_agents].
CAUTION
- As of Nov 2016 there are ~27k category-list-pages (@30 recipes each), weighing in at 5GB
- Downloading took several hours (with a reasonably fast connection)
- Need to check out grequests library for asynchronous requests.