Introduction

This the code from a private research project. The project's aim is to gather and analyze data from the German-language website Chefkoch.

Data Source

Data is gathered from the German-language cooking website Chefkoch. Note: As of mid-October 2016 the chefkoch.de website's terms of use made no provision restricting scraping the website for information, so this is all perfectly legal.

Inspecting their robots.txt reveals that chefkoch wants to restrict crawling behavior for some sub-folders (awesome hint from Scrapehero). I don't have insight into their folder structure, so have to guess which parts they are restricting. Most things (e.g., photos, user-pictures etc.) are uninteresting. I'm guessing "produkte" is the products beings sold on the website (not the recipes being referred to as "products"). The only one that might be tricky is /user/ because I do want some info on the users (experience level etc.).

In an old discussion on API usage a chefkoch admin says that "grabbing information and putting it on other platforms" is disallowed. It's unclear from the wording whether each or both of the two activities are disallowed, so I'm assuming it's both (frankly because it suits me best and who cares if I gather data for research purposes anyway).

This awesome guide by Jason Austwick is a useful resource.

Code

Category pages

Background
Recipe pages on Chefkoch are organized into seven overlapping categories:

Baking and sweets (Backen & Süßspeisen)
Drinks (Getränke)
Type of recipe (Menüart), e.g. starter or main course
Regional cuisine (Regional)
Seasonal recipes (Saisonal)
Special recipes (Spezielles), e.g. baby food or camping
Method of preparation (Zubereitungsarten), don't really understand this category

Categories comprise 12k-260k recipes. Links to the recipes are listed on batches of 30 on 'category-list-pages' (like this).

crawl_category_subpages.py
This script uses a list of the category syntax and a list of user-defined path preferences in config/. It downloads the HTML code of each category-list-page and and stores it to a local txt file.

The user-agents are curtesy of [willdrevo][user_agents].

CAUTION

As of Nov 2016 there are ~27k category-list-pages (@30 recipes each), weighing in at 5GB
Downloading took several hours (with a reasonably fast connection)

Notes

Need to check out grequests library for asynchronous requests.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
config		config
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl_category_subpages.py		crawl_category_subpages.py
crawl_recipe_pages.py		crawl_recipe_pages.py
link_page_data_analysis.ipynb		link_page_data_analysis.ipynb
parse_category_subpages.py		parse_category_subpages.py
parse_recipe_pages.py		parse_recipe_pages.py
parse_recipe_pages.pyc		parse_recipe_pages.pyc
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Data Source

Code

Category pages

Notes

About

Releases

Packages

Languages

License

leonzucchini/Recipes

Folders and files

Latest commit

History

Repository files navigation

Introduction

Data Source

Code

Category pages

Notes

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages