Skip to content

Commit

Permalink
Merge e1b3ba6 into c6887ca
Browse files Browse the repository at this point in the history
  • Loading branch information
jknndy committed Jun 8, 2024
2 parents c6887ca + e1b3ba6 commit 1f3ba78
Show file tree
Hide file tree
Showing 5 changed files with 68 additions and 70 deletions.
6 changes: 3 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@

## Goal

This library has the goals of
The goals of this library are

* making recipe information **accessible**,
* ensuring the author is **attributed** correctly,
* representing the recipes **accurately** and **authentically**
* representing recipes **accurately** and **authentically**

Sometimes it is simple and straightforward to achieve all these goals, and sometimes it is more difficult (which is why this library exists). Where some interpretation or creativity is required to scrape a recipe, we should always keep those goals in mind. Occasionally, that might mean that we can't support a particular website.
Sometimes it is simple and straightforward to achieve all these goals, while other times it is more difficult (which is why this library exists). When some interpretation or creativity is required to scrape a recipe, we should always keep these goals in mind. Occasionally, this might mean that we can't support a particular website.

## Contents

Expand Down
18 changes: 9 additions & 9 deletions docs/how-to-develop-scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

## 1. Find a website

If you have found a website you want to scrape the recipes from, first of all check to see if the website is already supported.
If you have found a website from which you want to scrape recipes, first check to see if the website is already supported.

The project [README](https://github.com/hhursev/recipe-scrapers/blob/main/README.rst) has a list of the hundreds of websites already supported.
For a comprehensive list of supported websites, refer to the project's [README](https://github.com/hhursev/recipe-scrapers/blob/main/README.rst).

You can also check from within Python:

Expand All @@ -20,13 +20,13 @@ You can also check from within Python:
recipe_scrapers.bbcgoodfood.BBCGoodFood
```

It's a good idea to file an [issue](https://github.com/hhursev/recipe-scrapers/issues/new/choose) on GitHub to track support for the website, and to indicate whether you are working on it.
Before starting development, consider filing an [issue](https://github.com/hhursev/recipe-scrapers/issues/new/choose) on GitHub. This helps track the website's support status and lets others know you are working on it.

## 2. Fork the recipe-scrapers repository and clone

If this is your first time contributing to this repository then you will need to create a fork of the repository and clone it to your computer.

To create a fork, click the Fork button near the top of page on the project GitHub page. This will create a copy of the repository under your GitHub user.
To create a fork, click the Fork button near the top of page on the project GitHub page. This will create a copy of the repository under your GitHub account.

You can then clone the fork to your computer and set it up for development.

Expand Down Expand Up @@ -148,7 +148,7 @@ If the website supports Recipe Schema, then this is mostly done for you already.

Some additional functionality may be required in the scraper functions to make the output match the recipe on the website.

An in-depth guide on all the functions a scraper can support and what their output should be can be found [here](in-depth-guide-scraper-functions.md). The automatically generated scraper does not include all of these functions be default, so you may wish to add some of the additional functions listed if the website can support them.
An in-depth guide on all the functions a scraper can support and what their output should be can be found [here](in-depth-guide-scraper-functions.md). The automatically generated scraper does not include all of these functions by default, so you may wish to add some of the additional functions listed if the website can support them.

If the website does not support Recipe Schema, or the schema does not include all of the recipe information, then you can scrape the information out of the website HTML. Each scraper has a `bs4.BeautifulSoup` object made available in `self.soup` which contains the parsed HTML. This can be used to extract the recipe information needed.

Expand All @@ -165,9 +165,9 @@ A test case was automatically created when the scraper class was created. It can
The test case comprises two parts:

1. testhtml file containing the html from the URL used to generate the scraper
2. json file containing the expected output from the scraper when the scraper is run on the testhtml file.
2. JSON file containing the expected output from the scraper when the scraper is run on the testhtml file.

The generated json file will look something like this, with only the host field populated:
The generated JSON file will look something like this, with only the host field populated:

```json
{
Expand All @@ -190,9 +190,9 @@ The generated json file will look something like this, with only the host field

Each of the fields in this file has the same name as the related scraper function. You will need to add the correct output from the scraper to each of these fields.

If the scraper implements any of the optional functions listed in the [Scraper Functions guide](in-depth-guide-scraper-functions.md), then you should add the appropriate fields to the json file.
If the scraper implements any of the optional functions listed in the [Scraper Functions guide](in-depth-guide-scraper-functions.md), then you should add the appropriate fields to the JSON file.

In some cases, a scraper is not able to support one or more of the mandatory functions because the website doesn't provide the information. In these cases, remove the field from the json file. What will happen is that the test case will check to see if the scraper raises an exception if any of the unsupported functions are called.
In some cases, a scraper is not able to support one or more of the mandatory functions because the website doesn't provide the information. In these cases, remove the field from the JSON file. What will happen is that the test case will check to see if the scraper raises an exception if any of the unsupported functions are called.

You can check whether your scraper is passing the tests by running

Expand Down
16 changes: 8 additions & 8 deletions docs/in-depth-guide-html-scraping.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,25 @@
# In Depth Guide: HTML Scraping
# In-Depth Guide: HTML Scraping

The preferred method of scraping recipe information from a web page is to use the schema.org Recipe data. This is a machine readable, structured format intended to provide a standardised method of extracting information. However, whilst most recipe websites use the schema.org Recipe format, not all do, and for those websites that do, it does not always include all the information we are looking for. In these cases, we can use HTML scraping to extract the information from the HTML markup.
The preferred method of scraping recipe information from a web page is to use the schema.org Recipe data. This is a machine-readable, structured format intended to provide a standardized method of extracting information. However, while most recipe websites use the schema.org Recipe format, not all websites do, and for those websites that do, it does not always include all the information we are looking for. In these cases, HTML scraping is used to extract the information from the HTML markup.

## `soup`

Each scraper has a `BeautifulSoup` object that can be accessed using the `self.soup` attribute. The `BeautifulSoup` object is a representation of the web page HTML that has been parsed into a format that we can query and extract information from.

The [Beautiful Soup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the best resource for learning how to use `BeautifulSoup` objects to interact with HTML documents.
The [BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is the best resource for learning how to use `BeautifulSoup` objects to interact with HTML documents.

This guide covers a number of common patterns that are used in this library.

## Finding a single element

The `self.soup.find()` function returns the first element matching the arguments. This is useful if you are trying to extract some information that should only occur once, for example the prep time or total time.
The `self.soup.find` function returns the first element matching the arguments. This is useful if you are trying to extract some information that should only occur once, for example, the prep time or total time.

```python
# To find a particular element
self.soup.find("h1") # Returns the first h1 element

# To find an element with particular class (note the underscore at the end of class_)
self.soup.find(class_"total-time") # Returns the first element with total-time class.
self.soup.find(class_="total-time") # Returns the first element with total-time class.

# To find an element with a particular ID
self.soup.find(id="total-time")
Expand All @@ -29,7 +29,7 @@ self.soup.find(id="total-time")
self.soup.find("h1", class_="title")
```

`self.soup` returns a `bs4.element.Tag` object. Usually we just want the text from the selected element and the best way to do that is to use `.get_text()`.
`self.soup.find` returns a `bs4.element.Tag` object. Usually, we just want the text from the selected element, and the best way to do that is to use `.get_text()`.

```python
title_tag = self.soup.select("h1") # bs4.element.Tag object
Expand Down Expand Up @@ -74,7 +74,7 @@ The Beautiful Soup documentation for `find` is [here](https://www.crummy.com/sof

### Normalizing strings

A convenience function called `normalize_string()` is provided in the `_utils` package. This function will convert any characters escaped for HTML to their actual character (e.g. `&` to `&`) and remove unnecessary white space. It is best practice to always use this when extracting text from the HTML.
A convenience function called `normalize_string()` is provided in the `_utils` package. This function will convert any characters escaped for HTML to their actual character (e.g., `&` to `&`) and remove unnecessary white space. It is best practice to always use this when extracting text from the HTML.

```python
from ._utils import normalize_string
Expand Down Expand Up @@ -139,7 +139,7 @@ The Beautiful Soup documentation for `find_all` is [here](https://www.crummy.com

## Using CSS selectors

If you are already familiar with CSS selectors, then you can use `select()` to achieve the same result as `find_all()`, or `select_one()` to achieve the same result as `find`.
If you are already familiar with CSS selectors, then you can use `self.soup.select()` to achieve the same result as `self.soup.find_all()`, or `self.soup.select_one()` to achieve the same result as `self.soup.find`.

```python
ingredient_tag = self.soup.select("li.wprm-recipe-ingredient") # Match all li elements with wprm-recipe-ingredient class
Expand Down
Loading

0 comments on commit 1f3ba78

Please sign in to comment.