Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for thecookingguy.com #1067

Merged
merged 34 commits into from
May 7, 2024
Merged

Conversation

rmdluo
Copy link
Contributor

@rmdluo rmdluo commented Apr 19, 2024

This PR adds support for thecookingguy.com, using https://www.thecookingguy.com/recipes/chili-cumin-lamb as the test case.

Resolves #957

@jknndy
Copy link
Collaborator

jknndy commented Apr 19, 2024

Great start! The site_name should be returned by the schema as the expected key, I tested it locally and the scraper is expected Sam The Cooking Guy, answering #957 here to consolidate the discussion.

edit: I was mistaken! For instances where site_name is not automatically populating you can define it by adding a declaration to the scraper class, like this...

    def site_name(self):
        return "Sam The Cooking Guy"

note: should the description of this PR read thecookingguy.com instead of bestrecipes.com.au?

@rmdluo
Copy link
Contributor Author

rmdluo commented Apr 19, 2024

Oh yes, sorry about that I copied and pasted from another one so I could get the format right

Copy link
Contributor

@mlduff mlduff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, just a couple of minor comments (which are just suggestions, happy to be wrong).

recipe_scrapers/thecookingguy.py Outdated Show resolved Hide resolved
recipe_scrapers/thecookingguy.py Outdated Show resolved Hide resolved
@jknndy
Copy link
Collaborator

jknndy commented Apr 22, 2024

@rmdluo , as this has been under active development let me know when you're ready for a review!

@rmdluo
Copy link
Contributor Author

rmdluo commented Apr 23, 2024

I think we're ready for review now!

@rmdluo rmdluo marked this pull request as ready for review April 23, 2024 21:02
@rmdluo
Copy link
Contributor Author

rmdluo commented Apr 23, 2024

Oops wait forgot to add it to the list of websites on the readme

@jayaddison
Copy link
Collaborator

Oops wait forgot to add it to the list of websites on the readme

Good catch 😄

Comment on lines 19 to 20
def total_time(self):
return None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that no recipes from thecookingguy.com seem to have total times listed? (I can believe it, but just checking that that's the intended behaviour here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so yes, they have times within the instructions but not a single field for total time. See here for an example: https://www.thecookingguy.com/recipes/chili-cumin-lamb

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't something I've suggested before, so it would be a bit novel, but: what do you think about emitting a Python warning -- perhaps even one that would be ignored by default -- before returning the empty (not-known-to-be-available) value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the risk of seeming spammy or self-absorbed, let me provide some more context here; I generally have a few types of end-user in mind for this library:

  1. Enthusiasts/hobbyists who want to find an accessible way to read a small number of recipes from the web, or who may be learning some fun/straightforward Python code, and who may be running the code partly/entirely from the Python interpreter.
  2. Slightly more advanced users -- let's call them tech parents of the 2020s -- who are able to create a spreadsheet and run some Python code, and may be doing things like importing a group of recipes that the family likes, or some candidate options, into a spreadsheet for filtering/reference purposes.
  3. Larger, probably automated sites that are crawling the web itself, or perhaps using CommonCrawl, to assess larger numbers of recipes for academic/commercial/research purposes.

As a maintainer and developer, I generally want all three of these groups to have a good developer/user experience, but also to encourage them to behave well, and to notify them about potential problems in their usage patterns and the data.

In particular, for this field method that we always expect to return None -- so it's kinda pointless to invoke -- I'd like use-cases (1) and (3) to get some kind of indication -- whether it's in their individual console or their haystack of logs -- that for a given website, there's no information available from a particular field. For use-case (2), it's probably fine to pass by silently -- ideally we'd perhaps autopopulate a comment or note in the spreadsheet to indicate that an individual row isn't available from that source (phrased in a clearer way than I just attempted), but doing so does incur a risk of distracting the person from their objectives.

I'm also likely to be biased because my own use-case is (3) as a small-business owner that crawls the web -- but to a certain extent that also makes me want to boost the other use-cases because I want the project to remain healthy (and I think that that depends on the other two cases).

Copy link
Contributor Author

@rmdluo rmdluo Apr 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do that, could we get some more details on what that looks like? Would it just be calling something like

warnings.filterwarnings('ignore', message='Not known to be available') 
warnings.warn("Not known to be available")

before returning?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the behaviour I had in mind, yep - was it possible to achieve that without quirky/unusual code? (if the code to achieve it seem complicated, then I think we might want to go with a simpler approach instead. in any case though, thank you for investigating this)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was really simple, just an extra if statement at the top of the file. Not sure how it interacts with other parts of the repo though. I'll push it and you can take a look.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and made the warning message more descriptive as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it also displays the warning when calling python -m unittest -k thecookingguy, not sure if that is desirable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some ways that is quite useful I think, because it provides us with some context about how many scraper fields are affected by this. However: that output can seem a bit noisy when compared to the fairly clean default output of unittest.

Again I'm afraid I don't have neat simple answers about how to handle this yet. We could add some filtering in our continuous integration to filter the message there -- or collect them so that they're somehow aggregated in the test report output -- and/or do the same for local development.

warns users about total_time returning null, by default the message is
ignored. The message gets displayed if any warning flags except -Wignore
is used.
Comment on lines 10 to 11
if not sys.warnoptions:
warnings.simplefilter("ignore")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, neat, thank you @rmdluo. I'm thinking through some of the implications of this, but it's good to know that this is possible.

A request: could we narrow the filtering for this to the specific method that is affected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most recent commit should do this, I moved the if statement into the function. If no flags are set, then it will ignore the warning. If a flag is set, then it will follow the default behavior of that flag (as described in https://docs.python.org/3/library/warnings.html#the-warnings-filter). There's no more module level filtering now and I tested that by putting a temporary warning in the title() method and saw that, when run with no flags, the title warning printed as normal and the total_time warning was ignored.

filters warnings specifically in total_time, switching from setting the
default filter for the whole module. Maintains the same functionality
(ignore if no flag set, if flag is set follow flag behavior and show on
-Wdefault)
Comment on lines 31 to 34
if sys.warnoptions:
warnings.warn(
null_return_warning.format(self.host(), "total_time", BUG_REPORT_LINK)
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is slightly different to my suggestion, I think; I reckon we should always emit the warning (no if condition gating it), but because I'm wary of people copy-and-pasting code, I don't think we should include a warnings.simplefilter line that ignores all warnings (even -- or perhaps especially -- when it is only enabled if sys.warnoptions is set).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you're right, I'll go update it

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem :) thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or wait no, I think we were using the ignore filter to ignore warnings by default when no flags are set. I can change it to just follow the default behavior of the flag and always emitting with no flag if you want?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I probably overcomplicated the requirements, to be honest :/

Let's begin by emitting a warning -- without applying any conditional filtering -- and see what the results of that are like.

Copy link
Contributor Author

@rmdluo rmdluo Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without any conditional filtering, the warning always gets emitted except when the -Wignore flag is used. How many times it gets emitted depends on which flag is used.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think that's acceptable; perhaps an improvement - and it's simpler to reason about. Does that seem OK to you too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems good to me! Either way if we decide to go back to conditional filtering, the code is somewhere in the previous commits. I'll push the simplified one now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that could be done in the future is creating a centralized custom warnings set:
https://angwalt12.medium.com/custom-warnings-in-python-aa61022c725c

@mlduff mlduff mentioned this pull request Apr 29, 2024
Copy link
Collaborator

@jayaddison jayaddison left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thanks @rmdluo!

I'll likely merge and release this later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Website scraper - www.thecookingguy.com
8 participants