-
Notifications
You must be signed in to change notification settings - Fork 500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for thecookingguy.com #1067
Conversation
generates initial web scraper framework
Adds custom ingredients method
Great start! edit: I was mistaken! For instances where def site_name(self):
return "Sam The Cooking Guy" note: should the description of this PR read |
Oh yes, sorry about that I copied and pasted from another one so I could get the format right |
…o thecookingguy-description
implemented ingredient_groups
thecookingguy missing fields
currently fetches instruction div form html and outputs a single string when calling scrapper.instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, just a couple of minor comments (which are just suggestions, happy to be wrong).
…groups Thecookingguy refactor ingredient groups
conditonal check for checking for headers was incorrect and was outputting nothing before for pages with no headers for instructions
Thecookingguy refactor instructions
Thecookingguy tests
@rmdluo , as this has been under active development let me know when you're ready for a review! |
Thecookingguy description
I think we're ready for review now! |
Oops wait forgot to add it to the list of websites on the readme |
Good catch 😄 |
recipe_scrapers/thecookingguy.py
Outdated
def total_time(self): | ||
return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that no recipes from thecookingguy.com seem to have total times listed? (I can believe it, but just checking that that's the intended behaviour here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so yes, they have times within the instructions but not a single field for total time. See here for an example: https://www.thecookingguy.com/recipes/chili-cumin-lamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't something I've suggested before, so it would be a bit novel, but: what do you think about emitting a Python warning -- perhaps even one that would be ignored by default -- before returning the empty (not-known-to-be-available) value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the risk of seeming spammy or self-absorbed, let me provide some more context here; I generally have a few types of end-user in mind for this library:
- Enthusiasts/hobbyists who want to find an accessible way to read a small number of recipes from the web, or who may be learning some fun/straightforward Python code, and who may be running the code partly/entirely from the Python interpreter.
- Slightly more advanced users -- let's call them tech parents of the 2020s -- who are able to create a spreadsheet and run some Python code, and may be doing things like importing a group of recipes that the family likes, or some candidate options, into a spreadsheet for filtering/reference purposes.
- Larger, probably automated sites that are crawling the web itself, or perhaps using CommonCrawl, to assess larger numbers of recipes for academic/commercial/research purposes.
As a maintainer and developer, I generally want all three of these groups to have a good developer/user experience, but also to encourage them to behave well, and to notify them about potential problems in their usage patterns and the data.
In particular, for this field method that we always expect to return None
-- so it's kinda pointless to invoke -- I'd like use-cases (1) and (3) to get some kind of indication -- whether it's in their individual console or their haystack of logs -- that for a given website, there's no information available from a particular field. For use-case (2), it's probably fine to pass by silently -- ideally we'd perhaps autopopulate a comment or note in the spreadsheet to indicate that an individual row isn't available from that source (phrased in a clearer way than I just attempted), but doing so does incur a risk of distracting the person from their objectives.
I'm also likely to be biased because my own use-case is (3) as a small-business owner that crawls the web -- but to a certain extent that also makes me want to boost the other use-cases because I want the project to remain healthy (and I think that that depends on the other two cases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can do that, could we get some more details on what that looks like? Would it just be calling something like
warnings.filterwarnings('ignore', message='Not known to be available')
warnings.warn("Not known to be available")
before returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the behaviour I had in mind, yep - was it possible to achieve that without quirky/unusual code? (if the code to achieve it seem complicated, then I think we might want to go with a simpler approach instead. in any case though, thank you for investigating this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was really simple, just an extra if statement at the top of the file. Not sure how it interacts with other parts of the repo though. I'll push it and you can take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead and made the warning message more descriptive as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it also displays the warning when calling python -m unittest -k thecookingguy
, not sure if that is desirable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some ways that is quite useful I think, because it provides us with some context about how many scraper fields are affected by this. However: that output can seem a bit noisy when compared to the fairly clean default output of unittest
.
Again I'm afraid I don't have neat simple answers about how to handle this yet. We could add some filtering in our continuous integration to filter the message there -- or collect them so that they're somehow aggregated in the test report output -- and/or do the same for local development.
warns users about total_time returning null, by default the message is ignored. The message gets displayed if any warning flags except -Wignore is used.
recipe_scrapers/thecookingguy.py
Outdated
if not sys.warnoptions: | ||
warnings.simplefilter("ignore") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, neat, thank you @rmdluo. I'm thinking through some of the implications of this, but it's good to know that this is possible.
A request: could we narrow the filtering for this to the specific method that is affected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most recent commit should do this, I moved the if statement into the function. If no flags are set, then it will ignore the warning. If a flag is set, then it will follow the default behavior of that flag (as described in https://docs.python.org/3/library/warnings.html#the-warnings-filter). There's no more module level filtering now and I tested that by putting a temporary warning in the title() method and saw that, when run with no flags, the title warning printed as normal and the total_time warning was ignored.
filters warnings specifically in total_time, switching from setting the default filter for the whole module. Maintains the same functionality (ignore if no flag set, if flag is set follow flag behavior and show on -Wdefault)
recipe_scrapers/thecookingguy.py
Outdated
if sys.warnoptions: | ||
warnings.warn( | ||
null_return_warning.format(self.host(), "total_time", BUG_REPORT_LINK) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly different to my suggestion, I think; I reckon we should always emit the warning (no if
condition gating it), but because I'm wary of people copy-and-pasting code, I don't think we should include a warnings.simplefilter
line that ignores all warnings (even -- or perhaps especially -- when it is only enabled if sys.warnoptions
is set).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you're right, I'll go update it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem :) thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or wait no, I think we were using the ignore filter to ignore warnings by default when no flags are set. I can change it to just follow the default behavior of the flag and always emitting with no flag if you want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I probably overcomplicated the requirements, to be honest :/
Let's begin by emitting a warning -- without applying any conditional filtering -- and see what the results of that are like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without any conditional filtering, the warning always gets emitted except when the -Wignore flag is used. How many times it gets emitted depends on which flag is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think that's acceptable; perhaps an improvement - and it's simpler to reason about. Does that seem OK to you too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me! Either way if we decide to go back to conditional filtering, the code is somewhere in the previous commits. I'll push the simplified one now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that could be done in the future is creating a centralized custom warnings set:
https://angwalt12.medium.com/custom-warnings-in-python-aa61022c725c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks @rmdluo!
I'll likely merge and release this later this week.
This PR adds support for thecookingguy.com, using https://www.thecookingguy.com/recipes/chili-cumin-lamb as the test case.
Resolves #957