-
-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Product Information? #17
Comments
Hey @oyeanuj, nice I agree! Related to #11, I think it would be nice to have these different types of rules bundled as separate plugins, since they're very specific. And it doesn't really make sense for articles to be given so much weight over other types of content by being part of core. I just did it that way since it was my first needed use case. If you end up hacking on a product bundled of scraping rules, I'd be down to split them out! |
@oyeanuj I already have https://github.com/blakeembrey/node-scrappy which is parsing out production information (JSON-LD and microdata) if you're interested. It just needs to be extracted from the resulting data set (Scrappy uses to two phase scrapping process - first scrapes all information, second creates snippets). Here's an example of production information from Airbnb (https://github.com/blakeembrey/node-scrappy/blob/master/test/fixtures/airbnb-ny-apartment/result.json#L62-L75). @ianstormtaylor Sorry to cross-promote, we had this discussion a while back, I think. My goal is to extract known information from the page, while this one's was slightly different. I'd still be down to try to normalize them if possible. Edit: Note that my goal is also only using standardised metadata for now, it's not scraping unknowns. Edit 2: It's also parsing favicons, so you may want to replicate that logic into here - https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L415-L421 and https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L533-L556. |
Nice! No worries about cross-promotion at all :) |
Thanks 😄 FWIW, all the major product pages in the linked Ruby app seem to have decent metadata already on the page. Ran it on the current version of Scrappy and it extracted production information from them all (borderless/unfurl@612dff2) - all of them are using microdata. Someone just needs to use that microdata. Edit: See |
@blakeembrey Very cool! I'll try to go over the commit and play around with node-scrappy soon! |
Please check #41 😄 |
Hi @ianstormtaylor, I'm not sure if this is completely out of scope for this library - if yes, apologies.
But in case, it isn't, it would be amazing to treat product pages as distinct from articles by getting product specific information from the sites (atleast the main ones have it standardized). Here is a library (though a bit outdated) I found which does some of that - https://github.com/hulihanapplications/fletcher/blob/master/lib/fletcher/models/
Thank you - your library looks great! :)
The text was updated successfully, but these errors were encountered: