Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Product Information? #17

Closed
oyeanuj opened this issue Jul 16, 2016 · 6 comments
Closed

Product Information? #17

oyeanuj opened this issue Jul 16, 2016 · 6 comments

Comments

@oyeanuj
Copy link

oyeanuj commented Jul 16, 2016

Hi @ianstormtaylor, I'm not sure if this is completely out of scope for this library - if yes, apologies.

But in case, it isn't, it would be amazing to treat product pages as distinct from articles by getting product specific information from the sites (atleast the main ones have it standardized). Here is a library (though a bit outdated) I found which does some of that - https://github.com/hulihanapplications/fletcher/blob/master/lib/fletcher/models/

Thank you - your library looks great! :)

@ianstormtaylor
Copy link

Hey @oyeanuj, nice I agree!

Related to #11, I think it would be nice to have these different types of rules bundled as separate plugins, since they're very specific. And it doesn't really make sense for articles to be given so much weight over other types of content by being part of core. I just did it that way since it was my first needed use case.

If you end up hacking on a product bundled of scraping rules, I'd be down to split them out!

@blakeembrey
Copy link

blakeembrey commented Jul 17, 2016

@oyeanuj I already have https://github.com/blakeembrey/node-scrappy which is parsing out production information (JSON-LD and microdata) if you're interested. It just needs to be extracted from the resulting data set (Scrappy uses to two phase scrapping process - first scrapes all information, second creates snippets). Here's an example of production information from Airbnb (https://github.com/blakeembrey/node-scrappy/blob/master/test/fixtures/airbnb-ny-apartment/result.json#L62-L75).

@ianstormtaylor Sorry to cross-promote, we had this discussion a while back, I think. My goal is to extract known information from the page, while this one's was slightly different. I'd still be down to try to normalize them if possible.

Edit: Note that my goal is also only using standardised metadata for now, it's not scraping unknowns.

Edit 2: It's also parsing favicons, so you may want to replicate that logic into here - https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L415-L421 and https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L533-L556.

@ianstormtaylor
Copy link

Nice! No worries about cross-promotion at all :)

@blakeembrey
Copy link

blakeembrey commented Jul 18, 2016

Thanks 😄

FWIW, all the major product pages in the linked Ruby app seem to have decent metadata already on the page. Ran it on the current version of Scrappy and it extracted production information from them all (borderless/unfurl@612dff2) - all of them are using microdata. Someone just needs to use that microdata.

Edit: See result.json, that's the raw extracted data before it's shrunk into a normalized snippet.

@oyeanuj
Copy link
Author

oyeanuj commented Jul 18, 2016

@blakeembrey Very cool! I'll try to go over the commit and play around with node-scrappy soon!

@Kikobeats
Copy link
Member

Please check #41 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants