Product Information? #17

oyeanuj · 2016-07-16T17:23:03Z

Hi @ianstormtaylor, I'm not sure if this is completely out of scope for this library - if yes, apologies.

But in case, it isn't, it would be amazing to treat product pages as distinct from articles by getting product specific information from the sites (atleast the main ones have it standardized). Here is a library (though a bit outdated) I found which does some of that - https://github.com/hulihanapplications/fletcher/blob/master/lib/fletcher/models/

Thank you - your library looks great! :)

ianstormtaylor · 2016-07-17T21:32:02Z

Hey @oyeanuj, nice I agree!

Related to #11, I think it would be nice to have these different types of rules bundled as separate plugins, since they're very specific. And it doesn't really make sense for articles to be given so much weight over other types of content by being part of core. I just did it that way since it was my first needed use case.

If you end up hacking on a product bundled of scraping rules, I'd be down to split them out!

blakeembrey · 2016-07-17T23:43:25Z

@oyeanuj I already have https://github.com/blakeembrey/node-scrappy which is parsing out production information (JSON-LD and microdata) if you're interested. It just needs to be extracted from the resulting data set (Scrappy uses to two phase scrapping process - first scrapes all information, second creates snippets). Here's an example of production information from Airbnb (https://github.com/blakeembrey/node-scrappy/blob/master/test/fixtures/airbnb-ny-apartment/result.json#L62-L75).

@ianstormtaylor Sorry to cross-promote, we had this discussion a while back, I think. My goal is to extract known information from the page, while this one's was slightly different. I'd still be down to try to normalize them if possible.

Edit: Note that my goal is also only using standardised metadata for now, it's not scraping unknowns.

Edit 2: It's also parsing favicons, so you may want to replicate that logic into here - https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L415-L421 and https://github.com/blakeembrey/node-scrappy/blob/master/src/rules/html.ts#L533-L556.

ianstormtaylor · 2016-07-17T23:56:47Z

Nice! No worries about cross-promotion at all :)

blakeembrey · 2016-07-18T00:02:22Z

Thanks 😄

FWIW, all the major product pages in the linked Ruby app seem to have decent metadata already on the page. Ran it on the current version of Scrappy and it extracted production information from them all (borderless/unfurl@612dff2) - all of them are using microdata. Someone just needs to use that microdata.

Edit: See result.json, that's the raw extracted data before it's shrunk into a normalized snippet.

oyeanuj · 2016-07-18T00:19:52Z

@blakeembrey Very cool! I'll try to go over the commit and play around with node-scrappy soon!

Kikobeats · 2017-12-13T20:13:19Z

Please check #41 😄

oyeanuj mentioned this issue Mar 12, 2017

Product information borderless/unfurl#35

Open

Kikobeats closed this as completed Dec 16, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Product Information? #17

Product Information? #17

oyeanuj commented Jul 16, 2016

ianstormtaylor commented Jul 17, 2016

blakeembrey commented Jul 17, 2016 •

edited

Loading

ianstormtaylor commented Jul 17, 2016

blakeembrey commented Jul 18, 2016 •

edited

Loading

oyeanuj commented Jul 18, 2016

Kikobeats commented Dec 13, 2017

Product Information? #17

Product Information? #17

Comments

oyeanuj commented Jul 16, 2016

ianstormtaylor commented Jul 17, 2016

blakeembrey commented Jul 17, 2016 • edited Loading

ianstormtaylor commented Jul 17, 2016

blakeembrey commented Jul 18, 2016 • edited Loading

oyeanuj commented Jul 18, 2016

Kikobeats commented Dec 13, 2017

blakeembrey commented Jul 17, 2016 •

edited

Loading

blakeembrey commented Jul 18, 2016 •

edited

Loading