New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown JSON-LD item #10

Closed
danjaywing opened this Issue Feb 10, 2017 · 10 comments

Comments

Projects
None yet
2 participants
@danjaywing

danjaywing commented Feb 10, 2017

Hi

I'm looking to build a script that sees what data it can glean from any given url, microdata first, then content. Your parser seems perfect for that, but I've noticed a case where an error is thrown in certain situations.

I'm giving the following url:
http://www.currys.co.uk/gbuk/computing/laptops/laptops/lenovo-yoga-510-14-2-in-1-black-10146249-pdt.html

And I'm getting the following warning:

Warning: get_class() expects parameter 1 to be object, array given in C:\Users\danm\Documents\Websites\page-scraper-analyser\vendor\jkphl\micrometa\src\Jkphl\Micrometa\Parser\JsonLD.php on line 217
Unknown JSON-LD item: {"items":[{"id":"_:b0","types":["http:\/\/schema.org\/BreadcrumbList"]

Is it finding microdata but attempting to parse it as JSON-LD?

I've also noticed cases where no data is obtained though microdata is used on the page, is this indicative of poor configuration their end?

Thanks in advance

EDIT

Here's a list of urls with data that either isn't being returned, or is buggy:

I appreciate that some of these may be down to the implementation of the microdata on the pages themselves.

@jkphl

This comment has been minimized.

Owner

jkphl commented Feb 10, 2017

Thanks for reporting! I'm on holidays right now but will look into this as soon as possible!

@danjaywing

This comment has been minimized.

danjaywing commented Feb 10, 2017

I've got around the warning by defining the parsers manually and exluding the JsonLD parser, I'm less interested in JsonLD but it would be nice to find a fix at some point. Enjoy your holidays!

@jkphl jkphl closed this in 77fd011 Feb 11, 2017

@jkphl

This comment has been minimized.

Owner

jkphl commented Feb 11, 2017

@danjaywing Serveral things:

  1. The warning you've seen (get_class() expects parameter 1 to be object, array given) was due to the fact that the JSON-LD parser didn't support value lists yet (as needed for the breadcrumb navigation). I just added that feature.

  2. There is JSON-LD embedded into the page you provided (2 external blocks) and the error was thrown when that JSON-LD was parsed. And no, there's no HTML microdata on that page. Please don't confuse the different formats (i.e. Microformats, HTML Microdata and JSON-LD) and the vocabularies (e.g. http://schema.org). The schema.org vocabulary can be expressed with both HTML microdata and JSON-LD (which is the case here).

  3. You don't have to use a workaround but can easily control which formats are parsed by using the appropriate parser constant(s):

$parsers =  \Jkphl\Micrometa\Parser\Microformats2::PARSE |  // Microformats = 1
            \Jkphl\Micrometa\Parser\Microdata::PARSE |      // Microdata = 2
            \Jkphl\Micrometa\Parser\JsonLD::PARSE;          // JSON-LD = 4
$micrometaParser = new \Jkphl\Micrometa($url, null, $parsers);
  1. If you encounter cases where no data can be obtained but you think it should, please leave me some example URLs. Thanks! :)
@danjaywing

This comment has been minimized.

danjaywing commented Feb 13, 2017

Thanks for the fix.

The following is an example of a page containing microdata that isn't parsed:
http://www.argos.co.uk/product/6707596

As you can see from the source code, there is a product type, but when I attempt to parse the url, no data for the product is retrieved.

Possibly an issue with their code. If your parser identifies 'mainEntityOfPage' does it begin parsing inside it?

@jkphl jkphl reopened this Feb 13, 2017

@jkphl

This comment has been minimized.

Owner

jkphl commented Feb 13, 2017

@danjaywing Thanks, I'll dig into it! :)

@danjaywing

This comment has been minimized.

danjaywing commented Feb 13, 2017

Sorry, updated my last comment as there IS microdata coming through but none for the main product

@danjaywing

This comment has been minimized.

danjaywing commented Feb 13, 2017

I've edited the original post with all current examples I've found, sorry if it seems nitpicky!

@jkphl

This comment has been minimized.

Owner

jkphl commented Feb 21, 2017

@danjaywing Thanks for your edits — and sorry, I found them only now. I released a new parser for RDFa Lite and HTML Microdata just yesterday which I plan to integrate into micrometa soon. I'll get back to this issue as soon as the new parser's working under the hood ...

@danjaywing

This comment has been minimized.

danjaywing commented Feb 22, 2017

Ok thanks!

@jkphl jkphl self-assigned this Mar 24, 2017

@jkphl jkphl added the bug label Mar 24, 2017

@jkphl jkphl added this to the Second generation milestone Mar 24, 2017

jkphl added a commit that referenced this issue May 30, 2017

@jkphl

This comment has been minimized.

Owner

jkphl commented May 30, 2017

@danjaywing FYI: I just published the next major release with improved support for additional formats. I did a rough check with your list of example files. They all yield results now, I think there are still some issues with HTML Microdata parsing though. I will further track these over at jkphl/rdfa-lite-microdata#6. Thanks again for this valuable set of examples!

@jkphl jkphl closed this May 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment