Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown JSON-LD item #10

Closed
danjaywing opened this issue Feb 10, 2017 · 10 comments
Closed

Unknown JSON-LD item #10

danjaywing opened this issue Feb 10, 2017 · 10 comments
Assignees
Labels
bug

Comments

@danjaywing
Copy link

@danjaywing danjaywing commented Feb 10, 2017

Hi

I'm looking to build a script that sees what data it can glean from any given url, microdata first, then content. Your parser seems perfect for that, but I've noticed a case where an error is thrown in certain situations.

I'm giving the following url:
http://www.currys.co.uk/gbuk/computing/laptops/laptops/lenovo-yoga-510-14-2-in-1-black-10146249-pdt.html

And I'm getting the following warning:

Warning: get_class() expects parameter 1 to be object, array given in C:\Users\danm\Documents\Websites\page-scraper-analyser\vendor\jkphl\micrometa\src\Jkphl\Micrometa\Parser\JsonLD.php on line 217
Unknown JSON-LD item: {"items":[{"id":"_:b0","types":["http:\/\/schema.org\/BreadcrumbList"]

Is it finding microdata but attempting to parse it as JSON-LD?

I've also noticed cases where no data is obtained though microdata is used on the page, is this indicative of poor configuration their end?

Thanks in advance

EDIT

Here's a list of urls with data that either isn't being returned, or is buggy:

I appreciate that some of these may be down to the implementation of the microdata on the pages themselves.

@jkphl
Copy link
Owner

@jkphl jkphl commented Feb 10, 2017

Thanks for reporting! I'm on holidays right now but will look into this as soon as possible!

@danjaywing
Copy link
Author

@danjaywing danjaywing commented Feb 10, 2017

I've got around the warning by defining the parsers manually and exluding the JsonLD parser, I'm less interested in JsonLD but it would be nice to find a fix at some point. Enjoy your holidays!

@jkphl jkphl closed this in 77fd011 Feb 11, 2017
@jkphl
Copy link
Owner

@jkphl jkphl commented Feb 11, 2017

@danjaywing Serveral things:

  1. The warning you've seen (get_class() expects parameter 1 to be object, array given) was due to the fact that the JSON-LD parser didn't support value lists yet (as needed for the breadcrumb navigation). I just added that feature.

  2. There is JSON-LD embedded into the page you provided (2 external blocks) and the error was thrown when that JSON-LD was parsed. And no, there's no HTML microdata on that page. Please don't confuse the different formats (i.e. Microformats, HTML Microdata and JSON-LD) and the vocabularies (e.g. http://schema.org). The schema.org vocabulary can be expressed with both HTML microdata and JSON-LD (which is the case here).

  3. You don't have to use a workaround but can easily control which formats are parsed by using the appropriate parser constant(s):

$parsers =  \Jkphl\Micrometa\Parser\Microformats2::PARSE |  // Microformats = 1
            \Jkphl\Micrometa\Parser\Microdata::PARSE |      // Microdata = 2
            \Jkphl\Micrometa\Parser\JsonLD::PARSE;          // JSON-LD = 4
$micrometaParser = new \Jkphl\Micrometa($url, null, $parsers);
  1. If you encounter cases where no data can be obtained but you think it should, please leave me some example URLs. Thanks! :)
@danjaywing
Copy link
Author

@danjaywing danjaywing commented Feb 13, 2017

Thanks for the fix.

The following is an example of a page containing microdata that isn't parsed:
http://www.argos.co.uk/product/6707596

As you can see from the source code, there is a product type, but when I attempt to parse the url, no data for the product is retrieved.

Possibly an issue with their code. If your parser identifies 'mainEntityOfPage' does it begin parsing inside it?

@jkphl jkphl reopened this Feb 13, 2017
@jkphl
Copy link
Owner

@jkphl jkphl commented Feb 13, 2017

@danjaywing Thanks, I'll dig into it! :)

@danjaywing
Copy link
Author

@danjaywing danjaywing commented Feb 13, 2017

Sorry, updated my last comment as there IS microdata coming through but none for the main product

@danjaywing
Copy link
Author

@danjaywing danjaywing commented Feb 13, 2017

I've edited the original post with all current examples I've found, sorry if it seems nitpicky!

@jkphl
Copy link
Owner

@jkphl jkphl commented Feb 21, 2017

@danjaywing Thanks for your edits — and sorry, I found them only now. I released a new parser for RDFa Lite and HTML Microdata just yesterday which I plan to integrate into micrometa soon. I'll get back to this issue as soon as the new parser's working under the hood ...

@danjaywing
Copy link
Author

@danjaywing danjaywing commented Feb 22, 2017

Ok thanks!

@jkphl jkphl self-assigned this Mar 24, 2017
@jkphl jkphl added the bug label Mar 24, 2017
@jkphl jkphl added this to the Second generation milestone Mar 24, 2017
jkphl added a commit that referenced this issue May 30, 2017
@jkphl
Copy link
Owner

@jkphl jkphl commented May 30, 2017

@danjaywing FYI: I just published the next major release with improved support for additional formats. I did a rough check with your list of example files. They all yield results now, I think there are still some issues with HTML Microdata parsing though. I will further track these over at jkphl/rdfa-lite-microdata#6. Thanks again for this valuable set of examples!

@jkphl jkphl closed this May 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants