Skip to content

Commit

Permalink
Added support for custom HTML parsing error handler (closes #4)
Browse files Browse the repository at this point in the history
  • Loading branch information
jkphl committed Mar 24, 2017
1 parent 624e7e0 commit cc8002f
Show file tree
Hide file tree
Showing 6 changed files with 89 additions and 11 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ All Notable changes to *jkphl/rdfa-lite-microdata* will be documented in this fi
## [0.3.1] - 2017-03-24
### Added
* Dedicated exception for HTML parsing errors
* Added custom HTML parsing error handling ([#4](https://github.com/jkphl/rdfa-lite-microdata/issues/4))

## [0.3.0] - 2017-03-17
### Added
Expand Down
35 changes: 34 additions & 1 deletion doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,11 @@ $rdfaItems = $rdfaParser->parseHtmlFile('/path/to/file.html');
// Parse an HTML string
$rdfaItems = $rdfaParser->parseHtml('<html><head>...</head><body vocab="http://schema.org/">...</body>');

// Parse an HTML file with custom error handler
$rdfaItems = $rdfaParser->parseHtmlFile('/path/to/file.html', function(\LibXMLError $error) {
// ...
});

// Parse a DOM document (here: created from an HTML string)
$rdfaDom = new \DOMDocument();
$rdfaDom->loadHTML('<html><head>...</head><body vocab="http://schema.org/">...</body>');
Expand Down Expand Up @@ -163,7 +168,7 @@ The [Microdata](https://www.w3.org/TR/microdata/) format isn't specified for non
$microdataParser = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\Microdata();

// Parse an HTML file
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html');
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html' /*, $customErrorHandler*/);

// Parse an HTML string
$microdataItems = $microdataParser->parseHtml('<html><head>...</head><body itemscope itemtype="http://schema.org/Movie">...</body>');
Expand All @@ -178,6 +183,34 @@ $microdataParserIri = new \Jkphl\RdfaLiteMicrodata\Ports\Parser\Microdata(true);
$microdataItems = $microdataParser->parseHtmlFile('/path/to/file.html');
```

## HTML parsing error handling

The parser uses PHP's bundled [libxml2](http://php.net/manual/de/book.libxml.php) to parse web documents, which requires the documents to be reasonably well-formed and valid. Unfortunately, libxml2 isn't very up-to-date and fails e.g. on all the modern HTML5 elements. It also throws errors on various common HTML problems like

* invalid HTML entities (e.g. an unescaped ampersand `"&"`) or
* multiple attributes with the same name on a single element.

Generally ignoring these errors isn't a good idea, however, as this could lead to fundamentally wrong parsing results. Therefore, the parser sports an internal HTML5 compatibility layer and additionally allows passing in a custom parsing error handler for all the `parseHtml*()` methods as the second argument. The handler is called for each parsing error and is expected to return `true` for allowable errors, `false` otherwise.

```php
$customErrorHandler = function(\LibXMLError $error) {
// Allow elements with unknown names
if ($error->code == 801) {
return true;
}
return false;
};
$parser->parseHtml('<html>...</html>', $customErrorHandler);
```

The various error codes are described within the official [xmlError API documentation](http://www.xmlsoft.org/html/libxml-xmlerror.html#xmlParserErrors). To completely suppress all parsing errors you might use a handler like this:

```php
$customErrorHandler = function(\LibXMLError $error) {
return true;
};
```

## Installation

This library requires PHP >=5.5 or later. I recommend using the latest available version of PHP as a matter of principle. It has no userland dependencies. It's installable and autoloadable via [Composer](https://getcomposer.org/) as [jkphl/rdfa-lite-microdata](https://packagist.org/packages/jkphl/rdfa-lite-microdata).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,22 @@ class HtmlDocumentFactory implements DocumentFactoryInterface
'wbr',
'xmp'
];
/**
* Custom HTML parsing error handler
*
* @var callable|null
*/
protected $errorHandler;

/**
* Constructor
*
* @param callable|null $errorHandler Custom HTML parsing error handler
*/
public function __construct(callable $errorHandler = null)
{
$this->errorHandler = $errorHandler;
}

/**
* Create a DOM document from a source
Expand Down Expand Up @@ -219,14 +235,14 @@ protected function processParsingErrors(array $errors)
{
/** @var \LibXMLError $error */
foreach ($errors as $error) {
if ($this->isNotInvalidHtml5TagError($error)) {
if ($this->isNotInvalidHtml5TagError($error) && $this->isNotAllowedError($error)) {
throw new HtmlParsingException($error);
}
}
}

/**
* Test if a parsing error is not because of an "invalid" HTML5 tag
* Test whether a parsing error is not because of an "invalid" HTML5 tag
*
* @param \LibXMLError $error Parsing error
* @return bool Error is not because of an "invalid" HTML5 tag
Expand All @@ -239,4 +255,15 @@ protected function isNotInvalidHtml5TagError(\LibXMLError $error)
!in_array($tag[1], self::$html5)
);
}

/**
* Test whether a parsing error is allowed per custom HTML parser error handler
*
* @param \LibXMLError $error Parsing error
* @return bool Error is not allowed
*/
protected function isNotAllowedError(\LibXMLError $error)
{
return !(is_callable($this->errorHandler) && call_user_func($this->errorHandler, $error));
}
}
10 changes: 6 additions & 4 deletions src/RdfaLiteMicrodata/Ports/Parser/Microdata.php
Original file line number Diff line number Diff line change
Expand Up @@ -55,24 +55,26 @@ class Microdata extends AbstractParser
* Parse an HTML file
*
* @param string $file HTML file path
* @param callable|null $errorHandler Custom HTML parsing error handler
* @return \stdClass Extracted things
*/
public function parseHtmlFile($file)
public function parseHtmlFile($file, callable $errorHandler = null)
{
return $this->parseHtml($this->getFileContents($file));
return $this->parseHtml($this->getFileContents($file), $errorHandler);
}

/**
* Parse an HTML string
*
* @param string $string HTML string
* @param callable|null $errorHandler Custom HTML parsing error handler
* @return \stdClass Extracted things
*/
public function parseHtml($string)
public function parseHtml($string, callable $errorHandler = null)
{
return $this->parseSource(
$string,
new HtmlDocumentFactory(),
new HtmlDocumentFactory($errorHandler),
new MicrodataElementProcessor(),
new MicrodataContext()
);
Expand Down
10 changes: 6 additions & 4 deletions src/RdfaLiteMicrodata/Ports/Parser/RdfaLite.php
Original file line number Diff line number Diff line change
Expand Up @@ -82,24 +82,26 @@ public function parseXml($string)
* Parse an HTML file
*
* @param string $file HTML file path
* @param callable|null $errorHandler Custom HTML parsing error handler
* @return \stdClass Extracted things
*/
public function parseHtmlFile($file)
public function parseHtmlFile($file, callable $errorHandler = null)
{
return $this->parseHtml($this->getFileContents($file));
return $this->parseHtml($this->getFileContents($file), $errorHandler);
}

/**
* Parse an HTML string
*
* @param string $string HTML string
* @param callable|null $errorHandler Custom HTML parsing error handler
* @return \stdClass Extracted things
*/
public function parseHtml($string)
public function parseHtml($string, callable $errorHandler = null)
{
return $this->parseSource(
$string,
new HtmlDocumentFactory(),
new HtmlDocumentFactory($errorHandler),
(new RdfaLiteElementProcessor())->setHtml(true),
new RdfaLiteContext()
);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -86,4 +86,17 @@ public function testHtmlDocumentParsingError()
$this->assertEquals('Tag invalid invalid', trim($parsingError->message));
}
}

/**
* Test a custom HTML parsing error handler
*/
public function testHtmlDocumentCustomErrorHandler()
{
$customErrorHandler = function(\LibXMLError $error) {
return ($error->level == 2) && ($error->code == 801) && (trim($error->message) == 'Tag invalid invalid');
};
$htmlDocumentFactory = new HtmlDocumentFactory($customErrorHandler);
$htmlSource = '<html><head><title>Test</title></head><body><invalid>Test</invalid></body></html>';
$htmlDocumentFactory->createDocumentFromSource($htmlSource);
}
}

0 comments on commit cc8002f

Please sign in to comment.