Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refined spiegel.de enhancer and added androidpolice.com #664

Conversation

Projects
None yet
2 participants
@chaotix-
Copy link
Contributor

chaotix- commented Nov 22, 2014

This patch adds an androidpolice.com xpathenhancer and refines the existing spiegel.de xpathenhancer. (Remove empty bullet points from top of article and add author credit at the bottom.)

chaotix- added some commits Nov 22, 2014

Refined spiegel.de xpatharticleenhancer
This xpathenhancer does not result in empty bullet points at the top of
article. It also adds the authors name/sign at the end of the article.
// If this fails look for charset in <meta http-equiv="Content-Type" ...>
// As a last resort use mb_detect_encoding()
// Use UTF-8 if mb_detect_encoding does not return anything (or the HTML page is messed up)
$csregex = "/<meta\s+[^>]*charset\s*=\s*['\"]([^>]*)['\"][^>]*>/i";

This comment has been minimized.

Copy link
@BernhardPosselt

BernhardPosselt Nov 22, 2014

Contributor

I think rather than using a regex we should use xpath, see https://github.com/owncloud/news/blob/master/articleenhancer/globalarticleenhancer.php#L35 how to do this without running into XXE issues

This comment has been minimized.

Copy link
@BernhardPosselt

BernhardPosselt Nov 22, 2014

Contributor

You can create a single xpath query for your two regexes btw by seperating them via |
e.g. https://github.com/owncloud/news/blob/master/articleenhancer/xpathenhancers.json#L20

@chaotix-

This comment has been minimized.

Copy link
Contributor Author

chaotix- commented Nov 23, 2014

I revised the patch and used one single regex. I don't know if the additional computational cost of an xpath query is worth it. You would have to create a DOM-tree, get the charset information (if available), convert $body to the right encoding and create another DOM-tree to extract the article information.

Maybe a possible alternative would be to create the DOM tree without converting and then convert the extracted result of the xpathenhancer to the right encoding. But I don't know if creating the DOM tree without converting the HTML to the proper encoding first could have any unwanted side-effects.

@BernhardPosselt

This comment has been minimized.

Copy link
Contributor

BernhardPosselt commented Nov 23, 2014

I actually think the cost is negligible since update is mostly io limited. Xpath is easier to read and avoids encoding/escaping issues. And we got enough of those xpath queries so don't worry ;)

@BernhardPosselt

This comment has been minimized.

Copy link
Contributor

BernhardPosselt commented Nov 23, 2014

PS we could share the Dom creation since we're using an xpath later on ;)

@BernhardPosselt

This comment has been minimized.

Copy link
Contributor

BernhardPosselt commented Nov 23, 2014

Although maybe encoding issues, can you look that up?

@BernhardPosselt

This comment has been minimized.

Copy link
Contributor

BernhardPosselt commented Nov 23, 2014

Lets worry about this afterwards if you like to. PR looks fine enough for me to merge 👍

BernhardPosselt pushed a commit that referenced this pull request Nov 23, 2014

Bernhard Posselt Bernhard Posselt
Merge pull request #664 from chaotix-/androidpolice-spiegel-enhancers
Refined spiegel.de enhancer and added androidpolice.com

@BernhardPosselt BernhardPosselt merged commit bbd765b into owncloudarchive:master Nov 23, 2014

2 checks passed

ci/scrutinizer Scrutinizer: Canceled — Tests: passed
Details
continuous-integration/travis-ci The Travis CI build passed
Details
@chaotix-

This comment has been minimized.

Copy link
Contributor Author

chaotix- commented Nov 24, 2014

Mmmh. I tried doing it the xpath way but it's messy. libxml2 is supposed to detect the encoding of HTML documents by itself. But older versions (< 2.80) don't support HTML5-style . (According to http://www.glenscott.co.uk/blog/html5-character-encodings-and-domdocument-loadhtml-and-loadhtmlfile/)

So it looks like this:

                $encSearch = "/html/head/meta[@http-equiv='Content-Type']/@content | /html/head/meta/@charset";
                $xpath = new DOMXpath($dom);
                $encResult = $xpath->evaluate($encSearch);
                if($encResult->length == 0) {
                    $enc = mb_detect_encoding($body);
                    $enc = $enc ? $enc : "UTF-8";
                    $body = mb_convert_encoding($body, 'HTML-ENTITIES', $enc);
                    $dom = $this->loadHTML($body);
                    $xpath = new DOMXpath($dom);
                } elseif((LIBXML_VERSION < 20800)
                          && ($encResult->length == 1)
                          && ($encResult->item(0)->name == "charset")) {
                    $body = mb_convert_encoding($body, 'HTML-ENTITIES',
                                strtoupper($encResult->item(0)->value));
                    $dom = $this->loadHTML($body);
                    $xpath = new DOMXpath($dom);
                }

                $xpathResult = $xpath->evaluate($search);

Problem is I have articles from heise.de with messed up characters. My libxml is > 2.80 and it should work, but it doesn't for about 20% of the articles. (And the charset definitions look identical to me in the HTML source.)

I could drop the LIBXML_VERSION check, but I really don't like the idea to create the DOM tree twice in the regular case. It's an expensive operation and although IO may be the limiting factor in the time it takes to update the feeds it's a waste of CPU resources. (Although I have done just that on my server now to check if it works at all.)

I also don't think creating the DOM tree once and just converting the XPath query result is likely to be less trouble. Because libxml2 is trying to determine the encoding by itself, we don't know if it succeeded. And in the query results all special characters are converted to entities (with hexnumbers like &xAA;). So even if you would know that libxml2 messed up the conversion you would have to use mb_convert_encoding() to convert the entities back to the original encoding (I'm not sure if you need to know the source charset that libxml2 guessed for that) and then convert it again using the right source charset.

@BernhardPosselt

This comment has been minimized.

Copy link
Contributor

BernhardPosselt commented Nov 24, 2014

I see, then lets keep the regex solution :)

@chaotix- chaotix- deleted the chaotix-:androidpolice-spiegel-enhancers branch Nov 24, 2014

@chaotix-

This comment has been minimized.

Copy link
Contributor Author

chaotix- commented Nov 26, 2014

Just for the record: The code above works (for me) if I disable the LIBXML_VERSION check. That way all HTML5-style charset definitions lead to the creation of two DOM trees.

So if the regex solution does pose problems in the future this may be a possible fix. It should be slow but reliable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.