Skip to content
This repository has been archived by the owner on Jul 26, 2023. It is now read-only.

Please add support for "Web View" #14

Open
greglaun opened this issue Nov 2, 2013 · 5 comments
Open

Please add support for "Web View" #14

greglaun opened this issue Nov 2, 2013 · 5 comments

Comments

@greglaun
Copy link

greglaun commented Nov 2, 2013

Pocket's "Article View" generally mangles most of what I want to read into an unreadable format by deleting images and sometimes only grabbing a fraction of the article I added. On Android's pocket app there's a way to read "Web View" that isn't subject to these bugs.

Since it seems to be non-trivial to scrape Pocket's Article View and in particular since that scraping frequently removes essential images from the article, it would be very useful to have a way to set Calibre to download the article the way it was meant to be viewed in the browser.

@onlyhavecans
Copy link
Owner

I've found that rendering the entire webpage into an epub creates nearly unreadable content, which is why we clean and reflow content in the first place. Because this is from several random sources instead of something predictable like the Article Viewer, or for example the feed from a single website, trying to clean and render the article cleanly would be very nontrivial.

I would be willing to review and push a PR but I am not willing to write this myself due to it being a edge case that will generate a LOT of code.

@greglaun
Copy link
Author

greglaun commented Nov 4, 2013

I definitely wouldn't call it an edge case. The issue is that rendering the article view into an epub creates nearly unreadable content. I used this recipe to download ~30 articles and nearly all of them had major errors or were missing key parts of the article I had saved. No Wikipedia article rendered properly. Articles from other websites had strange non-printing characters showing up. All of them render perfectly readably on my phone in web view and would certainly render much more readably than the article view on my kindle.

Maybe I'm underestimating the complexity of adding this option. I can look into the code at some point in the future, but I'm currently swamped with other projects. But I'm surprised that it would take more than a few lines to take each link in the rss feed and download the content of that link.

If this change is really going to generate a lot of code, would it be possible to change the script to download all images in an article instead? This would fix about 80% of the readability issues for Wikipedia articles.

@onlyhavecans
Copy link
Owner

I definitely wouldn't call it an edge case.

Once complaint in a few years is edge to me, I understand this affects you and that's not good but I've not had the problems you've had, but I don't pocket wiki links, just news.

but I'm currently swamped with other projects

Likewise

I'm surprised that it would take more than a few lines to take each link in the rss feed and download the content of that link

Making it just pull the source website instead of the Article View is EASY. I have accidentally done it a few times in dev builds. The problem is reliably reflowing nearly random source content and making that readable on the nook/kindle. If you just take the raw page and cram that into an epub it will be nightmarish because it will grab EVERYTHING on that page, and I mean everything. The power of the calibre recipe is it's reflowing of the page content to a proper epub, not just taping the page into the file.

would it be possible to change the script to download all images in an article instead? This would fix about 80% of the readability issues for Wikipedia articles.

Yes and I have this on my list of things to do but it's not high priority, sorry.

@greglaun
Copy link
Author

greglaun commented Nov 6, 2013

For anyone interested in reading content that Pocket's Article View doesn't work for, you can download the web view as follows. In the recipe, set articles_are_obfuscated to False and in the parse_index function, change

'url':          u'{0}/a/read/{1}'.format(self.index_url, pocket_article[0]),
'real_url':     pocket_article[1]['resolved_url'],

to

#'url':          u'{0}/a/read/{1}'.format(self.index_url, pocket_article[0]),
'url':     pocket_article[1]['resolved_url'],

I did this and all of the articles are now readable, which is a big improvement for my needs. The primary downside is the presence of navigation code in the articles. Perhaps in a few months when I have time to learn python's html libraries I'll add functionality to strip nav elements to do away with these. And I'll make web view optional, whereas the changes above permanently replace Article View with Web View.

@ssd2
Copy link

ssd2 commented Feb 17, 2016

It might be interesting to add webview to the branch that pulls by tag. Use the webview on articles tagged with 'webview' ...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants