Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JPG compression and lightweight html nbconvert template for blogs #4448

Closed
wants to merge 11 commits into from
Closed

JPG compression and lightweight html nbconvert template for blogs #4448

wants to merge 11 commits into from

Conversation

dbarbeau
Copy link
Contributor

These commits make publishing on blog sites (eg: blogger) easier.

Rationale

Using "ipython nbconvert" produces self contained html (with data URIs for images) which is very handy for blog posts : one just copies the body a pastes it in the new blog post. However, hosts limit the size per post so, for this purpose, a more compact html representations of the notebook is desired.

Strategy

Two places where bytes can be easily saved have currently been identified:

  • Data URIs for matplotlib figures are currently PNG by default, or SVG optionnaly. Both can get bigger than data-uri encoded JPG quickly.
  • Code cells are pre-highlighted, which introduces many tags for even small code blocks.

The solution works in the scenario where the user starts ipython notebook, creates and saves notebooks, then uses nbconvert to convert the notebook to html. Only the body is useful.

JPG compression for images

By starting the notebook with "--InlineBackend.figure_format=jpg" the figures are transported in the JPG format. The user also has access to "--InlineBackend.quality=XX" to set the level of compression. Of course, JPG is lossy but many figures can do with some compression. This is only enabled if PIL is installed.

This can also be useful in the general use case, to save some bandwidth.

In browser code highlighting

Syntax highlighting in html introduces many tags which eat many precious bytes. The strategy retained here is to convert code blocks to "html escaped ascii" code blocks (raw code) and assign them css classes that enable google's prettyfier to run on the code, in the browser. This is done by using a new template for nbconvert: "lightweight_blog"

Doubts

The user must still edit his blog's template settings to add the required JS and CSS to make the blog post look as much as possible like the notebook. The generated html includes instructions on what should go where but those bits shouldn't be included in the blog post to save bytes. As they are common to many blog posts, it is better to put those resources into the template than into each post. Are there ways to remove the need for user intervention? This is unlikely with the constraints given. Are there better ways to tell the user to edit his blog's template (an nbconvert post-processor that prints an info message)?

@Carreau
Copy link
Member

Carreau commented Oct 28, 2013

Hi,

Thanks !

I don't think the choice for jpeg should be done in the kernel before nbconvert processing.
In any case this is a mechanism we will update soon, so the part of the patch that allow inline-backend=jpeg will probably be refused (or at least it should be discussed in an orthogonal PR). We will move toward arbitrary mime type in new notebook format.

I think It would be much better to have a preprocessor in nbconvert that png->jpeg.

I'll re-check, but the highlighting in the browser is something that make sens and that I was planning to do on nbviewer too to reduce page size. But I'd like to avoid to hardcode lang-python as ipynb can also have ruby/julia/haskell...

Will do more comment later.

@dbarbeau
Copy link
Contributor Author

Hello,

Indeed, I just stumbled upon stubs (in the nbconvert's code) that convert pictures as a preprocessor. This is definitely the way to go to convert to a blog format (or anything else).

However, the InlineBackend.figure_format option also allows for JPG compression during normal notebook (not nbconvert-targeted) use. Anyway, if arbitrary mime-types are planned then jpeg is included, so I'm fine with that ^^.

Concerning hard-coded css classes, I wasn't aware so many languages were supported. It certainly makes sense not to hardcode the language. I will check to see if this can be handled more gracefully.

@dbarbeau
Copy link
Contributor Author

The two previous commits enable in-browser syntax highlighting both in notebook and in nbconvert in markdown and input cells.

  • In notebooks it will highlight markdown code blocks with highlight.js' language autodetection. If fenced blocks are used it might be able to use the specified language (untested). It doesn't touch to input cells (handled by codemirror).
  • In nbconvert's output it will highlight markdown code blocks with google_code_prettify language autodetection. It will highlight input cells using the cell's language attribute.

The current situation where i use both prettify and highlight.js is because my primary choice was prettify and suspected the highlighting in the notebook was being done by pygments on the server side. I then found out there was already highlight.js in the code base. I'm undecided regarding what's the best choice.

@minrk
Copy link
Member

minrk commented Oct 28, 2013

You have restored old behavior for highlighting with no language specified, and we found this to be problematic, hence the current behavior. The auto language detection should be removed.

@dbarbeau
Copy link
Contributor Author

I was suspecting that. The code was screaming something like: "i've been disabled for some reason". This morning I found both the commit that disabled it and a justification (there is no other way to disable highlighting locally, or in other words don't highlight by default). Will revert.

@dbarbeau
Copy link
Contributor Author

I reverted notebook highlighting to its old behaviour: for code blocks inside markdown cells, if no language is given we don't hightlight.

The nbconvert "lightweight_blog" template now follows the same rule. The only difference is the highlighter being used. The notebook uses highlight.js while the template uses prettify. I think I prefer prettify for one main reason: line numbers can be enabled.

On a side note, I think I've been misusing git fetch/merge upstream to keep up-to-date with upstream. Maybe rebase would have been better. If this causes headaches, I'll just create a good old patch. But I really need to understand this git thingy more deeply!

@damianavila
Copy link
Member

Yep! this need a rebase ;-)

dbarbeau added 11 commits November 2, 2013 10:35
…cludes some comments to guide the user in publishing the blog post
=========
- Enable syntax highlighting inside markdown code. Uses google code prettify
because it doesn't need <code> rags inside <pre> tags. Hum... is this a good reason?
- Fix Javascript tags and code.
========
- Enable in-browser syntax highlighting inside markdown. It solely relies
on highlight.js' language autodetection. A better way would be to use fenced code blocks
which can include info about the language.
do not highlight code in markdown that doesn't have a language specified.

There is a bug in pandoc < 1.12.1 where the --no-highlight option is not honored
in some situations. The IPython.nbconvert.utils.pandoc module prints a warning
if minimal version is not satisfied..
@dbarbeau
Copy link
Contributor Author

dbarbeau commented Nov 2, 2013

Hello,

After many headaches I finally think I have rebased the branch on top of ipython/master. Tell me if there's anything.
I reordered commits so that the JPG figure_format patch comes last, in one single commit and can (hopefuly) be easily skipped.

Daniel

@ivanov
Copy link
Member

ivanov commented Dec 5, 2013

Sorry about not having communicated this more clearly before, but in order to speed up the distribution of nbconvert templates and make it simpler to share such contributions, we encourage sharing those links here.

could you please put a link to your IPython/nbconvert/templates/lightweight_blog.tpl file on the wiki.

(I've now added some documenation about this in #4650)

@Carreau
Copy link
Member

Carreau commented Dec 6, 2013

To be a little more specific about what ivanov said:

We discussed this PR yesterday on google hangout, so you can get exactly what was said it is available on youtube. To recap, we believe that this PR mix many things. and that they should probably be separated.

  • Jpeg Formater for inline figure
  • lightweight blog templates
  • minimal pandoc version check
  • some fix for highlight
  • filters for nbconvert.

The templates itself will not be accepted, as we try to keep bare minimal template into IPython itself.
If you cannot do something with --template flag or config, then there is probably something we need to fix.

We agreed that highlighting should be fixed; we don't know how yet.


I have no strong feeling about pandoc version

def escape_for_html(text):
 +    return html_escape(text)

why not keep html_escape ?

I would like also to apologize for responding late, and usually we respond to PR paster.
I hope that splitting this into smaller chunk will help to review each of them more quickly.

Thanks.

@dbarbeau
Copy link
Contributor Author

dbarbeau commented Dec 6, 2013

Hello guys,

Thanks for the feedback. I'm all for splitting this into chunks and putting the template somewhere else.

Carreau : Do you mean that the name html_escape was better? I think you're right, I have no clue to why I suddenly changed it! If you mean there's already an html_escape equivalent function somewhere, I probably missed it OR it had limitations that since then I forgot about (and that I should have documented).

What I'll do, if you agree, is redo this work cleanly :) It was my first github collaboration and I think my workflow was not good (should have worked in branches of course!). So, this PR can be closed and I'll submit new ones when specific points are (from your bullet list) are ready.

@Carreau
Copy link
Member

Carreau commented Dec 6, 2013

Do you mean that the name html_escape was better

No the name was right, and in any case, filter get named by the dictionary that map them.
So in the end you have multiple layer of indirection:

import stuff as thing

def myfun(arg):
    return things(arg)

# I could also have written 
myfun = thing

#finally 
filter_dict['i_m_a_filter'] = myfun

It is less verbose, and more readable to do

import stuff as i_am_a_filter
filter_dict['i_m_a_filter'] = i_am_a_filter

In you case the following was enough:

try:
    from html import escape as escape_for_html
except:
    from cgi import escape as escape_for_html

@Carreau
Copy link
Member

Carreau commented Dec 6, 2013

Also,

It was my first github collaboration and I think my workflow was not good (should have worked in branches of course!). So, this PR can be closed and I'll submit new ones when specific points are (from your bullet list) are ready.

Well for a first contribution it was quite good, feel free to close submit new PR as you like.
And thanks for contributing !

@dbarbeau
Copy link
Contributor Author

New PRs are coming so I'm closing this one!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants