Fetching contributors…
Cannot retrieve contributors at this time
91 lines (73 sloc) 5.55 KB

HTML cleaner for Rheinwerk (ex-Galileo) openbooks

This is a tool for cleaning up Rheinwerk openbooks (formerly known as Galileo openbooks) before converting them to EPUB or PDF format.

Current state of development: v1.2.0-SNAPSHOT is feature complete, i.e. it can download, MD5-verify, unpack and convert all 37 openbooks available at release time.

History: If you want to know details about what has changed in which version, please take a look at the change log.

Download: A precompiled, executable JAR file is available here.


$ java -jar openbook_cleaner-1.2.0-SNAPSHOT.jar --help

OpenbookCleaner usage: java ... [options] <book_id>*

Option                       Description
------                       -----------
-?, --help                   Display this help text
-c, --check-avail            Check Galileo homepage for available books,
                               compare with known ones
-d, --download-dir <File>    Download directory for openbooks; must exist
                               (default: .)
-l, --log-level <Integer>    Log level (0=normal, 1=verbose, 2=debug, 3=trace)
                               (default: 0)
-m, --check-md5              Download all known books without storing them,
                               verifying their MD5 checksums (slow! >1 Gb
-t, --threading <Integer>    Threading mode (0=single, 1=multi); single is
                               slower, but better for diagnostics) (default: 1)
-w, --write-config           Write editable book list to config.xml, enabling
                               you to update MD5 checksums or add new books

book_id1 book_id2 ...        Books to be downloaded & converted

Legal book IDs:
  all (magic value: all books), actionscript_1_und_2, actionscript_einstieg,
  apps_iphone_ios5, apps_iphone_ios6, asp_net, c_von_a_bis_z, dreamweaver_8,
  excel_2007, hdr_fotografie, it_handbuch, javascript_ajax, java_7, java_insel,
  joomla_1_5, linux, linux_unix_prog, microsoft_netzwerk, oop, photoshop_cs2,
  photoshop_cs4, php_pear, ruby_on_rails_2, shell_prog, ubuntu_10_04,
  ubuntu_11_04, ubuntu_12_04, unix_guru, vb_2008, vb_2008_einstieg,
  vb_2010_einstieg, vb_2012_einstieg, vcsharp_2008, vcsharp_2010, vcsharp_2012,
  vmware, windows_server_2008, windows_server_2012

Dependencies: Openbook cleaner was developed in Java 7. It also uses a few open source libraries:

  • jsoup 1.8.3 for parsing the "dirty" openbook HTML, selecting DOM elements and editing them, removing navigation elements, ads and other types of clutter, and finally write a clean, pretty-printed HTML document back to disk
  • JOpt Simple 4.9 for parsing command-line parameters and showing a help page (usage info)
  • Apache Commons Compress 1.10 for unzipping downloaded openbook archives. Note: When Java 7 is available on MacOS, this library might be removed again and we can revert to using the built-in Java classes.
  • XStream 1.4.8 parsing the config.xml file containing openbook meta data
  • AspectJ 1.8.13 for cross-cutting concerns like logging, timing, tracing which are not part of the main application logic. This helps to keep the core code clean and free from scattered code addressing secondary concerns.

Development environment:

  • IDE: I originally started developing this project with Eclipse but have switched to IntelliJ IDEA which for me personally is preferable because of its superior Maven support. OTOH, Eclipse has better AspectJ integration. So if you want to change any of the aspect code, you might want to use Eclipse anyway.
  • Git support is needed in your IDE of choice (or at least from the command line) if you want to interact with the source code repository and not just download a ZIP archive from GitHub.
  • Maven is used for dependency management and the whole build and packaging cycle. Any Maven 3 version should be safe, I recommend using the latest stable version. It is totally up to you if you want to build from the command line or via IDE integration. In IntelliJ IDEA you should install the original Maven plugins, for Eclipse you need m2e and also the AspectJ Maven Configurator (can be installed from
  • AspectJ support is available for both Eclipse (AJDT, AspectJ Development Tools) and IntelliJ IDEA. I do not know about Netbeans or other IDEs though. So please make sure to install the corresponding IDE plugins for AspectJ support if you want to edit the aspect code comfortably. But this is optional, because Maven can still build the project, fetching all necessary dependencies including AspectJ.

Because later I might want to use this Git repository as a refactoring showcase for my developer workshops, I am going to do any refactoring step by step, documenting progress in small, fine-granular Git changesets, so later on I can review the evolutionary progress with others.

As you can see, I am mostly doing this little project for myself, but I like to share the results and receive some user feedback. I hope the openbook cleaner is useful to you. Enjoy! :-)

Alexander Kriegisch