Real world example to demonstrate advanced JAXB techniques to unmarshall very large xml document with very low memory footprint..
Java

README.md

Dictionary builder OpenHub

About

Dictionary builder is a demonstration of advanced JAXB techniques to unmarshall very large xml document with very low memory footprint. This project allow you to build dictionaries based on Wiktionary entries.

dictionary-builder is an EDLA project.

The purpose of edla.org is to promote the state of the art in various domains.

How to use it

  1. Java 8u60 or later is required

  2. Get a fresh wiktionary backup
    Choose your favorite language and download the dump containing the current versions of article content here
    Example for the english dump: http://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-articles-multistream.xml.bz2

  3. Uncompress the fresh downloaded dump somewhere (Take care you need up to 5 Gigas of free disk space)

  4. Edit dico.properties to indicate the language you choose, where the dump is located and last but not least where the dictionary should be generated.
    (Take care you need some free disk space to store your dictionary)
    (dico.properties is located here : dictionary-builder/src/main/resources/org/edla/dico/construct/dico.properties)
    Nota 1 : If you are using Windows you need to escape \ like this : C:\\some_folder\\some_subfolder\\some_file
    Or you can use / like this : C:/some_folder/some_subfolder/some_file
    Nota 2 : If your language is not with a latin alphabet you need to convert the language property to ISO 8859-1 with escape sequences.
    Example for Nepali, you need to set language=\u0928\u0947\u092A\u093E\u0932\u0940

  5. Build the project : mvn clean package
    (You need to rebuild the project each time you modify the dico.properties file)

  6. Launch the program : java -jar target/dictionary-builder.jar

  7. Some results :
    From the English dictionary 549207 entries are generated in less than 20 min and 2 Gigas disk space are required for the dictionary.
    From the French dictionary 1205597 entries are generated in less than 30 min and 5 Gigas disk space are required for the dictionary.
    From the Nepali dictionary 1062 entries are generated in 3 seconds and 5 Megas disk space are required for the dictionary.

That's it.