Skip to content

ZIMs Naming Convention

benoit74 edited this page Mar 7, 2024 · 13 revisions
  • When publishing a ZIM, it's important to pay attention to its metadata as those are the way other people will distinguish it from other content.
  • Metadata lists the common and required metadata expected for a ZIM file.
  • None of them needs to be unique. ZIMs already includes an identifier (called ID that is a UUID) that is generated automatically during creation. It doesn't diminishes the value of the other metadata though. You still want readers to easily and confidently choose ZIMs according to those.
  • We need to ensure collisions will not happen (two different websites leading to the same ZIM Name typically) and that the user understand which source content he is downloading / using.
  • Choosing good and appropriate metadata can be difficult, but it's not what this document is about.

This document is about setting valid Name metadata and filename for openZIM-created ZIMs (usually via the Zimfarm).

Why do we care?

  • We create thousands of ZIMs every month. Convention is essential to be able to automate some tasks.
  • Convention means applying a pattern, so no need to find what to use: simpler, faster.
  • We use Name metadata to match Zimfarm-produced ZIMs with Titles in the CMS
  • We use Name metadata to set the ZIM filename in most scrapers.
  • Many scripts depends on the filenames to maintain the central library: build the XML library, move files to appropriate folder, evict older files, generate redirects, etc.
  • Offspot YAML catalog uses Human IDs that are derived from the filenames.

ZIM Name Metadata

Format: {project}_{lang}_{selection}

The _ character is reserved as separator between the parts.

The parts must only contain alphanums or - or . characters.

The parts must be all lowercase.

Part Description Example
project Domain name (or project name) ^1 android.stackexchange.com, wikipedia
lang ISO-639-1 (2 chars) language code en, fr, zh, mul^2
selection A short, slug-like string indicating the selection over the project all, top, football
  • ^1: Domain name by default, project names are exceptions (basically valid only if we at least have a dedicated category for this project); use domain names if unsure, or best, ask on Slack. Should domain name could contains illegal characters for our convention, it will be encoded with Punycode, e.g. https://www.punycoder.com/)
  • ^2: mul is to be used for multiple-language ZIMs. Note that the ZIM Language metadata lists the languages (ISO-639-3) instead of using mul.

ZIM filename

Format: {Name}[_{flavour}]_{period}.zim

The _ character is reserved as separator between the parts.

The parts must only contain alphanums or - or . characters.

The filename must be all lowercase.

Part Description Example
Name The Name metadata described above^1 wikipedia_fr_top, wikihow_th_all, stackoverflow.com_en_all
flavour Optional. One of the existing flavour indicating a modification of the content for size reasons mini, nopic, maxi
period The period when the ZIM has been created, in format YYYY-MM (year-month) 2019-03, 2022-12
  • ^1: It doesn't need to be the equal to the Name metadata but requirements identical.

Zimfarm

Depending on the scraper, setting the Name metadata in the Zimfarm can be mandatory (follow above instructions) or optional. When optional, the scraper usually properly sets it according to the convention. Should it not, open a ticket on the scraper repo and set it manually in the recipe until it is fixed.

Filenames are also optional in the Zimfarm but the common behavior is to append the period-part (ex: _2022-01 after the value of the Name metadata. If you customized the Name, make sure the filename will remain valid or set it manually.

Important: when setting filename manually, you are responsible for the whole filename, including the period part. Most scraper allow inserting a special {period} string that will be replaced with the year-date one. Ex: supersite.com_en_all_{period}.zim