## Modifying ZIM Files

#### The Larger Picture
* Kiwix scrapes many useful sources, but sometimes the chunks are too big for IIAB.
* Using the zimdump program, the highly compressed ZIM files can be flattened into a file tree, modified, and then re-packaged as a ZIM file.
* This Notebook has a collection of tools which help in the above process.


#### How to Use this notebook
* There are install steps that only need to happen once. The cells containing these steps are set to "Raw" in the right most dropdown so that they do not execute automatically each time the notebook starts.
* The following bash script successfully installed zimtools on Ubuntu 20.04.It only needs to be run once. I think it's easier to do it from the command line, with tab completion. The script is at  
/opt/iiab/iiab-factory/content/kiwix/generic/install-zim-tools.sh. 
```
./install-zim-tools.sh
```

* **Some conventions**: Jupyter does not want to run as root. We will create a file structure that exists in the users home directory -- so the application will be able to write all the files it needs to function.
```
<home directory>
├── new_zim
├── tree
└── working
```
In general terms, this program will dump the zim data into "tree", modify it, gather additional data into "working"
, and create a ZIM file in "new_zim"
* For testing purposes, the user will need to link from the server's document root to her home directory:
```
cd
mkdir -p zimtest
ln -s /home/<user name>/zimtest /library/www/html/zimtest 
```


### Declare input and output environment
* The ZIM file names tend to be long and hard to remember. The PROJECT_NAME, initialized below, is used to create output path names. All of the output of the zimdump program is placed in /library/www/html/zimtest/\<PROJECT_NAME\>. All if the intermediate downloads, and data, are placed in /library/working/kiwix/\<PROJECT_NAME\>. If you use the IIAB Admin Console to download ZIMS, you will find them in /library/zims/content/.

In [1]:
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import os,sys
import youtube_dl
import pprint as pprint

# Declare a short project name (ZIM files are often long strings
PROJECT_NAME = 'teded'
# Input the full path of the downloaded ZIM file
ZIM_PATH = '/library/www/html/teded/teded_en_all_2020-06.zim'

# The rest of the paths are computed and represent the standard layout
# Jupyter sets a working director as part of it's setup. We need it's value
CWD = !pwd
WORKING_DIR = '~/zimtest/' + PROJECT_NAME + '/working'
PROJECT_DIR = '~/zimtest/' + PROJECT_NAME + '/tree'
dir_list = ['new-zim','tree','working']
for f in dir_list: 
    if not os.path.isdir('~/zimtest/' + PROJECT_NAME +'/' + f):
       os.makedirs('~/zimtest/' + PROJECT_NAME +'/' + f)

# abort if the input file cannot be found
if not os.path.exists(ZIM_PATH):
    print('%s path not found. Quitting. . .')
    sys.exit(2)


In [2]:
# First we need to get a current copy of the script
cwd=!pwd
%cp /opt/iiab/iiab-factory/content/kiwix/de-namespace.sh {cwd[0]}

In [3]:
# The following command will zimdump to the "tree" directory
#   and remove the namespace directories
# It will return without doing anything if the "tree' is not empty
!./de-namespace.sh {ZIM_PATH} {PROJECT_NAME}

+ DOCROOT=/library/www/html
+ '[' 2 -lt 2 ']'
+ '[' '!' -f /library/www/html/teded/teded_en_all_2020-06.zim ']'
+ '[' -d /library/www/html/zimtest ']'
+ '[' '!' -f /library/www/html/zimtest/de-namespace ']'
++ ls /library/www/html/zimtest/teded/tree
++ wc -l
+ contents=1744
+ '[' 1744 -ne 0 ']'
+ echo 'The /library/www/html/zimtest/teded/tree is not empty. Delete if you want to repeat this step.'
The /library/www/html/zimtest/teded/tree is not empty. Delete if you want to repeat this step.
+ exit 0


* The next step is a manual one that you will need to do with your browser. That is: to verify that after the namespace directories were removed, and all of the html links have been adjusted correctly. Point your browser to <hostname>/zimtest/\<PROJECT_NAME\>/tree.
* If everything is working, it's time to go fetch the information about each video from youtube.

In [4]:
ydl = youtube_dl.YoutubeDL()

downloaded = 0
skipped = 0
# Create a list of youtube id's
yt_id_list = os.listdir(PROJECT_DIR + '/videos/')
for yt_id in iter(yt_id_list):
    if os.path.exists(WORKING_DIR + '/' + PROJECT_NAME + '/' + yt_id + '.json'):
        # skip over items that are already downloadd
        skipped += 1
        continue
    with ydl:
       result = ydl.extract_info(
                'http://www.youtube.com/watch?v=%s'%yt_id,
                download=False # We just want to extract the info
                )
       downloaded += 1

    with open(WORKING_DIR + '/' + PROJECT_NAME + '/' + yt_id + '.json','w') as fp:
        fp.write(json.dumps(result))
    #pprint.pprint(result['upload_date'],result['view_count'])
print('%s skipped and %s downloaded'%(skipped,downloaded))

FileNotFoundError: [Errno 2] No such file or directory: '~/zimtest/teded/tree/videos/'