Skip to content
jaap-karssenberg edited this page Oct 25, 2013 · 1 revision

Document all your digital life in Zim

(for the latest version & more information see the project page at http://www.inrim.it/~magni/zimDMS.htm)

Since a lot of time I've been searching for a system able to document all my digital data in a wiki-like way. The purpose being to allow me to orient myself in a huge and ever increasing data structure, where I keep all my digital life.

Data Management Systems (DMS) seem to provide this, but are cumbersome, and network&multiuser oriented. I'm interested instead in personal use, but for directories with sizes in the order of thousands of subfolders.

The wiki concept and Zim in particular are by themselves a perfect way to solve this problem, with easy subfolder ("node children") links creation for navigation.

Its only problem (in my opinion) is that Zim considers any *.txt file to be a zim node, so it will overwrite, on every textfile it finds, a Zim-related header. The only workaround I found is this python script I wrote, which crawls a given directory (D), creating a Zim-compatible mirror structure (in ZD)

When launched, it crawls the D folder structure (avoiding paths explicitly excluded via the command line, for stuff you're not interested in documenting). When arriving to a folder not present as a node in ZD, it creates the corresponding node in ZD. The zim node is simply a title and the list of its children. Symlinks are supported.

During successive executions things are much faster. Both new nodes and nodes whose subdirectories changed are updated with the new list of subdirectories written. Any subfolder present in ZD and not anymore in D is deleted from ZD, //if and only if// it has never been manually edited. Else it is marked as belonging to a deleted subtree structure, and left alone. This makes in perfect to run it periodically - e.g. nightly via crontab.

You are of course free to create new nodes in the resulting ZD structure -successive launch of the script will leave those nodes alone.

command line switches

  • -z (--zimdepo) : position of Zim repository (default ~/zim)
  • -r (--rootname) : position of root directory to explore (default ~)
  • -f (--force): force rewriting all nodes - preserves however any custom modification (default False)
  • -b (--backup): backup in directory zimdepo_backup (default False)

example execution times

(HP Proliant ML350 quad-core Xeon CPU 1.86GHz)

n.directories: 4600.\ initial scan: ~ 14sec.\ initial notebook upgrade (only once): ~ 13min.\ zim folder total dimension: 37MB.\ maintenance scans: ~ 13sec.\

History

  • 0.7 change zimfolder structure, make it hierarchical as a map of your rootname
  • 0.7 opt arglist of directories not to scan
  • 0.7 even if arglist empty, DO NOT attempt to scan zimdepo anyway
  • 0.7 remove subtree: only if empty
  • 0.94 remove node: only if body empty - else tag it as deleted in body \ and do not list it in the 'children' section
  • 0.7 able to identify zimDMS/not zimDMS nodes - based on title
  • 0.7 non-zimDMS nodes never deleted, never added CHILDREN section \ remove only if never modified from default generation
  • 0.7 write down children in node files
  • 0.91 include symlinks
  • 0.7 print warning not to mess up after CHILDREN NODES
  • 0.7 add --f switch to force the update of all the nodes - in case of script update
  • 0.7 change in all zimDMS nodes ' ' to '_', correct equality tests
  • 0.8 set children links in Home.txt too
  • 0.9 write down parent node
  • 0.91 better formatting
  • 0.92 title sends you to filemanager
    • collect readme.txt and directory.jpg files when crawling ?
  • OK check: moving wiki subtree, then moving directories works ?
  • OK check: is the wiki portable on devices ?
  • 0.92 backup wiki
  • 0.93 better routine iszimDMSnode to check if a node is a zimDMS node
  • 0.94 improved update() routine
  • 0.95 dont write children is they're on the noscan list
  • 0.95 added option to exclude .* directories
  • 0.96 add info after run (% of edited nodes, n. of deleted-in-root nodes etc)
  • 0.96 implement backup rotation scheme (see http://www.computer-repair.com/Backup.htm, possibly GFS scheme) \ > added as external routine
    • alternate/add new way to exclude dirs: .zimDontScan files in root dir

UNSOLVED

  • OK after some modifications (e.g. delete subtrees) seems mandatory do a zim --index zimdir -> solved, header problem
  • OK set default node background/color (not possible here: change GTK theme) -> set env

The script

#! /usr/bin/env python

"""
Copyright 2010 Alessandro Magni magni@inrim.it

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

 
      zimDMS

      DIRECTORY CRAWLER
      GENERATING A ZIM-COMPATIBLE WIKI STRUCTURE

 



crawler program scanning a given (rootname) root directory,
and creating (in zipdepo) a Zim-compatible mirror structure of rootname

example execution times (HP Proliant ML350 quad-core Xeon CPU 1.86GHz)

n.directories: ~4800.
initial scan: ~ 14sec.
initial notebook upgrade (only once): ~ 13min.
zim folder total dimension: 37MB.
maintenance scans: ~ 17sec.
force scans: ~ 25sec.



TODO ('-' == yet to do)
0.7 change zimfolder structure, make it hierarchical as a map of your rootname
0.7 opt arglist of directories not to scan
0.7    even if arglist empty, DO NOT attempt to scan zimdepo anyway
0.7 remove subtree: only if empty
0.94 remove node: only if body empty - else tag it as deleted in body
                                       and do not list it in the 'children' section
0.7        able to identify zimDMS/not zimDMS nodes - based on title
0.7           non-zimDMS nodes never deleted, never added CHILDREN section
         remove only if never modified from default generation
0.7 write down children in node files
0.91 include symlinks
0.7 print warning not to mess up after CHILDREN NODES
0.7 add --f switch to force the update of all the nodes - in case of script update
0.7 change in all zimDMS nodes ' ' to '_', correct equality tests
0.8 set children links in Home.txt too
0.9 write down parent node
0.91 better formatting
0.92   title sends you to filemanager
-   collect readme.txt and directory.jpg files when crawling ?
OK   check: moving wiki subtree, then moving directories works ?
OK   check: is the wiki portable on devices ?
0.92 backup wiki
0.93 better routine iszimDMSnode to check if a node is a zimDMS node
0.94 improved update() routine    
0.95 dont write children is they're on the noscan list
0.95 added option to exclude .* directories
0.96 add info after run (% of edited nodes, n. of deleted-in-root nodes etc)
0.96 implement backup rotation scheme (see http://www.computer-repair.com/Backup.htm, possibly GFS scheme)
       > added as external routine
-   alternate/add new way to exclude dirs: .zimDontScan files in root dir


UNSOLVED
OK after some modifications (e.g. delete subtrees) seems mandatory do a zim --index zimdir -> solved, header problem
OK set default node background/color (not possible here: change GTK theme) -> set env

"""

VERSION="0.96"

import os, sys, errno
from subprocess import call
import shutil
import re
import tarfile
import glob
import datetime
import time
import pickle
from optparse import OptionParser

defrootname=os.path.expanduser('~')
defzimdepo=os.path.join(os.path.expanduser('~'),"zim")
rootname=''
zimdepo=''
scandotdir=False

dotre=re.compile('.*/\.')

totn, newn, deln, edtn = 0,0,0,0

# key: current directory
# value: list
#         0:list of children directories
#         1:list of parent directories
# used to check if a tree under/above a given node is changed
mappa={}


# all the symlinks under rootname.
# key:  symlink; value: target
symlinks={}

# list of directories not to scan
noscan=[]


# -------------------------------------------------------------------------------------------------------------------
#  -------------------------------------------- FUNC DEFINITIONS --------------------------------------------------
# -------------------------------------------------------------------------------------------------------------------

# mkdir -p behaviour
def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc: # Python >2.5
        if exc.errno != errno.EEXIST:
            raise

def addheader(x,n):
    # add Zim-like header in empty x file for nodename n
    F=open(x,'w')
    F.write('Content-Type: text/x-zim-wiki\n')
    F.write('Wiki-Format: zim 0.4\n')
    F.write('X-Zimdms: '+VERSION+'\n\n')
#    t=datetime.datetime.now()
#    F.write('Creation-Date: '+str(t).replace(' ','T')+'\n')       # format: Creation-Date: 2010-09-17T14:09:00.910925
    F.write([[file://'+n+'|'+n+']]\n')
    F.close()


def bodyinsert(zt,m):
    # inserts the string m just after the X-Zimdms header - if not already present
    os.rename( zt, zt+"~" )
    destination= open( zt, "w" )
    source= open( zt+"~", "r" )
    for line in source:
        destination.write( line )
        if 'X-Zimdms:' in line:
            l=source.next()
            if not m in l:
                destination.write(m)
            destination.write(l)
    source.close()
    destination.close()
    os.remove(zt+"~")


def getchildren(node):
    # returns an array to be given as children to mappa dictionary: 
    # dont report dotdirs if necessary, dont report children in noscan[]
    ch=[]
    ld=os.listdir(node)
    for child in ld:
        report=1
        s=os.path.join(node,child)
        if os.path.isdir(s):
            if not scandotdir and re.match(dotre, s): 
                report=0
                continue
            for ns in noscan:
                if s.startswith(ns):
                    report=0
                    continue
            if report==1: ch.append(s)
    return ch


def getparents(node,osnode):
    # returns an array containing all the nodes linking to node
    # it is called also with the real OS name, to look for symlinks
    global symlinks
    par=[os.path.dirname(node)]   # init it with the direct parent node
    for k,v in symlinks.items():
        if v==osnode:
            par.append(os.path.dirname(k))  # append the directory where the simlink is (key: symlink; value: target)
    return par


            
def lll(dirname):
    # find all symlinks under dirname
    sy={}
    for root,dirs,files in os.walk(dirname):
        if dirs:
            for node in dirs:
                full=os.path.join(root,node)
                if os.path.islink(full):
                    sy[full]=os.readlink(full)
    return sy
                    

def iszimDMSnode(zn):
    # check by header if the file zn is a zimDMS file
    # go for something better in files header...
    if os.path.isfile(zn):
        F=open(zn,'r')
        line=F.readline()
        if line=="Content-Type: text/x-zim-wiki\n":   # touch only zimfiles
            for i in range(2):
                line=F.readline()
        if 'X-Zimdms:' in line:
            return True
        else:
            # print'node ',zn,' doesnt seem a zimDNS node'
            return False
        F.close()
    else:
        return False

def recdelete(znode):
    # recursively deletes the zimDMSfiles and the relative directories under znode
    # starting from node and descending into the children
    # watchout for symlinks exiting from the subtree!
    global deln
    
    for dname in os.listdir(znode):
        d=os.path.join(znode,dname)
        if os.path.isdir(d):
            recdelete(d)
    a=os.listdir(znode)
    zt=znode+'.txt'
    if len(a)==0:
        body=retbody(zt)
        #if iszimDMSnode(zt):
        if body==[]:
            deln+=1
            os.rmdir(znode)
            os.remove(zt)
            orignode=str.replace(znode,zimdepo,rootname)
            if orignode in mappa:
                del mappa[orignode]
        else:
            bodyinsert(zt,'__ORIGINAL TREE NOT FOUND__\n')
    else:
        bodyinsert(zt,'__ORIGINAL TREE NOT FOUND__\n')
        print'warning - zim node ',znode,' not empty\nIt will remain in your wiki as long as you will not delete additional material manually'
    
    
    
def retbody(zf):
    # returns the body of a zimDMS file,
    # or None if it doesnt exist, or it isnt a zimDMS file,
    # or [] if body is empty
    if os.path.isfile(zf) and iszimDMSnode(zf):
        body=[]
        x=re.compile('.*\[\[file')
            
        F=open(zf,'r')
        for line in F:
            if re.match(x, line):
                break                
        for line in F:                       # now we collect the body
            if line=='**PARENT NODES**\n': 
                break  
            else:
                body.append(line)     
        F.close()
        
        for l in body:
            if not l.isspace():
                return body
        return []
    else:
        return None
    
def update(zf,n,c,p):
    # update zimDMSfile zn, node n, with list of children c, parents p    
    global edtn
    
    body=retbody(zf)
    if body is not None:
        if body==[]:
            body='\n'*5
        else: edtn+=1
    
        addheader(zf,n)
        F=open(zf,'a')
        for a in body:
            F.write(a)
    
        # write down the parent&children links of node    
        F.write('**PARENT NODES**\n')
        for x in sorted(p):
            if x==rootname:
                F.write('* [[Home|Home]]\n')
            elif x==p[0]:            # the 1st is the real parent
                F.write('* [['+os.path.basename(x)+'|'+os.path.basename(x)+']]\n') 
            else:                    # the rest are symlinks
                x=compatzn(x)
                x=str.replace(x,rootname+'/','')
                x=str.replace(x,'/',':')
                F.write('* [['+x+'|'+x+']](**symlink**)\n')                    
        F.write('**CHILDREN NODES**\n')
        for x in sorted(c):
            if os.path.islink(x):
                if x in symlinks:
                    x=symlinks[x]
                    x=compatzn(x)
                    x=str.replace(x,rootname+'/','')
                    x=str.replace(x,'/',':')
                    F.write('* [['+x+'|'+x+']](**symlink**)\n')  # not using '+' it is absolute addressing
            else:
                F.write('* [[+'+os.path.basename(x)+'|'+os.path.basename(x)+']]\n')  # links '+name' are resolved UNDER the current page1
        F.close()
    else:
        pass     # zf doesnt exist or isnt a zimDMS file

    
def compatzn(f):    
    # converts a name to be zim compatible
    # i.e. at the moment a name where ' ' -> '_'
    f = f.replace(' ', '_')
    return f

def doBackup():
    # does a tar jcvf to the repository
    bck=zimdepo+'_backup'
    if not os.path.exists(bck):
        mkdir_p(bck)
        
    if os.path.exists(zimdepo):
        today = datetime.datetime.now()
        today=today.strftime("%Y-%m-%d")
        destination='repo'+today+'.bz2'
        destination = os.path.join(bck,destination)
        print'backup to dir ',destination
        if os.path.exists(destination):
            os.remove(destination)
        out = tarfile.TarFile.open(destination, 'w:bz2')
        out.add(zimdepo, arcname=os.path.basename(zimdepo))
        out.close()

        
        
# -------------------------------------------------------------------------------------------------------------------
#   ------------------------------------------  MAIN  ------------------------------------------------------------
# -------------------------------------------------------------------------------------------------------------------
            
def main():
 
    global symlinks, mappa
    global noscan
    global totn, newn, deln
    global rootname,zimdepo

    parser = OptionParser(usage="Usage: %prog [options] [d1 ... dn]",
                          description="""alexxx.magni@gmail.com Alessandro Magni
                          
                          Version """+VERSION+"""
                         
                          
                          crawler program scanning a given (rootname) root directory,
                          and creating (in zipdepo) a Zim-compatible mirror structure of rootname
                          d1..dn are directories excluded from crawling""")
    parser.add_option("-z", "--zimdepo",
                      dest="zimdepo",default=defzimdepo,
                      help="position of Zim repository (default ~/zim)")
    parser.add_option("-r", "--rootname",
                      dest="rootname",default=defrootname,
                      help="position of root directory to explore (default ~)")
    parser.add_option("-f", "--force",
                      action="store_true", dest="force",
                      help="force rewriting all nodes - preserves however any custom modification (default False)")
    parser.add_option("-b", "--backup",
                      action="store_true", dest="bck",
                      help="backup in directory zimdepo_backup (default False)")
    parser.add_option("-d", "--scandotdir",
                      action="store_true", dest="scandotdir",
                      help="scan also inside directories starting with . (default False)")
    
    (options, args) = parser.parse_args()
    r=parser.rargs
    zimdepo=options.zimdepo
    rootname=options.rootname
    force=options.force
    bck=options.bck
    scandotdir=options.scandotdir
    
    if force:
        print'updating all the nodes in the structure'
    if scandotdir:
        print'scanning of dot-directories enabled'
    
    # Backup Wiki ---------------------------------------------------
    if bck:
        doBackup()

        
    if zimdepo[-1]=='/': zimdepo=zimdepo[:-1]
    if rootname[-1]=='/': rootname=rootname[:-1]
    dictdumpfile=os.path.join(zimdepo,'mappa.dump')

    if not os.path.exists(zimdepo):
        mkdir_p(zimdepo)
    


    for a in args:
        ba=os.path.abspath(a)
        noscan.append(ba)
    noscan.append(zimdepo)
    print'Directories not to be crawled:'
    for a in noscan:
        print '> '+a

    t0 = datetime.datetime.now()

    # load mappa dictionary if present -------------------------------------------
    if os.path.isfile(dictdumpfile) and os.path.getsize(dictdumpfile) > 0:  # cannot pickle empty file
        F=open(dictdumpfile,"r")
        mappa=pickle.load(F)
        F.close()
        for k in mappa:
            if not scandotdir and re.match(dotre, k): del mappa[k]
        print len(mappa),' entries on map file'
    else:
        print'map file not present'

        
    symlinks=lll(rootname)


    # recursive walk on zimdepo to possibly delete subtrees
    #    given a node.txt and node dir: does it exists in rootname?
    #         yes
    #         no: is it empty?
    #              yes -> delete node.txt & directory
    #              no  -> do nothing, just warn. 
    for zimroot,zimdirs,files in os.walk(zimdepo):
        if zimdirs:
            for znode in zimdirs:
                if znode!='.zim':                # .zim is always present
                    znode=os.path.join(zimroot,znode)
                    if not scandotdir and re.match(dotre, znode): 
                        os.rmdir(znode)
                        if os.path.exists(znode+'.txt'): os.remove(znode+'.txt')
                    realdir=str.replace(znode,zimdepo,rootname)
                    realdir=realdir.replace('[', '[[]')   # 2 lines to compare independently
                    realdir=realdir.replace('_', '[ _]')  # of spaces in dirnames                  
                    # if not os.path.exists(realdir):
                    if not glob.glob(realdir):
                        recdelete(znode)

    # --- Check the directory structure ---
    # 1st the root node
    x=os.path.join(zimdepo,'Home.txt')
    addheader(x,rootname)
    ch=getchildren(rootname)
    addheader(x,rootname)
    F=open(x,'a')
    F.write('\n\n             {{../Home.jpg}}')
    F.write('\n\n\n\n\n\n**CHILDREN NODES**\n')
    for c in sorted(ch):
        F.write('* [['+os.path.basename(c)+'|'+os.path.basename(c)+']]\n')  # links '+name' are resolved UNDER the current page1
    F.close()
    
    # then recursive walk from the children downward
    for root,dirs,files in os.walk(rootname):
        if dirs:
            for node in dirs:
                proceed=1
                nodeupdate=0
                osnode=os.path.join(root,node)  # osnode before conversion
                node=compatzn(osnode)
                if os.path.islink(node): proceed=0
                if not scandotdir and re.match(dotre, node): proceed=0
                for ns in noscan:
                    if node.startswith(ns):
                        proceed=0
                if proceed:
                    totn+=1
                    zimpath=str.replace(node,rootname,zimdepo)
                    zimname=zimpath+'.txt'
                    ch=getchildren(osnode)
                    pa=getparents(node,osnode)
                    if not os.path.exists(zimpath):
                        print'node ',node,' is new'
                        newn+=1
                        nodeupdate=1
                        mkdir_p(zimpath)
                        addheader(zimname,osnode)
                        #mappa[node]=[ch,pa]
                    if node in mappa:
                        d=mappa[node]
                        savedch,savedpa = d[0],d[1]
                        if sorted(ch) != sorted(savedch) or sorted(pa) != sorted(savedpa):
                            nodeupdate=1
                            print 'children/parents of node ',node,' changed'
                            #mappa[node]=[ch,pa]
                    else:
                        nodeupdate=1
                    mappa[node]=[ch,pa]
                    if nodeupdate==1 or force:
                        update(zimname,osnode,ch,pa)

    # save a clean dictionary
    mappaclean={}
    for root,dirs,files in os.walk(rootname):
        if dirs:
            for node in dirs:
                node=compatzn(os.path.join(root,node))
                if node in mappa:
                    mappaclean[node]=mappa[node]
    F=open(dictdumpfile,"w")
    pickle.dump(mappaclean,F)
    F.close()
    if len(mappaclean)!=totn:
        print len(mappaclean),' entries on map file, ',totn,' total nodes'
    
    print "Scanned a total of %d nodes, among which %d were new and %d resulted deleted" % (totn,newn,deln)
    if force==True:
        print"number of non-default (edited) nodes is %d (%.1f %%)" % (edtn,100.*edtn/totn)
    print "\n\nEdit whatever you want, but only between the title and the CHILDREN NODES line"
    
    delta_t = datetime.datetime.now() - t0 
    print "Time needed ",delta_t," sec"

if __name__ == '__main__':
    main()


=== Comments === . . .

Clone this wiki locally