Switch branches/tags
Nothing to show
Find file History
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Windows Installation

  1. Ensure you have Java JDK 6 installed (not tried with Java 7).

    Download from http://www.oracle.com/technetwork/java/javase/downloads/index.html

    Follow default installation

    Check it works (from Command Line), type:

    java -version

  2. Ensure you have Python 2.7.3 installed

    Can be installed http://www.python.org/ftp/python/2.7.3/python-2.7.3.msi

    Follow default installation

  3. Install WinCDEmu (only needed if using ISORunner.py)

    Download from http://wincdemu.sysprogs.org/

  4. Copy scripts and Java files to an appropriate installation directory <INSTALL_DIR> (e.g. C:/SPRUCE/TikaFileIdentifier)

  5. If you don't have it, download the Tika JAR file to <INSTALL_DIR>/tika

    Tika available from: http://tika.apache.org/download.html

  6. Compile the Java Tika wrapper. From command line, type:

    cd <INSTALL_DIR>/java javac -cp ../tika/tika-app-1.1.jar TikaWrapper.java (Tika jar may be named differently)

  7. Update the variables in config.py to suit your installation: TIKA = "" IGNORE = List of Metadata headings to ignore. Usually cause problems with formatting CSV MOUNTER = MOUNT_POINT = Free directory to mount an ISO image to, e.g. V:


There are 5 scripts: 1 configuration script, 3 scripts that do the bulk of the work, and 1 ISO file workflow script.


Contains user settings. Edit these values to ensure they reflect your system setup


From the command line, type:

cd <INSTALL_DIR>/python
python ISORunner.py <input ISO file directory> <temp output directory> <summary CSV to create>


python TikaRunner.py C:/SPRUCE/Data/Seven_Stories/ C:/SPRUCE/Output/Seven_Stories C:/SPRUCE/Output/SS_Summary.csv

This script will identify all .ISO files within the specified directory, and mount each file to make it's content available. It will then run TikaRunner.py over the files contained in the ISO file, outputting the results to the temporary output folder. Each ISO's results are aggregated using CSVFormatter.py into a single CSV file.
Once all ISOs have been processed, Summariser.py is used to summarise the data into one summary CSV file.

Note: ISORunner creates two sub directories under the : /TikaRunner and /Aggregated. TikaRunner.py results files are placed in the first, the aggregated CSVs are placed in the second. Therefore the summary CSV can be placed in the same without the problem mentioned in the CSVFormatter.py notes below.

The following 2 scripts can also be run standalone, in order, on a directory of files.


From the command line, type:

python TikaRunner.py <input file directory> <temp output directory>


python TikaRunner.py C:/SPRUCE/Data/Seven_Stories/ C:/SPRUCE/Output/Data/Seven_Stories

This script will run Tika over all files in the input file directory, providing information about expected time left


From the command line, type:

python CSVFormatter.py <input file directory> <temp output directory> <output csv file>


python CSVFormatter.py C:/SPRUCE/Data/Seven_Stories/ C:/SPRUCE/Output/Data/Seven_Stories C:/SPRUCE/Output/results.csv

Note: DO NOT save the output CSV file in the same directory as the temp output directory, otherwise if you repeatedly run the script it will attempt to aggregate this CSV into itself. Note 2: The input file directory enables the filenames listed in the CSV to reflect the actual original file, rather than the temp output file


From the command line, type:

python Summariser.py <output csv file>


python Summariser.py <output csv file> <summarised csv file>


python Summariser.py C:/SPRUCE/Output/results.csv


python Summariser.py C:/SPRUCE/Output/results.csv C:/SPRUCE/Output/summary.csv

The first approach just outputs summary statistics to the command line. The second approach also outputs the summary into a single row in a CSV file