PKP XML Parsing Service
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
config strip p tags from figures, and set up xmllint command to fix texture … May 15, 2018
docs simplified reference jobs if they run after merge Jul 5, 2016
javascript fix symlinks again Jul 6, 2016
module Merge branch 'master' of github.com:pkp/ots Jul 5, 2018
public bump to texture RC2 May 18, 2018
style fix symlinks again Jul 6, 2016
var Removed gitignore Dec 16, 2013
vendor correction #78 Aug 25, 2016
.gitignore Adding Doxygen scaffold and ignoring its output. Oct 27, 2015
.gitmodules Added application scaffold Sep 20, 2013
BatchTestSuite.php Fixed whitespace Jul 7, 2014
Doxyfile Adding Doxygen scaffold and ignoring its output. Oct 27, 2015
Guardfile Configured Guard, set up js/scss/css structure Sep 20, 2013
LICENSE added gplv3 Aug 17, 2015
README.md Typo on #375 Aug 5, 2018
UnitTestBootstrap.php Added function comments Nov 14, 2013
assets.yml Added jquery autocomplete assets Jan 21, 2014
composer.json cermine version bump Mar 28, 2017
composer.lock
composer.phar simplified reference jobs if they run after merge Jul 5, 2016
init_autoloader.php Added application scaffold Sep 20, 2013
phpunit.xml xml final module which copies xml document from stage 5 if it exists,… Feb 28, 2018
start_queues.sh bump grobid version May 14, 2018
unittest.sh fixed permissions Jul 5, 2016

README.md

PKP XML Parsing Service

Module Description

  • User

    • Authentication

    • Registration

    • New password

    • Generate API access token

  • Admin

    • Confirm registrations

    • Set a user's document conversion rate

    • Delete user

    • Edit User

    • System log viewer

  • Manager

    • Receives conversion jobs

    • Job list

    • Handles job distribution to queues

  • PathFinder

    • Determines which conversion path to take
  • DocxConversion

    • Converts documents to DocX format
  • WpPdfConversion

    • Converts word processing documents directly to PDF
  • NlmxmlConversion

    • Uses meTypeset to perform part of the NLM XML conversion
  • Cermine

    • Uses CERMINE to perform part of the NLM XML conversion
  • Grobid

    • Uses Grobid to perform part of the NLM XML conversion
  • ReferenceConversion

    • Parses references from DocX document into a seperate XML file
  • BibtexConversion

    • Converts references from the previous step into Bibtex
  • BibtexreferenceConversion

    • Converts Bibtex references into NLMXML and merges the converted references into the NLMXML document
  • HtmlConversion

    • Converts the NLMXML document into HTML
  • CitationStyleConversion

    • Formats the citations in the HTML document according to the citationstyle requested by the user
  • PdfConversion

    • Converts the HTML document into PDF
  • XmpConversion

    • Adds an XMP sidecar with metadata from the NLMXML to the PDF document
  • EpubConversion

    • Converts the NLMXML document into Epub format
  • MergeXMLOutputs

    • Combines the CERMINE and meTypeset outputs (as modified by BibTeX processing, if appropriate) into a better NLM XML document
  • ZipConversion

    • Zips all documents
  • API

    • Simple REST API to submit and retrieve jobs and to provide functionality for the frontends AJAX callbacks.

 

Requirements

  • Apache mod_headers needs to be installed and enabled

  • meTypeset needs the python lxml library

  • Java VM needs to be installed

  • sendmail needs to be installed and configured

  • zip and unzip need to be installed, and I have to mention this because they aren’t by default on Ubuntu for some reason.

  • MySQL

  • citation parsing has a variety of requirements, as listed in the ParsCit documentation. At minimum, you need Perl XML Twig and XML Writer, a working Ruby installation, and CRF++ binaries.

  • xml2bib needs to be installed

  • Pandoc, pandoc-citeproc, and libghc-citeproc-hs-data need to be installed

  • The XMP conversion needs Exiftool to be installed

  • The docX conversion needs LibreOffice with unoconv installed. The server is tested to work with LibreOffice 4.2.4 through 5.x.

  • The PHP5 xsl and curl modules must be installed.

 

Installation

  • Copy the source

# git clone https://github.com/pkp/xmlps.git # cd xmlps

  • Install the dependencies

# php composer.phar self-update # php composer.phar install

  • Ensure there is a MySQL database called xmlps, and a MySQL user with access to it.

  • Configure the environment:

  • Copy the local.php template, then edit it:

    # cp config/autoload/local.php.dist config/autoload/local.php
    
  • Provide the MySQL user and password.

  • If you want debug messages to be displayed to the user (i.e., if developing locally), set these options:

        'view_manager' => array(
            'display_not_found_reason' => true,
            'display_exceptions' => true,
        ),
  • If using a version of unoconv installed via package manager (such as apt), you may need to override the command, e.g.:

        'conversion' => array(
            'docx' => array(
                'unoconv' => array(
                    'command' => 'unoconv',
                ),
            ),
        ),
  • Make sure the cache directories are writable by your Web server process, and empty the cache directory, e.g:

# chmod -R go+w var # rm var/cache/zfcache-ea/*

# vendor/doctrine/doctrine-module/bin/doctrine-module orm:schema-tool:update --force

# ./start_queues.sh

 

Sample sites-available/httpd.conf:

 

<VirtualHost *:80>
    ServerAdmin webmaster@localhost

    DocumentRoot /var/www/html/public
    <Directory />
        Options FollowSymLinks
        AllowOverride All
    </Directory>
    <Directory /var/www/html/public>
        Options -Indexes +FollowSymLinks +MultiViews
        AllowOverride All
        Order allow,deny
        allow from all
    </Directory>



    ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
    <Directory "/usr/lib/cgi-bin">
        AllowOverride None
        Options +ExecCGI -MultiViews +SymLinksIfOwnerMatch
        Order allow,deny
        Allow from all
    </Directory>

    ErrorLog ${APACHE_LOG_DIR}/error.log

    # Possible values include: debug, info, notice, warn, error, crit,
    # alert, emerg.
    LogLevel warn

    CustomLog ${APACHE_LOG_DIR}/access.log combined

    Alias /doc/ "/usr/share/doc/"
    <Directory "/usr/share/doc/">
        Options Indexes MultiViews FollowSymLinks
        AllowOverride None
        Order deny,allow
        Deny from all
        Allow from 127.0.0.0/255.0.0.0 ::1/128
    </Directory>

</VirtualHost>

 

Note that you will probably want to make sure that port 8080 is blocked from non-localhost connections so that external submissions can not be made directly to the Grobid module service.

 

Unit tests

After a successful installation the unit tests should complete without errors:

# ./unittest.sh

You will need to re-empty the cache directories afterward, if your Web server runs as a different user than you.

 

Developer notes

  • SASS compilation, CSS and Javascript compression & unification is done using Guard (http://guardgem.org)

  • After making changes to Javascript (javascript/) or style files (style/scss/) recompile/recompress the style and Javascript files by running

# guard

 

API

There is a simple REST API available to submit, view and retrieve jobs from/to the server.

Submit

Submit a job to the server. The citationStyleHash is an internal identifier for the requested citaton style. A list of hashes can be retrieved through the citationStyleList API. The API will return the job id which can be used to retrieve the completed job later or to query the server for the job status.

  • URL: api/job/submit

  • Request type: POST

  • Parameters:

  • email

  • access_token

  • fileName

  • fileContent

  • citationStyleHash

  • fileMetadata

E.g.:

http://example.com/api/job/submit
POST parameters:
    'email' => 'user@example.com'
    'access_token' => 'access_token'
    'fileName' => 'document.docx'
    'citationStyleHash' => 'c6de5efe3294b26391ea343053c19a84',
    'fileContent' => '...'
    'fileMetadata' => *OPTIONAL* known good metadata like https://raw.githubusercontent.com/pkp/xmlps/master/module/MergeXMLOutputs/test/assets/metadata.json 

Example response:

{"status":"success","id":123}

Status

Returns the current status for a job. Only completed jobs can be retrieved from the server. A full list of statuses can be found here.

  • URL: api/job/status

  • Request type: GET

  • Parameters:

  • email

  • access_token

  • id

E.g.:

http://example.com/api/job/status?email=user@example.com&access_token=access_token&id=123

Example response:

{"status":"success","jobStatus":0,"jobStatusDescription":"Pending"}

Citation Style List

Returns a list of available citation styles and their internal ids. We support all citation styles from citationstyles.org.

  • URL: api/job/citationStyleList

  • Request type: GET

E.g.:

http://example.com/api/job/citationStyleList

Example response:

{"status":"success","citationStyles":{"c6de5efe3294b26391ea343053c19a84":"ACM SIG Proceedings (\u0022et al.\u0022 for 15+ authors)"...

Retrieve

Retrieve a converted document. The jobConversionStage parameter specifies which type of conversion you want to get returned. A full list of conversion stages can be found here. The "final" XML produced by our pipeline is stage 20.

  • URL: api/job/retrieve

  • Request type: GET

  • Parameters:

  • email

  • access_token

  • id

  • conversionStage

E.g.:

http://example.com/api/job/retrieve?email=user@example.com&access_token=access_token&id=123&conversionStage=10

Example response:

The requested document or a JSON string with an error message.