# The Genius of Pyjnius
### Using Pyjnius with Tika and the Stanford Core NLP
#### Don MacMillen

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

[pyjnius](https://github.com/kivy/pyjnius) is a Python module to access Java classes that live in compiled jar files by using the Java Native Interface ([JNI](https://en.wikipedia.org/wiki/Java_Native_Interface))

pyjnius is a subproject of the [kivy](https://github.com/kivy) project which is an open source cross platform Python framework for the applications development of user interfaces.

I use pyjnius for interfacing with Apache [Tika](https://tika.apache.org/) and the Stanford [NLP](http://nlp.stanford.edu/) project.

Since NLP uses Java 1.8, we need to have that installed.  On Ubuntu 14.04, I found it difficult to find a trusted ppa for OpenJDK, so I went with the Oracle JDK.

To install it, do the following

    sudo add-apt-repository ppa:webupd8team/java
    sudo apt-get update
    sudo apt-get install oracle-java8-installer
    
You can verify by typing java -version

In [1]:
!java -version

java version "1.8.0_66"
Java(TM) SE Runtime Environment (build 1.8.0_66-b17)
Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode)


You will also need to make certain that your JAVA_HOME environment variable is set.  I put it into my .bashrc

    export JAVA_HOME=/usr/lib/jvm/java-8-oracle
    
and after sourcing .bashrc you can check that it is set.

In [2]:
!env | grep JAVA

JAVA_HOME=/usr/lib/jvm/java-8-oracle


Now we need to install pyjnius.  For whatever reason, the version on pypi is 3 years old and limited to Python 2.7.  It is an active project, however, so we need to clone from github and build

    git clone https://github.com/kivy/pyjnius.git
    cd pyjnius
    make
    make test
    python setup.py install
    
Using pyjnius and tika was first (for me) described [here](http://www.hackzine.org/using-apache-tika-from-python-with-jnius.html)

We need the tika application jar.  We can get that by [here](https://tika.apache.org/download.html) and put it someplace convenient.

Now we are ready to extract some text.

We have to set the classpath environment variable and that can be done either with os.environ or with jnius_config command.  I have also added the directory for the Stanford NLP Core library example that is used later.  You will have to download those jars from the NLP website mentioned earlier.

jnius_config command must be imported and set **before** importing the jnius module.  That's because importing jnius actually launches a JVM.  The options -Xmx4G tell the JVM it can use a cache size up to 4GB. The default cache size is too small.

We also need to up the number of maximum characters to consider.  This must be done separately for the input and the output.  The following code shows how.

In [3]:
import os
import os.path
import re
import sys


os.environ['CLASSPATH'] = ("/home/fermi/tikajars/*:"
                           "/home/fermi/snlp/stanford-corenlp-full-2015-12-09/*")

import jnius_config
jnius_config.add_options('-Xmx4G')

import jnius
from jnius import autoclass
from jnius import JavaException

# Import Java classes
Tika = autoclass('org.apache.tika.Tika')
Metadata = autoclass('org.apache.tika.metadata.Metadata')
AutoDetectParser = autoclass('org.apache.tika.parser.AutoDetectParser')
ParseContext = autoclass('org.apache.tika.parser.ParseContext')
BodyContentHandler = autoclass('org.apache.tika.sax.BodyContentHandler')
LanguageIdentifier = autoclass('org.apache.tika.language.LanguageIdentifier')
FileInputStream = autoclass('java.io.FileInputStream')

MAX_CHARACTERS = 30*1024*1024

# Besides getting the metadata, this also extracts the text
def get_meta(fname):
    '''
    Return the file meta data as a dict as well as the body text
    of the file.  Tika will automatically figure out file type and
    extract the text.
    '''
    tika = Tika()
    
    tika.setMaxStringLength(MAX_CHARACTERS)
    parser = AutoDetectParser()
    handler = BodyContentHandler(MAX_CHARACTERS)  # number is max chars to write
    meta = Metadata()
    inputstream = FileInputStream(fname)
    context = ParseContext()
    
    parser.parse(inputstream, handler, meta, context)
    
    mdict = dict((name, meta.get(name)) for name in meta.names())
    return (mdict, handler.toString())

In [4]:
mdict, txt = get_meta('./data/nnnn.pdf')

In [5]:
for k, v in mdict.items():
    print(k, v)

access_permission:extract_content true
access_permission:extract_for_accessibility true
access_permission:modify_annotations true
created Thu Mar 13 05:25:09 PDT 2008
Content-Type application/pdf
Last-Save-Date 2008-03-13T12:27:15Z
access_permission:assemble_document true
Creation-Date 2008-03-13T12:25:09Z
dcterms:modified 2008-03-13T12:27:15Z
meta:creation-date 2008-03-13T12:25:09Z
producer Acrobat Distiller 6.0.1 (Windows)
xmp:CreatorTool Arbortext Advanced Print Publisher 9.0.114/W
dc:title 10055_2008_84_12_1-web 17..25
access_permission:fill_in_form true
dcterms:created 2008-03-13T12:25:09Z
pdf:encrypted false
access_permission:can_print_degraded true
date 2008-03-13T12:27:15Z
pdf:PDFVersion 1.3
X-Parsed-By org.apache.tika.parser.DefaultParser
access_permission:can_print true
title 10055_2008_84_12_1-web 17..25
modified 2008-03-13T12:27:15Z
xmpTPg:NPages 9
access_permission:can_modify true
meta:save-date 2008-03-13T12:27:15Z
dc:format application/pdf; version=1.3
Last-Modified 2008

In [6]:
print(txt[:200])


ORIGINAL ARTICLE

FARMTASIA: an online game-based learning environment
based on the VISOLE pedagogy

Kevin K. F. Cheung Æ Morris S. Y. Jong Æ
F. L. Lee Æ Jimmy H. M. Lee Æ Eric T. H. Luk Æ
Junjie Sha


One of the 'problems' with using pyjnius is that the objects that are returned are not very 'Pythonic'.  For example, a Java list is **not** a Python list, as the following clearly shows.

In [7]:
ArrayList = autoclass('java.util.ArrayList')
jl = ArrayList()
print(dir(jl))
jl.add('aaa')
jl.add('eee')
jl.add('hhh')
print(jl.toString())
print('jl.__class__', jl.__class__)

['__class__', '__cls_storage', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__javaclass__', '__javaconstructor__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__pyx_vtable__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'add', 'addAll', 'clear', 'clone', 'contains', 'containsAll', 'ensureCapacity', 'equals', 'forEach', 'get', 'getClass', 'hashCode', 'indexOf', 'isEmpty', 'iterator', 'lastIndexOf', 'listIterator', 'mpty', 'notify', 'notifyAll', 'parallelStream', 'remove', 'removeAll', 'removeIf', 'replaceAll', 'retainAll', 'set', 'size', 'sort', 'spliterator', 'stream', 'subList', 'toArray', 'toString', 'trimToSize', 'wait']
[aaa, eee, hhh]
jl.__class__ <class 'jnius.reflect.java.util.ArrayList'>


I am not going to try to reproduce the full list object behavior for the Java ArrayList type, but we can make a Python class that will make iteration over these Javaesque objects just a little less ugly (and be useful for other classes that have the Java iteration() method)

In [8]:
# Make iteration over Java objects a little less ugly
class Jiter():
    def __init__(self, jobj):
        self.jobj = jobj

    def __iter__(self):
        iter = self.jobj.iterator()
        while iter.hasNext():
            yield iter.next()
            
    def __len__(self):
        return self.jobj.size()

pl = Jiter(jl)
for item in pl:
    print(item)
    
print(len(pl))

aaa
eee
hhh
3


So now we show the Python version of ShiftReduceDemo.java that is shipped with the Core NLP code.  One (of the many) differences in the Python version is that we do not need all the abstract classes that are strewn around in the Java code.

In [9]:
StringReader = autoclass('java.io.StringReader')

ShiftReduceParser = autoclass('edu.stanford.nlp.parser.shiftreduce.ShiftReduceParser')
DocumentPreprocessor = autoclass('edu.stanford.nlp.process.DocumentPreprocessor')
MaxentTagger = autoclass('edu.stanford.nlp.tagger.maxent.MaxentTagger')

model_path = "edu/stanford/nlp/models/srparser/englishSR.ser.gz"
tagger_path = "edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger"

In [10]:
demo_text = ("My dog likes to shake his stuffed chickadee toy.\n"
             "The truth will set you free only when the truth is freely available.")
    
tagger = MaxentTagger(tagger_path)
model = ShiftReduceParser.loadModel(model_path)
strdr = StringReader(demo_text)

So now we come up against a bug in pyjnius.

We want to create a new tokenizer from the DocumentPreprocessor class.  The class constructor can take either a Reader object (but this is an abstract class), or a string that is a path name to a file.  We'd like to pass in a StringReader object (subclassed from Reader) and this should be OK but pyjnius thinks it is not and throws an error.

    ---> 84     tokenizer = DocumentPreprocessor(StringReader(text))
         85 
         86     print('strdr class: ', strdr.__class__)

    jnius/jnius_export_class.pxi in jnius.JavaClass.__init__ (jnius/jnius.c:18610)()

    jnius/jnius_export_class.pxi in jnius.JavaClass.call_constructor (jnius/jnius.c:19719)()

    jnius/jnius_conversion.pxi in jnius.populate_args (jnius/jnius.c:8323)()

    jnius/jnius_utils.pxi in jnius.check_assignable_from (jnius/jnius.c:5607)()

    JavaException: Invalid instance of 'java/io/StringReader' passed for a 'java/lang/String'

There are actually five different signatures for the constructor for DocumentPreprocessor.  They are 

    public DocumentPreprocessor(Reader input) {
    public DocumentPreprocessor(Reader input, DocType t) {
    public DocumentPreprocessor(String docPath) {
    public DocumentPreprocessor(String docPath, DocType t) {
    public DocumentPreprocessor(String docPath, DocType t, String encoding) {

So what to do? Well, we hack and fool pyjnius into thinking it is OK (and yes, I did file an issue on pyjnius)

In [11]:
Reader = autoclass('java.io.Reader')
strdr.__class__ = Reader  # I will burn in the Python hell for this.
tokenizer = DocumentPreprocessor(strdr)

So the tokenizer is an iterator over sentences in the source text.  For each sentence, we will first pass it to be tagged.  Then the tagged sentence can be parsed into a grammatical parse tree.

In [12]:
trees = []
for sentence in Jiter(tokenizer):
    tagged = tagger.tagSentence(sentence)
    trees.append(model.apply(tagged))

Now we can take a look at our parse trees

In [13]:
for tree in trees:
    print(tree.toString(), '\n')

(ROOT (S (NP (PRP$ My) (NN dog)) (VP (VBZ likes) (S (VP (TO to) (VP (VB shake) (NP (PRP$ his) (VBN stuffed) (NN chickadee) (NN toy)))))) (. .))) 

(ROOT (S (NP (DT The) (NN truth)) (VP (MD will) (VP (VB set) (NP (PRP you)) (ADVP (JJ free)) (SBAR (WHADVP (RB only) (WRB when)) (S (NP (DT the) (NN truth)) (VP (VBZ is) (ADJP (RB freely) (JJ available))))))) (. .))) 



Very cool.