# Detecting (unscored) Data Fields

In this work and the next document, we will follow the pipeline as described in [1].

Given set of words, how do we convert them into data fields? This document will explore it.

The data fields here are unscored, meaning they can't yet be used for learning. In the next section (Classifying Data Fields) we will assign score to them.

In [1]:
# Preparing required modules.
import os
os.sys.path.append(os.path.dirname(os.path.abspath('.')))

import sys
import inspect
base_path = os.path.realpath(
    os.path.abspath(
        os.path.join(
            os.path.split(
                inspect.getfile(
                    inspect.currentframe()
                )
            )[0],
            '..'
        )
    )
)

sys.path.append(base_path)

class ListTable(list):
    """ Overridden list class which takes a 2-dimensional list of 
        the form [[1,2,3],[4,5,6]], and renders an HTML Table in 
        IPython Notebook. """
    
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html).decode('utf-8')

## 1. First version: Use dumb clusterer

In this first version we will just use a dumb classifier, which basically just finds multi-word expressions (mwes) with trigrams and bigrams distribution calculation. It will also let custom mwes to be setup.

If that sounds dumb, that's because it is. This clusterer was just made to create a baseline for our entire pipeline.

In [14]:
# Prepare corpus if it hasn't been created.
from libs.arthur.reader import create_corpus, read
from libs.arthur.clusterer import DumbClusterer
from libs.arthur import ArthurDocument

zip_path = os.path.join(base_path, 'exploration', 'DRAS_sample_v1_20150605.zip')
corpus_dir = 'corpus'
create_corpus(zip_path, corpus_dir)
clusterer = DumbClusterer(corpus_dir, setup_mwes=False)

In [19]:
clusterer.setup_mwes(trigram_nbest=1000, bigram_nbest=15000)
pdf_path = 'pdfs/05337591.pdf'
with open(pdf_path, 'rb') as f:
    document = ArthurDocument(f.read(), doctype='pdf', name='test')
field_data = read(document, clusterer)

List of MWEs:

In [25]:
table1 = ListTable()
table1.append(['id', 'mwe'])
for counter, record in enumerate(clusterer.mwes[0:10]):
    table1.append([counter, record])
print("# of MWEs: %i" % len(clusterer.mwes))
print("Sample of first 10:")
table1

# of MWEs: 16000
Sample of first 10:


0,1
id,mwe
0,"[u'bonnymuir', u'pl']"
1,"[u'covered', u'boatslips']"
2,"[u'name', u'annex']"
3,"[u'permit', u'this']"
4,"[u'backs', u'onto']"
5,"[u'whalesback', u'dynamic', u'contemporary']"
6,"[u'imperial', u'without']"
7,"[u'also', u'ideal']"
8,"[u'landscaping', u'impressive']"


List of (unscored) field data:

In [21]:
table1 = ListTable()
table1.append(['id', 'text', 'x', 'x1', 'y', 'y2'])
for counter, record in enumerate(field_data):
    table1.append([counter, record['text'], record['x'], record['x1'], record['y'], record['y1']])
table1

0,1,2,3,4,5
id,text,x,x1,y,y2
0,1191,53.9999784,77.879968848,13.020311592,23.9283072288
1,ROUTE,80.879967648,120.969935924,13.020311592,23.9283072288
2,785,123.95984928,141.95984208,13.020311592,23.9283072288
3,",",141.95984208,144.95984088,13.020311592,23.9283072288
4,UTOPIA,147.95983968,191.899776256,13.020311592,23.9283072288
5,",",191.873760626,194.873759426,13.020311592,23.9283072288
6,NEW BRUNSWICK,197.867898857,300.255637014,13.020311592,23.9283072288
7,",",300.229592104,303.229590904,13.020311592,23.9283072288
8,New Brunswick,306.223730334,383.867261645,13.020311592,23.9283072288


As seen from above, values like "Property Type", "Single Family", "Heat Pump", "Water Well", etc. were correctly exported from this document. Trigrams like "Age of Building", however, did not get captured, even by trying over 20,000 trigram_nbest - possibly caused by the rarity of this trigram compared to other trigrams.