# Learning Documents With NLTK

Let's explore the possibility of learning the documents by creating a corpus for them in python's NLTK package. This is important to help Arthur learns [collocations](http://www.nltk.org/howto/collocations.html) (expressions with multiple words), so it knows when to split tokens with more than one word.

The goal here is to split this text, for example:

```
Finest property in New Brunswick! Modern, luxurious, architectural home takes full advantage of center stage on Lake Utopia in St. George. Panoramic views of the lake, 27,000 sq.ft.under roof and 100+ acres of unspoiled natural beauty. Experience resort-style living with 3 homes, 2 tournament quality outdoor tennis courts, and 1 stadium quality indoor tennis court with state-of-the-art indoor stadium lighting, water park including 2 pools & wading pool with umbrella feature, beach volleyball court, baseball field, custom go-kart track, driving range, indoor basketball court, playground, private dock with boat lift and 3 private beaches. Main home offers expanses of glass flooding the interior with brilliant light, sleek contemporary design, dramatic master suite with custom shower/central tub showcasing unparalleled views, master lanai with drapery screening and built-in Jacuzzi. Two guest homes provide luxurious privacy for visitors enjoying this exquisite estate. Welcome to paradise!
```

Into this:

```
[('finest', ), 'property', 'in', '']
```

Then with human aid turn it into concepts:

```
[
    {
        name: 'location',
        values: ['new brunswick', 'st. george'],
        hints: ['in']
    },
    {
        name: 'features',
        values: ['modern', 'luxurious', 'architectural', 'tennis courts', 'built-in jacuzzi', ...]
    },
    {
        name: 'bulding size',
        values: ['27,000 sq.ft'],
        hints: ['under roof']
    },
    {
        name: 'land size',
        values: ['100+ acres'],
        hints: ['unspoiled natural beauty']
    },
    ...
]
```

## 1. Collocations in prebuilt corpus

First we will calculate collocations in a corpus that came with NLTK, just to see if this works.

In [5]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))
finder.nbest(bigram_measures.pmi, 10)

[(u'Allon', u'Bacuth'),
 (u'Ashteroth', u'Karnaim'),
 (u'Ben', u'Ammi'),
 (u'En', u'Mishpat'),
 (u'Jegar', u'Sahadutha'),
 (u'Salt', u'Sea'),
 (u'Whoever', u'sheds'),
 (u'appoint', u'overseers'),
 (u'aromatic', u'resin'),
 (u'cutting', u'instrument')]

How are corpus words stored and used anyway?

In [6]:
print(nltk.corpus.genesis.words('english-web.txt'))

[u'In', u'the', u'beginning', u'God', u'created', ...]


Alright, turns out we simply need to pass tokenized words into `BigramCollocationFinder.from_words()`.

## 2. Building our own corpus from plaintext

In the above, we inputted `nltk.corpus.genesis.words` into `BigramCollocationFinder`. How do we create our own corpus from some plaintext files? Let's explore it below. This was all from [this stackoverflow discussion](http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk).

In [7]:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]

# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)

# Output the files into the directory.
filename = 0
for text in corpus:
    filename+=1
    with open(corpusdir+str(filename)+'.txt','w') as fout:
        print>>fout, text

# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
    assert open(corpusdir+infile,'r').read().strip() == text.strip()

# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')

import pdb
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
    print infile # The fileids of each file.
    fin = newcorpus.open(infile) # Opens the file.
    print fin.read().strip() # Prints the content of the file
    fin.close()
print

1.txt
This is a foo bar sentence.
And this is the first txtfile in the corpus.
2.txt
Are you a foo bar? Yes I am. Possibly, everyone is.



In [8]:
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()

This is a foo bar sentence.
And this is the first txtfile in the corpus.
Are you a foo bar? Yes I am. Possibly, everyone is.


In [10]:
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and 
#       nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()

[[[u'This', u'is', u'a', u'foo', u'bar', u'sentence', u'.'], [u'And', u'this', u'is', u'the', u'first', u'txtfile', u'in', u'the', u'corpus', u'.']], [[u'Are', u'you', u'a', u'foo', u'bar', u'?'], [u'Yes', u'I', u'am', u'.'], [u'Possibly', u',', u'everyone', u'is', u'.']]]


In [11]:
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])

[[[u'This', u'is', u'a', u'foo', u'bar', u'sentence', u'.'], [u'And', u'this', u'is', u'the', u'first', u'txtfile', u'in', u'the', u'corpus', u'.']]]


In [12]:
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()

[[u'This', u'is', u'a', u'foo', u'bar', u'sentence', u'.'], [u'And', u'this', u'is', u'the', u'first', u'txtfile', u'in', u'the', u'corpus', u'.'], ...]


In [13]:
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])

[[u'This', u'is', u'a', u'foo', u'bar', u'sentence', u'.'], [u'And', u'this', u'is', u'the', u'first', u'txtfile', u'in', u'the', u'corpus', u'.']]


In [14]:
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()

[u'This', u'is', u'a', u'foo', u'bar', u'sentence', ...]


In [15]:
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])

[u'This', u'is', u'a', u'foo', u'bar', u'sentence', ...]


## 3. Collocations in custom corpus

Putting the words from our custom corpus into `BigramAssocMeasures`, we get:

In [16]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(
    newcorpus.words())
finder.nbest(bigram_measures.pmi, 10)

[(u',', u'everyone'),
 (u'?', u'Yes'),
 (u'And', u'this'),
 (u'Are', u'you'),
 (u'I', u'am'),
 (u'Possibly', u','),
 (u'Yes', u'I'),
 (u'first', u'txtfile'),
 (u'txtfile', u'in'),
 (u'a', u'foo')]

In [17]:
len(newcorpus.words())

32

Looks good. Next, we will look into implementing this with our documents.

## 4. BigramAssocMeasures in extracted pdf documents

We are going to follow the following pipeline:

- Use ArthurReader to export documents into plaintexts.
- Get words from them.
- Use words to create collocations.
- Use collocations in word splitting.

The reason ArthurReader is needed when exporting documents is due to the bolded texts issue in pdf i.e. **A Text** would be written as "A TextA TextA TextA Text" if we just extract them as they are (when extracted from ArthurDocument, that is). Ideally later on we should keep bold information as feature, but for now let's just remove the duplicates.

### 4.1. Exporting to plaintexts

This section attempts to extract documents and remove duplications.

In [19]:
# Preparing required modules.
import os
os.sys.path.append(os.path.dirname(os.path.abspath('.')))

import sys
import inspect
base_path = os.path.realpath(
    os.path.abspath(
        os.path.join(
            os.path.split(
                inspect.getfile(
                    inspect.currentframe()
                )
            )[0],
            '..',
            'Arthur.workspace'
        )
    )
)

sys.path.append(base_path)

class ListTable(list):
    """ Overridden list class which takes a 2-dimensional list of 
        the form [[1,2,3],[4,5,6]], and renders an HTML Table in 
        IPython Notebook. """
    
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col.encode('utf-8')))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html).decode('utf-8')

In [20]:
from libs.arthur import ArthurDocument
from zipfile import ZipFile
from libs.arthur.errors import BatchReadingError
import numpy as np
import os
from scipy.spatial import cKDTree

# These are prototypes for ArthurReader's methods.

def correct_block(block, return_details=False):
    """Corrects block elements.
    
    Args:
        block(np.array): Block to correct
        
        return_details(bool): If True, return list instead of only corrected block. This list
                              contains details needed for debugging:
                              - removed features
                              - added features
                              Defaults to False
    """
    fxid = ArthurDocument.get_feature_id('x')
    fyid = ArthurDocument.get_feature_id('y')
    positions = block[:,[fxid,fyid]]
    tree = cKDTree(positions)

    # Removes duplicate elements that are close together
    radius = 0.4
    neighbors = tree.query_ball_point(positions, radius)
    neighbors = np.unique(neighbors)
    # This returns numpy array like:
    # [[0, 13, 26, 39] [1, 14, 27, 40] [5, 31, 44, 18] [11, 24, 37, 50]
    # [16, 29, 42, 3] [17, 30, 43, 4] [21, 8, 34, 47] [22, 35, 48, 9]
    # [32, 45, 19, 6] [36, 23, 10, 49] [38, 12, 25, 51] [41, 28, 2, 15]
    # [46, 33, 7, 20] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62]
    # [63] [64]]
    #
    # Which we will then remove duplicates e.g. remove index 13, 26, 39, 14, 27, etc.
    removed = []
    for n in neighbors:
        removed.extend(np.sort(n)[1:])
    
    # Removes image elements
    removed.extend(np.where(block[:,ArthurDocument.get_feature_id('img_width')] != -1)[0].tolist())
    
    cblock = np.delete(block, removed, axis=0)

    if return_details:
        return (cblock, removed)
    else:
        return cblock

def get_texts(document):
    """Get corrected texts from a document.
    Returns:
        list: one text instance per block.
    """
    blocks = DefaultCluster.create_blocks(document)

    texts = []
    for idx, block in enumerate(blocks):
        cblock = correct_block(block)
        texts.append(document.get_text(cblock))
    return texts

def create_corpus(zip_path, corpus_dir, batch_size=100, start_batch=0, stdout=None, recreate=False):
    """Create corpus from zip file. A corpus is basically just a list of text files.
    Args:
        zip_path(str):    Path of zip file to load.
        corpus_dir(str):  Path to corpus dir where the files will be written into.
        batch_size(int):  Size of batch to be processed. If we have one million documents,
                          we'd want to process them in batches. Defaults to 100.
        start_batch(int): When an error happens in batch processing, reader will return
                          index of the last batch processed. enter that index value to
                          start processing from that batch index. Defaults to 0.
        stdout(Object):   Pass sys.stdout to print progress, or pass any object with `write`
                          method to pass printed progress to it.
        overwrite(bool):  Overwrite files as they are created?
    """
    zipfile = ZipFile(zip_path, 'r')
    namelist = zipfile.namelist()
    jobs_total = len(namelist)
    jobs_left = jobs_total - start_batch*batch_size
    
    def process_batch(zipfile, corpus_dir, batch, total, counter=0, overwrite=False):
        for docname in batch:
            counter += 1
            filename = os.path.join(corpus_dir, docname+'.txt')
            if os.path.isfile(filename) and not overwrite:
                stdout.write("%s already exists (%i/%i)\n" % (docname, counter, total))
            else:
                content = zipfile.read(docname)
                stdout.write("processing %s (%i/%i)\n" % (docname, counter, total))
                document = ArthurDocument(content, name=docname)
                texts = get_texts(document)
                if len(texts) > 0:
                    if not os.path.isdir(corpus_dir):
                        os.mkdir(corpus_dir)

                    with open(filename,'w') as fout:
                        for text in texts:
                            print>>fout, text
                else:
                    stdout.write("    empty text! moving on...\n")

    while jobs_left > 0:
        job_start = jobs_total - jobs_left
        job_end = job_start + batch_size
        batch = namelist[job_start:job_end]
        process_batch(zipfile, corpus_dir, batch, jobs_total, job_start)
        jobs_left -= batch_size

    zipfile.close()


In [21]:
import sys
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.collocations import *
from libs.arthur.errors import BatchReadingError

zip_path = 'DRAS_sample_v1_20150605.zip'
corpus_dir = 'corpus'
try:
    create_corpus(zip_path, corpus_dir, stdout=sys.stdout)        
except BatchReadingError as e:
    print(e.msg)
    print('last batch was: %i' % e.last_batch)
    # Maybe do some retries here.

NameError: global name 'overwrite' is not defined

### 4.2. Create Collocations

Here's the plan:
1. Find collocations in our custom corpus.
2. Use them to tokenize a document's content with [MWETokenizer](http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.mwe).

In [None]:
corpus = PlaintextCorpusReader(corpus_dir, '.*')
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
bigram_finder = BigramCollocationFinder.from_words(corpus.words())
trigram_finder = TrigramCollocationFinder.from_words(corpus.words())

In [None]:
table1 = ListTable()
table1.append(['word1', 'word2'])
for pmi in bigram_finder.nbest(bigram_measures.pmi, 10):
    table1.append(pmi)
table1

In [None]:
table2 = ListTable()
table2.append(['word1', 'word2', 'word3'])
for pmi in trigram_finder.nbest(trigram_measures.pmi, 10):
    table2.append(pmi)
table2

### 4.3. Tokenize A Document with MWETokenizer

The next step is to tokenize our document, extracting all the data fields from them. Each 


keep the results of these bigrams and trigrams, then use them for MWETokenizer.

In [None]:
from nltk import word_tokenize
from nltk.tokenize import MWETokenizer

tokenizer = MWETokenizer(bigram_finder.nbest(bigram_measures.pmi, 30000))

pdf_path = os.path.join(base_path, 'exploration', 'pdfs', '10086622.pdf')
with open(pdf_path, 'r') as f:
    document = ArthurDocument(f.read())
    texts = get_texts(document)
    text = ' '.join(texts)
    tokenized = tokenizer.tokenize(word_tokenize(text))
    print tokenized