# Learning Documents With NLTK

Let's explore the possibility of learning the documents by creating a corpus for them in python's NLTK package. This is important to help Arthur learns [collocations](http://www.nltk.org/howto/collocations.html) (expressions with multiple words), so it knows when to split tokens with more than one word.

The goal here is to split this text, for example:

```
Finest property in New Brunswick! Modern, luxurious, architectural home takes full advantage of center stage on Lake Utopia in St. George. Panoramic views of the lake, 27,000 sq.ft.under roof and 100+ acres of unspoiled natural beauty. Experience resort-style living with 3 homes, 2 tournament quality outdoor tennis courts, and 1 stadium quality indoor tennis court with state-of-the-art indoor stadium lighting, water park including 2 pools & wading pool with umbrella feature, beach volleyball court, baseball field, custom go-kart track, driving range, indoor basketball court, playground, private dock with boat lift and 3 private beaches. Main home offers expanses of glass flooding the interior with brilliant light, sleek contemporary design, dramatic master suite with custom shower/central tub showcasing unparalleled views, master lanai with drapery screening and built-in Jacuzzi. Two guest homes provide luxurious privacy for visitors enjoying this exquisite estate. Welcome to paradise!
```

Into this:

```
[('finest', ), 'property', 'in', '']
```

Then with human aid turn it into concepts:

```
[
    {
        name: 'location',
        values: ['new brunswick', 'st. george'],
        hints: ['in']
    },
    {
        name: 'features',
        values: ['modern', 'luxurious', 'architectural', 'tennis courts', 'built-in jacuzzi', ...]
    },
    {
        name: 'bulding size',
        values: ['27,000 sq.ft'],
        hints: ['under roof']
    },
    {
        name: 'land size',
        values: ['100+ acres'],
        hints: ['unspoiled natural beauty']
    },
    ...
]
```

## 1. Collocations in prebuilt corpus

First we will calculate collocations in corpurs that came with NLTK, just to see if this works.

In [None]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(
    nltk.corpus.genesis.words('english-web.txt'))
finder.nbest(bigram_measures.pmi, 10)

How are corpus words stored and used anyway?

In [None]:
print(nltk.corpus.genesis.words('english-web.txt'))

Alright, turns out we simply need to pass tokenized words into `BigramCollocationFinder.from_words()`.

## 2. Building our own corpus from plaintext

In the above, we inputted `nltk.corpus.genesis.words` into `BigramCollocationFinder`. How do we create our own corpus from some plaintext files? Let's explore it below. This was all from [this stackoverflow discussion](http://stackoverflow.com/questions/4951751/creating-a-new-corpus-with-nltk).

In [None]:
import os
from nltk.corpus.reader.plaintext import PlaintextCorpusReader

# Let's create a corpus with 2 texts in different textfile.
txt1 = """This is a foo bar sentence.\nAnd this is the first txtfile in the corpus."""
txt2 = """Are you a foo bar? Yes I am. Possibly, everyone is.\n"""
corpus = [txt1,txt2]

# Make new dir for the corpus.
corpusdir = 'newcorpus/'
if not os.path.isdir(corpusdir):
    os.mkdir(corpusdir)

# Output the files into the directory.
filename = 0
for text in corpus:
    filename+=1
    with open(corpusdir+str(filename)+'.txt','w') as fout:
        print>>fout, text

# Check that our corpus do exist and the files are correct.
assert os.path.isdir(corpusdir)
for infile, text in zip(sorted(os.listdir(corpusdir)),corpus):
    assert open(corpusdir+infile,'r').read().strip() == text.strip()

# Create a new corpus by specifying the parameters
# (1) directory of the new corpus
# (2) the fileids of the corpus
# NOTE: in this case the fileids are simply the filenames.
newcorpus = PlaintextCorpusReader('newcorpus/', '.*')

import pdb
# Access each file in the corpus.
for infile in sorted(newcorpus.fileids()):
    print infile # The fileids of each file.
    fin = newcorpus.open(infile) # Opens the file.
    print fin.read().strip() # Prints the content of the file
    fin.close()
print

In [None]:
# Access the plaintext; outputs pure string/basestring.
print newcorpus.raw().strip()

In [None]:
# Access paragraphs in the corpus. (list of list of list of strings)
# NOTE: NLTK automatically calls nltk.tokenize.sent_tokenize and 
#       nltk.tokenize.word_tokenize.
#
# Each element in the outermost list is a paragraph, and
# Each paragraph contains sentence(s), and
# Each sentence contains token(s)
print newcorpus.paras()

In [None]:
# To access pargraphs of a specific fileid.
print newcorpus.paras(newcorpus.fileids()[0])

In [None]:
# Access sentences in the corpus. (list of list of strings)
# NOTE: That the texts are flattened into sentences that contains tokens.
print newcorpus.sents()

In [None]:
# To access sentences of a specific fileid.
print newcorpus.sents(newcorpus.fileids()[0])

In [None]:
# Access just tokens/words in the corpus. (list of strings)
print newcorpus.words()

In [None]:
# To access tokens of a specific fileid.
print newcorpus.words(newcorpus.fileids()[0])

## 3. Collocations in custom corpus

Putting the words from our custom corpus into `BigramAssocMeasures`, we get:

In [None]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
finder = BigramCollocationFinder.from_words(
    newcorpus.words())
finder.nbest(bigram_measures.pmi, 10)

In [None]:
len(newcorpus.words())

Looks good. Next, we will look into implementing this with our documents.

## 4. BigramAssocMeasures in extracted pdf documents

We are going to follow the following pipeline:

- Use ArthurReader to export documents into plaintexts.
- Get words from them.
- Use words to create collocations.
- Use collocations in word splitting.

The reason ArthurReader is needed when exporting documents is due to the bolded texts issue in pdf i.e. **A Text** would be written as "A TextA TextA TextA Text" if we just extract them as they are (when extracted from ArthurDocument, that is). Ideally later on we should keep bold information as feature, but for now let's just remove the duplicates.

### 4.1. Exporting to plaintexts

This section attempts to extract documents and remove duplications.

In [21]:
# Preparing required modules.
import os
os.sys.path.append(os.path.dirname(os.path.abspath('.')))

import sys
import inspect
base_path = os.path.realpath(
    os.path.abspath(
        os.path.join(
            os.path.split(
                inspect.getfile(
                    inspect.currentframe()
                )
            )[0],
            '..'
        )
    )
)

sys.path.append(base_path)

class ListTable(list):
    """ Overridden list class which takes a 2-dimensional list of 
        the form [[1,2,3],[4,5,6]], and renders an HTML Table in 
        IPython Notebook. """
    
    def _repr_html_(self):
        html = ["<table>"]
        for row in self:
            html.append("<tr>")
            
            for col in row:
                html.append("<td>{0}</td>".format(col.encode('utf-8')))
            
            html.append("</tr>")
        html.append("</table>")
        return ''.join(html).decode('utf-8')

In [2]:
from libs.arthur import ArthurDocument
from zipfile import ZipFile
from libs.arthur.errors import BatchReadingError
import numpy as np
import os
from scipy.spatial import cKDTree
from libs.arthur import ArthurReader
from libs.arthur.cluster import DefaultCluster

# These are prototypes for ArthurReader's methods.

def correct_block(block, return_details=False):
    """Corrects block elements.
    
    Args:
        block(np.array): Block to correct
        
        return_details(bool): If True, return list instead of only corrected block. This list
                              contains details needed for debugging:
                              - removed features
                              - added features
                              Defaults to False
    """
    fxid = ArthurDocument.get_feature_id('x')
    fyid = ArthurDocument.get_feature_id('y')
    positions = block[:,[fxid,fyid]]
    tree = cKDTree(positions)

    # Removes duplicate elements that are close together
    radius = 0.4
    neighbors = tree.query_ball_point(positions, radius)
    neighbors = np.unique(neighbors)
    # This returns numpy array like:
    # [[0, 13, 26, 39] [1, 14, 27, 40] [5, 31, 44, 18] [11, 24, 37, 50]
    # [16, 29, 42, 3] [17, 30, 43, 4] [21, 8, 34, 47] [22, 35, 48, 9]
    # [32, 45, 19, 6] [36, 23, 10, 49] [38, 12, 25, 51] [41, 28, 2, 15]
    # [46, 33, 7, 20] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62]
    # [63] [64]]
    #
    # Which we will then remove duplicates e.g. remove index 13, 26, 39, 14, 27, etc.
    removed = []
    for n in neighbors:
        removed.extend(np.sort(n)[1:])
    cblock = np.delete(block, removed, axis=0)
    
    if return_details:
        return (cblock, removed)
    else:
        return cblock
    
def create_corpus(zip_path, corpus_dir, batch_size=100, start_batch=0, stdout=None, recreate=False):
    """Create corpus from zip file. A corpus is basically just a list of text files.
    Args:
        zip_path(str):    Path of zip file to load.
        corpus_dir(str):  Path to corpus dir where the files will be written into.
        batch_size(int):  Size of batch to be processed. If we have one million documents,
                          we'd want to process them in batches. Defaults to 100.
        start_batch(int): When an error happens in batch processing, reader will return
                          index of the last batch processed. enter that index value to
                          start processing from that batch index. Defaults to 0.
        stdout(Object):   Pass sys.stdout to print progress, or pass any object with `write`
                          method to pass printed progress to it.
        recreate(bool):   Recreate files?
    """
    zipfile = ZipFile(zip_path, 'r')
    namelist = zipfile.namelist()
    jobs_total = len(namelist)
    jobs_left = jobs_total - start_batch*batch_size
    
    def process_batch(zipfile, corpus_dir, batch, total, counter=0):
        for docname in batch:
            counter += 1
            filename = os.path.join(corpus_dir, docname+'.txt')
            if os.path.isfile(filename):
                stdout.write("%s already exists (%i/%i)\n" % (docname, counter, total))
            else:
                content = zipfile.read(docname)
                stdout.write("processing %s (%i/%i)\n" % (docname, counter, total))
                document = ArthurDocument(content, name=docname)
                texts = get_texts(document, stopwords)
                if len(texts) > 0:
                    if not os.path.isdir(corpus_dir):
                        os.mkdir(corpus_dir)

                    with open(filename,'w') as fout:
                        for text in texts:
                            print>>fout, text
                else:
                    stdout.write("    empty text! moving on...\n")
    
    def get_texts(document):
        blocks = DefaultCluster.create_blocks(document)
        
        texts = []
        for idx, block in enumerate(blocks):
            cblock = ArthurReader.correct_block(block)
            texts.append(document.get_text(cblock))
        return texts

    while jobs_left > 0:
        job_start = jobs_total - jobs_left
        job_end = job_start + batch_size
        batch = namelist[job_start:job_end]
        process_batch(zipfile, corpus_dir, batch, jobs_total, job_start)
        jobs_left -= batch_size

    zipfile.close()


In [13]:
import sys
import nltk
from nltk.corpus.reader.plaintext import PlaintextCorpusReader
from nltk.collocations import *
from libs.arthur.errors import BatchReadingError

zip_path = 'DRAS_sample_v1_20150605.zip'
corpus_dir = 'corpus'
try:
    create_corpus(zip_path, corpus_dir, stdout=sys.stdout)        
except BatchReadingError as e:
    print(e.msg)
    print('last batch was: %i' % e.last_batch)
    # Maybe do some retries here.

V1123680.pdf already exists (1/300)
V1119135.pdf already exists (2/300)
V1102480.pdf already exists (3/300)
V1123741.pdf already exists (4/300)
V1119430.pdf already exists (5/300)
V1122760.pdf already exists (6/300)
V1123301.pdf already exists (7/300)
F1439898.pdf already exists (8/300)
V1116097.pdf already exists (9/300)
V1124383.pdf already exists (10/300)
V1120838.pdf already exists (11/300)
V1117770.pdf already exists (12/300)
V1104999.pdf already exists (13/300)
V1099142.pdf already exists (14/300)
V1123148.pdf already exists (15/300)
V1081878.pdf already exists (16/300)
V1104283.pdf already exists (17/300)
V1114927.pdf already exists (18/300)
V1099185.pdf already exists (19/300)
V1106593.pdf already exists (20/300)
F1440892.pdf already exists (21/300)
V1084554.pdf already exists (22/300)
V1121393.pdf already exists (23/300)
V1114732.pdf already exists (24/300)
348906.pdf already exists (25/300)
V1123093.pdf already exists (26/300)
V1125886.pdf already exists (27/300)
V1106471.pdf

### 4.2. Create Collocations

In [48]:

text = nltk.Text(corpus.words())




AttributeError: 'Text' object has no attribute 'lower'

In [47]:
list(string.punctuation)

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [43]:
text.collocations(100)
len(text._collocations)

British Columbia; http ://; http www; :// www; www realtor; Single
Family; Walk Score; Real Estate; aspx PropertyId; propertyDetails
aspx; Vancouver British; Residential Single; Data provided; displayed
believed; warranties representations; independently verified;
representations made; Show measurements; information displayed; kind
Data; Bathrooms Total; Floor Space; Title Freehold; measurements
Imperial; Real Board; Estate Board; 000 Listing; Land Size; Property
Single; Fax 604; Style Detached; Building House; Car Dependent;
Appliances Included; Property Type; Description Building; Score Walk;
Space sqft; Basement Unknown; Board Greater; Building Type; Type
House; Amenities Nearby; Description Type; Score Car; Fire Protection;
Type Single; Details Nearby; Greater http; Type Family; Basement
Features; West Vancouver; House Built; Basement Type; Details
Amenities; Greater Vancouver; Personal Real; Real Corporation;
Architecture Style; Estate Corporation; Personal Estate; Measurements
av

100

In [4]:
from nltk.corpus import stopwords
corpus = PlaintextCorpusReader(corpus_dir, '.*')
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
bigram_finder = BigramCollocationFinder.from_words(corpus.words())
trigram_finder = TrigramCollocationFinder.from_words(corpus.words())

In [27]:
table1 = ListTable()
table1.append(['word1', 'word2'])
for pmi in bigram_finder.nbest(bigram_measures.pmi, 100):
    table1.append(pmi)
table1

0,1
word1,word2
+.,Learn
/#,tab
0096,MARK
0702,Raymar
0928,Prudential
10171,Saskatchewan
1020,Austin
10X10,36X18
1206,CENTRE


In [28]:
table2 = ListTable()
table2.append(['word1', 'word2', 'word3'])
for pmi in trigram_finder.nbest(trigram_measures.pmi, 100):
    table2.append(pmi)
table2

0,1,2
word1,word2,word3
+.,Learn,Success
0096,MARK,ROLLINS
0928,Prudential,Power
1716,Craig,Doherty
1726,JAMIE,BLAIR
2X16,GUEST,ROOM
Antifaev,Jon,Moss
BRIGHT,VERSATILE,BSMT
Berwick,whose,practice


### 4.3. Use in Splits