# Tutorial: Phrase Extraction with Manatee

This tutorial shows how to extract phrases from corpora by explicit rules (grammar patterns) and learn their distributed representations with POcs2Vec.

## Prerequistities

You will need <a href="https://nlp.fi.muni.cz/trac/noske">Manatee</a> corpus manager with compiled SUSANNE corpus.  As for knowledge, you should be happy with basics of <a href="https://www.sketchengine.co.uk/documentation/corpus-querying/">CQL</a>.

## Imports

In [1]:
from itertools import islice

from diphra.lex import Lexicon
from diphra.pocs import POcs, VerticalPOcs, HDF5POcs
from diphra.models import pocs2vec
from diphra.extractors import ManateeExtractor
from diphra.extractors.manatee_extractor import match_cql

## Simple Pattern Matching

In [6]:
matches = match_cql(
    corpus='susanne',
    struct='p',
    lex_attr='lemma',
    cql_attr='tag',
    query='1:"VV.*" 2:"RP" "AT.*"? 3:"N.*"',
    template='to {1.lemma} {2.word} {3.word} | {3.tag}'
)

In [7]:
for struct_id, label2pos, label2lex_id, text in islice(matches, 10):
    print text

to take over bank | NNJ1c
to cover up places | NN2
to carry out obligations | NN2
to take up matter | NN1n
to take over Johnston | NP1s
to point out state | NNL1n
to bring on state | NNL1n
to report out Tuesday | NPD1
to toss in towel | NN1c
to carry on battle | NN1n


As we can see, `match_cql` cares only about tokens that are labeled &mdash; there is no way how to get an unlabeled token into the template (though such feature is planned as it would have nice use cases).

Interpretation of the quadruples yielded by `match_cql` is:

 - a sentence ID (numbered from 0)
 - a dict mapping labels to in-sentence positions (numbered from 0)
 - a dict mapping labels to lemma IDs
 - a filled in template

## ManateeExtractor *class*

ManateeExtractor manages the process of extracting many different kinds of phrases from huge corpora.  To achieve scalability (currently mainly memory-friendliness), it involves things like *external sort* or *bounded counting*.  In the end, you end up with a large pocs dataset that you can put into POcs2Vec for machine interpretation.  

The motivation behind this extractor was: *knowledge of CQL should be enough to experiment with distributional semantics of phrases*.

### Phrase Types

<img src="dep-tree.png" style="float: right;margin-left: 20px;" />

A phrase type is **a tree of word categories**.  It is crudely based on the concept of *dependency syntactic trees*.  Its main purpose is to augment phrases with some structure, so that we can decompose phrases to heads and collocates.  Root is the head, subtrees are collocates.

The phrase type depicted on the right covers phrases like TO&nbsp;TAKE&nbsp;THE&nbsp;BULL&nbsp;BY&nbsp;THE&nbsp;HORNS or TO&nbsp;TAKE&nbsp;THE&nbsp;WORLD&nbsp;BY&nbsp;STORM (determiners can be ignored in phrase types).

Each phrase in ManateeExtractor is eventually identified by:

 - a phrase type tree
 - an assignment of lexical IDs (usually lemmas) to nodes of the tree

An important implication is that the textual representation can be ambiguous, i.e. DUCK with phrase type `(noun, )` is something different than DUCK with phrase type `(verb, )`.

## Workspace

ManateeExtractor uses disk space for many intermediate results.  Firstly, you need to make a workspace &mdash; a directory with a configuration file.

In [None]:
ManateeExtractor.make(
    directory="./extraction-directory", corpus="susanne",
    struct="p", lex_attr="lemma", cql_attr="tag"
)

Initialize an extractor just by providing a workspace directory:

In [2]:
extractor = ManateeExtractor('./extraction-directory')

### Phrase Grammar

You have to decide which phrase types you want and write CQL queries for each of them.  ManateeExtractor will handle everything else.  You only give it a dictionary object `grammar`, whose keys are phrase type trees in *Polish Notation* (technically nested tuples of strings).

Each item in the `grammar` must contain a list of `patterns`.  Each pattern consist of a `query` and a `template`, which are processed by `match_cql` function which you are already familiar with.  However, we need to interconnect labeled tokens with nodes of the phrase type tree.  The convention is very simple &mdash; a&nbsp;token labeled as `i` corresponds to `i`-th node of the phrase type tree in pre-order traversal.

Also, you can have additional labeled tokens and use them in templates only (like determiners) &mdash; ManateeExtractor will automatically select the most common textual form for each phrase.

In [6]:
grammar = {
    (('verb', 'adv'), 'noun'): dict(
        # The best way to start is to list several
        # concrete examples of what we eventually want
        # (diversity encouraged).
        examples=[
            'to pick up speed',
            'to get off the ground', 
            'to take time out',
            'to keep an eye out'
        ],
        # Now craft some patterns that will cover
        # the defined examples.
        patterns=[
            dict(  # verb adv noun
                query='1:"VV.*" 2:"RP" 3:"N.*"',
                template='to {1.lemma} {2.word} {3.word}'
            ),
            dict(  # verb adv det noun
                query='1:"VV.*" 2:"RP" 4:"AT.*" 3:"N.*"',
                template='to {1.lemma} {2.word} {4.word} {3.word}'
            ),
            dict(  # verb noun adv
                query='1:"VV.*" 3:"N.*" 2:"RP"',
                template='to {1.lemma} {3.word} {2.word}'
            ),
            dict(  # verb det noun adv
                query='1:"VV.*" 4:"AT.*" 3:"N.*" 2:"RP"',
                template='to {1.lemma} {4.word} {3.word} {2.word}'
            ),
        ]
    )
    # ...
    # ... analogically add more phrase types
}

Words are incorporated into the grammar in the exactly same way as phrases.  After all, a word is just a phrase of length 1.

In [9]:
grammar.update({
    ('det', ): dict(patterns=[dict(query='1:"AT.*"', template='{1.lemma}')]),
    ('noun', ): dict(patterns=[dict(query='1:"N.*"', template='{1.lemma}')]),
    ('adj', ): dict(patterns=[dict(query='1:"J.*"', template='{1.lemma}')]),
    ('prep', ): dict(patterns=[dict(query='1:"I.*"', template='{1.lemma}')]),
    ('verb', ): dict(patterns=[dict(query='1:"V.*"', template='to {1.lemma}')]),
    # ...
    # ... add all other word categories you wish to incorporate (preferably all)
})

### Grammar Check

Phrase grammars can be quite exhaustive and the extraction may take tens of hours for >1B corpora.  Before we run anything, we want to be absolutely sure that we've done a good job.  If there's a problem, we want to know where exactly it occurs.  So firstly, we put the grammar into `ManateeExtractor.check_grammar` function.  It will try to execute all the included CQL queries and print specified number of outputs for each of them.

What one fool can mess up, one call to `check_grammar` should detect.

In [10]:
extractor.check_grammar(grammar, nb_outputs=6)

('adj',) / #1

	grand
	recent
	primary
	executive
	overall
	superior

('prep',) / #1

	of
	in
	of
	of
	of
	for

('noun',) / #1

	Fulton
	county
	jury
	Friday
	investigation
	Atlanta

('det',) / #1

	the
	an
	no
	the
	the
	the

(('verb', 'adv'), 'noun') / #1

	to take over bank
	to cover up places
	to carry out obligations
	to bring on state
	to report out Tuesday
	to cut down expenses

(('verb', 'adv'), 'noun') / #2

	to take up the matter
	to take over the Johnston
	to point out the state
	to toss in the towel
	to carry on the battle
	to draw up a plan

(('verb', 'adv'), 'noun') / #3

	to take petitions out
	to move Cooke across
	to give time off
	to take time out
	to get Miller out
	to put coffee on

(('verb', 'adv'), 'noun') / #4

	to pass the bill on
	to keep the people in
	to keep the pressure on
	to get a day off
	to get the secrets off
	to shoot the bastards down

('verb',) / #1

	to say
	to produce
	to take
	to say
	to have
	to deserve



### Phrase Extraction

In [11]:
extractor.extract_phrases(grammar)

Number of phrase types: 6
There are colliding phrase types (3):
	- ('noun',)
	- ('verb',)
	- (('verb', 'adv'), 'noun')
Colliding phrase types will be overwritten.

('adj',) / #1
[[ Progress: 100%, Occurrences: 9,311 ]]

('prep',) / #1
[[ Progress: 100%, Occurrences: 16,488 ]]]

('noun',) / #1
[[ Progress: 100%, Occurrences: 35,415 ]]]

('det',) / #1
[[ Progress: 100%, Occurrences: 13,319 ]]]

(('verb', 'adv'), 'noun') / #1
[[ Progress: 100%, Occurrences: 27 ]]

(('verb', 'adv'), 'noun') / #2
[[ Progress: 100%, Occurrences: 52 ]]

(('verb', 'adv'), 'noun') / #3
[[ Progress: 100%, Occurrences: 9 ]]

(('verb', 'adv'), 'noun') / #4
[[ Progress: 100%, Occurrences: 18 ]]

('verb',) / #1
[[ Progress: 100%, Occurrences: 23,523 ]]]

Phrase extraction succesfully finished!


Each phrase type is processed independently.  The statement really is equivallent to the following:

In [12]:
# for k, v in grammar.items():
#     extractor.extract_phrases({k: v})

### Build Lexicon

At this  point, we have a separate (and potentially huge) vocabulary for each phrase type.  What needs to be done is to **unify** it into a single lexicon and **restrict** its size &mdash; e.g. throw away phrases with frequency&nbsp;&lt;&nbsp;60 and/or keep only 1&nbsp;million most frequent items.

Now... SUSANNE is a tiny, tiny corpus.  The configuration here is therefore a bit funny.

In [13]:
extractor.build_lexicon(min_count=1, max_items=int(3e5))

Collecting phrases from raw vocabularies ...
	Phrase Type: ('prep',) 
	Phrase Type: ('noun',) 
	Phrase Type: (('verb', 'adv'), 'noun') 
	Phrase Type: ('det',) 
	Phrase Type: ('verb',) 
	Phrase Type: ('adj',) 

Detecting vassals (coeff=0.75) ...
[[ Progress: 100%, Phrases: 10,644, Vassals: 7 ]]]

Persisting lexicon to disk ...


In [14]:
%%bash
head extraction-directory/lexicon.vert

0	3	0	the	9616
1	4	41	to be	4875
2	0	9	of	4668
3	3	63	a	2951
4	0	24	in	2723
5	4	32	to have	1429
6	0	51	to	1358
7	0	39	for	1067
8	1	11	-	996
9	0	179	with	895


In [15]:
%%bash
grep -P "\t[0-9]+,[0-9]+,[0-9]+\t" extraction-directory/lexicon.vert |head -5

8812	2	1582,702,3291	to move the flights over	1
8813	2	1713,702,1805	to look the setup over	1
8814	2	1812,702,182	to hand the money over	1
8815	2	21,669,9047	to take a carbine down	1
8816	2	2438,669,8948	to ease the Winchester down	1


Columns in `lexicon.vert` have the following interpretation:

 - phrase ID
 - phrase type ID
 - comma-separated lex. attribute IDs (for phrase type tree nodes in pre-order)
 - textual representation
 - frequency

The list of phrase types is pickled in `phrase_types.pickle` file.

### Iterate over Phrase Occurrences

Now that we have a lexicon of bounded size, we can iterate over **unsorted** phrase occurrences:

In [16]:
for phrase, sentence_i, positions in islice(extractor.iter_pocs(), 6):
    print phrase, sentence_i, positions

of 0 (9,)
in 1 (4,)
of 1 (19,)
of 1 (29,)
of 1 (32,)
for 1 (35,)


... or better replacing textual reprs. with phrase IDs:

In [17]:
for phrase, sentence_i, positions in islice(extractor.iter_pocs(use_ids=True), 6):
    print phrase, sentence_i, positions

2 0 (9,)
4 1 (4,)
2 1 (19,)
2 1 (29,)
2 1 (32,)
7 1 (35,)


### Sort Phrase Occurrences

We yet need to sort the pocs.  To do that, we first persist them to disk:

In [18]:
POcs(func=lambda: extractor.iter_pocs(use_ids=True)).save_to_vertical("my_pocs.unsorted.vert.gz")

In [19]:
%%bash
zcat my_pocs.unsorted.vert.gz |head

2	0	9
4	1	4
2	1	19
2	1	29
2	1	32
7	1	35
4	1	38
12	2	9
2	2	19
4	2	24



gzip: stdout: Broken pipe


... and apply an external sort that is happpy with compressed files:

In [20]:
VerticalPOcs.pocs_sort(
    inp_fn="my_pocs.unsorted.vert.gz",
    out_fn="my_pocs.sorted.vert.gz",
    line_chunk_size=int(2e4)
)

Sorting lines 0 - 20,000
Sorting lines 20,000 - 40,000
Sorting lines 40,000 - 60,000
Sorting lines 60,000 - 80,000
Sorting lines 80,000 - 100,000
Number of chunks: 5
Merging chunks 0 - 5


Set `line_chunk_size` according to your RAM.  Here it is only 20k solely for the purpose of demonstrating the chunking mechanism on small data.  10&ndash;100m is more realistic for >1B corpora.

In [21]:
%%bash
zcat my_pocs.sorted.vert.gz |head

0	0	0
929	0	1
137	0	2
1298	0	3
365	0	4
18	0	5
970	0	6
15	0	7
900	0	8
2	0	9



gzip: stdout: Broken pipe


### Compile POcs

This step is optional, though highly recommended -- it significantly speeds up POcs2Vec.

In [23]:
HDF5POcs.build(
    pocs=VerticalPOcs("my_pocs.sorted.vert.gz"),
    output_name="my_pocs.sorted.hdf5",
    max_len=3,  # max. number of nodes in a phrase type tree in your grammar
    use_ids=True
)

## POcs2Vec &mdash; Distributional Semantics of Phrases

In [34]:
my_pocs = HDF5POcs("my_pocs.sorted.hdf5")

pocs2vec(pocs=my_pocs, output_name="phrase_model",
         dim=32, nb_epochs=128, window=6, sample=int(1e3))

[[ PROGRESS: 93.05% | 1,057,834 pocs/s | 2,353,660 tc/s | 155,883 downsampled/s ]]
worker thread finished; awaiting finish of 3 more threads
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads


## Lexicon

In [35]:
lexicon = Lexicon.load("extraction-directory/")

In [36]:
lexicon.load_vecs("phrase_model.target.npy")

In [38]:
for p in lexicon.most_similar(lexicon["to have"][0]):
    print p.text

to awaken
repetition
to mess
gregarious
secret
doubt
encouragement
visit
uncertain
stink
to cheat
