In this assignment, you will perform your first text mining analysis with the MeTA toolkit.

In [2]:
!pip install --upgrade pip

Collecting pip
[?25l  Downloading https://files.pythonhosted.org/packages/d7/41/34dd96bd33958e52cb4da2f1bf0818e396514fd4f4725a79199564cd0c20/pip-19.0.2-py2.py3-none-any.whl (1.4MB)
[K    100% |████████████████████████████████| 1.4MB 2.8MB/s ta 0:00:011
[?25hInstalling collected packages: pip
  Found existing installation: pip 18.0
    Uninstalling pip-18.0:
      Successfully uninstalled pip-18.0
Successfully installed pip-19.0.2


In [7]:
!pip install metapy pytoml

Collecting metapy
[?25l  Downloading https://files.pythonhosted.org/packages/59/cd/e5299611320b6ea281911cf871ccd91f04e2f71f2a434c7e6c5b5d7443bf/metapy-0.2.13-cp36-cp36m-macosx_10_6_intel.whl (13.5MB)
[K    100% |████████████████████████████████| 13.5MB 2.8MB/s eta 0:00:01
[?25hCollecting pytoml
  Downloading https://files.pythonhosted.org/packages/35/35/da1123673c54b6d701453fcd20f751d6a1fae43339b3993ae458875576e4/pytoml-0.1.20.tar.gz
Building wheels for collected packages: pytoml
  Building wheel for pytoml (setup.py) ... [?25ldone
[?25h  Stored in directory: /Users/zhoukewei/Library/Caches/pip/wheels/d7/88/22/c6bd63ab856af808b7f7c2442fdd9eb8846027c35e37f9d9ee
Successfully built pytoml
Installing collected packages: metapy, pytoml
Successfully installed metapy-0.2.13 pytoml-0.1.20


In [9]:
import metapy
metapy.log_to_stderr()

In [10]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")

## Tokenization
MeTA provides a stream-based interface for performing document tokenization.
Each stream starts off with a Tokenizer object, and in most cases you should use the Unicode standard aware ICUTokenizer.


In [11]:
tok = metapy.analyzers.ICUTokenizer()

Tokenizers operate on raw text and provide an Iterable that spits out the individual text tokens.
Let's try running just the ICUTokenizer to see what it does.

In [15]:
tok.set_content(doc.content()) # this could be any string
tokens = [token for token in tok]
print(tokens)

['<s>', 'I', 'said', 'that', 'I', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', '</s>']



One thing that you likely immediately notice is the insertion of these pseudo-XML looking tags.
These are called “sentence boundary tags”.
As a side-effect, a default-construted ICUTokenizer discovers the sentences in a document by delimiting them with the sentence boundary tags.
Let's try tokenizing a multi-sentence document to see what that looks like.


In [16]:
doc.content("I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.")
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['<s>', 'I', 'said', 'that', 'I', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', '</s>', '<s>', 'I', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '$', '30', 'before', '.', '</s>']


Most of the information retrieval techniques you have likely been learning about in this class don't need to concern themselves with finding the boundaries between separate sentences in a document, but later today we'll explore a scenario where this might matter more.
Let's pass a flag to the ICUTokenizer constructor to disable sentence boundary tags for now.

In [17]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['I', 'said', 'that', 'I', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', 'I', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '$', '30', 'before', '.']



As mentioned earlier, MeTA treats tokenization as a streaming process, and that it starts with a tokenizer.
It is often beneficial to modify the raw underlying tokens of a document, and thus change its representation.
The “intermediate” steps in the tokenization stream are represented with objects called Filters.
Each filter consumes the content of a previous filter (or a tokenizer) and modifies the tokens coming out of the stream in some way.
Let's start by using a simple filter that can help eliminate a lot of noise that we might encounter when tokenizing web documents: a LengthFilter.


In [18]:
tok = metapy.analyzers.LengthFilter(tok, min=2, max=30)
tok.set_content(doc.content())

tokens = [token for token in tok]
print(tokens)

['said', 'that', "can't", 'believe', 'that', 'it', 'only', 'costs', '19.95', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '30', 'before']


Here, we can see that the LengthFilter is consuming our original ICUTokenizer.
It modifies the token stream by only emitting tokens that are of a minimum length of 2 and a maximum length of 30.
This can get rid of a lot of punctuation tokens, but also excessively long tokens such as URLs.

In [35]:
import wget
wget.download('https://raw.githubusercontent.com/meta-toolkit/meta/master/data/lemur-stopwords.txt')

'lemur-stopwords.txt'

In [36]:
tok = metapy.analyzers.ListFilter(tok, "lemur-stopwords.txt", metapy.analyzers.ListFilter.Type.Reject)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

["can't", 'believe', 'costs', '19.95', 'find', '30']


Here we've downloaded a common list of stopwords and created a ListFilter to reject any tokens that occur in that list of words.
You can see how much of a difference removing stopwords can make on the size of a document's token stream!

Another common filter that people use is called a stemmer, or lemmatizer.
This kind of filter tries to modify individual tokens in such a way that different inflected forms of a word all reduce to the same representation.
This lets you, for example, find documents about a “run” when you search “running” or “runs”.
A common stemmer is the Porter2 Stemmer, which MeTA has an implementation of.


In [40]:
tok = metapy.analyzers.Porter2Filter(tok)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

["can't", 'believ', 'cost', '19.95', 'find', '30']


## N-grams

Finally, after you've got the token stream configured the way you'd like, it's time to analyze the document by consuming each token from its token stream and performing some actions based on these tokens.
In the simplest case, our action can simply be counting how many times these tokens occur.
For clarity, let's switch back to a simpler token stream first.
We will write a token stream that tokenizes with ICUTokenizer, and then lowercases each token.


In [41]:
tok = metapy.analyzers.ICUTokenizer(suppress_tags=True)
tok = metapy.analyzers.LowercaseFilter(tok)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['i', 'said', 'that', 'i', "can't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', 'i', 'could', 'only', 'find', 'it', 'for', 'more', 'than', '$', '30', 'before', '.']



Now, let's count how often each individual token appears in the stream.
This representation is called “bag of words” representation or “unigram word counts”.
In MeTA, classes that consume a token stream and emit a document representation are called Analyzers.


In [42]:
ana = metapy.analyzers.NGramWordAnalyzer(1, tok)
print(doc.content())
unigrams = ana.analyze(doc)
print(unigrams)

I said that I can't believe that it only costs $19.95! I could only find it for more than $30 before.
{'30': 1, 'more': 1, 'it': 2, 'before': 1, '.': 1, 'said': 1, 'only': 2, 'that': 2, 'find': 1, 'i': 3, '!': 1, '$': 2, 'for': 1, 'could': 1, "can't": 1, '19.95': 1, 'costs': 1, 'than': 1, 'believe': 1}


If you noticed the name of the analyzer, you might have realized that you can count not just individual tokens, but groups of them.
“Unigram” means “1-gram”, and we count individual tokens. “Bigram” means “2-gram”, and we count adjacent tokens together as a group.
Let's try that now.

In [43]:
ana = metapy.analyzers.NGramWordAnalyzer(2, tok)
bigrams = ana.analyze(doc)
print(bigrams)

{('$', '30'): 1, ('it', 'only'): 1, ('than', '$'): 1, ("can't", 'believe'): 1, ('before', '.'): 1, ('i', 'could'): 1, ('that', 'it'): 1, ('it', 'for'): 1, ('believe', 'that'): 1, ('i', "can't"): 1, ('said', 'that'): 1, ('19.95', '!'): 1, ('could', 'only'): 1, ('i', 'said'): 1, ('more', 'than'): 1, ('$', '19.95'): 1, ('for', 'more'): 1, ('costs', '$'): 1, ('30', 'before'): 1, ('!', 'i'): 1, ('find', 'it'): 1, ('only', 'find'): 1, ('that', 'i'): 1, ('only', 'costs'): 1}


Now the individual “tokens” we're counting are pairs of tokens.
Sometimes looking at n-grams of characters is useful.

In [44]:
tok = metapy.analyzers.CharacterTokenizer()
ana = metapy.analyzers.NGramWordAnalyzer(4, tok)
fourchar_ngrams = ana.analyze(doc)
print(fourchar_ngrams)

{('i', 't', ' ', 'f'): 1, ('1', '9', '.', '9'): 1, ('l', 'y', ' ', 'c'): 1, (' ', 'b', 'e', 'f'): 1, ('i', 'n', 'd', ' '): 1, ('o', 'u', 'l', 'd'): 1, ('t', ' ', 'i', 't'): 1, ('f', 'o', 'r', 'e'): 1, ('y', ' ', 'c', 'o'): 1, ('$', '3', '0', ' '): 1, ('s', 'a', 'i', 'd'): 1, ('y', ' ', 'f', 'i'): 1, (' ', 'b', 'e', 'l'): 1, ('5', '!', ' ', 'I'): 1, ('e', ' ', 't', 'h'): 2, ('l', 'i', 'e', 'v'): 1, ('u', 'l', 'd', ' '): 1, ('n', 'l', 'y', ' '): 2, ('a', 'i', 'd', ' '): 1, (' ', 'I', ' ', 'c'): 2, ('a', 'n', "'", 't'): 1, ('n', ' ', '$', '3'): 1, ("'", 't', ' ', 'b'): 1, ('b', 'e', 'f', 'o'): 1, ('o', 'r', ' ', 'm'): 1, ('o', 'r', 'e', ' '): 1, ('n', 'd', ' ', 'i'): 1, ('3', '0', ' ', 'b'): 1, ('t', 'h', 'a', 't'): 2, ('a', 't', ' ', 'i'): 1, ('b', 'e', 'l', 'i'): 1, ('i', 't', ' ', 'o'): 1, ('a', 't', ' ', 'I'): 1, ('o', 's', 't', 's'): 1, (' ', 'f', 'i', 'n'): 1, (' ', 'f', 'o', 'r'): 1, ('t', 's', ' ', '$'): 1, ('e', 'v', 'e', ' '): 1, (' ', 'c', 'a', 'n'): 1, ('t', ' ', 'b', 'e'): 1,

# POS tagging

Now, let's explore something a little bit different.
MeTA also has a natural language processing (NLP) component, which currently supports two major NLP tasks: part-of-speech tagging and syntactic parsing.
POS tagging is a task in NLP that involves identifying a type for each word in a sentence.
For example, POS tagging can be used to identify all of the nouns in a sentence, or all of the verbs, or adjectives, or…
This is useful as first step towards developing an understanding of the meaning of a particular sentence.
MeTA places its POS tagging component in its “sequences” library.
Let's play with some sequences first to get an idea of how they work.
We'll start of by creating a sequence.

In [45]:
seq = metapy.sequence.Sequence()

Now, we can add individual words to this sequence.
Sequences consist of a list of Observations, which are essentially (word, tag) pairs.
If we don't yet know the tags for a Sequence, we can just add individual words and leave the tags unset.
Words are called “symbols” in the library terminology.

In [46]:
for word in ["The", "dog", "ran", "across", "the", "park", "."]:
    seq.add_symbol(word)
print(seq)

(The, ???), (dog, ???), (ran, ???), (across, ???), (the, ???), (park, ???), (., ???)


The printed form of the sequence shows that we do not yet know the tags for each word.
Let's fill them in by using a pre-trained POS-tagger model that's distributed with MeTA.

In [58]:
wget.download('https://github.com/meta-toolkit/meta/releases/download/v3.0.1/greedy-perceptron-tagger.tar.gz')


'greedy-perceptron-tagger.tar.gz'

In [59]:

tf = tarfile.open("greedy-perceptron-tagger.tar.gz")
tf.extractall()

In [61]:
tagger = metapy.sequence.PerceptronTagger("perceptron-tagger/")
tagger.tag(seq)
print(seq)

  > Loading feature mapping: [>                               ]   0% ETA 00:00:00

(The, DT), (dog, NN), (ran, VBD), (across, IN), (the, DT), (park, NN), (., .)





Each tag indicates the type of a word, and this particular tagger was trained to output the tags present in the Penn Treebank tagset.
But what if we want to POS-tag a document?


In [67]:
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")
tok = metapy.analyzers.ICUTokenizer() # keep sentence boundaries!
tok = metapy.analyzers.PennTreebankNormalizer(tok)
tok.set_content(doc.content())
tokens = [token for token in tok]
print(tokens)

['<s>', 'I', 'said', 'that', 'I', 'ca', "n't", 'believe', 'that', 'it', 'only', 'costs', '$', '19.95', '!', '</s>']


Now, we will write a function that can take a token stream that contains sentence boundary tags and returns a list of Sequence objects.
We will not include the sentence boundary tags in the actual Sequence objects.

In [68]:
def extract_sequences(tok):
    sequences = []
    for token in tok:
        if token == '<s>':
            sequences.append(metapy.sequence.Sequence())
        elif token != '</s>':
            sequences[-1].add_symbol(token)
    return sequences

doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")
tok.set_content(doc.content())
for seq in extract_sequences(tok):
    tagger.tag(seq)
    print(seq)

(I, PRP), (said, VBD), (that, IN), (I, PRP), (ca, MD), (n't, RB), (believe, VB), (that, IN), (it, PRP), (only, RB), (costs, VBZ), ($, $), (19.95, CD), (!, .)


## Config.toml file: setting up a pipeline

In practice, it is often beneficial to combine multiple feature sets together.
We can do this with a MultiAnalyzer. Let's combine unigram words, bigram POS tags, and rewrite rules for our document feature representation.
We can certainly do this programmatically, but doing so can become tedious quite quickly.
Instead, let's use MeTA's configuration file format to specify our analyzer, which we can then load in one line of code.
MeTA uses TOML configuration files for all of its configuration. If you haven't heard of TOML before, don't panic! It's a very simple, readable format.
Open a text editor and copy the text below, but be careful not to modify the contents. Save it as `config.toml` .


In [82]:
wget.download('https://github.com/meta-toolkit/meta/releases/download/v3.0.2/crf.tar.gz')


'crf.tar.gz'

In [83]:
tf = tarfile.open("crf.tar.gz")
tf.extractall()

In [93]:
wget.download('https://github.com/meta-toolkit/meta/releases/download/v3.0.2/greedy-constituency-parser.tar.gz')
tf = tarfile.open("greedy-constituency-parser.tar.gz")
tf.extractall()

In [98]:
ana = metapy.analyzers.load('config.toml')
doc = metapy.index.Document()
doc.content("I said that I can't believe that it only costs $19.95!")
print(ana.analyze(doc))



{'subtree-(VP (VBZ) (NP))': 1, 'subtree-(ADVP (RB))': 1, 'subtree-(VB)': 1, 'VBD_IN': 1, 'subtree-(NP (PRP))': 3, 'subtree-(VBD)': 1, 'PRP_VBD': 1, 'subtree-(VP (VB) (SBAR))': 1, 'subtree-(CD)': 1, 'subtree-(.)': 1, 'subtree-(S (NP) (ADVP) (VP))': 1, '$_CD': 1, 'PRP_MD': 1, 'cost': 1, 'subtree-(IN)': 2, 'subtree-($)': 1, 'subtree-(S (NP) (VP))': 1, 'believ': 1, 'subtree-(VP (MD) (RB) (VP))': 1, 'IN_PRP': 2, 'subtree-(S (NP) (VP) (.))': 1, 'subtree-(ROOT (S))': 1, 'subtree-(RB)': 2, 'subtree-(SBAR (IN) (S))': 2, "can't": 1, 'subtree-(PRP)': 3, 'VBZ_$': 1, 'subtree-(NP ($) (CD))': 1, 'VB_IN': 1, 'PRP_RB': 1, 'subtree-(VBZ)': 1, 'CD_.': 1, 'subtree-(MD)': 1, 'subtree-(VP (VBD) (SBAR))': 1, 'RB_VBZ': 1, 'MD_RB': 1, 'RB_VB': 1}
