# Day 5: Advanced Processing, Bring Your Own Corpora

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Advanced processing: lemmatization
- NLTK's WordNet lemmatizer 
- It works well for nouns. Verbs are tricky: default POS is set to 'noun', and verbs need to be specified as such. 
- For a better/knowlege-rich/context-aware solution, you might need to venture outside Python/NLTK and try full-scale NLP suites such as [Stanford's Core NLP](https://stanfordnlp.github.io/CoreNLP/). 

In [1]:
import nltk
wnl = nltk.WordNetLemmatizer()

In [2]:
# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('cats')

'cat'

In [3]:
# 'n' (noun; default), 'v' (verb), 'a' (adjective), 'r' (adverb)
wnl.lemmatize('walking', 'v')

'walk'

In [4]:
# From this page: http://www.pitt.edu/~naraehan/python3/text-samples.txt
moby = """Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation."""

In [5]:
%pprint
nltk.word_tokenize(moby)

Pretty printing has been turned OFF


['Call', 'me', 'Ishmael', '.', 'Some', 'years', 'ago', '--', 'never', 'mind', 'how', 'long', 'precisely', '--', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', '.']

In [6]:
[wnl.lemmatize(t) for t in nltk.word_tokenize(moby)]
# Output isn't very intelligent without us supplying individual tokens with their correct POS 
# Any way to identify verbs?

['Call', 'me', 'Ishmael', '.', 'Some', 'year', 'ago', '--', 'never', 'mind', 'how', 'long', 'precisely', '--', 'having', 'little', 'or', 'no', 'money', 'in', 'my', 'purse', ',', 'and', 'nothing', 'particular', 'to', 'interest', 'me', 'on', 'shore', ',', 'I', 'thought', 'I', 'would', 'sail', 'about', 'a', 'little', 'and', 'see', 'the', 'watery', 'part', 'of', 'the', 'world', '.', 'It', 'is', 'a', 'way', 'I', 'have', 'of', 'driving', 'off', 'the', 'spleen', 'and', 'regulating', 'the', 'circulation', '.']

## Advanced processing: POS tagging
- `nltk.pos_tag` is NLTK's default POS tagger.  
- Default tagset is the [Penn Treebank ('wsj') tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 
- A word of warning: it is not state-of-the-art. (Built on limited data.) 

In [7]:
chom = 'colorless green ideas sleep furiously'.split()
chom

['colorless', 'green', 'ideas', 'sleep', 'furiously']

In [8]:
nltk.pos_tag(chom)

[('colorless', 'NN'), ('green', 'JJ'), ('ideas', 'NNS'), ('sleep', 'VBP'), ('furiously', 'RB')]

In [9]:
nltk.pos_tag(nltk.word_tokenize(moby))

[('Call', 'VB'), ('me', 'PRP'), ('Ishmael', 'NNP'), ('.', '.'), ('Some', 'DT'), ('years', 'NNS'), ('ago', 'RB'), ('--', ':'), ('never', 'RB'), ('mind', 'VB'), ('how', 'WRB'), ('long', 'JJ'), ('precisely', 'RB'), ('--', ':'), ('having', 'VBG'), ('little', 'JJ'), ('or', 'CC'), ('no', 'DT'), ('money', 'NN'), ('in', 'IN'), ('my', 'PRP$'), ('purse', 'NN'), (',', ','), ('and', 'CC'), ('nothing', 'NN'), ('particular', 'JJ'), ('to', 'TO'), ('interest', 'NN'), ('me', 'PRP'), ('on', 'IN'), ('shore', 'NN'), (',', ','), ('I', 'PRP'), ('thought', 'VBD'), ('I', 'PRP'), ('would', 'MD'), ('sail', 'VB'), ('about', 'IN'), ('a', 'DT'), ('little', 'JJ'), ('and', 'CC'), ('see', 'VB'), ('the', 'DT'), ('watery', 'JJ'), ('part', 'NN'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('way', 'NN'), ('I', 'PRP'), ('have', 'VBP'), ('of', 'IN'), ('driving', 'VBG'), ('off', 'RP'), ('the', 'DT'), ('spleen', 'NN'), ('and', 'CC'), ('regulating', 'VBG'), ('the', 'DT

In [10]:
help(nltk.pos_tag)

Help on function pos_tag in module nltk.tag:

pos_tag(tokens, tagset=None, lang='eng')
    Use NLTK's currently recommended part of speech tagger to
    tag the given list of tokens.
    
        >>> from nltk.tag import pos_tag
        >>> from nltk.tokenize import word_tokenize
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."))
        [('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is', 'VBZ'),
        ("n't", 'RB'), ('all', 'PDT'), ('that', 'DT'), ('bad', 'JJ'), ('.', '.')]
        >>> pos_tag(word_tokenize("John's big idea isn't all that bad."), tagset='universal')
        [('John', 'NOUN'), ("'s", 'PRT'), ('big', 'ADJ'), ('idea', 'NOUN'), ('is', 'VERB'),
        ("n't", 'ADV'), ('all', 'DET'), ('that', 'DET'), ('bad', 'ADJ'), ('.', '.')]
    
    NB. Use `pos_tag_sents()` for efficient tagging of more than one sentence.
    
    :param tokens: Sequence of tokens to be tagged
    :type tokens: list(str)
    :param tagset: the tagset to be u

## Bring Your Own Corpora (1): Treebanks
- Treebanks are syntactically annotated sentences. 
- They are used in training POS-taggers and syntactic parsers. 
- NLTK includes a sample section of the Penn English Treebank (3914 sentences and about 10% of the entire corpus). 
- For more details on Treebanks and how to interact with tree structure, see [this NLTK book section](http://www.nltk.org/book/ch08.html#treebanks-and-grammars). 

In [11]:
from nltk.corpus import treebank
treebank.words()

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', ...]

In [12]:
treebank.sents()

[['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.'], ['Mr.', 'Vinken', 'is', 'chairman', 'of', 'Elsevier', 'N.V.', ',', 'the', 'Dutch', 'publishing', 'group', '.'], ...]

In [13]:
treebank.tagged_sents()

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]

In [14]:
treebank.parsed_sents()

[Tree('S', [Tree('NP-SBJ', [Tree('NP', [Tree('NNP', ['Pierre']), Tree('NNP', ['Vinken'])]), Tree(',', [',']), Tree('ADJP', [Tree('NP', [Tree('CD', ['61']), Tree('NNS', ['years'])]), Tree('JJ', ['old'])]), Tree(',', [','])]), Tree('VP', [Tree('MD', ['will']), Tree('VP', [Tree('VB', ['join']), Tree('NP', [Tree('DT', ['the']), Tree('NN', ['board'])]), Tree('PP-CLR', [Tree('IN', ['as']), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])])]), Tree('NP-TMP', [Tree('NNP', ['Nov.']), Tree('CD', ['29'])])])]), Tree('.', ['.'])]), Tree('S', [Tree('NP-SBJ', [Tree('NNP', ['Mr.']), Tree('NNP', ['Vinken'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP-PRD', [Tree('NP', [Tree('NN', ['chairman'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('NP', [Tree('NNP', ['Elsevier']), Tree('NNP', ['N.V.'])]), Tree(',', [',']), Tree('NP', [Tree('DT', ['the']), Tree('NNP', ['Dutch']), Tree('VBG', ['publishing']), Tree('NN', ['group'])])])])])]), Tree('.', ['.'])]), ...]

In [15]:
# Note: just flashing the first tree will give you an "unable to find the gs file" error. 
#    Saving it into t works, however. 
# https://stackoverflow.com/questions/36942270/nltk-was-unable-to-find-the-gs-file/37160385
# In short: you need to install GhostScript and add it to your system's PATH. 

t = treebank.parsed_sents()[0]

In [16]:
# Trees are composed of subtrees, each of which itself is a Tree. 
print(t)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


In [17]:
# Opens up a new window. Close it before moving to next cell. 
t.draw()

In [18]:
# "said" is a verb (VBD) that takes a clausal complement (S). 
#   The nodes are children of a VP node. 
print(treebank.parsed_sents()[7])

(S
  (NP-SBJ (DT A) (NNP Lorillard) (NN spokewoman))
  (VP
    (VBD said)
    (, ,)
    (`` ``)
    (S
      (NP-SBJ (DT This))
      (VP (VBZ is) (NP-PRD (DT an) (JJ old) (NN story)))))
  (. .))


In [19]:
# myfilter: returns True/False on whether current Tree is a VP node with an S child. 
# You can define your own function through def keyword. 

def myfilter(tree):
    child_nodes = [child.label() for child in tree if isinstance(child, nltk.Tree)]
    return  (tree.label() == 'VP') and ('S' in child_nodes)

In [20]:
# For every full tree in the Treebank, recurse through its subtrees, 
#    filter in only those that meet the filter condition. 
# Searching through first 20 sentences only: remove [:20] for a full search. 

%pprint 
[subtree for tree in treebank.parsed_sents()[:20]
             for subtree in tree.subtrees(myfilter)]

Pretty printing has been turned ON


[Tree('VP', [Tree('VBN', ['named']), Tree('S', [Tree('NP-SBJ', [Tree('-NONE-', ['*-1'])]), Tree('NP-PRD', [Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['nonexecutive']), Tree('NN', ['director'])]), Tree('PP', [Tree('IN', ['of']), Tree('NP', [Tree('DT', ['this']), Tree('JJ', ['British']), Tree('JJ', ['industrial']), Tree('NN', ['conglomerate'])])])])])]),
 Tree('VP', [Tree('VBD', ['said']), Tree(',', [',']), Tree('``', ['``']), Tree('S', [Tree('NP-SBJ', [Tree('DT', ['This'])]), Tree('VP', [Tree('VBZ', ['is']), Tree('NP-PRD', [Tree('DT', ['an']), Tree('JJ', ['old']), Tree('NN', ['story'])])])])]),
 Tree('VP', [Tree('VBD', ['said']), Tree('S', [Tree('-NONE-', ['*T*-1'])])]),
 Tree('VP', [Tree('VBN', ['expected']), Tree('S', [Tree('-NONE-', ['*?*'])])]),
 Tree('VP', [Tree('VBD', ['said']), Tree('S', [Tree('-NONE-', ['*T*-1'])])]),
 Tree('VP', [Tree('VBZ', ['appears']), Tree('S', [Tree('NP-SBJ', [Tree('-NONE-', ['*-1'])]), Tree('VP', [Tree('TO', ['to']), Tree('VP', [Tree('VB', ['be']), Tree('

In [21]:
found = [subtree for tree in treebank.parsed_sents()[:50]
             for subtree in tree.subtrees(myfilter)]

In [22]:
for t in found:
    print(t)

(VP
  (VBN named)
  (S
    (NP-SBJ (-NONE- *-1))
    (NP-PRD
      (NP (DT a) (JJ nonexecutive) (NN director))
      (PP
        (IN of)
        (NP (DT this) (JJ British) (JJ industrial) (NN conglomerate))))))
(VP
  (VBD said)
  (, ,)
  (`` ``)
  (S
    (NP-SBJ (DT This))
    (VP (VBZ is) (NP-PRD (DT an) (JJ old) (NN story)))))
(VP (VBD said) (S (-NONE- *T*-1)))
(VP (VBN expected) (S (-NONE- *?*)))
(VP (VBD said) (S (-NONE- *T*-1)))
(VP
  (VBZ appears)
  (S
    (NP-SBJ (-NONE- *-1))
    (VP
      (TO to)
      (VP
        (VB be)
        (NP-PRD
          (NP (DT the) (JJS highest))
          (PP
            (IN for)
            (NP
              (NP (DT any) (NN asbestos) (NNS workers))
              (RRC
                (VP
                  (VBN studied)
                  (NP (-NONE- *))
                  (PP-LOC
                    (IN in)
                    (NP
                      (JJ Western)
                      (VBN industrialized)
                      (NNS countries)))))

### Treebanks in Non-English
- A sample of 'Sinica Treebank' (Chinese) is available as part of NLTK's data. 
- You should download it first. 

In [23]:
nltk.download('sinica_treebank')

[nltk_data] Downloading package sinica_treebank to
[nltk_data]     D:\narae/nltk_data...
[nltk_data]   Package sinica_treebank is already up-to-date!


True

In [24]:
from nltk.corpus import sinica_treebank as chtb
print(chtb.parsed_sents()[3450])

(VP
  (Ndabe 同時)
  (Dd 就)
  (VC32 帶)
  (Di 了)
  (NP
    (NP (DM 四張) (VH11 熟) (Nab 牛皮))
    (Caa 和)
    (NP (DM 十二頭) (VH16 肥) (Nab 牛))))


In [25]:
chtb.parsed_sents()[3450].draw()    # Opens a new window

## Bring Your Own Corpora (2): CHILDES
**CHAT vs. XML**
- CHILDES uses its own corpus format: CHAT. 
- Many data sets also come in XML format (https://childes.talkbank.org/data-xml/), which NLTK can read in.
- If no XML version is provided, you can use a converter called Chatter: https://talkbank.org/software/chatter.html

**Getting the data**
1. Navigate to <https://childes.talkbank.org/data-xml/>.
1. Click on the link to the language that interests you, e.g., `Eng-NA` (North American English). These directories hold `zip` archives of subcorpora in the designated language.
1. Download one of more of the zip files, say `Valian.zip`. 
1. Create a new directory named `CHILDES` on your Desktop. Unzip the downloaded file into it. 
1. Now you should have a `Valian` directory inside `CHILDES`. 

**Starter code:**

In [26]:
import nltk
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = 'C:/Users/narae/Desktop/CHILDES/Valian' # change path as needed
valian = CHILDESCorpusReader(corpus_root, '.*.xml')

In [27]:
%pprint
valian.fileids()         # returns list of filenames

Pretty printing has been turned OFF


['01a.xml', '01b.xml', '02a.xml', '02b.xml', '03a.xml', '03b.xml', '04a.xml', '04b.xml', '04c.xml', '05a.xml', '06a.xml', '06b.xml', '07a.xml', '08a.xml', '08b.xml', '09a.xml', '09b.xml', '09c.xml', '10a.xml', '10b.xml', '10c.xml', '11a.xml', '11b.xml', '12a.xml', '12b.xml', '13a.xml', '13b.xml', '14a.xml', '14b.xml', '15a.xml', '15b.xml', '16a.xml', '16b.xml', '17a.xml', '17b.xml', '18a.xml', '19a.xml', '19b.xml', '20a.xml', '20b.xml', '21a.xml', '21b.xml', '21c.xml']

- If that all works, navigate to <http://www.nltk.org/howto/childes.html> and begin at the line that reads “Printing properties of the corpus files”.
- More CHILDES & Python tutorials:
  - http://ling-blogs.bu.edu/lx390f17/standoff-annotation-xml-and-more-childes/
  - http://aaronstevenwhite.io/language-acquisition/working-with-childes-part1/

## How about...?
- Files in MS Word or PDF? (See [this NLTK book section](http://www.nltk.org/book/ch03.html#extracting-text-from-pdf-msword-and-other-binary-formats))
- Non-English corpora? (See [this NLTK book section](http://www.nltk.org/book/ch02.html#corpora-in-other-languages))
- Corpora in XML format? (See [this NLTK book section](http://www.nltk.org/book/ch11.html#working-with-xml))
- Looking to load your own annotated corpus (POS-tagged, Treebanks, etc.)? NLTK provides specialized corpus loaders for such formats: see [this NLTK how-to page](http://www.nltk.org/howto/corpus.html).  


## What next?
Take a Python course! There are many online courses available on [Coursera](http://www.coursera.org), [EdX](https://www.edx.org/), [udemy](https://www.udemy.com/courses/), [DataCamp](https://www.datacamp.com/courses), and more.

The NLTK book "Natural Language Processing with Python" is available here: http://www.nltk.org/book/