# Day 5: Advanced Processing, Bring Your Own Corpora

Na-Rae Han (`naraehan@pitt.edu`) and David J. Birnbaum (`djbpitt@pitt.edu`) 

June 25-29, [NASSLLI 2018 at CMU](https://www.cmu.edu/nasslli2018/) 

This tutorial is found on https://github.com/naraehan/NASSLLI2018-Corpus-Linguistics. 
- Jump to: [Day 1](day1.ipynb), [Day 2](day2.ipynb), [Day 3](day3.ipynb), [Day 4](day4.ipynb), [Day 5](day5.ipynb)

## Advanced processing: lemmatization
- NLTK's WordNet lemmatizer 
- It works well for nouns. Verbs are tricky: default POS is set to 'noun', and verbs need to be specified as such. 
- For a better/knowlege-rich/context-aware solution, you might need to venture outside Python/NLTK and try full-scale NLP suites such as [Stanford's Core NLP](https://stanfordnlp.github.io/CoreNLP/). 

In [None]:
import nltk
wnl = nltk.WordNetLemmatizer()   # initialize a lemmatizer

In [None]:
# try 'geese', 'walks', 'walked', 'walking' 
wnl.lemmatize('cats')

In [None]:
wnl.lemmatize('walking', 'v')

In [None]:
# From this page: http://www.pitt.edu/~naraehan/python3/text-samples.txt
moby = """Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation."""

In [None]:
%pprint
nltk.word_tokenize(moby)

In [None]:
[wnl.lemmatize(t) for t in nltk.word_tokenize(moby)]
# Output isn't very intelligent without us supplying individual tokens with their correct POS 
# Any way to identify verbs?

## Advanced processing: POS tagging
- `nltk.pos_tag` is NLTK's default POS tagger.  
- Default tagset is the [Penn Treebank ('wsj') tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 
- A word of warning: it is not state-of-the-art. (Built on limited data.) 

In [None]:
chom = 'colorless green ideas sleep furiously'.split()
chom

In [None]:
nltk.pos_tag(chom)

In [None]:
nltk.pos_tag(nltk.word_tokenize(moby))

In [None]:
help(nltk.pos_tag)

## Bring Your Own Corpora (1): Treebanks
- Treebanks are syntactically annotated sentences. 
- They are used in training POS-taggers and syntactic parsers. 
- NLTK includes a sample section of the Penn English Treebank (3914 sentences and about 10% of the entire corpus). 
- For more details on Treebanks and how to interact with tree structure, see [this NLTK book section](http://www.nltk.org/book/ch08.html#treebanks-and-grammars). 

In [None]:
from nltk.corpus import treebank
treebank.words()

In [None]:
treebank.sents()

In [None]:
treebank.tagged_sents()

In [None]:
treebank.parsed_sents()

In [None]:
# Note: just flashing the first tree will give you an "unable to find the gs file" error. 
#    Saving it into t works, however. 
# https://stackoverflow.com/questions/36942270/nltk-was-unable-to-find-the-gs-file/37160385
# In short: you need to install GhostScript and add it to your system's PATH. 

t = treebank.parsed_sents()[0]

In [None]:
# Trees are composed of subtrees, each of which itself is a Tree. 
print(t)

In [None]:
# Opens up a new window. Close it before moving to next cell. 
t.draw()

In [None]:
# "said" is a verb (VBD) that takes a clausal complement (S). 
#   The nodes are children of a VP node. 
print(treebank.parsed_sents()[7])

In [None]:
# myfilter: returns True/False on whether current Tree is a VP node with an S child. 
# You can define your own function through def keyword. 

def myfilter(tree):
    child_nodes = [child.label() for child in tree if isinstance(child, nltk.Tree)]
    return  (tree.label() == 'VP') and ('S' in child_nodes)

In [None]:
# For every full tree in the Treebank, recurse through its subtrees, 
#    filter in only those that meet the configuration. 
# Searching through first 50 sentences only: remove [:50] for a full search. 

%pprint 
[subtree for tree in treebank.parsed_sents()[:50]
             for subtree in tree.subtrees(myfilter)]

### Treebanks in Non-English
- A sample of 'Sinica Treebank' (Chinese) is available as part of NLTK's data. 
- You should download it first. 

In [None]:
nltk.download('sinica_treebank')

In [None]:
from nltk.corpus import sinica_treebank as chtb
print(chtb.parsed_sents()[3450])

In [None]:
chtb.parsed_sents()[3450].draw()    # Opens a new window

## Bring Your Own Corpora (2): CHILDES
**CHAT vs. XML**
- CHILDES uses its own corpus format: CHAT. 
- Many data sets also come in XML format (https://childes.talkbank.org/data-xml/), which NLTK can read in.
- If no XML version is provided, you can use a converter called Chatter: https://talkbank.org/software/chatter.html

**Getting the data**
1. Navigate to <https://childes.talkbank.org/data-xml/>.
1. Click on the link to the language that interests you, e.g., `Eng-NA` (North American English). These directories hold `zip` archives of subcorpora in the designated language.
1. Download one of more of the zip files, say `Valian.zip`. 
1. Create a new directory named `CHILDES` on your Desktop. Unzip the downloaded file into it. 
1. Now you should have a `Valian` directory inside `CHILDES`. 

**Starter code:**

In [None]:
import nltk
from nltk.corpus.reader import CHILDESCorpusReader
corpus_root = 'C:/Users/narae/Desktop/CHILDES/Valian' # change path as needed
valian = CHILDESCorpusReader(corpus_root, '.*.xml')

valian.fileids()         # returns list of filenames
len(valian.fileids())    # returns count of files in corpus 

- If that all works, navigate to <http://www.nltk.org/howto/childes.html> and begin at the line that reads “Printing properties of the corpus files”.
- More CHILDES & Python tutorials:
  - http://ling-blogs.bu.edu/lx390f17/standoff-annotation-xml-and-more-childes/
  - http://aaronstevenwhite.io/language-acquisition/working-with-childes-part1/

## How about...?
- Files in MS Word or PDF? (See [this NLTK book section](http://www.nltk.org/book/ch03.html#extracting-text-from-pdf-msword-and-other-binary-formats))
- Non-English corpora? (See [this NLTK book section](http://www.nltk.org/book/ch02.html#corpora-in-other-languages))
- Corpora in XML format? (See [this NLTK book section](http://www.nltk.org/book/ch11.html#working-with-xml))
- Looking to load your own annotated corpus (POS-tagged, Treebanks, etc.)? NLTK provides specialized corpus loaders for such formats: see [this NLTK how-to page](http://www.nltk.org/howto/corpus.html).  


## What next?
Take a Python course! There are many online courses available on [Coursera](http://www.coursera.org), [EdX](https://www.edx.org/), [udemy](https://www.udemy.com/courses/), [DataCamp](https://www.datacamp.com/courses), and more.

The NLTK book "Natural Language Processing with Python" is available here: http://www.nltk.org/book/