# Chapter 2: Accessing Text Corpora and Lexical Resources

Near the end of the last notebook, I had you switch over to reading the text of the NLTK textbook in one browser window and working through the code examples in the notebook in the other. We're going to stick with that method here. This, hopefully, will make the notebook easier for you to read and to keep track of. However, you are welcome to add cells of your own with notes in either markdown or by using a hash (#) in the code. The section numbers in each chapter and the titles/subtitles are indicated so that you can follow along. 

## 1 Accessing Text Corpora 
### 1.1 Gutenberg Corpus

In [None]:
import nltk
nltk.corpus.gutenberg.fileids()

In [None]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')
len(emma)

In [None]:
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt'))
emma.concordance("surprize")

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

In [None]:
emma = gutenberg.words('austen-emma.txt')

In [None]:
for fileid in gutenberg.fileids():
    num_chars = len(gutenberg.raw(fileid))
    num_words = len(gutenberg.words(fileid))
    num_sents = len(gutenberg.sents(fileid))
    num_vocab = len(set(w.lower() for w in gutenberg.words(fileid)))
    print(round(num_chars/num_words), round(num_words/num_sents), round(num_words/num_vocab), fileid)

**Respond:** In a new cell, explain in your own words what this for loop does. 

In [None]:
macbeth_sentences = gutenberg.sents('shakespeare-macbeth.txt')
macbeth_sentences

In [None]:
macbeth_sentences[1116]

In [None]:
longest_len = max(len(s) for s in macbeth_sentences)
[s for s in macbeth_sentences if len(s) == longest_len]

### 1.2 Web and Chat Text

In [None]:
from nltk.corpus import webtext
for fileid in webtext.fileids():
    print(fileid, webtext.raw(fileid)[:65], '...')

In [None]:
from nltk.corpus import nps_chat
chatroom = nps_chat.posts('10-19-20s_706posts.xml')
chatroom[123]

### 1.3 Brown Corpus

In [None]:
from nltk.corpus import brown
brown.categories()

In [None]:
brown.words(categories='news')

In [None]:
brown.words(fileids=['cg22'])

In [None]:
brown.sents(categories=['news', 'editorial', 'reviews'])

In [None]:
from nltk.corpus import brown
news_text = brown.words(categories='news')
fdist = nltk.FreqDist(w.lower() for w in news_text)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', fdist[m], end=' ')

**Your Turn:** Choose a different section of the Brown Corpus and adapt the previous example to count a selection of wh words such as what, when, where, who, and why. 

In [None]:
cfd = nltk.ConditionalFreqDist(
    (genre,word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))
genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']
cfd.tabulate(conditions=genres, samples=modals)

### 1.4 Reuters Corpus

In [None]:
from nltk.corpus import reuters
reuters.fileids()

In [None]:
reuters.categories()

In [None]:
reuters.categories('training/9865')

In [None]:
reuters.categories(['training/9865','training/9880'])

In [None]:
reuters.fileids('barley')

In [None]:
reuters.fileids(['barley', 'corn'])

In [None]:
reuters.words('training/9865')[:14]

In [None]:
reuters.words(['training/9865', 'training/9880'])

In [None]:
reuters.words(categories='barley')

In [None]:
reuters.words(categories=['barley', 'corn'])

### 1.5 Inaugural Address Corpus

In [None]:
from nltk.corpus import inaugural
inaugural.fileids()

In [None]:
[fileid[:4] for fileid in inaugural.fileids()]

In [None]:
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd.plot()

### 1.6 Annotated Text Corpora

### 1.7 Corpora in Other Languages

In [None]:
nltk.corpus.cess_esp.words()

In [None]:
nltk.corpus.floresta.words()

In [None]:
nltk.corpus.indian.words('hindi.pos')

In [None]:
nltk.corpus.udhr.fileids()

In [None]:
nltk.corpus.udhr.words('Javanese-Latin1')[11:]

In [None]:
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch', 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang + '-Latin1'))
cfd.plot(cumulative=True)

### 1.8 Text Corpus Structure

In [None]:
raw = gutenberg.raw("burgess-busterbrown.txt")
raw[1:20]

In [None]:
words = gutenberg.words("burgess-busterbrown.txt")
words[1:20]

In [None]:
sents = gutenberg.sents("burgess-busterbrown.txt")
sents[1:20]

### 1.9 Loading your own Corpus

In [None]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = '/usr/share/dict'
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()

In [None]:
wordlists.words('connectives')

In [None]:
# You can skip this exercise, as it depends on knowing the location of the Wall Street Journal files in 
# your downloaded NLTK corpus. If you understand how to locate files on your computer from the command line,
# you might want to play around with this exercise to see if you can find a folder with text files. (My path
# would be /usr/local/share/nltk_data/corpora/treebank/raw.)
from nltk.corpus import BracketParseCorpusReader
corpus_root = "../../../nltk_data/corpora/treebank/raw/"
file_pattern = "wsj_.*"
ptb = BracketParseCorpusReader(corpus_root, file_pattern)
ptb.fileids()

## 2 Conditional Frequency Distributions

### 2.1 Conditions and Events

In [None]:
text = ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said']
pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County')]

### 2.2 Counting Words by Genre

In [None]:
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

In [None]:
genre_word = [(genre, word)
             for genre in ['news', 'romance']
             for word in brown.words(categories=genre)]
len(genre_word)

In [None]:
genre_word[:4]

In [None]:
genre_word[-4:]

In [None]:
cfd = nltk.ConditionalFreqDist(genre_word)
cfd

In [None]:
cfd.conditions()

In [None]:
print(cfd['news'])

In [None]:
print(cfd['romance'])

In [None]:
cfd['romance'].most_common(20)

In [None]:
cfd['romance']['could']

### 2.3 Plotting and Tabulating Distributions

In [None]:
from nltk.corpus import inaugural
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'citizen']
    if w.lower().startswith(target))
cfd.plot()

In [None]:
from nltk.corpus import udhr
languages = ['Chickasaw', 'English', 'German_Deutsch','Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
cfd = nltk.ConditionalFreqDist(
    (lang, len(word))
    for lang in languages
    for word in udhr.words(lang+ '-Latin1'))
cfd.plot()

In [None]:
cfd.tabulate(conditions=['English', 'German_Deutsch'],
             samples = range(10), cumulative=True)

In [None]:
from nltk.corpus import brown
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
cfd.plot(conditions=['romance', 'news'],
            samples=days)

In [None]:
from nltk.corpus import BracketParseCorpusReader
corpus_root = '/Users/kfitzpatrick/Documents/archive/projects/MSUdocs/governance'
docs = BracketParseCorpusReader(corpus_root, '.*')
docs.fileids()

In [None]:
len(docs.words())

In [None]:
docs.words(fileids='COGS_bylaws.txt')

In [None]:
fdist = nltk.FreqDist(w.lower() for w in docs)
modals = ['can', 'could', 'may', 'might', 'must', 'will']
for m in modals:
    print(m + ':', fdist[m], end=' ')

### 2.4 Generating Random Text with Bigrams

In [None]:
sent = ['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']
list(nltk.bigrams(sent))

In [None]:
def generate_model(cfdist, word, num=15):
    for i in range(num):
        print(word, end=' ')
        word = cfdist[word].max()
        
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)

cfd['living']

In [None]:
generate_model(cfd, 'living')

## 3 More Python: Reusing Code
**NOTE:** This section introduces you to creating longer Python programs in a text editor rather than composing and running on the fly in the interpreter. This is a good opportunity to experiment with Spyder, the Python IDE available in Anaconda Navigator. 

## 4 Lexical Resources

### 4.1 Wordlist Corpora

In [None]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab - english_vocab
    return sorted(unusual)

unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))

In [None]:
unusual_words(nltk.corpus.nps_chat.words())

In [None]:
from nltk.corpus import stopwords
stopwords.words('english')

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)

content_fraction(nltk.corpus.reuters.words())

In [None]:
puzzle_letters = nltk.FreqDist('egivrvonl')
obligatory = 'r'
wordlist = nltk.corpus.words.words()
[w for w in wordlist if len(w) >= 6
    and obligatory in w
    and nltk.FreqDist(w) <= puzzle_letters]

In [None]:
names = nltk.corpus.names
names.fileids()

How might this process be helpful for someone interested in questions of "gender"? What might be misleading? 

In [None]:
male_names = names.words('male.txt')
female_names = names.words('female.txt')
[w for w in male_names if w in female_names]

In [None]:
cfd = nltk.ConditionalFreqDist(
    (fileid, name[-1])
    for fileid in names.fileids()
    for name in names.words(fileid))

cfd.plot()

### 4.2 A Pronouncing Dictionary

In [None]:
entries = nltk.corpus.cmudict.entries()
len(entries)

In [None]:
for entry in entries[42371:42379]:
    print(entry)

In [None]:
for word, pron in entries:
    if len(pron) == 3:
        ph1, ph2, ph3 = pron
        if ph1 == 'P' and ph3 == 'T':
            print(word, ph2, end=' ')

In [None]:
syllable = ['N', 'IH0', 'K', 'S']
[word for word, pron in entries if pron[-4:] == syllable]

In [None]:
[w for w, pron in entries if pron[-1] == 'M' and w[-1] == 'n']

In [None]:
sorted(set(w[:2] for w, pron in entries if pron[0] == 'N' and w[0] != 'n'))

In [None]:
def stress(pron):
    return[char for phone in pron for char in phone if char.isdigit()]
[w for w, pron in entries if stress(pron) == ['0', '2', '0', '1', '0']]

In [None]:
p3 = [(pron[0]+'-'+pron[2], word)
     for (word, pron) in entries
     if pron[0] == 'P' and len(pron) == 3]
cfd = nltk.ConditionalFreqDist(p3)
for template in sorted(cfd.conditions()):
    if len(cfd[template]) > 10:
        words = sorted(cfd[template])
        wordstring = ' '.join(words)
        print(template, wordstring[:70] + "...")
        

In [None]:
prondict = nltk.corpus.cmudict.dict()
prondict['fire']

In [None]:
prondict['blog']

In [None]:
text = ['natural', 'language', 'processing']
[ph for w in text for ph in prondict[w][0]]

## 4.3 Comparative Wordlists

In [None]:
from nltk.corpus import swadesh
swadesh.fileids()

In [None]:
swadesh.words('en')

In [None]:
fr2en = swadesh.entries(['fr', 'en'])
fr2en

In [None]:
translate = dict(fr2en)
translate['chien']

In [None]:
translate['jeter']

In [None]:
de2en = swadesh.entries(['de','en'])
es2en = swadesh.entries(['es', 'en'])
translate.update(dict(de2en))
translate.update(dict(es2en))
translate['Hund']

In [None]:
translate['perro']

In [None]:
languages = ['en', 'de', 'nl', 'es', 'fr', 'pt', 'la']
for i in [139, 140, 141, 142]:
    print(swadesh.entries(languages)[i])

## 4.4 Shoebox and Toolbox Lexicons

In [None]:
from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')

## 5 WordNet

### 5.1 Senses and Synonyms

In [None]:
from nltk.corpus import wordnet as wn
wn.synsets('motorcar')

In [None]:
wn.synset('car.n.01').lemma_names()

In [None]:
wn.synset('car.n.01').examples()

In [None]:
wn.synset('car.n.01').lemmas()

In [None]:
wn.lemma('car.n.01.automobile')

In [None]:
wn.lemma('car.n.01.automobile').synset()

In [None]:
wn.lemma('car.n.01.automobile').name()

In [None]:
wn.synsets('car')

In [None]:
for synset in wn.synsets('car'):
    print(synset.lemma_names())

In [None]:
wn.lemmas('car')

### 5.2 The WordNet Hierarchy

In [None]:
motorcar = wn.synset('car.n.01')
types_of_motorcar = motorcar.hyponyms()
types_of_motorcar[0]

In [None]:
sorted(lemma.name() for synset in types_of_motorcar for lemma in synset.lemmas())

In [None]:
motorcar.hypernyms()

In [None]:
paths = motorcar.hypernym_paths()
len(paths)

In [None]:
[synset.name() for synset in paths[0]]

In [None]:
[synset.name() for synset in paths[1]]

In [None]:
motorcar.root_hypernyms()

### 5.3 More Lexical Relations

In [None]:
wn.synset('tree.n.01').part_meronyms()

In [None]:
wn.synset('tree.n.01').substance_meronyms()

In [None]:
wn.synset('tree.n.01').member_holonyms()

In [None]:
for synset in wn.synsets('mint', wn.NOUN):
    print(synset.name() + ':', synset.definition())

In [None]:
wn.synset('mint.n.04').part_holonyms()

In [None]:
wn.synset('mint.n.04').substance_holonyms()

In [None]:
wn.synset('walk.v.01').entailments()

In [None]:
wn.synset('eat.v.01').entailments()

In [None]:
wn.synset('tease.v.03').entailments()

In [None]:
wn.lemma('supply.n.02.supply').antonyms()

In [None]:
wn.lemma('rush.v.01.rush').antonyms()

In [None]:
wn.lemma('horizontal.a.01.horizontal').antonyms()

In [None]:
wn.lemma('staccato.r.01.staccato').antonyms()

## 5.4 Semantic Similarity

In [None]:
right = wn.synset('right_whale.n.01')

In [None]:
orca = wn.synset('orca.n.01')

In [None]:
minke = wn.synset('minke_whale.n.01')

In [None]:
tortoise = wn.synset('tortoise.n.01')

In [None]:
novel = wn.synset('novel.n.01')

In [None]:
right.lowest_common_hypernyms(minke)

In [None]:
right.lowest_common_hypernyms(orca)

In [None]:
right.lowest_common_hypernyms(tortoise)

In [None]:
right.lowest_common_hypernyms(novel)

In [None]:
wn.synset('baleen_whale.n.01').min_depth()

In [None]:
wn.synset('whale.n.02').min_depth()

In [None]:
wn.synset('vertebrate.n.01').min_depth()

In [None]:
wn.synset('entity.n.01').min_depth()

In [None]:
right.path_similarity(minke)

In [None]:
right.path_similarity(orca)

In [None]:
right.path_similarity(tortoise)

In [None]:
right.path_similarity(novel)

## 8 Exercises

Challenge: Look at the list of exercises in the NLTK book and see if there is one that you can do in the cell below. If this still feels like too much, which one would you be interested in trying?  