# NLTK — Additional Examples

http://www.nltk.org/book/ch03.html

### Accessing Text Corpora and Lexical Resources

There are many books/texts available already in the NLTK, including Gutenberg Corpus, Web and Chat Text, Brown Corpus, Reuters Corpus, Inaugural Address Corpus, various annotated corpora (e.g,. WordNet), as well as corpora in non English languages.

The following is a simple example that shows how you can use a corpus (in this case the **Gutenberg Corpus**).

If you are intersted in other corpora, look at the following page for example code.

> http://www.nltk.org/book/ch02.html

In [51]:
import nltk
from nltk.corpus import gutenberg
tokens = gutenberg.words('bible-kjv.txt')   # get the words from the king james bible. It retuns a list of tokens.
#print(kjv[0:50])                           
kjv = nltk.Text(kjv)                        # Convert the list of tokens into an nltk.Text object.
print("{}\n".format(kjv[25:36]))            # print the chracters between indexes 25 and 36.
kjv.concordance("beget")                    # create concordances with 'beget'


['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']

Displaying 10 of 10 matches:
xceedingly ; twelve princes shall he beget , and I will make him a great nation
jealous God . 4 : 25 When thou shalt beget children , and children ' s children
 cast his fruit . 28 : 41 Thou shalt beget sons and daughters , but thou shalt 
l issue from thee , which thou shalt beget , shall they take away ; and they sh
 is an evil disease . 6 : 3 If a man beget an hundred children , and live many 
l issue from thee , which thou shalt beget , shall they take away ; and they sh
of them ; 29 : 6 Take ye wives , and beget sons and daughters ; and take wives 
, saith the Lord GOD . 18 : 10 If he beget a son that is a robber , a shedder o
 upon him . 18 : 14 Now , lo , if he beget a son , that seeth all his father ' 
that sojourn among you , which shall beget children among you : and they shall 


In [33]:
import nltk                              # ask Python to load the NLTK package
from nltk.corpus import gutenberg        # ask nltk.corpora module to import 'gutenberg' (corpus)
gutenberg.fileids()                      # list the text files included in the gutenberg corpus

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

## Converting a text into an nltk.Text object

If you conver your own text (from a file) into an nltk.Text object, you can do a few things easily. One of the most useful tool is the concordance tool. The following is an example that shows how you can use the concordance tool on your text. 

I have also included a few more useful tools below in separate code cells.

In [11]:
import nltk
from nltk.tokenize import word_tokenize   

fin = open("text/clinton_dnc_speech_2016.txt")   # open a file
tokens = word_tokenize(fin.read())               # tokenize the text in the file
words = [w.lower() for w in tokens]              # 'lowercase' all the words
text = nltk.Text(words)                          # create an nltk.Text object
text.concordance('our')                          # get concordances for 'our'
print("--")
print(text.collocations())

Displaying 25 of 76 matches:
me . thank you for bringing marc into our family , and charlotte and aidan into
ight . and to those of you who joined our campaign this week , thank you . what
use of his friendship . we heard from our terrific vice president , the one-and
n . he spoke from his big heart about our party 's commitment to working people
 lady michelle obama reminded us that our children are watching , and the presi
e 'll make the whole country proud as our vice president . and i want to thank 
who threw their hearts and souls into our primary . you 've put economic and so
now , i 've heard you . your cause is our cause . our country needs your ideas 
heard you . your cause is our cause . our country needs your ideas , energy , a
on . that is the only way we can turn our progressive platform into real change
e to philadelphia – the birthplace of our nation – because what happened in thi
hat took courage . they had courage . our founders embraced the enduring truth 
espect are 

In [82]:
# More nltk.Text functions

import nltk
from nltk.tokenize import word_tokenize   
fin = open("text/clinton_dnc_speech_2016.txt")   # open a file
fin.seek(0)
tokens = word_tokenize(fin.read())               # tokenize the text in the file
words = [w.lower() for w in tokens]              # 'lowercase' all the words
text = nltk.Text(words)                          # create an nltk.Text object

print(text.similar('energy'))   # finds all words that share a common context
print("---")
print(text[3])                  # 5th token
print("---")
print(text.count('our'))
print("---")
print(len(set(text)))           # set(text) returns a list of only one of each word (types)


lifetimes talents cause
None
---
thank
---
76
---
1378


In [88]:
# more nltk.Text functions...

import nltk
from nltk.tokenize import word_tokenize   
fin = open("text/clinton_dnc_speech_2016.txt")   # open a file
fin.seek(0)
tokens = word_tokenize(fin.read())               # tokenize the text in the file
words = [w.lower() for w in tokens]              # 'lowercase' all the words
text = nltk.Text(words)                          # create an nltk.Text object

res = [w for w in text if w.isalpha()]           # keep alphabetic characters
# print(res)

In [101]:
# more nltk.Text functions...

import nltk
from nltk.tokenize import word_tokenize   
fin = open("text/clinton_dnc_speech_2016.txt")   # open a file
fin.seek(0)
tokens = word_tokenize(fin.read())               # tokenize the text in the file
words = [w.lower() for w in tokens]              # 'lowercase' all the words
text = nltk.Text(words)                          # create an nltk.Text object

fd = nltk.FreqDist(text)             # create a new data object that contains information about word frequency
print( fd['our'] ) 
#print( fd.keys() )                   # a list of unique words (types)
#print( fd.items() )                  # a list of everything in the text (including punctuations, repeated words etc.)
#fd.plot(10, cumulative=False)        # generate a chart of the 10 most frequent words
fd.freq('the')
res =([len(w) for w in text])          # list of word length
#print(res)



76
                 .                  ,                and                the                 to                 we                  a                 of                you                  i                 in                our               that                 's                for                 it                 he                 is                n't               will               from                are             people                who               with                all                not                 do               have                 us                 be                 as                but                 so                 my                  –            america            country               just                can               what                 at               when                  ?               this                 on              trump               they               your          president                now                 me          

### N-grams


In [125]:
import pprint                                  # pretty prnting for debugging
pp = pprint.PrettyPrinter(indent=4)            # create a pretty printing object used for debugging.

import nltk
from nltk.tokenize import word_tokenize   
fin = open("text/clinton_dnc_speech_2016.txt")   # open a file
fin.seek(0)
tokens = word_tokenize(fin.read())               # tokenize the text in the file
words = [w.lower() for w in tokens]              # 'lowercase' all the words
text = nltk.Text(words)                          # create an nltk.Text object

res = nltk.bigrams(text)
pp.pprint(list(res)[:12])
print("")
res = nltk.trigrams(text)
pp.pprint(list(res)[:12])
print("")
res = nltk.ngrams(text, 5)
pp.pprint(list(res)[:12])


[   ('thank', 'you'),
    ('you', '!'),
    ('!', 'thank'),
    ('thank', 'you'),
    ('you', 'all'),
    ('all', 'very'),
    ('very', 'much'),
    ('much', '!'),
    ('!', 'thank'),
    ('thank', 'you'),
    ('you', 'for'),
    ('for', 'that')]

[   ('thank', 'you', '!'),
    ('you', '!', 'thank'),
    ('!', 'thank', 'you'),
    ('thank', 'you', 'all'),
    ('you', 'all', 'very'),
    ('all', 'very', 'much'),
    ('very', 'much', '!'),
    ('much', '!', 'thank'),
    ('!', 'thank', 'you'),
    ('thank', 'you', 'for'),
    ('you', 'for', 'that'),
    ('for', 'that', 'amazing')]

[   ('thank', 'you', '!', 'thank', 'you'),
    ('you', '!', 'thank', 'you', 'all'),
    ('!', 'thank', 'you', 'all', 'very'),
    ('thank', 'you', 'all', 'very', 'much'),
    ('you', 'all', 'very', 'much', '!'),
    ('all', 'very', 'much', '!', 'thank'),
    ('very', 'much', '!', 'thank', 'you'),
    ('much', '!', 'thank', 'you', 'for'),
    ('!', 'thank', 'you', 'for', 'that'),
    ('thank', 'you', 'for', 'th