# Word Tokenization with nltk

Utilizing word_tokenize and sent_tokenize from nltk.tokenize to tokenize both words and sentences from Python strings, the first scene of Monty Python's Holy Grail.

### Importing functions and downloading the required resource:

In [21]:
# Import necessary modules
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk import download

print('DOWNLOADING RESOURCES')
download('punkt')

DOWNLOADING RESOURCES
[nltk_data] Downloading package punkt to /Users/richard/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### This is the first scene of Monty Python's Holy Grail

In [22]:
# Text to tokenize
scene_one = "SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there!  [clop clop clop] \nSOLDIER #1: Halt!  Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.\nSOLDIER #1: What?  Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You're using coconuts!\nARTHUR: What?\nSOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.\nARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--\nSOLDIER #1: Where'd you get the coconuts?\nARTHUR: We found them.\nSOLDIER #1: Found them?  In Mercea?  The coconut's tropical!\nARTHUR: What do you mean?\nSOLDIER #1: Well, this is a temperate zone.\nARTHUR: The swallow may fly south with the sun or the house martin or the plover may seek warmer climes in winter, yet these are not strangers to our land?\nSOLDIER #1: Are you suggesting coconuts migrate?\nARTHUR: Not at all.  They could be carried.\nSOLDIER #1: What?  A swallow carrying a coconut?\nARTHUR: It could grip it by the husk!\nSOLDIER #1: It's not a question of where he grips it!  It's a simple question of weight ratios!  A five ounce bird could not carry a one pound coconut.\nARTHUR: Well, it doesn't matter.  Will you go and tell your master that Arthur from the Court of Camelot is here.\nSOLDIER #1: Listen.  In order to maintain air-speed velocity, a swallow needs to beat its wings forty-three times every second, right?\nARTHUR: Please!\nSOLDIER #1: Am I right?\nARTHUR: I'm not interested!\nSOLDIER #2: It could be carried by an African swallow!\nSOLDIER #1: Oh, yeah, an African swallow maybe, but not a European swallow.  That's my point.\nSOLDIER #2: Oh, yeah, I agree with that.\nARTHUR: Will you ask your master if he wants to join my court at Camelot?!\nSOLDIER #1: But then of course a-- African swallows are non-migratory.\nSOLDIER #2: Oh, yeah...\nSOLDIER #1: So they couldn't bring a coconut back anyway...  [clop clop clop] \nSOLDIER #2: Wait a minute!  Supposing two swallows carried it together?\nSOLDIER #1: No, they'd have to have it on a line.\nSOLDIER #2: Well, simple!  They'd just use a strand of creeper!\nSOLDIER #1: What, held under the dorsal guiding feathers?\nSOLDIER #2: Well, why not?\n"
print('\nSCENE ONE')
print(scene_one)


SCENE ONE
SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!  [clop clop clop] 
SOLDIER #1: Halt!  Who goes there?
ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.  King of the Britons, defeator of the Saxons, sovereign of all England!
SOLDIER #1: Pull the other one!
ARTHUR: I am, ...  and this is my trusty servant Patsy.  We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.  I must speak with your lord and master.
SOLDIER #1: What?  Ridden on a horse?
ARTHUR: Yes!
SOLDIER #1: You're using coconuts!
ARTHUR: What?
SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.
ARTHUR: So?  We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--
SOLDIER #1: Where'd you get the coconuts?
ARTHUR: We found them.
SOLDIER #1: Found them?  In Mercea?  The coconut's tropical!
ARTHUR: What do you mean?
SOLDIER #1: Well, this is a tempera

### Split the scene into sentences

In [27]:
# Split scene_one into sentences: sentences
sentences = sent_tokenize(scene_one)
print("\nSPLIT SCENE INTO SENTENCES")
for i, sentence in enumerate(sentences):
    print('['+str(i)+'] ' + sentence)


SPLIT SCENE INTO SENTENCES
[0] SCENE 1: [wind] [clop clop clop] 
KING ARTHUR: Whoa there!
[1] [clop clop clop] 
SOLDIER #1: Halt!
[2] Who goes there?
[3] ARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot.
[4] King of the Britons, defeator of the Saxons, sovereign of all England!
[5] SOLDIER #1: Pull the other one!
[6] ARTHUR: I am, ...  and this is my trusty servant Patsy.
[7] We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot.
[8] I must speak with your lord and master.
[9] SOLDIER #1: What?
[10] Ridden on a horse?
[11] ARTHUR: Yes!
[12] SOLDIER #1: You're using coconuts!
[13] ARTHUR: What?
[14] SOLDIER #1: You've got two empty halves of coconut and you're bangin' 'em together.
[15] ARTHUR: So?
[16] We have ridden since the snows of winter covered this land, through the kingdom of Mercea, through--
SOLDIER #1: Where'd you get the coconuts?
[17] ARTHUR: We found them.
[18] SOLDIER #1: Found them?
[19

### Use word_tokenize to tokenize the fourth sentence: tokenized_sent

In [28]:
# Use word_tokenize to tokenize the fourth sentence: tokenized_sent
tokenized_sent = word_tokenize(sentences[3])
print('\nTOKENIZE FOURTH SENTENCE')
for i, token in enumerate(tokenized_sent):
    print('['+str(i)+'] ' + token)


TOKENIZE FOURTH SENTENCE
[0] ARTHUR
[1] :
[2] It
[3] is
[4] I
[5] ,
[6] Arthur
[7] ,
[8] son
[9] of
[10] Uther
[11] Pendragon
[12] ,
[13] from
[14] the
[15] castle
[16] of
[17] Camelot
[18] .


### Make a set of unique tokens in the entire scene: unique_tokens

In [29]:
# Make a set of unique tokens in the entire scene: unique_tokens
unique_tokens = set(word_tokenize(scene_one))
print('\nSET OF UNIQUE TOKENS')
for i, unique_token in enumerate(unique_tokens):
    print('['+str(i)+'] ' + unique_token)


SET OF UNIQUE TOKENS
[0] all
[1] Not
[2] ratios
[3] The
[4] servant
[5] clop
[6] Pull
[7] strand
[8] guiding
[9] [
[10] 'd
[11] since
[12] Found
[13] It
[14] warmer
[15] wings
[16] dorsal
[17] using
[18] Listen
[19] martin
[20] not
[21] Where
[22] on
[23] yeah
[24] castle
[25] do
[26] goes
[27] agree
[28] land
[29] there
[30] you
[31] 's
[32] are
[33] them
[34] European
[35] must
[36] ...
[37] ]
[38] two
[39] bring
[40] ,
[41] found
[42] tell
[43] horse
[44] coconuts
[45] What
[46] times
[47] house
[48] air-speed
[49] back
[50] coconut
[51] We
[52] grip
[53] mean
[54] sovereign
[55] its
[56] You
[57] No
[58] Whoa
[59] Well
[60] grips
[61] go
[62] a
[63] got
[64] winter
[65] Britons
[66] kingdom
[67] 're
[68] Will
[69] Saxons
[70] Oh
[71] In
[72] matter
[73] simple
[74] in
[75] be
[76] one
[77] son
[78] creeper
[79] second
[80] Am
[81] they
[82] beat
[83] get
[84] held
[85] King
[86] then
[87] have
[88] course
[89] if
[90] could
[91] wants
[92] with
[93] or
[94] migrate
[95] SOLDIER
[9