https://github.com/japerk/nltk3-cookbook

In [1]:
import nltk
from nltk.tokenize import sent_tokenize

para = '''Hello World. It's good to see you. Thanks for buying this book.'''

In [2]:
nltk.__version__

'3.4'

In [3]:
sent_tokenize(para)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

### Tokenizing sentences into words
    instance of the TreebankWordTokenizer class.  
      separating words using spaces and punctuation, TreebankWordTokenizer class uses conventions found in the Penn Treebank corpus.   most significant conventions is to separate contraction
      
    PunktWordTokenize  splits on punctuation, but keeps it with the word instead of creating separate tokens;  
    ['Can', "'t", 'is', 'a', 'contraction.]
    
    WordPunctTokenize

In [4]:
from nltk.tokenize import TreebankWordTokenizer
from nltk.tokenize import WordPunctTokenizer

tree_tokenizer = TreebankWordTokenizer()
word_tokenizer = WordPunctTokenizer()
print('''Original senctence = "Can't is a contraction"''')
print('treebank word tokenizer= ', tree_tokenizer.tokenize("Can't is a contraction"))
print('word tokenizer= ', word_tokenizer.tokenize("Can't is a contraction"))


Original senctence = "Can't is a contraction"
treebank word tokenizer=  ['Ca', "n't", 'is', 'a', 'contraction']
word tokenizer=  ['Can', "'", 't', 'is', 'a', 'contraction']


### Tokenizing sentences using regular expressions

In [5]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")

print('using regex= ', tokenizer.tokenize("Can't is a contraction."))

using regex=  ["Can't", 'is', 'a', 'contraction']


    Simple whitespace tokenizer  
       If we used gaps=False, then the pattern would be used to identify tokens.

In [6]:
white_tokenizer = RegexpTokenizer('\s+', gaps=True)
print('''Original senctence = "Can't is a contraction"''')
print('treebank word tokenizer= ', tree_tokenizer.tokenize("Can't is a contraction"))
print('word tokenizer= ', word_tokenizer.tokenize("Can't is a contraction"))
print('using regex= ', tokenizer.tokenize("Can't is a contraction."))
print('this is tokenizing using regex whitespace= ', white_tokenizer.tokenize("Can't is a contraction."))

Original senctence = "Can't is a contraction"
treebank word tokenizer=  ['Ca', "n't", 'is', 'a', 'contraction']
word tokenizer=  ['Can', "'", 't', 'is', 'a', 'contraction']
using regex=  ["Can't", 'is', 'a', 'contraction']
this is tokenizing using regex whitespace=  ["Can't", 'is', 'a', 'contraction.']


### Train on raw text to produce a Custum Sentence Tokenizer. 

**PunktSentenceTokenizer class** that you can train on raw text to produce a custom sentence tokenizer. 

In [19]:
#page 14
from nltk.tokenize import PunktSentenceTokenizer
from nltk.corpus import webtext
text = webtext.raw('/Users/alessandropiccolo/nltk_data/corpora/webtext/overheard.txt')
print('text object = ', type(text))
for line in text[:100].split('\n'):  #sample of 1st 100 char of file with long string
    print(line)

text object =  <class 'str'>
White guy: So, do you have any plans for this evening?
Asian girl: Yeah, being angry!
White guy: Oh,


PunktSentenceTokenizer class uses an unsupervised learning algorithm to learn
what constitutes a sentence break.  
Creates custom PunktSentenceTokenizer alogrithim based on custom input, in this case the long raw string variable named 'text'.
specific technique used in this case is called sentence boundary detection and it works by counting punctuation and tokens that commonly end a sentence, such as a period or newline, then using the resulting frequencies to decide what the sentence boundaries should actually look like

In [22]:
sent_tokenizer = PunktSentenceTokenizer(text)  #create values based on text to tokenize

5.020962518514799e-06 5.585344057193923e-05 0.0 199165 17904 1 1
5.020962518514799e-06 5.585344057193923e-05 0.0 199165 17904 1 1
5.020962518514799e-06 5.585344057193923e-05 0.0 199165 17904 1 1
2.0083850074059197e-05 5.585344057193923e-05 1.6550719680460773e-05 199165 17904 4 1
0.00041673988903672836 5.585344057193923e-05 0.00045238633793259444 199165 17904 83 1
0.0005322220269625687 0.00027926720285969614 0.0005572075625755126 199165 17904 106 5
2.5104812592573997e-05 0.00022341376228775692 5.5169065601535905e-06 199165 17904 5 4
0.0006778299399994979 0.00039097408400357465 0.0007061640396996596 199165 17904 135 7
0.004162377927848768 0.0051943699731903485 0.004060443228273042 199165 17904 829 93
0.01161348630532473 0.0070375335120643435 0.012065474647055903 199165 17904 2313 126
2.0083850074059197e-05 5.585344057193923e-05 1.6550719680460773e-05 199165 17904 4 1
0.0334195265232345 0.04457104557640751 0.032318038629379736 199165 17904 6656 798
5.020962518514799e-06 5.585344057193923e

In [34]:
customerTokenzer = sent_tokenizer.tokenize(text)

In [27]:
from nltk.tokenize import sent_tokenize

In [41]:
defaultTokenizer = sent_tokenize(text)

In [43]:
print('''custom tokenizer results=\n''', customerTokenzer[0],'\n', customerTokenzer[678])
                                                                                    
print('''\ndefault tokenizer=\n''', defaultTokenizer[0],'\n',defaultTokenizer[678])

custom tokenizer results=
 White guy: So, do you have any plans for this evening? 
 Girl: But you already have a Big Mac...

default tokenizer=
 White guy: So, do you have any plans for this evening? 
 Girl: But you already have a Big Mac...
Hobo: Oh, this is all theatrical.


***example of reading overheard.txt directly instead of using the raw() corpus method.***

In [44]:
with open('/Users/alessandropiccolo/nltk_data/corpora/webtext/overheard.txt',
encoding='ISO-8859-2') as f:
    test = f.read()

In [45]:
custom2_tokenizer = PunktSentenceTokenizer(text)

5.020962518514799e-06 5.585968048262764e-05 0.0 199165 17902 1 1
5.020962518514799e-06 5.585968048262764e-05 0.0 199165 17902 1 1
5.020962518514799e-06 5.585968048262764e-05 0.0 199165 17902 1 1
2.0083850074059197e-05 5.585968048262764e-05 1.6550537064927757e-05 199165 17902 4 1
0.00041673988903672836 5.585968048262764e-05 0.0004523813464413587 199165 17902 83 1
0.0005322220269625687 0.0002792984024131382 0.0005572014145192345 199165 17902 106 5
2.5104812592573997e-05 0.00022343872193051055 5.5168456883092526e-06 199165 17902 5 4
0.0006778299399994979 0.0003910177633783935 0.0007061562481035843 199165 17902 135 7
0.004162377927848768 0.005194950284884371 0.004060398426595609 199165 17902 829 93
0.01161348630532473 0.007038319740811083 0.012065341520332335 199165 17902 2313 126
2.0083850074059197e-05 5.585968048262764e-05 1.6550537064927757e-05 199165 17902 4 1
0.0334195265232345 0.044576025025136856 0.0323176820421156 199165 17902 6656 798
5.020962518514799e-06 5.585968048262764e-05 0.

In [55]:
sents = sent_tokenizer.tokenize(text)

In [56]:
sents[0]

'White guy: So, do you have any plans for this evening?'