## 3.8 分割
本章目標
1) 範例如何利用nltk 的function將句子做斷句。斷句會遇到什麼問題?
2) 斷詞範例，程式是如何運作。


## 斷句練習 sent_tokenize

1)  讀入一篇文章
2)  再利用nltk.sent_tokenize 做斷句
3)  查看斷句結果



In [148]:
import nltk
import pprint
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')

sents = nltk.sent_tokenize(text)

pprint.pprint(sents[0:10])


['[The Man Who Was Thursday by G. K. Chesterton 1908]\n'
 '\n'
 'To Edmund Clerihew Bentley\n'
 '\n'
 'A cloud was on the mind of men, and wailing went the weather,\n'
 'Yea, a sick cloud upon the soul when we were boys together.',
 'Science announced nonentity and art admired decay;\n'
 'The world was old and ended: but you and I were gay;\n'
 'Round us in antic order their crippled vices came--\n'
 'Lust that had lost its laughter, fear that had lost its shame.',
 'Like the white lock of Whistler, that lit our aimless gloom,\n'
 'Men showed their own white feather as proudly as a plume.',
 'Life was a fly that faded, and death a drone that stung;\n'
 'The world was very old indeed when you and I were young.',
 'They twisted even decent sin to shapes not to be named:\n'
 'Men were ashamed of honour; but we were not ashamed.',
 'Weak if we were and foolish, not thus we failed, not thus;\n'
 'When that black Baal blocked the heavens he had no hymns from us\n'
 'Children we were--our for

筆紀
1) raw()可以查看語料庫的原始內容
2) sent_tokenize 做句子的斷句
3) pprint 更簡解的方式列印

##  斷句有會遇到的問題

1) 有時候英文字的縮寫會有句號.會被做斷句。被當成句子的結尾
2) 另外也會有斷的不正確的地方。在第6章會再說明斷句

## 斷詞 Word Segmentation

本節目標
1)  斷詞是如何去做斷詞的動作


a.		doyouseethekitty

b.		seethedoggy

c.		doyoulikethekitty

d.		likethedogg


## 自定義一個斷詞function , 可以依自已的方式做斷詞

In [140]:
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == '1':
            words.append(text[last:i+1])
            last = i+1
    words.append(text[last:])
    return words

In [141]:

text = "doyouseethekitty seethedoggy doyoulikethekitty likethedoggy"

seg1 = "0000000000000001 00000000001 00000000000000001 00000000000"

segment(text, seg1)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']

## 筆記

1) text 為要做的斷詞的文字

2) seg  為斷詞標記。對應斷詞的文字的標記，1就是要斷 。

3) 一開始先從seg 下手
0000000000000001 00000000001 0000000000000000100000000000

4) 發現第seg[15]為1

5) 接者text 第一個斷詞長度是 16 

6) 所以第一個斷詞是text[0:16]

7) 下一次開始斷詞text[16:30]

# 設定一個評分function ，來比較斷詞的好壞. 分數越小的代表斷詞越好


# 分數 =  斷詞結果list的長度+不重覆的單字的長度

# 例: 不重覆單字越長or 越多 代表斷的不好。等於有斷跟沒斷一樣


# 程式說明

斷詞1：　
['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']


長度4 +不重覆的單字長度 60


斷詞2：　['do,you,see,the,kitty,see,the,doggy,do,you,like,the,kitty,like,the,doggy']


長度16+不重覆的單字長度32


斷詞3：
['doyou,see,thekitt,y,see,thedogg,y,doyou,like,thekitt,y,like,thedogg,y']  


長度14+不重覆的單字長度33


In [109]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size


In [138]:
import pprint
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

seg1 = "0000000000000001000000000010000000000000000100000000000"

seg2 = "0100100100100001001001000010100100010010000100010010000"

seg3 = "0000100100000011001000000110000100010000001100010000001"

print(evaluate(text, seg1))
print(evaluate(text, seg2))
print(evaluate(text, seg3))

64
48
47


In [152]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
words = segment(text, seg1)
print(words)
text_size = len(words)
print(text_size)

lexicon_size = sum(len(word) + 1 for word in set(words))
print(lexicon_size)

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
4
60


#  利用程式來計算最好的斷詞方式。達到最小分數，就是最好的的斷詞。


In [142]:
from random import randint

def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0, len(segs)-1))
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = 55
    while 55 > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, round(temperature))
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        
        temperature = 55 / 1.2
        
        print(evaluate(text, segs), segment(text, segs))
    print()
    return segs

In [153]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
anneal(text, seg1, 5000, 1.2)

64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
63 ['doyouse', 'et', 'hek', 'ittyseet', 'hedoggy', 'doyoulik', 'et', 'hek', 'i', 'ttyliket', 'hedoggy']
63 ['doyouse', 'et', 'hek', 'ittyseet', 'hedoggy', 'doyoulik', 'et', 'hek', 'i', 'ttyliket', 'hedoggy']
62 ['doyo', 'u', 'se', 'et', 'heki', 'tty', 'se', 'et', 'hedoggy', 'doyoulik', 'et', 'heki', 'ttyliket', 'hedoggy']
61 ['doy', 'o', 'u', 'se', 'et', 'heki', 'tty', 'se', 'et', 'hedoggy

'0000101010000001010100000010000100101000000100101000000'