### 作業目標: 了解N-Gram如何藉由文本計算機率

### 為何需要使用馬可夫假設來簡化語言模型的計算?

原本的語言模型利用貝氏定理計算機率時為:
$$
\begin{aligned}
&W = (W_1W_2W_3W_4…W_m) \\
&P(W_1,W_2,W_3,W_4,…,W_m) = P(W_1)*P(W_2|W_1)*P(W_3|W_1,W_2)*…*P(W_m|W_1,…,W_{m-1})
\end{aligned}
$$

為何需要引入馬可夫假設使機率簡化為:
$$
P(W_m|W_1,W_2,W_3,…,W_{m-1}) = P(W_m|W_{m-n+1},W_{m-n+2},…,W_{m-1})
$$

In [138]:
'''
每個字與相近字的關聯最大，越後面關聯越小，所以機率影響不大
而且後面累積的字串很長可能會無法計算機率（文本中未出現字串）
'''

'\n每個字與相近字的關聯最大，越後面關聯越小，所以機率影響不大\n而且後面累積的字串很長可能會無法計算機率（文本中未出現字串）\n'

### 以Bigram模型下判斷語句是否合理

已知的機率值有
1. p(i|_start_) = 0.25
2. p(english|want) = 0.0011
3. p(food|english) = 0.5
4. p(_end_|food) = 0.68
5. P(want|_start_) = 0.25
6. P(english|i) = 0.0011

In [90]:
import numpy as np
import pandas as pd
words = ['i', 'want', 'to', 'eat', 'chinese', 'food', 'lunch', 'spend']
word_cnts = np.array([2533, 927, 2417, 746, 158, 1093, 341, 278]).reshape(1, -1)
df_word_cnts = pd.DataFrame(word_cnts, columns=words)
df_word_cnts

Unnamed: 0,i,want,to,eat,chinese,food,lunch,spend
0,2533,927,2417,746,158,1093,341,278


In [6]:
# 記錄當前字與前一個字詞的存在頻率
bigram_word_cnts = [[5, 827, 0, 9, 0, 0, 0, 2], [2, 0, 608, 1, 6, 6, 5, 1], [2, 0, 4, 686, 2, 0, 6, 211],
                    [0, 0, 2, 0, 16, 2, 42, 0], [1, 0, 0, 0, 0, 82, 1, 0], [15, 0, 15, 0, 1, 4, 0, 0],
                    [2, 0, 0, 0, 0, 1, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0]]

df_bigram_word_cnts = pd.DataFrame(bigram_word_cnts, columns=words, index=words)
df_bigram_word_cnts

Unnamed: 0,i,want,to,eat,chinese,food,lunch,spend
i,5,827,0,9,0,0,0,2
want,2,0,608,1,6,6,5,1
to,2,0,4,686,2,0,6,211
eat,0,0,2,0,16,2,42,0
chinese,1,0,0,0,0,82,1,0
food,15,0,15,0,1,4,0,0
lunch,2,0,0,0,0,1,0,0
spend,1,0,1,0,0,0,0,0


由上表可知當前一個字詞(列)是want的時候, 當前字詞(行)是to的頻率在文本中共有608次

In [134]:
df_bigram_prob = df_bigram_word_cnts.copy()
df_bigram_prob.loc['want'] # 'want to' 一起在文本出現的次數 = 608

i            2
want         0
to         608
eat          1
chinese      6
food         6
lunch        5
spend        1
Name: want, dtype: int64

In [135]:
df_word_cnts.loc[0]

i          2533
want        927
to         2417
eat         746
chinese     158
food       1093
lunch       341
spend       278
Name: 0, dtype: int64

In [136]:
df_word_cnts.loc[0, 'want'] # want 在文本中出現次數

927

In [137]:
df_bigram_prob.loc['want']/df_word_cnts.loc[0, 'want']
# 前一字是 want 時，當前字是 to 的機率 = p(to|want) = cnt_want_to/cnt_want = 608/2417 = 0.655879
# cnt_want_to: 照此順序同時出現的次數
# cnt_want: 單獨出現的次數

i          0.002157
want       0.000000
to         0.655879
eat        0.001079
chinese    0.006472
food       0.006472
lunch      0.005394
spend      0.001079
Name: want, dtype: float64

In [127]:
df_bigram_prob = df_bigram_word_cnts.copy()
for word in words:
    df_bigram_prob.loc[word] = df_bigram_prob.loc[word].apply(lambda x: x/df_word_cnts.loc[0, word])

df_bigram_prob

Unnamed: 0,i,want,to,eat,chinese,food,lunch,spend
i,0.001974,0.32649,0.0,0.003553,0.0,0.0,0.0,0.00079
want,0.002157,0.0,0.655879,0.001079,0.006472,0.006472,0.005394,0.001079
to,0.000827,0.0,0.001655,0.283823,0.000827,0.0,0.002482,0.087298
eat,0.0,0.0,0.002681,0.0,0.021448,0.002681,0.0563,0.0
chinese,0.006329,0.0,0.0,0.0,0.0,0.518987,0.006329,0.0
food,0.013724,0.0,0.013724,0.0,0.000915,0.00366,0.0,0.0
lunch,0.005865,0.0,0.0,0.0,0.0,0.002933,0.0,0.0
spend,0.003597,0.0,0.003597,0.0,0.0,0.0,0.0,0.0


---
### 請根據已給的機率與所計算出的機率(df_bigram_prob) , 試著判斷下列兩個句子哪個較為合理

s1 = “i want english food”

s2 = "want i english food"

In [67]:
s1 = "i want english food"
s2 = "want i english food"

p_i_start = 0.25
p_eng_want = 0.0011
p_food_eng = 0.5
p_end_food = 0.68
p_want_start = 0.25
p_eng_i = 0.0011

In [74]:
words_s1 = s1.split(' ')
print(words_s1)

prob_s1 = p_i_start * df_bigram_prob['want']['i'] * p_eng_want * p_food_eng * p_end_food
print('{:.10f}'.format(prob_s1))

['i', 'want', 'english', 'food']
0.0000305268


In [75]:
words_s2 = s2.split(' ')
print(words_s2)

prob_s2 = p_want_start * df_bigram_prob['i']['want'] * p_eng_i * p_food_eng * p_end_food
print('{:.10f}'.format(prob_s2))

['want', 'i', 'english', 'food']
0.0000002017


In [88]:
print('{:.10f}'.format(max(prob_s1, prob_s2))) # s1機率較大

0.0000305268
