### 作業目標: 了解N-Gram如何藉由文本計算機率

### 為何需要使用馬可夫假設來簡化語言模型的計算?

原本的語言模型利用貝氏定理計算機率時為:
$$
\begin{aligned}
&W = (W_1W_2W_3W_4…W_m) \\
&P(W_1,W_2,W_3,W_4,…,W_m) = P(W_1)*P(W_2|W_1)*P(W_3|W_1,W_2)*…*P(W_m|W_1,…,W_{m-1})
\end{aligned}
$$

為何需要引入馬可夫假設使機率簡化為:
$$
P(W_m|W_1,W_2,W_3,…,W_{m-1}) = P(W_m|W_{m-n+1},W_{m-n+2},…,W_{m-1})
$$

In [None]:
'''
###<your answer>###
'''

### 以Bigram模型下判斷語句是否合理

已知的機率值有
1. p(i|_start_) = 0.25
2. p(english|want) = 0.0011
3. p(food|english) = 0.5
4. p(_end_|food) = 0.68
5. P(want|_start_) = 0.25
6. P(english|i) = 0.0011

In [32]:
import numpy as np
import pandas as pd
words = ['i', 'want', 'to', 'eat', 'english', 'food', 'lunch', 'spend']
word_cnts = np.array([2533, 927, 2417, 746, 158, 1093, 341, 278]).reshape(1, -1)
df_word_cnts = pd.DataFrame(word_cnts, columns=words)
df_word_cnts

Unnamed: 0,i,want,to,eat,english,food,lunch,spend
0,2533,927,2417,746,158,1093,341,278


In [33]:
# 記錄當前字與前一個字詞的存在頻率
bigram_word_cnts = [[5, 827, 0, 9, 0, 0, 0, 2], [2, 0, 608, 1, 6, 6, 5, 1], [2, 0, 4, 686, 2, 0, 6, 211],
                    [0, 0, 2, 0, 16, 2, 42, 0], [1, 0, 0, 0, 0, 82, 1, 0], [15, 0, 15, 0, 1, 4, 0, 0],
                    [2, 0, 0, 0, 0, 1, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0]]

df_bigram_word_cnts = pd.DataFrame(bigram_word_cnts, columns=words, index=words)
df_bigram_word_cnts

Unnamed: 0,i,want,to,eat,english,food,lunch,spend
i,5,827,0,9,0,0,0,2
want,2,0,608,1,6,6,5,1
to,2,0,4,686,2,0,6,211
eat,0,0,2,0,16,2,42,0
english,1,0,0,0,0,82,1,0
food,15,0,15,0,1,4,0,0
lunch,2,0,0,0,0,1,0,0
spend,1,0,1,0,0,0,0,0


由上表可知當前一個字詞(列)是want的時候, 當前字詞(行)是to的頻率在文本中共有608次

In [34]:
#請根據給出的總詞頻(df_word_cnts)與bigram模型的詞頻(df_bigram_word_cnts)計算出所有詞的配對機率(ex:p(want|i))
df_bigram_prob = df_bigram_word_cnts.copy()

###<your code>###

for i in df_bigram_prob.index:
    df_bigram_prob.loc[i] = df_bigram_prob.loc[i].map(lambda x:x/df_bigram_prob.loc[i].sum())
    
df_bigram_prob

Unnamed: 0,i,want,to,eat,english,food,lunch,spend
i,0.005931,0.98102,0.0,0.010676,0.0,0.0,0.0,0.002372
want,0.00318,0.0,0.966614,0.00159,0.009539,0.009539,0.007949,0.00159
to,0.002195,0.0,0.004391,0.753019,0.002195,0.0,0.006586,0.231614
eat,0.0,0.0,0.032258,0.0,0.258065,0.032258,0.677419,0.0
english,0.011905,0.0,0.0,0.0,0.0,0.97619,0.011905,0.0
food,0.428571,0.0,0.428571,0.0,0.028571,0.114286,0.0,0.0
lunch,0.666667,0.0,0.0,0.0,0.0,0.333333,0.0,0.0
spend,0.5,0.0,0.5,0.0,0.0,0.0,0.0,0.0


請根據已給的機率與所計算出的機率(df_bigram_prob), 試著判斷下列兩個句子哪個較為合理

s1 = “i want english food”

s2 = "want i english food"

In [47]:
'''
###<your answer>###
'''
s1 = "i want english food"
s2 = "want i english food"

def n_gram_split(str_,n):
    a = str_.split()
    tuple_lst = [i for i in map(lambda x:tuple(x),[a[i:i+n] for i in range(len(a)) if i+n <= len(a)])]
    return tuple_lst


s1_prob = pd.Series(df_bigram_prob.loc[i] for i in n_gram_split(s1,2)).prod()
s2_prob = pd.Series(df_bigram_prob.loc[i] for i in n_gram_split(s2,2)).prod()

print(f"s1's probability :{s1_prob}\ns2's probability :{s2_prob}","\n{} makes more sense than {}".format('s1','s2' if s1>s2 else 's2','s1'))



s1's probability :0.00913509580036689
s2's probability :0.0 
s1 makes more sense than s2
