# 自然語言處理(NLP)實作

### 自然語言處理主要是指文字(Text)相關的應用

- 文字分類(Text Classification)：例如情緒分析(Sentiment Analysis)、主題的分類、垃圾信(Spam)的辨識、...等，乃至於聊天機器人(ChatBot)。
- 文字生成(Text Generation)：例如文本摘要(Text Summary)、作詞、作曲、製造假新聞(Fake News)、影像標題(Image captioning)...等。
- 翻譯(Text Translation)：多國語言互轉。
- 其他：克漏字、錯字更正、命名實體識別（NER）、著作風格的比對，例如紅樓夢最後幾個章節是不是曹雪芹寫的。

## 程式參考來源：
- https://keras.io/api/layers/core_layers/embedding/
- https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/
- https://keras.io/guides/working_with_rnns/


### 簡單 RNN

RNN 主要用於時間序列型的資料，如股價、氣候資料，或者上下文相關(Context Sensitive)的資料，例如文章字句有前後關聯，我們需要以較早期發生的資料作為訓練資料，預測當期或未來。

因與上下文相關，RNN 的輸入除了特徵(X)外，還會餵入上一筆隱藏層的輸出
![](https://ithelp.ithome.com.tw/upload/images/20200925/20001976GzQlGkpwOL.png)

當前資料會受到上一筆的影響，上一筆又受到【上上一筆】的影響，類似遞迴的概念，因此，稱為【循環神經網路】(Recurrent Neural Network, RNN)
![](https://ithelp.ithome.com.tw/upload/images/20200925/20001976pjlz1ErdbF.png)

RNN 基於共享權值(Shared Weights)的假設，遞迴的結果使權值(W)連乘，W>1時，會造成【梯度爆炸】(exploding gradient)，反之，W<1時，則會造成【梯度消失】(vanishing gradient)，故有改良的的演算法如 LSTM(Long Short Term Memory)、GRU(Gated Recurrent Unit)...等，多維護一條【記憶】處理流程。
![](https://ithelp.ithome.com.tw/upload/images/20200925/20001976TlanB2yqVi.png)

## Embedding

Keras 實作RNN/LSTM/GRU神經層，分別為SimpleRNN/LSTM/GRU，命名空間(Namespace)為 tensorflow.keras.layers，模型結構的第一層必須為嵌入層(Embedding layer)，它將文字轉為緊密的實數空間，使輸入變為向量，才能進行後續的運算。

嵌入層(Embedding layer)的重要參數說明如下：

- input_dim: int > 0。字彙表大小。
- output_dim: int >= 0。詞向量的維度。
- input_length: 輸入文字的長度，如果後面接 Flatten 和 Dense 層，則此參數勢必填的。

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
model = tf.keras.Sequential()

# 字彙表最大為1000，輸出維度為 64，輸入的字數為 10
model.add(layers.Embedding(input_dim=1000, output_dim=64))

# 產生亂數資料，32筆資料，每筆 10 個數字
input_array = np.random.randint(1000, size=(32, 10))

# 指定損失函數
model.compile('rmsprop', 'mse')

# 預測
output_array = model.predict(input_array)
print(output_array.shape)
output_array[0]

(32, 10, 64)


array([[ 0.01953734, -0.0157074 ,  0.00996393, -0.00673544, -0.02587519,
         0.03053397, -0.04191018,  0.04229479, -0.01038377,  0.00563044,
        -0.00477694,  0.00464076, -0.01616655, -0.04523896, -0.02540561,
         0.02133313,  0.00652258,  0.00679905,  0.01225382,  0.0344512 ,
        -0.0403166 ,  0.00016652,  0.03814531,  0.02289743,  0.04196259,
        -0.00600504,  0.03465677, -0.02778099,  0.03221213, -0.00787107,
        -0.04769366,  0.0346876 ,  0.01677204, -0.01855751, -0.00592537,
        -0.0323254 , -0.04971549, -0.00043731,  0.00608679, -0.01320542,
         0.03071863, -0.01272152, -0.0167508 ,  0.03141818,  0.02908845,
         0.01691998, -0.04993988, -0.01873283, -0.02002305,  0.02539049,
        -0.01617659, -0.0321295 , -0.02310722, -0.02385777,  0.04154817,
        -0.01848627, -0.01114576, -0.0079337 ,  0.0008026 ,  0.021645  ,
        -0.0138871 , -0.0208204 , -0.01179491,  0.01101322],
       [ 0.03860272, -0.00536673, -0.00034314, -0.01738031, -0.

## 使用真實的資料

In [3]:
import tensorflow as tf
from tensorflow.keras import layers
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

# 測試資料
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']


vocab_size = 50
maxlen = 4

# 先轉成 one-hot encoding
encoded_docs = [one_hot(d, vocab_size) for d in docs]

# 轉成固定長度，長度不足則後面補空白
padded_docs = pad_sequences(encoded_docs, maxlen=maxlen, padding='post')

# 模型只有 Embedding
model = tf.keras.Sequential()
model.add(layers.Embedding(vocab_size, 64, input_length=maxlen))
model.compile('rmsprop', 'mse')

# 預測
output_array = model.predict(padded_docs)
output_array.shape


(10, 4, 64)

## 加上Dense

In [4]:
import tensorflow as tf
from tensorflow.keras import layers
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

# define class labels(1：正面、0：負面)
labels = array([1,1,1,1,1,0,0,0,0,0])

vocab_size = 50
maxlen = 4
encoded_docs = [one_hot(d, vocab_size) for d in docs]
padded_docs = pad_sequences(encoded_docs, maxlen=maxlen, padding='post')

model = tf.keras.Sequential()
model.add(layers.Embedding(vocab_size, 8, input_length=maxlen))
model.add(layers.Flatten())
# 加上一般的完全連接層(Dense)
model.add(layers.Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))


Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten (Flatten)            (None, 32)                0         
_________________________________________________________________
dense (Dense)                (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 80.000001


## 加上 RNN

In [5]:
import tensorflow as tf
from tensorflow.keras import layers
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

# define class labels(1：正面、0：負面)
labels = array([1,1,1,1,1,0,0,0,0,0])

vocab_size = 50
maxlen = 4
encoded_docs = [one_hot(d, vocab_size) for d in docs]
padded_docs = pad_sequences(encoded_docs, maxlen=maxlen, padding='post')

model = tf.keras.Sequential()
model.add(layers.Embedding(vocab_size, 8, input_length=maxlen))
# Add a RNN layer with 128 internal units.
model.add(layers.SimpleRNN(128))
model.add(layers.Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))


Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 128)               17536     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 129       
Total params: 18,065
Trainable params: 18,065
Non-trainable params: 0
_________________________________________________________________
None
Accuracy: 89.999998


## 使用詞向量(Word2Vec)

In [6]:
import tensorflow as tf
from tensorflow.keras import layers
from numpy import array
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer


## 讀取 GloVe 100維的詞向量，產生字典資料型的變數，方便搜尋

[檔案下載位置](http://nlp.stanford.edu/data/glove.6B.zip)

In [7]:
# load the whole embedding into memory
embeddings_index = dict()
f = open('./glove/glove.6B.100d.txt', encoding='utf8')
for line in f:
	values = line.split()
	word = values[0]
	coefs = np.array(values[1:], dtype='float32')
	embeddings_index[word] = coefs
f.close()

## 分詞

In [8]:
vocab_size = 50
maxlen = 4


# define documents
docs = ['Well done!',
		'Good work',
		'Great effort',
		'nice work',
		'Excellent!',
		'Weak',
		'Poor effort!',
		'not good',
		'poor work',
		'Could have done better.']

# define class labels(1：正面、0：負面)
labels = array([1,1,1,1,1,0,0,0,0,0])

# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(docs)
# integer encode the documents
encoded_docs = t.texts_to_sequences(docs)

padded_docs = pad_sequences(encoded_docs, maxlen=maxlen, padding='post')

## 轉換為GloVe 100維的詞向量

In [9]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocab_size, 100))
for word, i in t.word_index.items():
	embedding_vector = embeddings_index.get(word)
	if embedding_vector is not None:
		embedding_matrix[i] = embedding_vector

## Embedding 設為不需訓練，直接輸入轉換後的向量

In [10]:
model = tf.keras.Sequential()

# trainable=False
model.add(layers.Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=maxlen, trainable=False))

# Add a LSTM layer with 128 internal units.
model.add(layers.LSTM(128))
model.add(layers.Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 4, 100)            5000      
_________________________________________________________________
lstm (LSTM)                  (None, 128)               117248    
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 129       
Total params: 122,377
Trainable params: 117,377
Non-trainable params: 5,000
_________________________________________________________________
None
Accuracy: 100.000000
