# ⚜️ 《C3 Natural Language Processing》

### 文本的情感

#### 基于单词的编码

给第一个单词一个编码  
```
I Love My Dog.
1 2    3  4
I Love My Cat.
1 2    3  5
==>
My Cat Love Me
3  5   2    6
```



### 标记化 Tokenization
在 TensorFlow NLP 中，单词或词汇通常会被转换成以下两种形式之一：  
	1.	词级 (Word-level) 编码
	•	每个单词被映射为一个唯一的整数 ID。
	•	示例：{"I": 1, "Love": 2, "My": 3, "Dog": 4, "Cat": 5}

	2.	子词级 (Subword-level) 编码（更推荐）
	•	句子被拆解成更小的子词单元（如 BERT 的 WordPiece、GPT 的 Byte-Pair Encoding 等）。   
文本示例： "unbelievably fast"   

| 编码等级 | 编码示例 |    
|:--------|:-------|  
| 词级 | unbelievably=100, fast = 101 |  
| 子词级 | un=1, believ=2, ably=3, fast=4 |  





In [4]:
import tensorflow as tf

sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
    ]

vectorize_layer = tf.keras.layers.TextVectorization()
vectorize_layer.adapt(sentences)
vocabulary = vectorize_layer.get_vocabulary()
print(vocabulary)
vocab = vectorize_layer.get_vocabulary(include_special_tokens=False)
print(vocab)

['', '[UNK]', 'my', 'love', 'dog', 'you', 'i', 'think', 'is', 'do', 'cat', 'amazing']
['my', 'love', 'dog', 'you', 'i', 'think', 'is', 'do', 'cat', 'amazing']


In [11]:
# 将句子 矢量化
test_new_seq = vectorize_layer("I LOVE My Bird")
print(test_new_seq) ## 注意unknown word--bird，在seq将使用句子中没有出现过的token替换

# 矢量化数组
test_arr_seq = vectorize_layer(['I love my son', 'My Son Love Me'])
print(test_arr_seq)

# 矢量化为map
sentence_dataset = tf.data.Dataset.from_tensor_slices(sentences)
sequences = sentence_dataset.map(vectorize_layer)
for sentence, sequence in zip(sentences, sequences):
  print(f'{sentence} ---> {sequence}')

tf.Tensor([6 3 2 1], shape=(4,), dtype=int64)
tf.Tensor(
[[6 3 2 1]
 [2 1 3 1]], shape=(2, 4), dtype=int64)
I love my dog ---> [6 3 2 4]
I love my cat ---> [ 6  3  2 10]
You love my dog! ---> [5 3 2 4]
Do you think my dog is amazing? ---> [ 9  5  7  2  4  8 11]


In [10]:
# 将句子对齐 (默认右对齐，左边补0)
sequences_pre = tf.keras.utils.pad_sequences(sequences=sequences)
print(sequences_pre)

# 示例右边补0，截断仅保留5位
sequences_post = tf.keras.utils.pad_sequences(sequences=sequences, padding='post', maxlen=5, truncating='post')
print(sequences_post)

[[ 0  0  0  6  3  2  4]
 [ 0  0  0  6  3  2 10]
 [ 0  0  0  5  3  2  4]
 [ 9  5  7  2  4  8 11]]
[[ 6  3  2  4  0]
 [ 6  3  2 10  0]
 [ 5  3  2  4  0]
 [ 9  5  7  2  4]]
