# 前處理

自然語言處理在使用機器學習模型之前，需要一些像是斷字、斷句等前處理(preprocessing)的工作。

- [斷字](#斷字)
  - [split by space](#split-by-space)
  - [nltk.tokenize](#nltk.tokenize)
  - [sklearn.CountVectorizer](#sklearn.CountVectorizer)
  - [keras.text_to_word_sequence](#keras.text_to_word_sequence) (推薦)
- [單字轉向量 / Word to Vector](#單字轉向量-/-Word-to-Vector)
  - [sklearn.LabelEncoder](#sklearn.LabelEncoder)
  - [sklearn.OneHotEncoder](#sklearn.OneHotEncoder)
  - [keras.to_categorical](#keras.to_categorical)
- [文件轉向量 / Document to Vector](#文件轉向量-/-Document-to-Vector)
  - [sklearn.CountVectorizer](#sklearn.CountVectorizer)
  - [sklearn.TfidfVectorizer](#sklearn.TfidfVectorizer)
  - [sklearn.TruncatedSVD](#sklearn.TruncatedSVD)
  - [sklearn.PCA](#sklearn.PCA)

# 斷字

以下面 `text` 的文字為例：

In [1]:
text = "Yeah, I know when I compliment her she won't believe me \
        And it's so sad to think that she don't see what I see \
        But every time she asks me do I look okay? \
        I say When I see your face \
        There's not a thing that I would change \
        Cause you're amazing Just the way you are"

### split by space

In [12]:
print(text.split())

['Yeah,', 'I', 'know', 'when', 'I', 'compliment', 'her', 'she', "won't", 'believe', 'me', 'And', "it's", 'so', 'sad', 'to', 'think', 'that', 'she', "don't", 'see', 'what', 'I', 'see', 'But', 'every', 'time', 'she', 'asks', 'me', 'do', 'I', 'look', 'okay?', 'I', 'say', 'When', 'I', 'see', 'your', 'face', "There's", 'not', 'a', 'thing', 'that', 'I', 'would', 'change', 'Cause', "you're", 'amazing', 'Just', 'the', 'way', 'you', 'are']


### nltk.tokenize

- 逗號、問號變成獨立的字
- 縮寫 `n't`、`'s` 與 `'re` 被斷成獨立的字

In [9]:
import nltk
from nltk.tokenize import word_tokenize
from os.path import abspath

# prepare `punkt` corpus
nltk.download('punkt', download_dir='./nltk_data')
nltk.data.path = [abspath('./nltk_data')]

print(word_tokenize(text))

[nltk_data] Downloading package punkt to ./nltk_data...
[nltk_data]   Package punkt is already up-to-date!
['Yeah', ',', 'I', 'know', 'when', 'I', 'compliment', 'her', 'she', 'wo', "n't", 'believe', 'me', 'And', 'it', "'s", 'so', 'sad', 'to', 'think', 'that', 'she', 'do', "n't", 'see', 'what', 'I', 'see', 'But', 'every', 'time', 'she', 'asks', 'me', 'do', 'I', 'look', 'okay', '?', 'I', 'say', 'When', 'I', 'see', 'your', 'face', 'There', "'s", 'not', 'a', 'thing', 'that', 'I', 'would', 'change', 'Cause', 'you', "'re", 'amazing', 'Just', 'the', 'way', 'you', 'are']


### sklearn.CountVectorizer

- 移除逗號、問號
- 縮寫 `'t` 與 `'s` 被移除；但縮寫 `'re` 被斷成獨立的字

In [11]:
from sklearn.feature_extraction.text import CountVectorizer
print(CountVectorizer().build_tokenizer()(text))

['Yeah', 'know', 'when', 'compliment', 'her', 'she', 'won', 'believe', 'me', 'And', 'it', 'so', 'sad', 'to', 'think', 'that', 'she', 'don', 'see', 'what', 'see', 'But', 'every', 'time', 'she', 'asks', 'me', 'do', 'look', 'okay', 'say', 'When', 'see', 'your', 'face', 'There', 'not', 'thing', 'that', 'would', 'change', 'Cause', 'you', 're', 'amazing', 'Just', 'the', 'way', 'you', 'are']


### keras.text_to_word_sequence

- 全轉小寫
- 移除逗號、問號
- 保留縮寫

看起來效果最好，不過速度最慢

In [15]:
from keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence(text))

Using TensorFlow backend.


['yeah', 'i', 'know', 'when', 'i', 'compliment', 'her', 'she', "won't", 'believe', 'me', 'and', "it's", 'so', 'sad', 'to', 'think', 'that', 'she', "don't", 'see', 'what', 'i', 'see', 'but', 'every', 'time', 'she', 'asks', 'me', 'do', 'i', 'look', 'okay', 'i', 'say', 'when', 'i', 'see', 'your', 'face', "there's", 'not', 'a', 'thing', 'that', 'i', 'would', 'change', 'cause', "you're", 'amazing', 'just', 'the', 'way', 'you', 'are']


## 單字轉向量 / Word to Vector

以下面 `words` 的單字串列為例：

In [17]:
words = ['cute', 'beauty', 'cold', 'cold', 'cold', 'hot', 'beauty', 'cute']

### sklearn.LabelEncoder

In [42]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
word_ids = le.fit_transform(words)
print(word_ids)

[2 0 1 1 1 3 0 2]


### sklearn.OneHotEncoder

In [43]:
from sklearn.preprocessing import OneHotEncoder
one_hot = OneHotEncoder(sparse=False)
print(one_hot.fit_transform(word_ids.reshape(len(word_ids), 1)))

[[ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  1.  0.]]


### keras.to_categorical

In [44]:
from keras.utils import to_categorical
print(to_categorical(word_ids))

[[ 0.  0.  1.  0.]
 [ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  1.  0.]]


## 文件轉向量 / Document to Vector

以下面 `docs` 的文件串列為例：

In [30]:
docs = ['bobo is cute', \
        'bobo is smart', \
        'bobo is humorous']

### sklearn.CountVectorizer

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
doc_term = cv.fit_transform(docs)
print(cv.get_feature_names())
print(doc_term.toarray())

['bobo', 'cute', 'humorous', 'is', 'smart']
[[1 1 0 1 0]
 [1 0 0 1 1]
 [1 0 1 1 0]]


### sklearn.TfidfVectorizer

- tf: term frequency
- idf: inverse document frequency

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
doc_term = tv.fit_transform(docs)
print(tv.get_feature_names())
print(doc_term.toarray())

['bobo', 'cute', 'humorous', 'is', 'smart']
[[ 0.45329466  0.76749457  0.          0.45329466  0.        ]
 [ 0.45329466  0.          0.          0.45329466  0.76749457]
 [ 0.45329466  0.          0.76749457  0.45329466  0.        ]]


### sklearn.TruncatedSVD

In [45]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=3)
print(svd.fit_transform(doc_term.toarray()))

[[ 0.77929545  0.          0.62665669]
 [ 0.77929545 -0.54270061 -0.31332835]
 [ 0.77929545  0.54270061 -0.31332835]]


### sklearn.PCA

In [46]:
from sklearn.decomposition import PCA
pca = PCA(n_components=3, svd_solver='full')
print(pca.fit_transform(doc_term.toarray()))

[[  6.26656690e-01   4.67720313e-17   1.54037282e-16]
 [ -3.13328345e-01  -5.42700613e-01   1.54037282e-16]
 [ -3.13328345e-01   5.42700613e-01   1.54037282e-16]]
