# 前處理

自然語言處理在使用機器學習模型之前，需要一些像是斷字、斷句等前處理(preprocessing)的工作。

- [斷字](#斷字)
  - [split by space](#split-by-space)
  - [nltk.tokenize](#nltk.tokenize)
  - [sklearn.CountVectorizer](#sklearn.CountVectorizer)
  - [keras.text_to_word_sequence](#keras.text_to_word_sequence) (推薦)
- [單字轉向量 / Word to Vector](#單字轉向量-/-Word-to-Vector)
  - [sklearn.LabelEncoder](#sklearn.LabelEncoder)
  - [sklearn.OneHotEncoder](#sklearn.OneHotEncoder)
  - [keras.to_categorical](#keras.to_categorical)
- [文件轉向量 / Document to Vector](#文件轉向量-/-Document-to-Vector)
  - [sklearn.CountVectorizer](#sklearn.CountVectorizer)
  - [sklearn.TfidfVectorizer](#sklearn.TfidfVectorizer)
  - [sklearn.TruncatedSVD](#sklearn.TruncatedSVD)
  - [sklearn.PCA](#sklearn.PCA)

# 斷字

以下面 `text` 的文字為例：

In [None]:
text = "Yeah, I know when I compliment her she won't believe me \
        And it's so sad to think that she don't see what I see \
        But every time she asks me do I look okay? \
        I say When I see your face \
        There's not a thing that I would change \
        Cause you're amazing Just the way you are"

### split by space

In [None]:
print(text.split())

### nltk.tokenize

- 逗號、問號變成獨立的字
- 縮寫 `n't`、`'s` 與 `'re` 被斷成獨立的字

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from os.path import abspath

# prepare `punkt` corpus
nltk.download('punkt', download_dir='.nltk_data')
nltk.data.path = [abspath('.nltk_data')]

print(word_tokenize(text))

### sklearn.CountVectorizer

- 移除逗號、問號
- 縮寫 `'t` 與 `'s` 被移除；但縮寫 `'re` 被斷成獨立的字

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
print(CountVectorizer().build_tokenizer()(text))

### keras.text_to_word_sequence

- 全轉小寫
- 移除逗號、問號
- 保留縮寫

看起來效果最好，不過速度最慢

In [None]:
from keras.preprocessing.text import text_to_word_sequence
print(text_to_word_sequence(text))

## 單字轉向量 / Word to Vector

以下面 `words` 的單字串列為例：

In [None]:
words = ['cute', 'beauty', 'cold', 'cold', 'cold', 'hot', 'beauty', 'cute']

### sklearn.LabelEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
word_ids = le.fit_transform(words)
print(word_ids)

### sklearn.OneHotEncoder

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
word_ids = le.fit_transform(words)
one_hot = OneHotEncoder(sparse=False)
print(one_hot.fit_transform(word_ids.reshape(len(word_ids), 1)))

### keras.to_categorical

In [None]:
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
word_ids = le.fit_transform(words)
print(to_categorical(word_ids))

## 文件轉向量 / Document to Vector

以下面 `docs` 的文件串列為例：

In [None]:
docs = ['bobo is cute', \
        'bobo is smart', \
        'bobo is humorous']

### sklearn.CountVectorizer

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
doc_term = cv.fit_transform(docs)
print(cv.get_feature_names())
print(doc_term.toarray())

### sklearn.TfidfVectorizer

- tf: term frequency
- idf: inverse document frequency

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
doc_term = tv.fit_transform(docs)
print(tv.get_feature_names())
print(doc_term.toarray())

### sklearn.TruncatedSVD

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
doc_term = tv.fit_transform(docs)
svd = TruncatedSVD(n_components=3)
print(svd.fit_transform(doc_term.toarray()))

### sklearn.PCA

In [None]:
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer()
doc_term = tv.fit_transform(docs)
pca = PCA(n_components=3, svd_solver='full')
print(pca.fit_transform(doc_term.toarray()))