Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
162 lines (116 sloc) 5.93 KB

文本数据表示为词袋

下面利用CountVectorizer对原始本文输入进行词袋统计,从而转换成稀疏的特征向量作为模型输入。

from sklearn.feature_extraction.text import CountVectorizer

# 2行文本数据
bards_words =["The fool doth think he is wise,", "but the wise man knows himself to be a fool"]

# 对每个文档分词, 生成样本中所有词的表, 但是忽略掉那些只出现在1个样本中的单词, 并且忽略掉停词
vect = CountVectorizer(min_df=2, stop_words='english')
vect.fit(bards_words)

# 打印词表
print(vect.vocabulary_)
# 打印特征向量的构成
print(vect.get_feature_names())

# 文本数据转换成词袋特征向量
bag_of_words = vect.transform(bards_words)

# 转成稠密矩阵输出2行特征向量
print(bag_of_words.toarray())

{'fool': 0, 'wise': 1}
['fool', 'wise']
[[1 1]
 [1 1]]

min_df令词表仅保留了在不同文档中出现过至少2次的单词,另外stop_words指定忽略掉英文中的停词,其实上述过程就是舍弃我们认为不重要的单词。

tf-idf缩放数据

即词频-逆向文档频率,如果一个单词在某个特定文档中经常出现,但在许多文档中却不常出现,那么这个单词可能是对文档的很好的描述。

from sklearn.feature_extraction.text import TfidfVectorizer

# 2行文本数据
bards_words =["The fool doth think he is wise,", "but the wise man knows himself to be a fool"]

# 文本统计tf/idf
vect = TfidfVectorizer()
vect.fit(bards_words)

# 打印词表
print(vect.vocabulary_)
# 打印特征向量的构成
print(vect.get_feature_names())

# 将输入文本分词,并用每个分词的tf/idf作为特征值
tfidf_X = vect.transform(bards_words)

# 转成稠密矩阵输出
print(tfidf_X.toarray())

{'the': 9, 'fool': 3, 'doth': 2, 'think': 10, 'he': 4, 'is': 6, 'wise': 12, 'but': 1, 'man': 8, 'knows': 7, 'himself': 5, 'to': 11, 'be': 0}
['be', 'but', 'doth', 'fool', 'he', 'himself', 'is', 'knows', 'man', 'the', 'think', 'to', 'wise']
[[0.         0.         0.42567716 0.30287281 0.42567716 0.
  0.42567716 0.         0.         0.30287281 0.42567716 0.
  0.30287281]
 [0.36469323 0.36469323 0.         0.25948224 0.         0.36469323
  0.         0.36469323 0.36469323 0.25948224 0.         0.36469323
  0.25948224]]

基于样本数据生成每个单词的tf/idf,然后对任意文本输入即可将其分词并用每个分词的tf/idf值作为特征值,构成特征向量。

上述第一个特征向量的第一列为0,其代表单词be,在一个文本中的确没有be。

�多个单词的词袋

只考虑1个单词的词袋,会出现这样的问题:

"it’s bad, not good at all"与"it’s good, not bad at all"经过1元词袋处理后的表示完全相同,但是它们实际的语义完全相反。

说明一元词袋无法考虑到单词的上下文,所以我们需要多元词袋:

from sklearn.feature_extraction.text import TfidfVectorizer

# 2行文本数据
bards_words =["The fool doth think he is wise,", "but the wise man knows himself to be a fool"]

# 文本统计tf/idf
vect = TfidfVectorizer(ngram_range=(1,2))
vect.fit(bards_words)

# 打印特征向量的构成
print(vect.get_feature_names())

# 将输入文本分词,并用每个分词的tf/idf作为特征值
tfidf_X = vect.transform(bards_words)

# 转成稠密矩阵输出
print(tfidf_X.toarray())
['be', 'be fool', 'but', 'but the', 'doth', 'doth think', 'fool', 'fool doth', 'he', 'he is', 'himself', 'himself to', 'is', 'is wise', 'knows', 'knows himself', 'man', 'man knows', 'the', 'the fool', 'the wise', 'think', 'think he', 'to', 'to be', 'wise', 'wise man']
[[0.         0.         0.         0.         0.29464404 0.29464404
  0.20964166 0.29464404 0.29464404 0.29464404 0.         0.
  0.29464404 0.29464404 0.         0.         0.         0.
  0.20964166 0.29464404 0.         0.29464404 0.29464404 0.
  0.         0.20964166 0.        ]
 [0.25384691 0.25384691 0.25384691 0.25384691 0.         0.
  0.18061417 0.         0.         0.         0.25384691 0.25384691
  0.         0.         0.25384691 0.25384691 0.25384691 0.25384691
  0.18061417 0.         0.25384691 0.         0.         0.25384691
  0.25384691 0.18061417 0.25384691]]

我们可以看出,ngram_range(1,2)参数指定了生成1元词袋和2元词袋,这样的特征向量会体现出词的上下文关系。

下面利用网格搜索最佳的N元分词,以及模型的正则化参数:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV 

# 样本数据
X_train =["good boy", "rubbish boy", "good girl", "rubbish girl", "cute girl", "cute boy"]
# 对应的分类
y_train = [1, 0, 1, 0, 1, 1]

# 先转换tfidf,再训练lr分类
pipe = Pipeline([('tfidf_vect', TfidfVectorizer()), ('lr', LogisticRegression())])

# 网格搜索参数: tfidf的N元词袋, 以及LR模型的C参数
param_grid = {'tfidf_vect__ngram_range': [(1,1),(1,2),(1,3)], 'lr__C': [0.001, 0.01, 0.1, 1, 10, 100]}
grid = GridSearchCV(pipe, param_grid, cv=2)
grid.fit(X_train, y_train)

print(grid.best_estimator_)
Pipeline(memory=None,
     steps=[('tfidf_vect', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

小结与展望

对文本应用词袋模型或者tf/idf模型,分别对应CountVectorizer与TfidfVectorizer,可以实现文本的简单处理。

更高级的文本处理涉及到NLP领域,比如递归神经网络RNN能够输出文本,而不是分类,适合于自动翻译与摘要,未来可以学习tensorflow框架来接触到这些模型。

You can’t perform that action at this time.