利用sklearn计算文档的Tf-Idf特征值, 与jieba利用内置或设定的idf字典库

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer

#### 英文文本TF-IDF

In [42]:
# 原始文本
document_en = ["I have a pen.",
            "I have an apple."]

In [43]:
"""
TfidfVectorizer(input='content', encoding='utf-8',
                 decode_error='strict', strip_accents=None, lowercase=True,
                 preprocessor=None, tokenizer=None, analyzer='word',
                 stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
                 ngram_range=(1, 1), max_df=1.0, min_df=1,
                 max_features=None, vocabulary=None, binary=False,
                 dtype=np.float64, norm='l2', use_idf=True, smooth_idf=True,
                 sublinear_tf=False))
其主要参数和CountVectorizer基本一致，多的几个参数为：
use_idf：默认True，即采用逆文档频率进行修正
smooth_idf: 逆文本文档采用拉普拉斯平滑
tokenizer: 分词器，None或callable对象，在analyzer='word'时可覆写
"""
vector = TfidfVectorizer()

In [45]:
result1 = vector.fit_transform(document_en)
print(result1)

  (0, 2)	0.5797386715376657
  (0, 3)	0.8148024746671689
  (1, 2)	0.4494364165239821
  (1, 0)	0.6316672017376245
  (1, 1)	0.6316672017376245


In [46]:
# 词汇
vector.vocabulary_

{'an': 0, 'apple': 1, 'have': 2, 'pen': 3}

####  中文文本TF-IDF

与英文文本处理的几点区别：<br>
(1) 需要手动分词<br>
(2) 可能需要修改token_pattern的正则表达式

In [49]:
document_cn = ["我是一条天狗呀！","我把月来吞了。","我把日来吞了。","我把一切的星球来吞了。","我把全宇宙来吞了。","我便是我了！"]

In [50]:
# 对于中文需要先分词
import jieba

def word_segment(sent):
    return(" ".join(jieba.lcut(sent)))

text_seg = list(map(word_segment, document_cn))
text_seg

['我 是 一条 天狗 呀 ！',
 '我 把 月 来 吞 了 。',
 '我 把 日来 吞 了 。',
 '我 把 一切 的 星球 来 吞 了 。',
 '我 把 全宇宙 来 吞 了 。',
 '我 便是 我 了 ！']

In [58]:
vector2 = TfidfVectorizer(max_features=10, token_pattern=r"(?u)\b\w+\b")
result2 = vector2.fit_transform(text_seg)

In [56]:
print(result2)

  (0, 7)	0.2994779590123645
  (0, 1)	0.6746528559436284
  (0, 6)	0.6746528559436284
  (1, 7)	0.3463385195969726
  (1, 8)	0.46287181591384574
  (1, 9)	0.5401550231336202
  (1, 5)	0.46287181591384574
  (1, 2)	0.3997268378432121
  (2, 7)	0.41154075931008827
  (2, 8)	0.5500127990559459
  (2, 5)	0.5500127990559459
  (2, 2)	0.4749800471343644
  (3, 7)	0.2730597726436735
  (3, 8)	0.3649368050763702
  (3, 9)	0.4258683324651302
  (3, 5)	0.3649368050763702
  (3, 2)	0.3151521222301723
  (3, 0)	0.6151389439974322
  (4, 7)	0.2730597726436735
  (4, 8)	0.3649368050763702
  (4, 9)	0.4258683324651302
  (4, 5)	0.3649368050763702
  (4, 2)	0.3151521222301723
  (4, 4)	0.6151389439974322
  (5, 7)	0.6199649234576673
  (5, 2)	0.35776646893886044
  (5, 3)	0.6983170106657491


In [57]:
print(vector2.vocabulary_)

{'我': 7, '一条': 1, '呀': 6, '把': 8, '来': 9, '吞': 5, '了': 2, '一切': 0, '全宇宙': 4, '便是': 3}


**对于文档出现频繁的词反而不是很重要， 可以设定max_df参数**

In [59]:
vector2_2 = TfidfVectorizer(max_features=10, token_pattern=r"(?u)\b\w+\b", max_df=0.6)
result2_2 = vector2_2.fit_transform(text_seg)
print(result2_2)

  (0, 8)	0.5
  (0, 1)	0.5
  (0, 5)	0.5
  (0, 4)	0.5
  (1, 9)	1.0
  (2, 6)	1.0
  (3, 9)	0.43968119520278715
  (3, 0)	0.6350907205215048
  (3, 7)	0.6350907205215048
  (4, 9)	0.5692126078464125
  (4, 3)	0.8221903715494888
  (5, 2)	1.0


In [60]:
print(vector2_2.vocabulary_)

{'是': 8, '一条': 1, '天狗': 5, '呀': 4, '来': 9, '日来': 6, '一切': 0, '星球': 7, '全宇宙': 3, '便是': 2}


**对于 是 呀 等可设定停用词**

In [61]:
vector2_3 = TfidfVectorizer(max_features=10, token_pattern=r"(?u)\b\w+\b", max_df=0.6, stop_words=["是","呀"])
result2_3 = vector2_3.fit_transform(text_seg)
print(result2_3)

  (0, 1)	0.7071067811865476
  (0, 4)	0.7071067811865476
  (1, 7)	0.8221903715494888
  (1, 8)	0.5692126078464125
  (2, 5)	1.0
  (3, 8)	0.3711559310287487
  (3, 0)	0.5361104596573927
  (3, 9)	0.5361104596573927
  (3, 6)	0.5361104596573927
  (4, 8)	0.5692126078464125
  (4, 3)	0.8221903715494888
  (5, 2)	1.0


In [62]:
print(vector2_3.vocabulary_)

{'一条': 1, '天狗': 4, '月': 7, '来': 8, '日来': 5, '一切': 0, '的': 9, '星球': 6, '全宇宙': 3, '便是': 2}
