## TF-IDF原理

TF-IDF（term frequency–inverse document frequency）是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency)，IDF意思是逆文本频率指数(Inverse Document Frequency)。

TF:某一个给定的词语在该文件中出现的频率
IDF:一个词语普遍重要性的度量

某一特定文件内的高词语频率以及该词语在整个文件集合中的低文件频率，可以产生出高权重的TF-IDf

In [2]:
from sklearn.feature_extraction.text import CountVectorizer  
from sklearn.feature_extraction.text import TfidfTransformer    

# 语料  
corpus = [  
    'This is the first document.',  
    'This is the second second document.',  
    'And the third one.',  
    'Is this the first document?',  
]  
# 将文本中的词语转换为词频矩阵  
vectorizer = CountVectorizer()  
# 计算个词语出现的次数  
X = vectorizer.fit_transform(corpus)  
# 获取词袋中所有文本关键词  
word = vectorizer.get_feature_names()  
# 查看词频结果  
print(X.toarray())

transformer = TfidfTransformer()  
tfidf = transformer.fit_transform(X)  
# tf-idf
tfidf.toarray()  

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


array([[0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674],
       [0.        , 0.27230147, 0.        , 0.27230147, 0.        ,
        0.85322574, 0.22262429, 0.        , 0.27230147],
       [0.55280532, 0.        , 0.        , 0.        , 0.55280532,
        0.        , 0.28847675, 0.55280532, 0.        ],
       [0.        , 0.43877674, 0.54197657, 0.43877674, 0.        ,
        0.        , 0.35872874, 0.        , 0.43877674]])

## 互信息

互信息(Mutual Information)是信息论里一种有用的信息度量，它可以看成是一个随机变量中包含的关于另一个随机变量的信息量，或者说是一个随机变量由于已知另一个随机变量而减少的不肯定性。

H(x,y) = H(x) + H(y|x) = H(y) + H(x|y)
H(x) - H(x|y) = H(y) - H(y|x) --->互信息

In [7]:
from sklearn import datasets
from sklearn.feature_selection import mutual_info_classif
 
iris = datasets.load_iris()
x = iris.data
label = iris.target
 
mutual_info = mutual_info_classif(x, label, discrete_features= False)
print(mutual_info)

[0.50467907 0.22145349 0.99041711 0.99483244]
