### TF-IDF(Term Frequency -  Inverse Document Frequency，即“词频-逆文本频率”)

    TF：即“词频”，即文本中各个词的出现频率统计。
    IDF：即“逆文本频率”。帮助我们来反映这个词的重要性，进而修正仅仅用词频表示的词特征值。
       概括来讲， IDF反应了一个词在所有文本中出现的频率，如果一个词在很多的文本中出现，那么它的IDF值应该低。而反过来如果一个词在比较少的文本中出现，那么它的IDF值应该高。比如一些专业的名词如“Machine Learning”。这样的词IDF值应该高。一个极端的情况，如果一个词在所有的文本中都出现，那么它的IDF值应该为0。
       
    一个词x的IDF的基本公式如下：
    > IDF(x)= log(N/N(x))  
    >> N: 代表语料库中文本的总数;N(x):代表语料库中包含词x的文本总数。
    以上公司当分母为0时无法计算(如某一个生僻词在语料库中没有)，故对以上公式进行平滑处理，为:
    >IDF(x)= log((N+1)/(N(x)+1))
    
    接下来计算某个词的TF-IDF值：
    TF-IDF(x) = TF(x)*IDF(x) 
    > TF(x):值词x在当前文本中的词频
    
   [参考文章](https://www.cnblogs.com/pinard/p/6693230.html)
   
### 下面随机抽取小样本数据将文本矩阵化，并使用词袋模型，计算TF-IDF

In [32]:
import pandas as pd
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import metrics as mr
from sklearn.feature_selection import mutual_info_classif

In [2]:
train = pd.read_table('./cnews/cnews.val.txt',sep='\t',encoding='utf-8',header=None,names=['label','content'])
train.head()

Unnamed: 0,label,content
0,体育,黄蜂vs湖人首发：科比带伤战保罗 加索尔救赎之战 新浪体育讯北京时间4月27日，NBA季后赛...
1,体育,1.7秒神之一击救马刺王朝于危难 这个新秀有点牛！新浪体育讯在刚刚结束的比赛中，回到主场的马...
2,体育,1人灭掘金！神般杜兰特！ 他想要分的时候没人能挡新浪体育讯在NBA的世界里，真的猛男，敢于直...
3,体育,韩国国奥20人名单：朴周永领衔 两世界杯国脚入选新浪体育讯据韩联社首尔9月17日电 韩国国奥...
4,体育,天才中锋崇拜王治郅 周琦：球员最终是靠实力说话2月14日从土耳其男篮邀请赛回到北京之后，周琦...


In [3]:
df1 = train[train.label=='体育'].sample(frac=0.01,random_state=1)
df2 = train[train.label=='娱乐'].sample(frac=0.01,random_state=1)
print(df1.shape,df2.shape)

(5, 2) (5, 2)


In [4]:
df = pd.concat([df1,df2])
df.shape

(10, 2)

In [5]:
df.head(1)

Unnamed: 0,label,content
304,体育,23+7+6！韦德一条龙暴扣 三巨头就只有他还在战斗新浪体育讯23分、7助攻、6篮板、1抢断...


In [6]:
stop_words = pd.read_csv('./cnews/stopwords.txt',index_col=False, quoting=3,sep='\t',
                        names=['stopword'],encoding='utf-8')
stop_words.head(2)

Unnamed: 0,stopword
0,!
1,""""


In [7]:
# 分词，去除停用词
content = df.content.values.tolist()
label = df.label.values.tolist()
stopwords = stop_words.stopword.values.tolist()

def preprocess_text(content,label,result):
    for i in range(len(content)):
        try:
            segs = jieba.lcut(content[i])
            segs = filter(lambda x:len(x)>1,segs)
            segs = filter(lambda x: x not in stopwords,segs)
            result.append((" ".join(segs),label[i]))
        except:
            print(content[i])
            continue

In [8]:
result = []
preprocess_text(content,label,result)

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/0d/j7b0cl_s2qx7rt2p7_vp4sfr0000gp/T/jieba.cache
Loading model cost 0.792 seconds.
Prefix dict has been built succesfully.


In [9]:
result[:1]

[('韦德 一条龙 暴扣 巨头 新浪 体育讯 助攻 篮板 抢断 盖帽 热火 最终 客场 无力回天 打出 数据 德维恩 韦德 问心无愧 热火 主力阵容 唯一 发挥 一刻 勇士 只不过 主场 作战 凯尔特人 信心 飙升 双拳 难敌 四手 韦德 只得 吞下 失利 苦果 两场 搞定 老辣 绿衫 韦德 进攻 予取予求 走上 罚球线 侵略性 可见一斑 凯尔特人 碰到 暴走 韦德 确实 办法 这场 比赛 回到 主场 凯尔特人 开场 攻击性 十足 保罗 皮尔斯 凯文 加内特 开场 哨响 卯足了劲 热火 韦德 是从 松懈 客场 志在 总冠军 韦德 阻碍 韦德 超级 巨星 得分 之外 进攻 端的 作用 关键 第二节 剩下 分多钟 乔尔 安东尼 马里奥 查尔 默斯 传球 得分 进攻 归功于 韦德 热火 进攻 配合 韦德 篮筐 右侧 接球 原本 另一侧 凯文 加内特 协防 韦德 并未 纠缠 第一 时间 将球 传到 外线 查尔 默斯 查尔 默斯 停顿 塞进 内线 安东尼 防守 安东尼 加内特 顾忌 查尔 默斯 传球 韦德 并未 回防 到位 安东尼 面对 补防 梅因 奥尼尔 篮下 轻松 得分 热火 阵容 查尔 默斯 安东尼 两名 替补 合力 表现 抢眼 两位 替补 发挥 之外 第二节 韦德 创造 机会 韦德 进攻 端起 推进 作用 助攻 可惜 詹姆斯 这方面 不好 克里斯 波什 睡醒 模样 一己 之力 单挑 凯尔特人 皮尔斯 加内特 韦德 竭尽所能 凯尔特人 追回 一场 热火 并不需要 紧张 绿衫 经验丰富 球队 棒子 打死 热火 关键 一点 韦德 发挥 替补 发挥 下一场 胜利 奠定 基础 XWT185',
  '体育')]

In [10]:
x,y = zip(*result)

In [11]:
vec = CountVectorizer(max_features=200) #为了便于观察，仅输出前200个词
trans = TfidfTransformer()
tfidf = trans.fit_transform(vec.fit_transform(x))
print(tfidf)

  (0, 191)	0.6610270076661171
  (0, 71)	0.04131418797913232
  (0, 101)	0.02369036158888635
  (0, 12)	0.03213552402585269
  (0, 27)	0.08262837595826464
  (0, 159)	0.04131418797913232
  (0, 128)	0.2891993158539262
  (0, 112)	0.036145032506955825
  (0, 64)	0.0971994068500506
  (0, 104)	0.04131418797913232
  (0, 82)	0.04131418797913232
  (0, 35)	0.1445801300278233
  (0, 6)	0.08262837595826464
  (0, 18)	0.18072516253477913
  (0, 52)	0.04131418797913232
  (0, 99)	0.04131418797913232
  (0, 166)	0.07229006501391165
  (0, 185)	0.2429985171251265
  (0, 183)	0.04131418797913232
  (0, 169)	0.04131418797913232
  (0, 25)	0.04131418797913232
  (0, 120)	0.03213552402585269
  (0, 47)	0.04131418797913232
  (0, 76)	0.0971994068500506
  (0, 14)	0.04131418797913232
  :	:
  (9, 98)	0.05488399330623517
  (9, 37)	0.05488399330623517
  (9, 51)	0.10976798661247034
  (9, 55)	0.07667707481026079
  (9, 79)	0.21953597322494067
  (9, 147)	0.10976798661247034
  (9, 116)	0.05488399330623517
  (9, 58)	0.054883993306235

In [12]:
'''get_feature_names()可看到所有文本的关键字'''
print(vec.get_feature_names())

['tm', '一人', '一位', '一场', '一年', '一点', '主场', '主帅', '主角奖', '人民币', '伊朗', '传球', '体育讯', '作品', '保罗', '儿子', '克劳德', '入殓', '凯尔特人', '凯撒', '凯文', '制片', '制片人', '前往', '剧情', '办法', '加内特', '助攻', '北京', '医生', '千万元', '华纳', '单防', '原本', '去世', '发挥', '发现', '发行', '发高烧', '受伤', '变得', '只会', '台湾', '合作', '合同', '名单', '哈尼', '回到', '国际', '外语片', '多年', '大奖', '失利', '夺冠', '奖座秀', '奥斯卡', '好消息', '好莱坞', '威廉', '娱乐', '季后赛', '孩子', '安东尼', '实力', '客场', '家中', '对手', '对抗', '导演', '将会', '山猫', '巨头', '巴黎', '希望', '帕金斯', '并未', '开场', '弗斯', '影坛', '影帝', '影片', '得分', '德维恩', '心诚则灵', '总冠军', '总裁', '情况', '感情', '感言', '打死', '打球', '执导', '扮演', '投入', '报道', '拍摄', '拿下', '接受', '接过', '搞定', '故事', '新浪', '新闻', '新闻节目', '无力回天', '日本', '时间', '晚间', '普里', '更好', '替补', '最佳', '最终', '期间', '未来', '本木雅弘', '机场', '李岗', '查尔', '横扫', '比赛', '法国', '法国电影', '法拉', '波士顿', '淘汰', '湖人', '演员', '热火', '爆出', '父亲', '父母', '片中', '球员', '球迷', '球队', '电影', '电影节', '男人', '留在', '百万富翁', '皮尔斯', '真诚', '眼神', '知名', '短暂', '短片', '票房', '禁止', '福尔摩斯', '科林斯', '童星', '第一', '第一场', '第一部', '第二节', '第六场', '第四节', '策略', '篮

In [13]:
'''vocabulary_可看到所有文本的关键字和其位置'''
print(vec.vocabulary_)

{'韦德': 191, '巨头': 71, '新浪': 101, '体育讯': 12, '助攻': 27, '篮板': 159, '热火': 128, '最终': 112, '客场': 64, '无力回天': 104, '德维恩': 82, '发挥': 35, '主场': 6, '凯尔特人': 18, '失利': 52, '搞定': 99, '绿衫': 166, '进攻': 185, '走上': 183, '罚球线': 169, '办法': 25, '比赛': 120, '回到': 47, '开场': 76, '保罗': 14, '皮尔斯': 141, '凯文': 20, '加内特': 26, '总冠军': 84, '得分': 81, '第二节': 155, '安东尼': 62, '查尔': 118, '默斯': 199, '传球': 11, '原本': 33, '并未': 75, '第一': 152, '时间': 106, '面对': 190, '阵容': 189, '替补': 110, '表现': 179, '一场': 3, '紧张': 161, '球队': 135, '打死': 89, '一点': 5, '胜利': 173, '受伤': 39, '闪电侠': 188, 'tm': 0, '顽强': 192, '北京': 28, '波士顿': 124, '球迷': 134, '打球': 90, '一位': 2, '对抗': 67, '身体': 184, '球员': 133, '新闻': 102, '膝盖': 174, '肯定': 172, '医生': 29, '情况': 86, '魔兽': 196, '科林斯': 150, '单防': 32, '魔术': 197, '季后赛': 60, '首轮': 194, '山猫': 70, '帕金斯': 74, '对手': 66, '横扫': 119, '老鹰': 171, '策略': 158, '系列赛': 160, '一人': 1, '第四节': 157, '落后': 177, '留在': 139, '黄蜂': 198, '第六场': 156, '湖人': 126, '实力': 63, '第一场': 153, '爆出': 129, '拿下': 96, '蓝色': 178, '眼神': 143, '未来': 114, '将

In [14]:
'''toarray()可看到词频矩阵的结果'''
print(vec.fit_transform(x).toarray())

[[0 0 0 ... 0 0 5]
 [2 0 2 ... 0 0 0]
 [0 2 0 ... 6 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [15]:
print(tfidf.toarray())

[[0.         0.         0.         ... 0.         0.         0.24299852]
 [0.18949997 0.         0.14093671 ... 0.         0.         0.        ]
 [0.         0.10362647 0.         ... 0.31087942 0.         0.        ]
 ...
 [0.         0.         0.06909125 ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]]


In [29]:
#便于观察 转化成df目测一下特征分布 共计200个词向量特征
tfidf_df = pd.DataFrame(tfidf.toarray())
tfidf_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,0.0,0.0,0.0,0.036145,0.0,0.041314,0.082628,0.0,0.0,0.0,...,0.036145,0.661027,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.242999
1,0.1895,0.0,0.140937,0.070468,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.563823,0.161092,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.103626,0.0,0.0,0.0,0.088092,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.044046,0.0,0.829012,0.310879,0.0,0.0
3,0.0,0.0,0.0,0.043304,0.0,0.0,0.197989,0.0,0.0,0.0,...,0.0,0.0,0.049497,0.0,0.049497,0.0,0.0,0.0,0.756935,0.0
4,0.0,0.0,0.0,0.0,0.159337,0.0,0.0,0.159337,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.028974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.028974,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.496027,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.069091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.078972,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.111719,0.0,...,0.0,0.0,0.0,0.111719,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.109768,0.0,...,0.048017,0.0,0.0,0.0,0.0,0.129125,0.0,0.0,0.0,0.0


In [30]:
#该方法似乎更适用于提取每个文本的关键词 

### 利用互信息进行特征筛选

点互信息(PMI): 衡量两个事物之间的相关性
> 举个自然语言处理中的例子来说，我们想衡量like这个词的极性（正向情感还是负向情感）。我们可以预先挑选一些正向情感的词，比如good。然后我们算like跟good的PMI。
    
互信息(MI): 其衡量的是两个随机变量之间的相关性，即一个随机变量中包含的关于另一个随机变量的信息量。
> 互信息其实就是对X和Y的所有可能的取值情况的点互信息PMI的加权和。

[参考文章](https://blog.csdn.net/u013710265/article/details/72848755)

[mutual_info_classif文档](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_classif.html)

[mutual_info_score文档](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html)

In [16]:
#便于观察 转化成df目测一下特征分布 共计200个词向量特征
vec_df = pd.DataFrame(vec.fit_transform(x).toarray())
vec_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,190,191,192,193,194,195,196,197,198,199
0,0,0,0,1,0,1,2,0,0,0,...,1,16,0,0,0,0,0,0,0,5
1,2,0,2,1,0,0,0,0,0,0,...,0,7,2,0,0,0,0,0,0,0
2,0,2,0,0,0,2,0,0,0,0,...,0,0,0,0,1,0,16,6,0,0
3,0,0,0,1,0,0,4,0,0,0,...,0,0,1,0,1,0,0,0,13,0
4,0,0,0,0,3,0,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,3,...,0,0,0,0,0,0,0,0,0,0
7,0,0,1,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
8,0,0,0,0,0,0,0,0,1,0,...,0,0,0,1,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,2,0,...,1,0,0,0,0,2,0,0,0,0


In [70]:
'''计算单个特征与结果的互信息'''
xx = vec.fit_transform(x).toarray()
print(xx)
print(y)

print(mr.mutual_info_score(xx[:,0], y))

[[0 0 0 ... 0 0 5]
 [2 0 2 ... 0 0 0]
 [0 2 0 ... 6 0 0]
 ...
 [0 0 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
('体育', '体育', '体育', '体育', '体育', '娱乐', '娱乐', '娱乐', '娱乐', '娱乐')
0.0748817616223546


In [80]:
mutual_info = mutual_info_classif(xx,y,discrete_features=False,random_state=1)
print(mutual_info)

[0.03873016 0.         0.         0.36039683 0.         0.06015873
 0.         0.22218254 0.14944444 0.00123016 0.08277778 0.
 1.20301587 0.02777778 0.0218254  0.         0.         0.
 0.         0.20611111 0.13944444 0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.08253968 0.18349206
 0.12277778 0.         0.         0.         0.         0.
 0.07801587 0.         0.         0.         0.01968254 0.08801587
 0.04444444 0.         0.         0.         0.00944444 0.
 0.         0.76230159 0.         0.24468254 0.08444444 0.18539683
 0.         0.         0.         0.         0.         0.
 0.         0.14039683 0.06206349 0.         0.02706349 0.
 0.03634921 0.         0.         0.         0.12539683 0.
 0.04087302 0.13015873 0.05777778 0.13253968 0.16587302 0.08373016
 0.21444444 0.         0.         0.         0.         0.02444444
 0.         0.03253968 0.08444444 0.         0.         0.18539683


In [77]:
print(mutual_info.max())
print(list(mutual_info).index(mutual_info.max()))

1.2030158730158729
12


In [78]:
print(list(mutual_info)[55])
list(vec.vocabulary_.keys())[list(vec.vocabulary_.values()).index(55)]

0.7623015873015871


'奥斯卡'

In [79]:
print(list(mutual_info)[12])
list(vec.vocabulary_.keys())[list(vec.vocabulary_.values()).index(12)]

1.2030158730158729


'体育讯'

In [75]:
print(list(mutual_info)[2])
list(vec.vocabulary_.keys())[list(vec.vocabulary_.values()).index(2)]

0.0


'一位'

In [31]:
#该方法更适用于特征选取

### 其他参考文章
[关于互信息的一些注记](https://www.douban.com/note/621588501/)
> 该文详细说明了，互信息（Mutual Information）如何作为特征选择指标，及互信息在此用途中的隐蔽缺陷，以及使用 normalized mutual information作为更合理指标的必要性。

[特征选择](https://www.cnblogs.com/stevenlk/p/6543628.html)
> 该文提到了最大信息系数方法，克服了互信息无法归一化以及对离散方式敏感的问题。它首先寻找一种最优的离散化方式，然后把互信息取值转换成一种度量方式，取值区间在[0，1]。 minepy 提供了MIC功能。

[如何进行特征选择（理论篇）机器学习你会遇到的“坑”](https://baijiahao.baidu.com/s?id=1604074325918456186&wfr=spider&for=pc)

In [81]:
'''标准化互信息NMI'''
print(mr.normalized_mutual_info_score(xx[:,0], y))

0.15774885315354795
