# 基于简单 TF-IDF的 标签测试

基本思路就是将问题与问题描述用到的词放在一起，当做一篇文章，然后计算全部问题的IDF,再将标签当做关键词计算标签与问题的相似度。

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time 
import operator
%matplotlib inline

In [None]:
from collections import Counter

## Read in data sets

In [None]:
#data_path='D:/AI.Data/Data/ieee_zhihu_cup/'
data_path = 'ieee_zhihu_cup/'
df_question_topic = pd.read_csv(data_path+'question_topic_train_set.txt', header=None, names=['question_id', 'topic_id'],sep='\t')
df_topics = pd.read_csv(data_path+'topic_info.txt', header=None, names=['topic_id', 'pid', 'cn', 'wn', 'cd', 'wd'],sep='\t')
df_questions = pd.read_csv(data_path+'question_train_set.txt',header=None, names=['question_id', 'ct', 'wt','cd','wd'], sep='\t')

## Prepare df_questions

In [None]:
def split(row):
    return (row.wt+','+row.wd).split(',') if type(row) != float else []

In [None]:
df_questions['wt_list'] = df_questions.apply(split, axis = 1)

In [None]:
#df_questions['wd_list'] = df_questions.wd.apply(split)

In [None]:
words = df_questions.wt_list + df_questions.wd_list
df_questions['words'] = words

In [None]:
df_questions['wt_counter'] = df_questions.words.apply(Counter)

In [None]:
def to_dict(row):
    d = {}
    for word in row:
        if word not in d.keys():
            d[word] = 1
        else:
            d[word] += 1
    return d

In [None]:
df_questions['word_dict'] = df_questions.words.apply(to_dict)

In [None]:
df_questions = df_questions.drop('wd_list', axis = 1)
df_questions = df_questions.drop('wt_list', axis=1)
df_questions.head(5)

## Count Occurancy of a word that occurs in a question, including description and title

In [None]:
def CountWords(row):
    for w in row:
        if w not in word_dict.keys():
            word_dict[w] = 1
        else:
            word_dict[w] += 1
    return
word_dict = {}
#word_dict = dict.fromset
_ = df_questions.words.apply(CountWords)

In [None]:
len(word_dict)

## 计算逆文本频率指数 IDF

$$ IDF = log(\frac{D}{D_w}) $$

D： 所有的Question的总数

Dw：词 w 出现在Dw 篇文章中

比如 的 几乎出现在所有的问题中，其IDF 就几乎为零。

In [None]:
idf_dict={}

In [None]:
D = len(df_questions)
for k,v in word_dict.items():
    idf_dict[k] = np.log2(float(D)/v)

In [None]:
idf_dict

# 根据 TF-IDF 计算 Question与Topic的相关性

$$ TF-IDF = TF_1\cdot IDF_1 + TF_2 \cdot IDF_2 + ... + TF_N \cdot IDF_N $$

TF1: 词1在此Question 出现的频率 $$ TFx = \frac{词_x在此question中出现的次数}{此Question中的总词数}$$ 

此处的词指来自Topic中的词。如Topic为w32,w1234 则计算每一篇文章与W32, w1234的相关性。



In [None]:
df_topics

In [None]:
df_topics[df_topics['topic_id'] == 738845194850773558].wd[0]

In [None]:
def tf_idf(topic_id, question_id):
    topic_wn = df_topics[df_topics['topic_id'] == topic_id].wn[0]
    topic_wd = df_topics[df_topics['topic_id'] == topic_id].wd[0]
    topic_word = (topic_wn+','+topic_wd).split(',')
    
    word_dict = df_questions[df_questions['question_id'] == question_id].word_dict[0]
    total_word = sum(word_dict.values())
    
    tf_idf_value = 0
    for word in topic_word:
        if word in word_dict:
            tf_idf_value += idf_dict[word]*word_dict[word]/total_word
    return tf_idf_value

In [None]:
tf_idf(738845194850773558, 6555699376639805223)

In [None]:
def split_to_list(row):
    return row.split(',')
df_question_topic['topic_id_list'] = df_question_topic.topic_id.apply(split_to_list)

In [None]:
df_question_topic['topic_count'] = df_question_topic.topic_id_list.apply(len)

In [None]:
sub_df = df_question_topic[df_question_topic['topic_count'] == 1]

In [None]:
sub_df.loc[1][1]

In [None]:
def tf_idf2(ser):
    return tf_idf(int(ser.topic_id_list[0]), ser.question_id)


In [None]:
sub_df.head(5).apply(tf_idf2, axis = 1)

In [None]:
sub_df['tf_idf'] = sub_df.apply(tf_idf2, axis = 1)

## 考查TF_IDF的分布情况

### 计算单topic的Question的TF-IDF分布情况

### 计算多topic的Question的TF-IDF分布情况

In [None]:
# your code here ...

## 分析多Topic时，Topic的位置与TF-IDF的关系

In [None]:
# your code here ...

## 研究 Topic 继承关系对Topic赋值的影响

## 研究同义词对 Topic赋值的影响