# 基于简单 TF-IDF的 标签测试

基本思路就是将问题与问题描述用到的词放在一起，当做一篇文章，然后计算全部问题的IDF,再将标签当做关键词计算标签与问题的相似度。

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time 
import operator
from collections import Counter
from ast import literal_eval
import os
import gc
import dask
import dask.dataframe as dd
from dask import threaded, multiprocessing
%matplotlib inline

## Parallelization of pandas.apply() 

根据 https://stackoverflow.com/questions/37078880/status-of-parallelization-of-pandas-apply

找到下面两个并行计算的网址

https://github.com/pandas-dev/pandas/issues/13111

http://www.racketracer.com/2016/07/06/pandas-in-parallel/

我试验了dask 超快，下面有几个框，先用传统的pandas做，然后再用dask做，你可以比较一下，最好能比较一下看看结果是不是一样的。如果 结果一样的话，就可以把传统的实现删除了。




## Read in data sets

将原来的数据路径放在环境变量里面，这样就不用每次改程序了

nrows: 考虑到曾经在328877那里曾经报过错，所以编程时至少加载35万条数据, 在read_csv时会用到

In [2]:
#data_path=os.environ.get('zhihu_data_path')+'/' 
data_path = 'ieee_zhihu_cup/'
nrows=350 * 1000 

In [3]:
start_time = time.time()
print('Start time:', start_time)
#df_questions = pd.read_csv(data_path+'question_train_set.txt',header=None, names=['question_id', 'ct', 'wt','cd','wd'], sep='\t', nrows=nrows)
df_questions = pd.read_csv(data_path+'question_train_set.txt',header=None, names=['question_id', 'ct', 'wt','cd','wd'], sep='\t')
print('time cost:', time.time() - start_time)

Start time: 1498020018.3561568
time cost: 31.47870373725891


### 将DataFrame 转成dask

In [4]:
start_time = time.time()
ddf_questions = dd.from_pandas(df_questions, npartitions = 4)
print('time cost:', time.time() - start_time)

time cost: 15.976133108139038


## Prepare df_questions

In [5]:
def split(row):
    return [] if type(row) == float else row.split(',')

### 两个数据异常
3 是没有description

328877 是有title 里面的字不成词

In [None]:
print (df_questions.loc[3])
print (df_questions.loc[328877])

In [None]:
df_questions.head(5)

### 两种转list的方法，速度差的不是一点儿半点儿

In [6]:
#start_time = time.time()

#df_questions['wt_list'] = df_questions.wt.apply(split)

#print('time cost:', time.time() - start_time)

start_time = time.time()

ddf_questions['wt_list'] = ddf_questions.wt.apply(split)

print('time cost:', time.time() - start_time)

time cost: 0.017102718353271484


  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


In [7]:
#start_time = time.time()

#df_questions['wd_list'] = df_questions.wd.apply(split)

#print('time cost:', time.time() - start_time)

start_time = time.time()

ddf_questions['wd_list'] = ddf_questions.wd.apply(split)

print('time cost:', time.time() - start_time)

time cost: 0.003017425537109375


  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


In [None]:
start_time = time.time()
df_questions['bag_of_words'] = df_questions.apply(lambda x : x['wt_list'] + x['wd_list'], axis = 1)
print('time cost:', time.time() - start_time)

In [8]:
start_time = time.time()
ddf_questions['bag_of_words'] = ddf_questions.apply(lambda x : x['wt_list'] + x['wd_list'], axis = 1)
print('time cost:', time.time() - start_time)

time cost: 0.0050008296966552734


  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


In [None]:
ddf_questions.head(5)

In [9]:
start_time = time.time()
df=ddf_questions[['question_id', 'bag_of_words']].compute()
df.to_pickle(data_path+'question_words_bag.pickle')
print('time cost:', time.time() - start_time)

time cost: 1319.7347674369812


### Remove all dataframe 

In [10]:
start_time = time.time()
del df
del df_questions
gc.collect()
print('time cost:', time.time() - start_time)

KeyboardInterrupt: 

### Reload data

Can we load the data to dask directly?

In [None]:
start_time = time.time()
df_bag=pd.read_pickle(data_path+'question_words_bag.pickle')
print('time cost:', time.time() - start_time)
print('End time:', time.time())

In [None]:
df_bag.head(5)

下面这一步很不可思议，只用了5秒多就统计好了全部数据

In [None]:
start_time = time.time()
df_bag['wt_counter'] = df_bag.bag_of_words.apply(Counter)
print('time cost:', time.time() - start_time)

df_bag.head(5)

In [None]:
df_bag.wt_counter.loc[0]

## Count Occurancy of a word that occurs in a question, including description and title

In [None]:
def CountWords(row):
    for w in row:
        if w not in word_dict.keys():
            word_dict[w] = 1
        else:
            word_dict[w] += 1
    return
word_dict = {}
#word_dict = dict.fromset
start_time = time.time()
_ = df_bag.wt_counter.apply(CountWords)
print('time cost:', time.time() - start_time)

print(len(word_dict))

## 计算逆文本频率指数 IDF

$$ IDF = log(\frac{D}{D_w}) $$

D： 所有的Question的总数

Dw：词 w 出现在Dw 篇文章中

比如 的 几乎出现在所有的问题中，其IDF 就几乎为零。

秒执行

In [None]:
idf_dict={}
D = len(df_bag)
for k,v in word_dict.items():
    idf_dict[k] = np.log2(float(D)/v)

In [None]:
idf_dict

# 根据 TF-IDF 计算 Question与Topic的相关性

$$ TF-IDF = TF_1\cdot IDF_1 + TF_2 \cdot IDF_2 + ... + TF_N \cdot IDF_N $$

TF1: 词1在此Question 出现的频率 $$ TFx = \frac{词_x在此question中出现的次数}{此Question中的总词数}$$ 

此处的词指来自Topic中的词。如Topic为w32,w1234 则计算每一篇文章与W32, w1234的相关性。



用的时候再加载，减少不必要的内存占用

In [11]:
df_topics = pd.read_csv(data_path+'topic_info.txt', header=None, names=['topic_id', 'pid', 'cn', 'wn', 'cd', 'wd'],sep='\t')
df_question_topic = pd.read_csv(data_path+'question_topic_train_set.txt', header=None, names=['question_id', 'topic_id'],sep='\t')

In [None]:
df_topics[df_topics['topic_id'] == 738845194850773558].wd[0]

### 数据预处理

1. 在df_bag 中增加一列，包含每一个question的词的总数， 就是下面的 total_word
2. 按照上面处理 question的方法，处理topics, 也要有 wt_counter 与 total_word


#### Adding bag_of_words, wt_counter and total_words to df_topics

In [12]:
def topic_word_bag(row):
    if type(row.wn) == float and type(row.wd) == float:
        return []
    elif type(row.wn) == float:
        return row.wd.split(',')
    elif type(row.wd) == float:
        return row.wn.split(',')
    return (row.wn + ',' + row.wd).split(',') 

df_topics['bag_of_words'] = df_topics.apply(topic_word_bag, axis = 1)

In [15]:
df2=df_topics[['topic_id', 'bag_of_words']]
df2.to_pickle(data_path+'topic_words_bag.pickle')

In [None]:
def topic_wt_counter(row):
    d = dict()
    for word in row:
        if word not in d.keys():
            d[word] = 1
        else:
            d[word] += 1
    return d
df_topics['wt_counter'] = df_topics.bag_of_words.apply(topic_wt_counter)

In [None]:
def total_word(row):
    return sum(row.values())
df_topics['total_word'] = df_topics.wt_counter.apply(total_word)

#### Adding total_words to df_bag

In [None]:
df_bag['total_word'] = df_bag.wt_counter.apply(total_word)

In [None]:
def tf_idf(topic_id, question_id):
    index1 = df_topics[df_topics['topic_id'] == topic_id].index[0]
    topic_word = df_topics[df_topics['topic_id'] == topic_id].bag_of_words[index1]

    index2 = df_bag[df_bag['question_id'] == question_id].index[0]
    word_dict = df_bag[df_bag['question_id'] == question_id].wt_counter[index2]

    total_word = df_bag[df_bag['question_id'] == question_id].total_word[index2]

    
    tf_idf_value = 0
    for word in topic_word:
        if word in word_dict:
            tf_idf_value += idf_dict[word]*word_dict[word]/total_word
    return tf_idf_value

In [None]:
tf_idf(-3149765934180654494, 2887834264226772863)

In [None]:
def split_to_list(row):
    return row.split(',')
df_question_topic['topic_id_list'] = df_question_topic.topic_id.apply(split_to_list)

In [None]:
df_question_topic['topic_count'] = df_question_topic.topic_id_list.apply(len)

In [None]:
sub_df = df_question_topic[df_question_topic['topic_count'] == 1]

In [None]:
sub_df.loc[1][1]

In [None]:
def tf_idf2(ser):
    return tf_idf(int(ser.topic_id_list[0]), ser.question_id)

In [None]:
sub_df.head(5)

In [None]:
df_topics[df_topics['topic_id'] == -3149765934180654494]

In [None]:
df_topics[ df_topics['topic_id'] == 738845194850773558]

In [None]:
print('df_topics[df_topics[\'topic_id\'] == 738845194850773558].wn=', df_topics[df_topics['topic_id'] == 738845194850773558].wn)
print('df_topics[df_topics[\'topic_id\'] == 738845194850773558].wn[0]=', df_topics[df_topics['topic_id'] == 738845194850773558].wn[0])
print()

print('df_topics[df_topics[\'topic_id\'] == -3149765934180654494].wn=', df_topics[df_topics['topic_id'] == -3149765934180654494].wn)
print('df_topics[df_topics[\'topic_id\'] == -3149765934180654494].wn[0]=', df_topics[df_topics['topic_id'] == -3149765934180654494].wn[769])

# why? !!!

In [None]:
df_topics[df_topics['topic_id'] == -3149765934180654494].index[0]

In [None]:
#tf_idf2(sub_df.loc[1])
print(int(sub_df.loc[1].topic_id_list[0]),sub_df.loc[1].question_id)
tf_idf(-3149765934180654494 ,2887834264226772863)

In [None]:
start_time = time.time()
a = sub_df.head(10000).apply(tf_idf2, axis = 1)
print('time cost:', time.time() - start_time)

In [None]:
sub_df.head(114900)

In [None]:
a.sort()

In [None]:
a

In [None]:
sub_df['tf_idf'] = sub_df.apply(tf_idf2, axis = 1)

## 考查TF_IDF的分布情况

### 计算单topic的Question的TF-IDF分布情况

### 计算多topic的Question的TF-IDF分布情况

In [None]:
# your code here ...

## 分析多Topic时，Topic的位置与TF-IDF的关系

In [None]:
# your code here ...

## 研究 Topic 继承关系对Topic赋值的影响

## 研究同义词对 Topic赋值的影响