# 说明

https://tianchi.aliyun.com/competition/entrance/231785/rankingList

https://zhuanlan.zhihu.com/p/127336206

## Background
本专题重点介绍曝光的公平性fariness of exposure，即如何推荐过去很少曝光的物品，以对抗推荐系统中经常遇到的马修效应。 尤其是，在对点击数据进行训练时，减少偏差对于该任务的成功至关重要。 就像现代推荐系统中记录的点击数据和实际在线环境之间存在差距一样，培训数据和测试数据之间也会存在差距，主要是关于趋势和项目的受欢迎程度。

* 获奖的解决方案需要在历史上很少接触的产品上表现良好。

* 培训数据和测试数据是在多个时期内收集的，甚至包括大规模的销售活动。 由于趋势的变化，对于可靠的预测不可避免地要执行偏差减小。

* 我们提供了商品的多模式功能以及一些（匿名的）关键用户功能，以帮助参与者探索能够抵抗数据偏差并能够很好地处理未开发商品的解决方案。

## Schedule
2020年3月30日：已发布培训数据样本集并开放注册。

2020年4月7日：发布了完整的培训和验证数据。

2020年4月13日：发布了（未标记）测试集A（仅用于开发，在确定最终的获奖解决方案时未考虑在内）。

2020年5月27日：报名截止日期。

2020年6月4日：（无标签）测试集B正式发布。

2020年6月11日：提交工作结束，提交技术报告的截止日期。

2020年6月20日：挑战赛获奖通知。

2020年8月23日至27日：KDD会议。

所有截止日期均为世界标准时间当天的晚上11:59。 如有必要，主办方保留更新比赛时间表的权利。

## Dataset
文件为CSV格式，采用UTF-8编码。 CSV文件的列可以是：

     item_id：商品的唯一标识符

     txt_vec：项目的文本特征，它是由预先训练的模型生成的128维实值向量

     img_vec：项目的图像特征，它是由预先训练的模型生成的128维实值向量

     user_id：用户的唯一标识符

     时间：点击事件发生的时间戳，即（（unix_timestamp-random_number_1）/ random_number_2

     user_age_level：用户所属的年龄段

     user_gender：用户的性别，可以为空

     user_city_level：用户所在城市的等级
     
     
数据收集时间超过十天，其中包括一次销售活动。 它涉及超过100万次点击，10万项和3万用户。 数据集的总大小约为500MB。

### 训练数据：underexpose_train.zip

underexpose_item_feat.csv的文件，其列为：item_id，txt_vec，img_vec

underexpose_user_feat.csv的文件，其列为：user_id，user_age_level，user_gender，user_city_level

它包含十个加密文件，其名称格式为underexpose_train_click-T.zip。 在这里T = 0,1,2，…，9表示我们处于比赛的阶段T。 比赛进入阶段T时，我们将在论坛中释放underexpose_train_click-T.zip的密码。underexpose_train_click-T.zip的内容为underexpose_train_click-T.csv，其列为：user_id，item_id，time

### 测试数据：underexpose_test.zip

它包含十个加密文件，其名称格式为underexpose_test_click-T.zip。 在这里T = 0,1,2，…，9表示我们处于比赛的阶段T。 当比赛进入阶段T时，我们将在论坛中释放underexpose_train_click-T.zip的密码。underexpose_test_click-T.zip的内容为underexpose_test_click-T.csv和underexpose_test_qtime-T.csv。

* underexpose_test_click-T.csv的列为：user_id，item_id，时间

* underexpose_test_qtime-T.csv的列为：user_id，query_time
这里的query_time是用户单击下一项的时间戳。 

这项比赛的任务是预测出现在underexpose_test_qtime-T.csv中的每个用户点击的下一项。 特别是，参与者需要为每个用户推荐五十个项目。 如果五十个推荐项目中的任何一个与实际情况相符，则参与者将获得positive score。

我们确保真实的下一项在underexpose_item_feat.csv中。 然而，在训练数据中可以观察到零次点击，尽管可能性不大。

## 提交
比赛进入阶段T时，参与者需要提交对underexpose_test_qtime-0,1,2，…，T.csv的预测。

     提交文件名：underexpose_submit-T.csv

     提交的文件应为51列的CSV文件。 不需要包括标题，即列的名称。 提交文件的51列应为：

         user_id，item_id_01，item_id_02，…，item_50

         这里item_id_01，item_id_02，…，item_id_50是为user_id推荐的五十个项目。 这五十个项目的顺序很重要。 请把最可能被用户点击的项目放在前面。 换句话说，item_id_01应该是最可能的。

         我们确保每个user_id不会出现在多个阶段中。 因此，您无需指定提交的每一行用于哪个阶段。
         
## the official script for evalution, which is posted in the forum.
https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6c3f5619NDeQ04&postId=102089
### [Must Read!] Official Script for Evaluation

In [None]:
# coding=utf-8
from __future__ import division
from __future__ import print_function

import datetime
import json
import sys
import time
from collections import defaultdict

import numpy as np

#### evaluate_each_phase

In [None]:
# the higher scores, the better performance
def evaluate_each_phase(predictions, answers):
    list_item_degress = []
    for user_id in answers:
        item_id, item_degree = answers[user_id]
        list_item_degress.append(item_degree)
    list_item_degress.sort()
    median_item_degree = list_item_degress[len(list_item_degress) // 2]

    num_cases_full = 0.0
    ndcg_50_full = 0.0
    ndcg_50_half = 0.0
    num_cases_half = 0.0
    hitrate_50_full = 0.0
    hitrate_50_half = 0.0
    for user_id in answers:
        item_id, item_degree = answers[user_id]
        rank = 0
        while rank < 50 and predictions[user_id][rank] != item_id:
            rank += 1
        num_cases_full += 1.0
        if rank < 50:
            ndcg_50_full += 1.0 / np.log2(rank + 2.0)
            hitrate_50_full += 1.0
        if item_degree <= median_item_degree:
            num_cases_half += 1.0
            if rank < 50:
                ndcg_50_half += 1.0 / np.log2(rank + 2.0)
                hitrate_50_half += 1.0
    ndcg_50_full /= num_cases_full
    hitrate_50_full /= num_cases_full
    ndcg_50_half /= num_cases_half
    hitrate_50_half /= num_cases_half
    return np.array([ndcg_50_full, ndcg_50_half,
                     hitrate_50_full, hitrate_50_half], dtype=np.float32)

#### evaluate

In [None]:
# submit_fname is the path to the file submitted by the participants.
# debias_track_answer.csv is the standard answer, which is not released.
def evaluate(stdout, submit_fname,
             answer_fname='debias_track_answer.csv', current_time=None):
    schedule_in_unix_time = [
        0,  # ........ 1970-01-01 08:00:00 (T=0)
        1586534399,  # 2020-04-10 23:59:59 (T=1)
        1587139199,  # 2020-04-17 23:59:59 (T=2)
        1587743999,  # 2020-04-24 23:59:59 (T=3)
        1588348799,  # 2020-05-01 23:59:59 (T=4)
        1588953599,  # 2020-05-08 23:59:59 (T=5)
        1589558399,  # 2020-05-15 23:59:59 (T=6)
        1590163199,  # 2020-05-22 23:59:59 (T=7)
        1590767999,  # 2020-05-29 23:59:59 (T=8)
        1591372799  # .2020-06-05 23:59:59 (T=9)
    ]
    assert len(schedule_in_unix_time) == 10
    for i in range(1, len(schedule_in_unix_time) - 1):
        # 604800 == one week
        assert schedule_in_unix_time[i] + 604800 == schedule_in_unix_time[i + 1]

    if current_time is None:
        current_time = int(time.time())
    print('current_time:', current_time)
    print('date_time:', datetime.datetime.fromtimestamp(current_time))
    current_phase = 0
    while (current_phase < 9) and (
            current_time > schedule_in_unix_time[current_phase + 1]):
        current_phase += 1
    print('current_phase:', current_phase)

    try:
        answers = [{} for _ in range(10)]
        with open(answer_fname, 'r') as fin:
            for line in fin:
                line = [int(x) for x in line.split(',')]
                phase_id, user_id, item_id, item_degree = line
                assert user_id % 11 == phase_id
                # exactly one test case for each user_id
                answers[phase_id][user_id] = (item_id, item_degree)
    except Exception as _:
        return report_error(stdout, 'server-side error: answer file incorrect')

    try:
        predictions = {}
        with open(submit_fname, 'r') as fin:
            for line in fin:
                line = line.strip()
                if line == '':
                    continue
                line = line.split(',')
                user_id = int(line[0])
                if user_id in predictions:
                    return report_error(stdout, 'submitted duplicate user_ids')
                item_ids = [int(i) for i in line[1:]]
                if len(item_ids) != 50:
                    return report_error(stdout, 'each row need have 50 items')
                if len(set(item_ids)) != 50:
                    return report_error(
                        stdout, 'each row need have 50 DISTINCT items')
                predictions[user_id] = item_ids
    except Exception as _:
        return report_error(stdout, 'submission not in correct format')

    scores = np.zeros(4, dtype=np.float32)

    # The final winning teams will be decided based on phase T=7,8,9 only.
    # We thus fix the scores to 1.0 for phase 0,1,2,...,6 at the final stage.
    if current_phase >= 7:  # if at the final stage, i.e., T=7,8,9
        scores += 7.0  # then fix the scores to 1.0 for phase 0,1,2,...,6
    phase_beg = (7 if (current_phase >= 7) else 0)
    phase_end = current_phase + 1
    for phase_id in range(phase_beg, phase_end):
        for user_id in answers[phase_id]:
            if user_id not in predictions:
                return report_error(
                    stdout, 'user_id %d of phase %d not in submission' % (
                        user_id, phase_id))
        try:
            # We sum the scores from all the phases, instead of averaging them.
            scores += evaluate_each_phase(predictions, answers[phase_id])
        except Exception as _:
            return report_error(stdout, 'error occurred during evaluation')

    return report_score(
        stdout, score=float(scores[0]),
        ndcg_50_full=float(scores[0]), ndcg_50_half=float(scores[1]),
        hitrate_50_full=float(scores[2]), hitrate_50_half=float(scores[3]))

#### _create_answer_file_for_evaluation

In [None]:
# FYI. You can create a fake answer file for validation based on this. For example,
# you can mask the latest ONE click made by each user in underexpose_test_click-T.csv,
# and use those masked clicks to create your own validation set, i.e.,
# a fake underexpose_test_qtime_with_answer-T.csv for validation.
def _create_answer_file_for_evaluation(answer_fname='debias_track_answer.csv'):
    train = 'underexpose_train_click-%d.csv'
    test = 'underexpose_test_click-%d.csv'

    # underexpose_test_qtime-T.csv contains only <user_id, item_id>
    # underexpose_test_qtime_with_answer-T.csv contains <user_id, item_id, time>
    answer = 'underexpose_test_qtime_with_answer-%d.csv'  # not released

    item_deg = defaultdict(lambda: 0)
    with open(answer_fname, 'w') as fout:
        for phase_id in range(10):
            with open(train % phase_id) as fin:
                for line in fin:
                    user_id, item_id, timestamp = line.split(',')
                    user_id, item_id, timestamp = (
                        int(user_id), int(item_id), float(timestamp))
                    item_deg[item_id] += 1
            with open(test % phase_id) as fin:
                for line in fin:
                    user_id, item_id, timestamp = line.split(',')
                    user_id, item_id, timestamp = (
                        int(user_id), int(item_id), float(timestamp))
                    item_deg[item_id] += 1
            with open(answer % phase_id) as fin:
                for line in fin:
                    user_id, item_id, timestamp = line.split(',')
                    user_id, item_id, timestamp = (
                        int(user_id), int(item_id), float(timestamp))
                    assert user_id % 11 == phase_id
                    print(phase_id, user_id, item_id, item_deg[item_id],
                          sep=',', file=fout)

## Evaluation评价

对于本次比赛，我们使用NDCG@50来衡量推荐列表的质量。

     我们将计算两个指标：NDCG@50-full和NDCG@50-rare。

         NDCG@50-full是对整个测试集（即underexpose_test_qtime-T.csv中的所有测试用例）进行计算的。

         在underexpose_test_qtime-T.csv中的一半测试用例上计算NDCG@50-rare。 所选的一半包括其下一个要预测的项目比过去训练集中的另一半更少探索的案例，即underexpose_train_click-0.zip，underexpose_train_click-1.zip，…，underexpose_train_click-T.zip。

     T = 0,1,2，…，6期正在开发中。 参与者的最终排名将基于T = 7,8,9进行计算。

         NDCG@50-full获胜团队需要跻身前10％，同时要在合格团队中获得最佳NDCG@50-rare。

# 思路

已知的数据包含 用户特征数据，item特征数据，user_item点击数据

任务：预测用户下一次点击哪个item?
## 思路1
baseline可以用UserCF和ItemCF来做，考虑到鱼佬说的，暂定用ItemCF来做
## 思路2
训练数据给出了很多的序列，可以按照NLG的思路做。

训练数据：
* abcdegf
* acdef
* ...

测试数据(?为待预测的数据):
* abc?
* cdf?

## 参考
### 鱼佬
https://zhuanlan.zhihu.com/p/127336206  
赛题主要考查如何消除人工智能偏见的问题，推荐历史点击次数少的商品。传统的召回方式，如协同过滤 item CF 和 user CF，user CF更加倾向于推荐热门商品，item CF推荐有很好的新颖性，很擅长推荐长尾里的物品，或许可以尝试一下。

根据向量相似性进行推荐，也是一个尝试的方向，不过看到用户前后点击的商品相似性并不高，让我有些迟疑。或许还需要更多的分析，如结合时间之类的属性。还有就是前后关系只能描述当前兴趣，可以尝试提取长期兴趣进行推荐。

还有就是深度学习模型进行召回，如YouTube的推荐系统算法，DSSM双塔模型等都是不错的方式。

对应排序阶段而言，一般都会上模型，需要考虑的就是特征，如何去区分热门商品，提高历史出现频次少的商品成为关键。要做到既能推荐对，又能推荐的够新颖。

In [2]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)  # 设置显示数据的最大列数，防止出现省略号…，导致数据显示不全
pd.set_option('expand_frame_repr', False)  # 当列太多时不自动换行

import seaborn as sns
sns.set(font='Arial Unicode MS')  # 解决Seaborn中文显示问题
import sys
sys.path.append('/Users/luoyonggui/PycharmProjects/mayiutils_n1/mayiutils/data_prepare')
from data_explore import DataExplore as de

  import pandas.util.testing as tm


# load data

In [3]:
path = './data_origin/'

## train_user_df

In [4]:
train_user_df = pd.read_csv(path+'underexpose_train/underexpose_user_feat.csv', names=['user_id','user_age_level','user_gender','user_city_level'])

In [3]:
train_user_df.head()

Unnamed: 0,user_id,user_age_level,user_gender,user_city_level
0,17,8.0,M,4.0
1,26,7.0,M,2.0
2,35,6.0,F,4.0
3,40,6.0,M,1.0
4,49,6.0,M,1.0


In [13]:
de.describe(train_user_df)

num of records: 6789, num of columns: 4


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,6786,0,0.0,14818,2,0.0294594,17247.4,10062.6,10.0,8637.0,17196.0,25856.0,35432.0
user_age_level,float64,8,83,1.2,4,1425,20.9898,4.53579,1.80313,1.0,3.0,5.0,6.0,8.0
user_gender,object,2,81,1.2,F,5211,76.7565,,,,,,,
user_city_level,float64,6,22,0.3,6,1870,27.5446,3.70844,1.79852,1.0,2.0,3.0,6.0,6.0


## train_item_df

In [4]:
train_item_df = pd.read_csv(path+'underexpose_train/underexpose_item_feat.csv', sep=r',\s+|,\[|\],\[',names=['item_id']+list(range(256)))

  """Entry point for launching an IPython kernel.


In [5]:
train_item_df.iloc[:, -1] = train_item_df.iloc[:, -1].str.replace(']', '').map(float)

In [25]:
train_item_df.head()

Unnamed: 0,item_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,...,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
0,42844,4.514945,-2.38372,0.500414,0.407068,-1.995229,0.109078,-0.691775,2.22746,-6.437974,-0.824897,-0.138724,-0.379329,0.62766,0.418377,4.441218,0.299819,0.578557,-4.699289,-0.39474,-2.391651,0.370532,-1.355466,-1.074178,-2.32164,-0.332456,0.123886,-2.439156,-0.345599,-3.304347,1.485284,0.909802,-1.643002,5.037034,2.780115,4.776496,2.255275,3.769707,-3.661684,-0.649405,4.199636,-0.634806,2.43034,-2.874019,-0.786178,-0.504916,-6.007789,1.498495,1.530613,2.379655,...,0.312405,3.444607,-0.88625,-1.343637,0.954459,0.630835,-2.394722,0.683487,1.149004,-1.351173,2.0239,1.599198,1.382868,1.605678,1.880667,-0.508161,0.24284,-0.260849,1.875943,0.206135,0.186973,2.047446,-0.575472,3.01641,2.757146,3.353721,-0.457272,-0.125337,2.332963,3.858967,-2.07549,-0.705496,0.203452,1.719733,2.925039,-0.388639,1.225732,-1.773137,0.052655,1.279922,-3.374727,-1.506969,-1.82018,-3.024644,0.445263,0.013933,-1.300239,2.759948,2.056171,0.508703
1,67898,-2.002905,-0.929881,0.790017,-1.380895,-0.510463,-1.810096,1.363962,0.497401,-4.038903,-3.057872,0.758558,-1.012155,2.816802,2.086895,-1.464331,-1.840496,-2.089971,-1.566872,1.54539,1.284341,-2.270262,0.780126,1.615594,-0.546058,1.37075,-1.178124,1.346842,0.442434,-1.49854,-0.589944,2.008351,-0.497135,-1.64423,3.140623,3.492178,0.335395,1.810923,-4.01208,2.419593,0.190941,-0.630611,3.289332,-1.446719,-0.61134,0.700662,-2.465656,-0.596773,2.49821,3.682916,...,2.39942,2.024863,0.170483,-0.039203,-1.506677,-1.945932,-0.020228,-0.495499,-0.141013,-1.617521,2.624676,-2.581922,0.220891,0.328793,0.647758,0.23199,1.101486,1.079527,2.953102,-0.528682,-1.1406,-0.373299,0.109811,2.813541,0.596998,1.754836,-1.359771,0.466501,2.377417,-0.180653,-3.259304,0.120833,2.225643,2.220507,-1.178944,-0.821367,0.717239,-1.455829,-1.260584,2.623467,-0.53833,-2.620164,1.277195,0.601015,-0.345312,0.993457,1.351633,2.162675,2.768597,-0.937197
2,66446,4.221673,-1.497139,1.13357,-2.745607,-4.197045,-0.542392,-1.396256,1.838419,-6.066454,-2.191799,0.752804,0.868623,6.187662,1.725745,2.887859,-1.486026,-0.182256,-3.710785,1.512866,-0.636434,0.288435,-3.369717,-0.265998,-3.549319,3.375338,-0.901461,-1.558371,1.695343,-4.450464,0.545495,1.000096,-3.468751,3.327641,1.55689,4.493203,0.369089,0.167196,-4.837062,1.216016,4.699153,-1.094529,3.015942,-1.322741,-0.829172,0.555047,-5.592765,1.254898,3.18245,3.053574,...,-0.005492,3.827181,-0.358198,-2.009379,-0.224391,0.803851,-0.909498,0.96281,2.601583,0.056328,1.859474,-0.316134,-1.131286,1.701278,2.305405,-1.941271,1.248002,0.291,0.792067,1.361166,1.129005,1.947404,-0.859423,2.023223,2.348651,4.506127,0.684437,2.064992,0.022901,3.464243,-2.325273,0.131324,-1.876178,1.770354,2.925176,-1.851054,-0.092587,-0.580742,-0.422019,0.923714,-4.582711,-1.05691,-2.568084,-2.038061,2.508719,-0.764789,-0.657116,3.252782,2.687366,0.844332
3,63651,2.65797,-0.941863,1.121529,-5.109496,-0.279041,-0.351968,-1.086983,2.703607,-6.494977,-0.746769,-0.068571,-3.89467,4.937046,-1.863204,-1.955068,1.900193,1.743841,-6.02479,1.460414,-2.206104,-1.997572,-3.414536,-0.178739,0.987313,1.255347,-1.187136,2.070518,2.191021,-2.936702,2.617733,0.919181,-3.087907,-0.358938,-0.428679,3.815598,2.440558,1.281061,-0.73253,1.517067,2.790302,-2.019122,2.419042,-2.044806,0.649187,1.940526,-4.965359,0.93046,-1.152011,0.167594,...,-0.733038,-0.913736,-1.29619,4.821739,0.687235,-1.406431,0.669184,-1.847598,-0.817075,-1.181172,2.588552,-1.123118,0.232427,2.170325,0.579414,2.601421,-0.596196,-1.798693,0.312326,0.387486,-2.207365,-1.029329,-1.274233,2.04723,1.928517,2.102633,-0.559383,-0.951418,-2.021749,1.366272,-1.947211,-2.114419,1.140394,-0.796024,1.906361,-0.35752,3.352968,-3.996377,1.520331,-0.000716,-0.487683,-1.889119,0.943015,-2.834418,1.633184,2.001801,-2.333152,2.645595,2.280233,-0.694448
4,46824,3.192195,-1.936676,1.199909,-2.562152,-2.573456,0.575841,-2.358653,1.620844,-4.302936,-0.487575,0.020896,-0.763327,4.341694,0.698798,3.33458,0.607683,-0.718644,-2.730188,0.193828,-1.706196,-0.468727,-2.281904,-1.837274,-2.84914,0.195873,-0.459765,-0.768752,1.033489,-2.490896,2.077521,-0.171984,-3.406347,2.61667,0.713099,4.450222,0.606497,0.160672,-2.604218,2.110272,4.714019,-2.297905,1.700881,-0.195633,-0.404006,2.140779,-5.351576,1.592488,1.312723,1.610867,...,1.086525,3.235504,-1.360041,-0.573626,-0.343873,-0.862111,-3.03626,-1.01661,-1.14977,0.46414,2.412406,-0.654135,0.486894,1.168054,-0.27142,0.667091,1.163557,-0.132236,0.668463,-0.228185,0.155768,1.546821,0.30764,2.542941,1.240985,2.144569,-1.079753,0.335191,0.245159,2.150169,-1.99144,-2.330727,0.736855,2.126931,0.556061,-1.611493,2.722133,-1.18594,0.399201,1.598617,-0.621475,-2.09141,0.5016,-3.083864,-1.060091,2.0536,-2.025008,2.399251,2.562317,0.694134


In [27]:
de.describe(train_item_df)

num of records: 108916, num of columns: 257


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
item_id,int64,108916,0,0.0,1.000000,1,0.000918139,58485.345982,33065.656602,1.000000,30189.750000,58643.500000,87014.250000,117538.000000
0,float64,108745,0,0.0,0.619635,3,0.00275442,1.458865,1.738504,-8.101768,0.285449,1.489091,2.647692,8.362688
1,float64,108810,0,0.0,-1.928095,3,0.00275442,-0.317904,1.874272,-8.306049,-1.594721,-0.405978,0.838072,8.153514
2,float64,108771,0,0.0,-1.808081,3,0.00275442,0.863508,1.635397,-7.136532,-0.162674,0.919081,1.920463,8.325392
3,float64,108649,0,0.0,-2.129149,3,0.00275442,-2.251081,1.686890,-9.212244,-3.422497,-2.393868,-1.209849,11.561003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,float64,108802,0,0.0,-0.310374,3,0.00275442,0.283913,1.766579,-10.643577,-0.726043,0.431549,1.425704,7.465464
252,float64,108791,0,0.0,-0.136049,3,0.00275442,0.507123,1.846375,-10.683113,-0.831946,0.641637,1.909708,6.809214
253,float64,108513,0,0.0,2.266967,3,0.00275442,2.334019,1.022302,-3.334639,1.690839,2.370883,3.020225,6.749975
254,float64,108610,0,0.0,1.178804,3,0.00275442,1.827126,1.169730,-5.783994,1.132481,1.855712,2.559368,7.211661


## click_df

### train_click_df

#### train_click_0_df

In [5]:
train_click_0_df = pd.read_csv(path+'underexpose_train/underexpose_train_click-0.csv',names=['user_id','item_id','time'])
train_click_0_df.head()

Unnamed: 0,user_id,item_id,time
0,4965,18,0.983763
1,20192,34,0.983772
2,30128,91,0.98378
3,29473,189,0.98393
4,10625,225,0.983925


In [7]:
de.describe(train_click_0_df)

num of records: 241784, num of columns: 3


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,16842,0,0.0,5701.0,154,0.0636932,16667.586548,10023.164428,1.0,7944.0,16251.0,25159.0,35419.0
item_id,int64,40772,0,0.0,113569.0,215,0.0889223,56465.234151,34170.27959,1.0,26117.0,55525.0,86465.0,117283.0
time,float64,161457,0,0.0,0.983842,8,0.00330874,0.983859,6.2e-05,0.98374,0.9838,0.983858,0.983907,0.983958


In [9]:
train_click_0_df.drop_duplicates('user_id item_id'.split()).shape

(241784, 3)

In [28]:
train_click_0_df.groupby('user_id')['item_id'].count()

user_id
1        13
2        15
4         6
7         5
9         7
         ..
35389     3
35391     5
35393    18
35399     3
35419     5
Name: item_id, Length: 16842, dtype: int64

#### train_click_1_df

In [6]:
train_click_1_df = pd.read_csv(path+'underexpose_train/underexpose_train_click-1.csv',names=['user_id','item_id','time'])
train_click_1_df.head()

Unnamed: 0,user_id,item_id,time
0,12836,18,0.984007
1,4965,18,0.983991
2,12421,80,0.984008
3,21919,80,0.983994
4,28229,146,0.98401


In [14]:
de.describe(train_click_1_df)

num of records: 242132, num of columns: 3


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,16946,0,0.0,5701.0,139,0.0574067,16586.010944,10006.197506,2.0,7868.5,16154.0,25013.0,35424.0
item_id,int64,41403,0,0.0,52766.0,202,0.0834256,56323.570945,34226.778555,3.0,25804.25,55314.0,86222.0,117448.0
time,float64,161596,0,0.0,0.983899,8,0.00330398,0.983913,6.3e-05,0.983794,0.983854,0.983908,0.983961,0.984012


In [15]:
train_click_1_df.drop_duplicates('user_id item_id'.split()).shape

(242132, 3)

#### merge train_click_df

In [7]:
train_click_df = pd.concat([train_click_0_df, train_click_1_df], ignore_index=True)

In [18]:
de.describe(train_click_df)

num of records: 483916, num of columns: 3


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,20934,0,0.0,5701.0,293,0.0605477,16626.769415,10014.751172,1.0,7896.0,16191.0,25075.0,35424.0
item_id,int64,51907,0,0.0,52766.0,398,0.0822457,56394.351611,34198.599073,1.0,25957.0,55401.0,86350.0,117448.0
time,float64,220214,0,0.0,0.983899,15,0.00309971,0.983886,6.8e-05,0.98374,0.983833,0.983888,0.983941,0.984012


In [19]:
train_click_df.drop_duplicates('user_id item_id'.split()).shape

(346101, 3)

### test_click_df

In [8]:
test_click_0_df = pd.read_csv(path+'underexpose_test/underexpose_test_click-0/underexpose_test_click-0.csv', names=['user_id','item_id','time'])
test_click_1_df = pd.read_csv(path+'underexpose_test/underexpose_test_click-1/underexpose_test_click-1.csv', names=['user_id','item_id','time'])

test_click_df = pd.concat([test_click_0_df, test_click_1_df], ignore_index=True)

In [41]:
test_click_df.user_id.nunique()

3389

#### test_click_df.user_id也在train_click_df中出现过

In [39]:
len(set(test_click_df.user_id).intersection(set(train_click_df.user_id)))

2866

#### test_click_0_df.user_id有在train_click_1_df出现过，也就是可以phase1阶段的train_click数据来改进phase0阶段的结果

In [42]:
len(set(test_click_0_df.user_id).intersection(set(train_click_1_df.user_id)))

1402

In [12]:
test_click_0_df.head()

Unnamed: 0,user_id,item_id,time
0,1133,221,0.983812
1,17864,253,0.983783
2,6941,309,0.983785
3,34089,358,0.983781
4,21659,536,0.983793


In [11]:
de.describe(test_click_0_df)

num of records: 21216, num of columns: 3


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,1663,0,0.0,35123.0,98,0.461916,16435.22516,10072.244892,11.0,7722.0,15939.0,24585.0,35398.0
item_id,int64,15670,0,0.0,113569.0,19,0.0895551,56308.638763,34119.203232,1.0,25690.5,55536.5,86055.75,117069.0
time,float64,20419,0,0.0,0.983876,4,0.0188537,0.983854,6.1e-05,0.98374,0.983794,0.983848,0.9839,0.983958


In [21]:
de.describe(test_click_1_df)

num of records: 24465, num of columns: 3


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,1726,0,0.0,4643.0,93,0.380135,16739.294666,9831.424398,1.0,8449.0,16248.0,24982.0,35421.0
item_id,int64,17295,0,0.0,52766.0,30,0.122624,56504.339792,34130.032894,23.0,26155.0,55472.0,86155.0,117283.0
time,float64,23408,0,0.0,0.983818,3,0.0122624,0.983909,6.1e-05,0.983794,0.983849,0.983902,0.983955,0.984012


In [25]:
de.describe(test_click_df)

num of records: 45681, num of columns: 3


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,3389,0,0.0,35123.0,98,0.214531,16598.073181,9945.043089,1.0,8107.0,16061.0,24839.0,35421.0
item_id,int64,27195,0,0.0,52766.0,47,0.102887,56413.448764,34124.769689,1.0,25955.0,55495.0,86099.0,117283.0
time,float64,42497,0,0.0,0.983876,4,0.00875638,0.983883,6.7e-05,0.98374,0.983832,0.983885,0.983938,0.984012


In [29]:
test_click_df.groupby('user_id')['item_id'].count().describe()

count    3389.000000
mean       13.479197
std        12.119118
min         2.000000
25%         6.000000
50%        10.000000
75%        17.000000
max        98.000000
Name: item_id, dtype: float64

### merge

In [9]:
click_df = pd.concat([train_click_df, test_click_df], ignore_index=True)

In [31]:
click_df.shape

(529597, 3)

In [32]:
# 删除重复的数据
click_df = click_df.drop_duplicates()

In [33]:
click_df.shape

(368885, 3)

In [35]:
de.describe(click_df)

num of records: 368885, num of columns: 3


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['temp_1a2b3c__'] = 1


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,21457,0,0.0,8330.0,194,0.0525909,16654.90162,10020.543539,1.0,7951.0,16196.0,25102.0,35424.0
item_id,int64,51914,0,0.0,52766.0,302,0.0818683,56475.735823,34213.124027,1.0,26092.0,55524.0,86413.0,117448.0
time,float64,227130,0,0.0,0.983897,9,0.00243978,0.983887,7.5e-05,0.98374,0.983826,0.983889,0.98395,0.984012


In [15]:
click_df

Unnamed: 0,user_id,item_id,time
0,4965,18,0.983763
1,20192,34,0.983772
2,30128,91,0.983780
3,29473,189,0.983930
4,10625,225,0.983925
...,...,...,...
529592,10990,116199,0.983983
529593,9098,116330,0.983899
529594,5336,116572,0.983900
529595,23992,116713,0.983952


In [10]:
click_df.groupby('user_id')['item_id'].count()

user_id
1        22
2        34
3        12
4         6
6         9
         ..
35399     3
35417     4
35419    12
35421     4
35424     4
Name: item_id, Length: 21457, dtype: int64

In [19]:
click_df.time.describe()

count    529597.000000
mean          0.983886
std           0.000068
min           0.983740
25%           0.983833
50%           0.983888
75%           0.983941
max           0.984012
Name: time, dtype: float64

In [17]:
click_df.loc[click_df['user_id']==1]

Unnamed: 0,user_id,item_id,time
3123,1,47611,0.983887
19709,1,76240,0.98377
19829,1,78142,0.983742
20480,1,89568,0.983763
20968,1,97795,0.983877
56362,1,78380,0.98379
74513,1,17887,0.983894
84964,1,69359,0.983942
108033,1,87533,0.98379
111177,1,18522,0.983887


## test_qtime_df

In [10]:
test_qtime_0_df = pd.read_csv(path+'underexpose_test/underexpose_test_click-0/underexpose_test_qtime-0.csv', names=['user_id','query_time'])

test_qtime_1_df = pd.read_csv(path+'underexpose_test/underexpose_test_click-1/underexpose_test_qtime-1.csv', names=['user_id','query_time'])


In [11]:
test_qtime_0_df.head()

Unnamed: 0,user_id,query_time
0,11,0.983869
1,22,0.983956
2,44,0.983924
3,55,0.983953
4,66,0.983895


In [12]:
de.describe(test_qtime_0_df)

num of records: 1663, num of columns: 2


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,1663,0,0.0,11.0,1,0.0601323,16845.425135,10166.465104,11.0,7986.0,16357.0,25228.5,35398.0
query_time,float64,1653,0,0.0,0.983936,2,0.120265,0.983919,4.3e-05,0.98374,0.983897,0.983937,0.983951,0.983958


In [22]:
de.describe(test_qtime_1_df)

num of records: 1726, num of columns: 2


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,1726,0,0.0,1.0,1,0.0579374,17258.209154,10017.568892,1.0,8605.75,17249.0,25592.5,35421.0
query_time,float64,1709,0,0.0,0.984003,2,0.115875,0.983975,4.3e-05,0.983796,0.983954,0.983991,0.984006,0.984012


In [23]:
test_qtime_df = pd.concat([test_qtime_0_df, test_qtime_1_df], ignore_index=True)

In [24]:
de.describe(test_qtime_df)

num of records: 3389, num of columns: 2


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,3389,0,0.0,1.0,1,0.0295072,17055.65388,10091.528118,1.0,8317.0,16875.0,25477.0,35421.0
query_time,float64,3352,0,0.0,0.983869,2,0.0590145,0.983947,5.1e-05,0.98374,0.983926,0.983952,0.983992,0.984012
