# 说明

https://tianchi.aliyun.com/competition/entrance/231785/information

https://zhuanlan.zhihu.com/p/127336206

## Dataset
文件为CSV格式，采用UTF-8编码。 CSV文件的列可以是：

     item_id：商品的唯一标识符

     txt_vec：项目的文本特征，它是由预先训练的模型生成的128维实值向量

     img_vec：项目的图像特征，它是由预先训练的模型生成的128维实值向量

     user_id：用户的唯一标识符

     时间：点击事件发生的时间戳，即（（unix_timestamp-random_number_1）/ random_number_2

     user_age_level：用户所属的年龄段

     user_gender：用户的性别，可以为空

     user_city_level：用户所在城市的等级
     
     
数据收集时间超过十天，其中包括一次销售活动。 它涉及超过100万次点击，10万项和3万用户。 数据集的总大小约为500MB。

### 训练数据：underexpose_train.zip

underexpose_item_feat.csv的文件，其列为：item_id，txt_vec，img_vec

underexpose_user_feat.csv的文件，其列为：user_id，user_age_level，user_gender，user_city_level

它包含十个加密文件，其名称格式为underexpose_train_click-T.zip。 在这里T = 0,1,2，…，9表示我们处于比赛的阶段T。 比赛进入阶段T时，我们将在论坛中释放underexpose_train_click-T.zip的密码。underexpose_train_click-T.zip的内容为underexpose_train_click-T.csv，其列为：user_id，item_id，time

### 测试数据：underexpose_test.zip

     它包含十个加密文件，其名称格式为underexpose_test_click-T.zip。 在这里T = 0,1,2，…，9表示我们处于比赛的阶段T。 当比赛进入阶段T时，我们将在论坛中释放underexpose_train_click-T.zip的密码。underexpose_test_click-T.zip的内容为underexpose_test_click-T.csv和underexpose_test_qtime-T.csv。

     underexpose_test_click-T.csv的列为：user_id，item_id，时间

     underexpose_test_qtime-T.csv的列为：user_id，query_time

         这里的query_time是用户单击下一项的时间戳。 这项比赛的任务是预测出现在underexpose_test_qtime-T.csv中的每个用户点击的下一项。 特别是，参与者需要为每个用户推荐五十个项目。 如果五十个推荐项目中的任何一个与实际情况相符，则参与者将获得积极的分数。

         我们确保真实的下一项在underexpose_item_feat.csv中。 但是，虽然不太可能，但是在训练数据中观察到的点击次数可能为零。

## 提交
比赛进入阶段T时，参与者需要提交对underexpose_test_qtime-0,1,2，…，T.csv的预测。

     提交文件名：underexpose_submit-T.csv

     提交的文件应为51列的CSV文件。 不需要包括标题，即列的名称。 提交文件的51列应为：

         user_id，item_id_01，item_id_02，…，item_50

         这里item_id_01，item_id_02，…，item_id_50是为user_id推荐的五十个项目。 这五十个项目的顺序很重要。 请把最可能被用户点击的项目放在前面。 换句话说，item_id_01应该是最可能的。

         我们确保每个user_id不会出现在多个阶段中。 因此，您无需指定提交的每一行用于哪个阶段。
         
## the official script for evalution, which is posted in the forum.
https://tianchi.aliyun.com/forum/postDetail?spm=5176.12586969.1002.3.6c3f5619NDeQ04&postId=102089
### [Must Read!] Official Script for Evaluation

In [None]:
# coding=utf-8
from __future__ import division
from __future__ import print_function

import datetime
import json
import sys
import time
from collections import defaultdict

import numpy as np

#### evaluate_each_phase

In [None]:
# the higher scores, the better performance
def evaluate_each_phase(predictions, answers):
    list_item_degress = []
    for user_id in answers:
        item_id, item_degree = answers[user_id]
        list_item_degress.append(item_degree)
    list_item_degress.sort()
    median_item_degree = list_item_degress[len(list_item_degress) // 2]

    num_cases_full = 0.0
    ndcg_50_full = 0.0
    ndcg_50_half = 0.0
    num_cases_half = 0.0
    hitrate_50_full = 0.0
    hitrate_50_half = 0.0
    for user_id in answers:
        item_id, item_degree = answers[user_id]
        rank = 0
        while rank < 50 and predictions[user_id][rank] != item_id:
            rank += 1
        num_cases_full += 1.0
        if rank < 50:
            ndcg_50_full += 1.0 / np.log2(rank + 2.0)
            hitrate_50_full += 1.0
        if item_degree <= median_item_degree:
            num_cases_half += 1.0
            if rank < 50:
                ndcg_50_half += 1.0 / np.log2(rank + 2.0)
                hitrate_50_half += 1.0
    ndcg_50_full /= num_cases_full
    hitrate_50_full /= num_cases_full
    ndcg_50_half /= num_cases_half
    hitrate_50_half /= num_cases_half
    return np.array([ndcg_50_full, ndcg_50_half,
                     hitrate_50_full, hitrate_50_half], dtype=np.float32)





#### evaluate

In [None]:
# submit_fname is the path to the file submitted by the participants.
# debias_track_answer.csv is the standard answer, which is not released.
def evaluate(stdout, submit_fname,
             answer_fname='debias_track_answer.csv', current_time=None):
    schedule_in_unix_time = [
        0,  # ........ 1970-01-01 08:00:00 (T=0)
        1586534399,  # 2020-04-10 23:59:59 (T=1)
        1587139199,  # 2020-04-17 23:59:59 (T=2)
        1587743999,  # 2020-04-24 23:59:59 (T=3)
        1588348799,  # 2020-05-01 23:59:59 (T=4)
        1588953599,  # 2020-05-08 23:59:59 (T=5)
        1589558399,  # 2020-05-15 23:59:59 (T=6)
        1590163199,  # 2020-05-22 23:59:59 (T=7)
        1590767999,  # 2020-05-29 23:59:59 (T=8)
        1591372799  # .2020-06-05 23:59:59 (T=9)
    ]
    assert len(schedule_in_unix_time) == 10
    for i in range(1, len(schedule_in_unix_time) - 1):
        # 604800 == one week
        assert schedule_in_unix_time[i] + 604800 == schedule_in_unix_time[i + 1]

    if current_time is None:
        current_time = int(time.time())
    print('current_time:', current_time)
    print('date_time:', datetime.datetime.fromtimestamp(current_time))
    current_phase = 0
    while (current_phase < 9) and (
            current_time > schedule_in_unix_time[current_phase + 1]):
        current_phase += 1
    print('current_phase:', current_phase)

    try:
        answers = [{} for _ in range(10)]
        with open(answer_fname, 'r') as fin:
            for line in fin:
                line = [int(x) for x in line.split(',')]
                phase_id, user_id, item_id, item_degree = line
                assert user_id % 11 == phase_id
                # exactly one test case for each user_id
                answers[phase_id][user_id] = (item_id, item_degree)
    except Exception as _:
        return report_error(stdout, 'server-side error: answer file incorrect')

    try:
        predictions = {}
        with open(submit_fname, 'r') as fin:
            for line in fin:
                line = line.strip()
                if line == '':
                    continue
                line = line.split(',')
                user_id = int(line[0])
                if user_id in predictions:
                    return report_error(stdout, 'submitted duplicate user_ids')
                item_ids = [int(i) for i in line[1:]]
                if len(item_ids) != 50:
                    return report_error(stdout, 'each row need have 50 items')
                if len(set(item_ids)) != 50:
                    return report_error(
                        stdout, 'each row need have 50 DISTINCT items')
                predictions[user_id] = item_ids
    except Exception as _:
        return report_error(stdout, 'submission not in correct format')

    scores = np.zeros(4, dtype=np.float32)

    # The final winning teams will be decided based on phase T=7,8,9 only.
    # We thus fix the scores to 1.0 for phase 0,1,2,...,6 at the final stage.
    if current_phase >= 7:  # if at the final stage, i.e., T=7,8,9
        scores += 7.0  # then fix the scores to 1.0 for phase 0,1,2,...,6
    phase_beg = (7 if (current_phase >= 7) else 0)
    phase_end = current_phase + 1
    for phase_id in range(phase_beg, phase_end):
        for user_id in answers[phase_id]:
            if user_id not in predictions:
                return report_error(
                    stdout, 'user_id %d of phase %d not in submission' % (
                        user_id, phase_id))
        try:
            # We sum the scores from all the phases, instead of averaging them.
            scores += evaluate_each_phase(predictions, answers[phase_id])
        except Exception as _:
            return report_error(stdout, 'error occurred during evaluation')

    return report_score(
        stdout, score=float(scores[0]),
        ndcg_50_full=float(scores[0]), ndcg_50_half=float(scores[1]),
        hitrate_50_full=float(scores[2]), hitrate_50_half=float(scores[3]))

#### _create_answer_file_for_evaluation

In [None]:
# FYI. You can create a fake answer file for validation based on this. For example,
# you can mask the latest ONE click made by each user in underexpose_test_click-T.csv,
# and use those masked clicks to create your own validation set, i.e.,
# a fake underexpose_test_qtime_with_answer-T.csv for validation.
def _create_answer_file_for_evaluation(answer_fname='debias_track_answer.csv'):
    train = 'underexpose_train_click-%d.csv'
    test = 'underexpose_test_click-%d.csv'

    # underexpose_test_qtime-T.csv contains only <user_id, item_id>
    # underexpose_test_qtime_with_answer-T.csv contains <user_id, item_id, time>
    answer = 'underexpose_test_qtime_with_answer-%d.csv'  # not released

    item_deg = defaultdict(lambda: 0)
    with open(answer_fname, 'w') as fout:
        for phase_id in range(10):
            with open(train % phase_id) as fin:
                for line in fin:
                    user_id, item_id, timestamp = line.split(',')
                    user_id, item_id, timestamp = (
                        int(user_id), int(item_id), float(timestamp))
                    item_deg[item_id] += 1
            with open(test % phase_id) as fin:
                for line in fin:
                    user_id, item_id, timestamp = line.split(',')
                    user_id, item_id, timestamp = (
                        int(user_id), int(item_id), float(timestamp))
                    item_deg[item_id] += 1
            with open(answer % phase_id) as fin:
                for line in fin:
                    user_id, item_id, timestamp = line.split(',')
                    user_id, item_id, timestamp = (
                        int(user_id), int(item_id), float(timestamp))
                    assert user_id % 11 == phase_id
                    print(phase_id, user_id, item_id, item_deg[item_id],
                          sep=',', file=fout)

## Evaluation评价

对于本次比赛，我们使用NDCG@50来衡量推荐列表的质量。

     我们将计算两个指标：NDCG@50-full和NDCG@50-rare。

         NDCG@50-full是对整个测试集（即underexpose_test_qtime-T.csv中的所有测试用例）进行计算的。

         在underexpose_test_qtime-T.csv中的一半测试用例上计算NDCG@50-rare。 所选的一半包括其下一个要预测的项目比过去训练集中的另一半更少探索的案例，即underexpose_train_click-0.zip，underexpose_train_click-1.zip，…，underexpose_train_click-T.zip。

     T = 0,1,2，…，6期正在开发中。 参与者的最终排名将基于T = 7,8,9进行计算。

         NDCG@50-full获胜团队需要跻身前10％，同时要在合格团队中获得最佳NDCG@50-rare。

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 200)
pd.set_option('display.max_columns', 100)  # 设置显示数据的最大列数，防止出现省略号…，导致数据显示不全
pd.set_option('expand_frame_repr', False)  # 当列太多时不自动换行

import seaborn as sns
sns.set(font='Arial Unicode MS')  # 解决Seaborn中文显示问题
import sys
sys.path.append('/Users/luoyonggui/PycharmProjects/mayiutils_n1/mayiutils/data_prepare')
from data_explore import DataExplore as de

  import pandas.util.testing as tm


# load data

In [2]:
path = './data_origin/'

## train

In [None]:
train_user_df = pd.read_csv(path+'underexpose_train/underexpose_user_feat.csv', names=['user_id','user_age_level','user_gender','user_city_level'])

In [3]:
train_user_df.head()

Unnamed: 0,user_id,user_age_level,user_gender,user_city_level
0,17,8.0,M,4.0
1,26,7.0,M,2.0
2,35,6.0,F,4.0
3,40,6.0,M,1.0
4,49,6.0,M,1.0


In [13]:
de.describe(train_user_df)

num of records: 6789, num of columns: 4


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
user_id,int64,6786,0,0.0,14818,2,0.0294594,17247.4,10062.6,10.0,8637.0,17196.0,25856.0,35432.0
user_age_level,float64,8,83,1.2,4,1425,20.9898,4.53579,1.80313,1.0,3.0,5.0,6.0,8.0
user_gender,object,2,81,1.2,F,5211,76.7565,,,,,,,
user_city_level,float64,6,22,0.3,6,1870,27.5446,3.70844,1.79852,1.0,2.0,3.0,6.0,6.0


In [21]:
train_item_df = pd.read_csv(path+'underexpose_train/underexpose_item_feat.csv', sep=r',\s+|,\[|\],\[',names=['item_id']+list(range(256)))

  """Entry point for launching an IPython kernel.


In [24]:
train_item_df.iloc[:, -1] = train_item_df.iloc[:, -1].str.replace(']', '').map(float)

In [25]:
train_item_df.head()

Unnamed: 0,item_id,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,...,206,207,208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
0,42844,4.514945,-2.38372,0.500414,0.407068,-1.995229,0.109078,-0.691775,2.22746,-6.437974,-0.824897,-0.138724,-0.379329,0.62766,0.418377,4.441218,0.299819,0.578557,-4.699289,-0.39474,-2.391651,0.370532,-1.355466,-1.074178,-2.32164,-0.332456,0.123886,-2.439156,-0.345599,-3.304347,1.485284,0.909802,-1.643002,5.037034,2.780115,4.776496,2.255275,3.769707,-3.661684,-0.649405,4.199636,-0.634806,2.43034,-2.874019,-0.786178,-0.504916,-6.007789,1.498495,1.530613,2.379655,...,0.312405,3.444607,-0.88625,-1.343637,0.954459,0.630835,-2.394722,0.683487,1.149004,-1.351173,2.0239,1.599198,1.382868,1.605678,1.880667,-0.508161,0.24284,-0.260849,1.875943,0.206135,0.186973,2.047446,-0.575472,3.01641,2.757146,3.353721,-0.457272,-0.125337,2.332963,3.858967,-2.07549,-0.705496,0.203452,1.719733,2.925039,-0.388639,1.225732,-1.773137,0.052655,1.279922,-3.374727,-1.506969,-1.82018,-3.024644,0.445263,0.013933,-1.300239,2.759948,2.056171,0.508703
1,67898,-2.002905,-0.929881,0.790017,-1.380895,-0.510463,-1.810096,1.363962,0.497401,-4.038903,-3.057872,0.758558,-1.012155,2.816802,2.086895,-1.464331,-1.840496,-2.089971,-1.566872,1.54539,1.284341,-2.270262,0.780126,1.615594,-0.546058,1.37075,-1.178124,1.346842,0.442434,-1.49854,-0.589944,2.008351,-0.497135,-1.64423,3.140623,3.492178,0.335395,1.810923,-4.01208,2.419593,0.190941,-0.630611,3.289332,-1.446719,-0.61134,0.700662,-2.465656,-0.596773,2.49821,3.682916,...,2.39942,2.024863,0.170483,-0.039203,-1.506677,-1.945932,-0.020228,-0.495499,-0.141013,-1.617521,2.624676,-2.581922,0.220891,0.328793,0.647758,0.23199,1.101486,1.079527,2.953102,-0.528682,-1.1406,-0.373299,0.109811,2.813541,0.596998,1.754836,-1.359771,0.466501,2.377417,-0.180653,-3.259304,0.120833,2.225643,2.220507,-1.178944,-0.821367,0.717239,-1.455829,-1.260584,2.623467,-0.53833,-2.620164,1.277195,0.601015,-0.345312,0.993457,1.351633,2.162675,2.768597,-0.937197
2,66446,4.221673,-1.497139,1.13357,-2.745607,-4.197045,-0.542392,-1.396256,1.838419,-6.066454,-2.191799,0.752804,0.868623,6.187662,1.725745,2.887859,-1.486026,-0.182256,-3.710785,1.512866,-0.636434,0.288435,-3.369717,-0.265998,-3.549319,3.375338,-0.901461,-1.558371,1.695343,-4.450464,0.545495,1.000096,-3.468751,3.327641,1.55689,4.493203,0.369089,0.167196,-4.837062,1.216016,4.699153,-1.094529,3.015942,-1.322741,-0.829172,0.555047,-5.592765,1.254898,3.18245,3.053574,...,-0.005492,3.827181,-0.358198,-2.009379,-0.224391,0.803851,-0.909498,0.96281,2.601583,0.056328,1.859474,-0.316134,-1.131286,1.701278,2.305405,-1.941271,1.248002,0.291,0.792067,1.361166,1.129005,1.947404,-0.859423,2.023223,2.348651,4.506127,0.684437,2.064992,0.022901,3.464243,-2.325273,0.131324,-1.876178,1.770354,2.925176,-1.851054,-0.092587,-0.580742,-0.422019,0.923714,-4.582711,-1.05691,-2.568084,-2.038061,2.508719,-0.764789,-0.657116,3.252782,2.687366,0.844332
3,63651,2.65797,-0.941863,1.121529,-5.109496,-0.279041,-0.351968,-1.086983,2.703607,-6.494977,-0.746769,-0.068571,-3.89467,4.937046,-1.863204,-1.955068,1.900193,1.743841,-6.02479,1.460414,-2.206104,-1.997572,-3.414536,-0.178739,0.987313,1.255347,-1.187136,2.070518,2.191021,-2.936702,2.617733,0.919181,-3.087907,-0.358938,-0.428679,3.815598,2.440558,1.281061,-0.73253,1.517067,2.790302,-2.019122,2.419042,-2.044806,0.649187,1.940526,-4.965359,0.93046,-1.152011,0.167594,...,-0.733038,-0.913736,-1.29619,4.821739,0.687235,-1.406431,0.669184,-1.847598,-0.817075,-1.181172,2.588552,-1.123118,0.232427,2.170325,0.579414,2.601421,-0.596196,-1.798693,0.312326,0.387486,-2.207365,-1.029329,-1.274233,2.04723,1.928517,2.102633,-0.559383,-0.951418,-2.021749,1.366272,-1.947211,-2.114419,1.140394,-0.796024,1.906361,-0.35752,3.352968,-3.996377,1.520331,-0.000716,-0.487683,-1.889119,0.943015,-2.834418,1.633184,2.001801,-2.333152,2.645595,2.280233,-0.694448
4,46824,3.192195,-1.936676,1.199909,-2.562152,-2.573456,0.575841,-2.358653,1.620844,-4.302936,-0.487575,0.020896,-0.763327,4.341694,0.698798,3.33458,0.607683,-0.718644,-2.730188,0.193828,-1.706196,-0.468727,-2.281904,-1.837274,-2.84914,0.195873,-0.459765,-0.768752,1.033489,-2.490896,2.077521,-0.171984,-3.406347,2.61667,0.713099,4.450222,0.606497,0.160672,-2.604218,2.110272,4.714019,-2.297905,1.700881,-0.195633,-0.404006,2.140779,-5.351576,1.592488,1.312723,1.610867,...,1.086525,3.235504,-1.360041,-0.573626,-0.343873,-0.862111,-3.03626,-1.01661,-1.14977,0.46414,2.412406,-0.654135,0.486894,1.168054,-0.27142,0.667091,1.163557,-0.132236,0.668463,-0.228185,0.155768,1.546821,0.30764,2.542941,1.240985,2.144569,-1.079753,0.335191,0.245159,2.150169,-1.99144,-2.330727,0.736855,2.126931,0.556061,-1.611493,2.722133,-1.18594,0.399201,1.598617,-0.621475,-2.09141,0.5016,-3.083864,-1.060091,2.0536,-2.025008,2.399251,2.562317,0.694134


In [27]:
de.describe(train_item_df)

num of records: 108916, num of columns: 257


Unnamed: 0,Data Type,Unique Values,count Missing,% Missing,Mode,Count Mode,% Mode,mean,std,min,25%,50%,75%,max
item_id,int64,108916,0,0.0,1.000000,1,0.000918139,58485.345982,33065.656602,1.000000,30189.750000,58643.500000,87014.250000,117538.000000
0,float64,108745,0,0.0,0.619635,3,0.00275442,1.458865,1.738504,-8.101768,0.285449,1.489091,2.647692,8.362688
1,float64,108810,0,0.0,-1.928095,3,0.00275442,-0.317904,1.874272,-8.306049,-1.594721,-0.405978,0.838072,8.153514
2,float64,108771,0,0.0,-1.808081,3,0.00275442,0.863508,1.635397,-7.136532,-0.162674,0.919081,1.920463,8.325392
3,float64,108649,0,0.0,-2.129149,3,0.00275442,-2.251081,1.686890,-9.212244,-3.422497,-2.393868,-1.209849,11.561003
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,float64,108802,0,0.0,-0.310374,3,0.00275442,0.283913,1.766579,-10.643577,-0.726043,0.431549,1.425704,7.465464
252,float64,108791,0,0.0,-0.136049,3,0.00275442,0.507123,1.846375,-10.683113,-0.831946,0.641637,1.909708,6.809214
253,float64,108513,0,0.0,2.266967,3,0.00275442,2.334019,1.022302,-3.334639,1.690839,2.370883,3.020225,6.749975
254,float64,108610,0,0.0,1.178804,3,0.00275442,1.827126,1.169730,-5.783994,1.132481,1.855712,2.559368,7.211661


In [10]:
train_click_0_df = pd.read_csv(path+'underexpose_train/underexpose_train_click-0.csv',names=['user_id','item_id','time'])
train_click_0_df.head()

Unnamed: 0,user_id,item_id,time
0,4965,18,0.983763
1,20192,34,0.983772
2,30128,91,0.98378
3,29473,189,0.98393
4,10625,225,0.983925


## test

In [None]:
test_qtime_0_df = pd.read_csv(path+'underexpose_test/underexpose_test_click-0/underexpose_test_qtime-0.csv', names=['user_id','query_time'])
test_click_0_df = pd.read_csv(path+'underexpose_test/underexpose_test_click-0/underexpose_test_click-0.csv', names=['user_id','item_id','time'])

In [11]:
test_qtime_0_df.head()

Unnamed: 0,user_id,query_time
0,11,0.983869
1,22,0.983956
2,44,0.983924
3,55,0.983953
4,66,0.983895


In [12]:
test_click_0_df.head()

Unnamed: 0,user_id,item_id,time
0,1133,221,0.983812
1,17864,253,0.983783
2,6941,309,0.983785
3,34089,358,0.983781
4,21659,536,0.983793
