## 赛题理解

理解赛题：

- 了解赛题背后的思想以及赛题业务逻辑的清晰，帮助构建有效的特征模型

- 根据概况和数据，分析赛题及大致处理方式

- 了解模型评测的指标

### 1. 赛题简介

> - 概述：新闻推荐场景下的用户行为预测挑战赛，以新闻APP中的新闻推荐为背景
> - 目标：根据用户历史浏览点击新闻文章的数据信息预测用户未来的点击行为---用户最后一次点击的新闻文章


### 2. 数据概况

> - 来源：某新闻app用户交互数据
>
> - 规模：30万用户，300万次点击，36万篇不同新闻文章，每篇新闻文章有对应 Embedding 向量表示
>
> - 训练与测试：20万用户的点击日志数据作为训练集，5万用户的点击日志数据作为测试集A，5万用户的点击日志数据作为测试集B

### 3. 评价方式理解

结合最后的提交文件：根据 sample.submit.csv，最后的提交格式是：

> - 针对每个用户，给出 5 篇文章的推荐结果，按照点击概率从前往后排序
>
> - 显然成立，真实的用户最后只会点击一篇新闻，那么推荐的 5 篇文章是否有真实答案
>
> - 对User1：User1, article1, article2, article3, article4, article5
>
> - 评价指标：
>   $$
   score(user) = \sum_{k = 1}^5\frac{s(user, k)}{k} \\
   e.g. \\
   s(user1, i) = 1,s(user1, other)=0
   $$

### 4. 赛题理解

#### 4.1 比赛目标

根据用户历史浏览点击新闻的数据信息预测用户最后一次点击的新闻文章

#### 4.2 比赛特点

- 目标上：我们要给用户推荐的是新闻文章，而不是预测一个数或者预测数据哪一类那样的问题
- 数据上：数据并非特征＋标签，而是基于真实的业务场景，得到的用户点击日志

#### 4.3 比赛理解

要先转成一个分类问题来做，而分类的标签就是用户是否会点击某篇文章，分类问题的特征中是否会有用户和文章，我们要训练的分类模型：对某用户最后一次点击某篇文章的概率进行预测

#### 4.4 解决思路

把该预测问题转化成一个监督学习的问题（特征＋标签），然后进行ML+DL等建模预测

#### 4.5 思考方向

> - 如何转成一个监督学习问题
> - 转化成一个怎样的监督学习问题
> - 能利用的特征有哪些
> - 有哪些模型可以尝试
> - 面对数万级别的文章推荐，有哪些策略

#### 4.6 转成一个怎样的监督学习问题

> - 从36万篇文章中预测某一篇，36万选1，嫩否转成分类问题
> - 通过用户最后一次点击文章的概率排序求解，即点击率预测问题，即监督学习领域分类问题---逻辑回归问题

## Baseline

In [6]:
# import packages
import time, math, os
from tqdm import tqdm
import gc
import pickle
import random
from datetime import datetime
from operator import itemgetter
import numpy as np
import pandas as pd
import warnings
from collections import defaultdict
warnings.filterwarnings('ignore')

In [7]:
data_path = './data_raw/'
save_path = './tmp_results/'

### 1 df 节省内存函数

In [9]:
# 节省内存的一个标配函数
# 当遇到大量数字类型的数据时，通过把int64/float64类型的数值用更小的int(float)32/16/8来搞定，
# 以达到实现减少内存使用的目的
def reduce_mem(df):
    starttime = time.time()
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if pd.isnull(c_min) or pd.isnull(c_max):
                continue
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('-- Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction),time spend:{:2.2f} min'.format(end_mem, 
                                                                        100*(start_mem - end_mem)/start_mem,
                                                                        (time.time() - starttime)/60))
    return df

### 2.1 读取采样数据

In [47]:
# debug模式：从训练集中划出一部分数据来调试代码
def get_all_click_sample(data_path='./data_raw/', sample_nums=10000):
    '''
    训练集中采样一部分数据调试
    data_path：元数据的存储路径
    sample_nums：采样数目（这里由于机器内存限制，采样用户做
    '''
    all_click = pd.read_csv(data_path + 'train_click_log.csv')  # 读取 csv 文件
    all_user_ids = all_click.user_id.unique()  # unique() 相当于：求集合
    
    sample_user_ids = np.random.choice(all_user_ids, size=sample_nums, replace=False)  
    # =============== numpy.random.choice(a, size=None, replace=True, p=None) ====================
    # 从a(只要是ndarray都可以，但必须是一维的)中随机抽取数字，并组成指定大小(size)的数组
    # replace:True表示可以取相同数字，False表示不可以取相同数字
    # 数组p：与数组a相对应，表示取数组a中每个元素的概率，默认为选取每个元素的概率相同。
    # 从数组中随机抽取元素
    # 这里不可以有相同元素
    all_click = all_click[all_click['user_id'].isin(sample_user_ids)]
    
    all_click = all_click.drop_duplicates((['user_id', 'click_article_id', 'click_timestamp']))
    # ================== pandas.DataFrame.drop_duplicates 去重复数据 ======================
    # DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
    # subset： 列标签，可选
    # keep： {‘first’, ‘last’, False}, 默认值 ‘first’
    #      first： 删除第一次出现的重复项。
    #      last： 删除重复项，除了最后一次出现。
    #      False： 删除所有重复项。
    # inplace：布尔值，默认为False，是否删除重复项或返回副本

    # 返回： 重复数据删除 ： DataFrame
    return all_click

#### 2.2 读取全量数据

In [48]:
# 读取点击数据，分成线上和线下，如果是为了获取线上提交结果，应该将测试集中的点击数据合并到总的数据中
# 如果是为了线下验证模型有效性，或者特征有效性，可以只使用训练集
def get_all_click_df(data_path='./data_raw/', offline=True):
    if offline:
        all_click = pd.read_csv(data_path + 'train_click_log.csv')
    else:
        trn_click = pd.read_csv(data_path + 'train_click_log.csv')
        tst_click = pd.read_csv(data_path + 'testA_click_log.csv')
        
        all_click = trn_click.append(tst_click)
    
    all_click = all_click.drop_duplicates((['user_id', 'click_article_id']))
    return all_click

In [49]:
# 全量数据集
all_click_df = get_all_click_df(offline=False)
print(all_click_df)

        user_id  click_article_id  click_timestamp  click_environment  \
0        199999            160417    1507029570190                  4   
1        199999              5408    1507029571478                  4   
2        199999             50823    1507029601478                  4   
3        199998            157770    1507029532200                  4   
4        199998             96613    1507029671831                  4   
...         ...               ...              ...                ...   
518005   221924             70758    1508211323220                  4   
518006   207823            331116    1508211542618                  4   
518007   207823            234481    1508211850103                  4   
518008   207823            211442    1508212189949                  4   
518009   207823            211401    1508212315718                  4   

        click_deviceGroup  click_os  click_country  click_region  \
0                       1        17              1     

### 3 获取 用户-文章-点击时间字典

In [50]:
# 根据点击时间获取获取用户的点击文章序列
# {user1: {item1: time1, item2: time2..}...}
def get_user_item_time(click_df):
    click_df = click_df.sort_values('click_timestamp')
    
    def make_item_time_pair(df):
        return list(zip(df['click_article_id'], df['click_timestamp']))
    
    user_item_time_df = click_df.groupby('user_id')['click_article_id', 'click_timestamp'].apply( \
                lambda x: make_item_time_pair(x)).reset_index().rename(columns={0: 'item_time_list'})
    user_item_time_dict = dict(zip(user_item_time_df['user_id'], user_item_time_df['item_time_list']))
    
    return user_item_time_dict
    

### 4. 获取点击最多的 topk 个文章

In [51]:
# 获取近期点击最多的文章
def get_item_topk_click(click_df, k):
    topk_click = click_df['click_article_id'].value_counts().index[:k]
    return topk_click

### 5 itemcf 的物品相似度计算

In [71]:
def itemcf_sim(df):
    '''
    文章与文章之间的相似性矩阵计算
    ：param df: 数据表
    ：item_created_time_dict: 文章创建时间的字典
    return: 文章与文章的相似性矩阵
    思路: 基于物品的协同过滤，在多路召回部分上会加上关联规则的召回策略
    '''
    
    user_item_time_dict = get_user_item_time(df)
    
    # 计算物品相似度
    i2i_sim = {}
    item_cnt = defaultdict(int)
    for user, item_time_list in tqdm(user_item_time_dict.items()):
        # tqdm 是一个快速可以扩展的进度条，可以在python长循环中添加一个进度提示性喜
        # 在基于商品的协同过滤优化的时候可以考虑时间因素
        for i, i_click_time in item_time_list:
            item_cnt[i] += 1
            i2i_sim.setdefault(i, {})
            for j, j_click_time in item_time_list:
                if(i == j):
                    continue
                i2i_sim[i].setdefault(j, 0)
                
                i2i_sim[i][j] += 1 / math.log(len(item_time_list) + 1)
            i2i_sim_ = i2i_sim.copy()
            for i, related_items in i2i_sim.items():
                for j, wij in related_items.items():
                    i2i_sim_[i][j] = wij / math.sqrt(item_cnt[i] * item_cnt[j])
                    
        # 将得到的相似性矩阵保存到本地
        pickle.dump(i2i_sim_, open(save_path + 'itemcf_i2i_sim.pkl', 'wb'))
        
        return i2i_sim_
                
        

In [72]:
itemcf_sim(all_click_df)

  2%|▏         | 174/10000 [00:00<00:05, 1703.78it/s]

defaultdict(<class 'int'>, {79851: 1})
{79851: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1})
{79851: {}, 158158: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1, 207603: 1})
{79851: {}, 158158: {}, 207603: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1, 207603: 1, 108903: 1})
{79851: {}, 158158: {}, 207603: {}, 108903: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1, 207603: 1, 108903: 1, 277712: 1})
{79851: {}, 158158: {}, 207603: {}, 108903: {}, 277712: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1, 207603: 1, 108903: 1, 277712: 1, 36162: 1})
{79851: {}, 158158: {}, 207603: {}, 108903: {}, 277712: {}, 36162: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1, 207603: 1, 108903: 1, 277712: 1, 36162: 1, 331116: 1})
{79851: {}, 158158: {}, 207603: {}, 108903: {}, 277712: {}, 36162: {}, 331116: {}}
defaultdict(<class 'int'>, {79851: 1, 158158: 1, 207603: 1, 108903: 1, 277712: 1, 36162: 1, 331116: 1, 61452: 1})
{79851: {}, 158158: {}, 207603: {}, 108903: {}, 27

  5%|▌         | 512/10000 [00:00<00:08, 1171.79it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 16%|█▋        | 1645/10000 [00:03<00:36, 228.32it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 17%|█▋        | 1695/10000 [00:03<00:42, 193.81it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--Notebook

 26%|██▌       | 2609/10000 [00:09<00:53, 137.39it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 27%|██▋       | 2667/10000 [00:09<00:46, 156.72it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 27%|██▋       | 2701/10000 [00:09<00:53, 136.07it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--Notebook

 36%|███▌      | 3624/10000 [00:16<00:56, 113.46it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 37%|███▋      | 3656/10000 [00:16<00:52, 121.59it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 37%|███▋      | 3700/10000 [00:17<00:52, 119.92it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--Notebook

 43%|████▎     | 4306/10000 [00:23<01:02, 91.02it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 43%|████▎     | 4334/10000 [00:23<01:01, 92.13it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 44%|████▎     | 4369/10000 [00:24<01:00, 92.99it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 49%|████▉     | 4934/10000 [00:31<01:03, 79.83it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 50%|████▉     | 4964/10000 [00:31<01:01, 82.02it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 50%|████▉     | 4973/10000 [00:31<01:05, 76.19it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 55%|█████▍    | 5490/10000 [00:38<00:55, 81.28it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 55%|█████▌    | 5513/10000 [00:39<01:01, 72.63it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 55%|█████▌    | 5544/10000 [00:39<00:54, 82.09it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 59%|█████▉    | 5919/10000 [00:45<01:13, 55.89it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 59%|█████▉    | 5925/10000 [00:45<01:16, 53.21it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 59%|█████▉    | 5939/10000 [00:46<01:50, 36.71it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 62%|██████▏   | 6182/10000 [00:52<01:42, 37.29it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 62%|██████▏   | 6199/10000 [00:52<01:41, 37.58it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 62%|██████▏   | 6214/10000 [00:53<01:20, 46.88it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 65%|██████▌   | 6540/10000 [01:00<01:24, 41.02it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 66%|██████▌   | 6553/10000 [01:00<01:09, 49.61it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 66%|██████▌   | 6565/10000 [01:00<01:14, 46.11it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 68%|██████▊   | 6847/10000 [01:07<01:16, 41.10it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 69%|██████▊   | 6862/10000 [01:07<01:20, 39.20it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 69%|██████▉   | 6877/10000 [01:07<01:14, 41.91it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 72%|███████▏  | 7171/10000 [01:14<01:20, 35.35it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 72%|███████▏  | 7182/10000 [01:15<01:09, 40.30it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 72%|███████▏  | 7192/10000 [01:15<01:25, 32.93it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 75%|███████▍  | 7488/10000 [01:22<00:54, 45.72it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 75%|███████▍  | 7494/10000 [01:22<01:15, 33.26it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 75%|███████▌  | 7504/10000 [01:22<01:16, 32.65it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 78%|███████▊  | 7801/10000 [01:29<01:07, 32.69it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 78%|███████▊  | 7814/10000 [01:30<00:57, 37.91it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 78%|███████▊  | 7823/10000 [01:30<01:15, 28.72it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 81%|████████  | 8079/10000 [01:37<00:59, 32.42it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 81%|████████  | 8083/10000 [01:38<01:54, 16.75it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 81%|████████  | 8101/10000 [01:38<01:05, 28.88it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 84%|████████▎ | 8374/10000 [01:47<00:58, 27.85it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 84%|████████▍ | 8381/10000 [01:47<01:22, 19.55it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 84%|████████▍ | 8393/10000 [01:47<00:58, 27.61it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 86%|████████▌ | 8577/10000 [01:54<00:44, 31.65it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 86%|████████▌ | 8584/10000 [01:55<01:05, 21.73it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 86%|████████▌ | 8597/10000 [01:55<00:52, 26.56it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 88%|████████▊ | 8792/10000 [02:03<00:58, 20.66it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 88%|████████▊ | 8800/10000 [02:04<00:49, 24.03it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 88%|████████▊ | 8808/10000 [02:04<00:47, 25.04it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 90%|████████▉ | 8984/10000 [02:10<00:39, 25.74it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 90%|████████▉ | 8992/10000 [02:11<00:45, 21.98it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 90%|████████▉ | 8999/10000 [02:11<00:58, 17.05it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 92%|█████████▏| 9158/10000 [02:18<00:39, 21.13it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 92%|█████████▏| 9161/10000 [02:19<00:53, 15.79it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 92%|█████████▏| 9169/10000 [02:19<00:40, 20.57it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 93%|█████████▎| 9328/10000 [02:26<00:38, 17.67it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 93%|█████████▎| 9338/10000 [02:26<00:32, 20.11it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 93%|█████████▎| 9345/10000 [02:27<00:29, 21.87it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 95%|█████████▌| 9508/10000 [02:34<00:26, 18.31it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 95%|█████████▌| 9516/10000 [02:34<00:20, 23.38it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 95%|█████████▌| 9524/10000 [02:34<00:21, 21.83it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 97%|█████████▋| 9681/10000 [02:42<00:17, 17.77it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 97%|█████████▋| 9696/10000 [02:43<00:15, 20.15it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 97%|█████████▋| 9702/10000 [02:44<00:21, 13.64it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

 99%|█████████▊| 9861/10000 [02:52<00:05, 25.37it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 99%|█████████▊| 9874/10000 [02:52<00:03, 32.81it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)

 99%|█████████▉| 9882/10000 [02:52<00:03, 32.11it/s]IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp

### 6 itemcf 的文章推荐


In [None]:
# 基于商品的召回i2i
def item_based_recommend(user_id, user_item_time_dict, i2i_sim, sim_item_topk, recall_item_num, item_topk_click):
    """
        基于文章协同过滤的召回
        :param user_id: 用户id
        :param user_item_time_dict: 字典, 根据点击时间获取用户的点击文章序列   {user1: {item1: time1, item2: time2..}...}
        :param i2i_sim: 字典，文章相似性矩阵
        :param sim_item_topk: 整数， 选择与当前文章最相似的前k篇文章
        :param recall_item_num: 整数， 最后的召回文章数量
        :param item_topk_click: 列表，点击次数最多的文章列表，用户召回补全        
        return: 召回的文章列表 {item1:score1, item2: score2...}
        注意: 基于物品的协同过滤(详细请参考上一期推荐系统基础的组队学习)， 在多路召回部分会加上关联规则的召回策略
    """
    
    # 获取用户历史交互的文章
    user_hist_items = user_item_time_dict[user_id]
    
    item_rank = {}
    for loc, (i, click_time) in enumerate(user_hist_items):
        for j, wij in sorted(i2i_sim[i].items(), key=lambda x: x[1], reverse=True)[:sim_item_topk]:
            if j in user_hist_items:
                continue
                
            item_rank.setdefault(j, 0)
            item_rank[j] +=  wij
    
    # 不足10个，用热门商品补全
    if len(item_rank) < recall_item_num:
        for i, item in enumerate(item_topk_click):
            if item in item_rank.items(): # 填充的item应该不在原来的列表中
                continue
            item_rank[item] = - i - 100 # 随便给个负数就行
            if len(item_rank) == recall_item_num:
                break
    
    item_rank = sorted(item_rank.items(), key=lambda x: x[1], reverse=True)[:recall_item_num]
        
    return item_rank

### 7 给每个用户根据物品的协同过滤推荐文章

In [None]:
# 定义
user_recall_items_dict = collections.defaultdict(dict)

# 获取 用户 - 文章 - 点击时间的字典
user_item_time_dict = get_user_item_time(all_click_df)

# 去取文章相似度
i2i_sim = pickle.load(open(save_path + 'itemcf_i2i_sim.pkl', 'rb'))

# 相似文章的数量
sim_item_topk = 10

# 召回文章数量
recall_item_num = 10

# 用户热度补全
item_topk_click = get_item_topk_click(all_click_df, k=50)

for user in tqdm(all_click_df['user_id'].unique()):
    user_recall_items_dict[user] = item_based_recommend(user, user_item_time_dict, i2i_sim, 
                                                        sim_item_topk, recall_item_num, item_topk_click)

### 8 将字典转换成 DF

In [None]:
# 将字典的形式转换成df
user_item_score_list = []

for user, items in tqdm(user_recall_items_dict.items()):
    for item, score in items:
        user_item_score_list.append([user, item, score])

recall_df = pd.DataFrame(user_item_score_list, columns=['user_id', 'click_article_id', 'pred_score'])

### 9 生成提交文件

In [73]:
# 生成提交文件
def submit(recall_df, topk=5, model_name=None):
    recall_df = recall_df.sort_values(by=['user_id', 'pred_score'])
    recall_df['rank'] = recall_df.groupby(['user_id'])['pred_score'].rank(ascending=False, method='first')
    
    # 判断是不是每个用户都有5篇文章及以上
    tmp = recall_df.groupby('user_id').apply(lambda x: x['rank'].max())
    assert tmp.min() >= topk
    
    del recall_df['pred_score']
    submit = recall_df[recall_df['rank'] <= topk].set_index(['user_id', 'rank']).unstack(-1).reset_index()
    
    submit.columns = [int(col) if isinstance(col, int) else col for col in submit.columns.droplevel(0)]
    # 按照提交格式定义列名
    submit = submit.rename(columns={'': 'user_id', 1: 'article_1', 2: 'article_2', 
                                                  3: 'article_3', 4: 'article_4', 5: 'article_5'})
    
    save_name = save_path + model_name + '_' + datetime.today().strftime('%m-%d') + '.csv'
    submit.to_csv(save_name, index=False, header=True)
    

In [None]:
# 获取测试集
tst_click = pd.read_csv(data_path + 'testA_click_log.csv')
tst_users = tst_click['user_id'].unique()

# 从所有的召回数据中将测试集中的用户选出来
tst_recall = recall_df[recall_df['user_id'].isin(tst_users)]

# 生成提交文件
submit(tst_recall, topk=5, model_name='itemcf_baseline')

### 10 总结

#### 10.1 赛题理解是在理解什么


> 理解赛题：直观上梳理问题，分析问题的目标
>
> 理解数据：了解和任务相关的数据字段和数据字段的类型，大致梳理那些数据对解决问题有用，方便数据分析和特征工程
>
> 理解评估指标：检验我们提出的方法，给出结果好坏的标准，在比赛中构建合理的本地验证集和验证的评价指标是很关键的步骤，能有效的节省时间，不同的评价指标会影响预测的侧重点

#### 理解赛题后应该做什么

> 梳理思路和框架
>
> 分析，难点、关键点，哪些地方可以更好的挖掘更好的特征
>
> 出现了过拟合或其他问题，用什么方法来解决这些问题