对于用户推荐业务逻辑上的一些思考:
    
1. 对于新用户的推荐, 也就是用户冷启动的问题, 可以以下方面思考:
    * 需要及时推荐的 - 可以基于排行榜进行推荐, 如用户关注量, 热门帖子等；
    * 不需要及时推荐的 - 可以在用户进行过相关的行为后, 收集模型需要的这些用户数据, 再根据模型给到推荐, 不限制于基于post和user的推荐；
2. 对于已经有过用户行为的用户的推荐, 比如已经有了用户关注和被关注, 或已经有了浏览帖子, 喜欢, 转发, 评论过帖子的:
    * 我们优先基于用户感兴趣的post帖子的推荐, 这也是最根本, 最有效的推荐:
        * 不基于模型的推荐, 推荐的优先顺序有: 用户评论过, 转发过, 喜欢过, 点击过用户->用户点击过用户, 用户评论过, 用户转发过->用户点击过用户, 用户评论过, 喜欢过->用户点击过用户, 用户评论过->用户点击过用户, 转发过, 喜欢过->用户点击过用户, 转发过->用户点击过用户, 喜欢过->用户点击过用户->用户评论过->转发过->喜欢过->用户浏览过
        * 基于模型的推荐, 用户在有了点击帖子和浏览的记录后, 就可以将这些数据送入模型, 来预测用户是否会对浏览内的帖子感兴趣, 按照模型给定的可能会点击的概率从高到底进行帖子的推荐, 而当推荐的帖子被用户点击, 喜欢, 转发, 评论, 甚至是点击用户后, 可以再按照上面的逻辑再推荐用户；
    * 在进行帖子推荐后推荐列表中会有一些推荐用户, 而如果推荐的用户数没有占满这个推荐列表时, 我们可以再使用基于用户的推荐来补充:
        * 用户有了帖子的浏览记录, 可以使用基于帖子的推荐, 而用户有了关注的用户, 就可以使用基于用户推荐的模型, 这也我这边在做的NuralCF模型；
        * 具体逻辑和方法如下:
            * 简单逻辑:
                * 直接使用待推荐的这个用户没有关注的用户列表送入模型, 得到打分, 按照这个打分的高低顺序给到相应的推荐；
            * 复杂逻辑:
                * 我们根据用户关注的用户列表, 使用模型得到这些关注关系的向量, 得到这些向量后, 再去库里找到与这些向量相似度接近的用户关系, 至于需要找多少个相近关系的, 参数可自己定义, 找到这部分用户关系后的逻辑为, 这些关系中被关注的用户, 是我这个要被推荐用户也关注过的, 且这些关系中的关注的这个用户的被关注用户是我没有关注的, 我们可以找到我与这些未被我关注的用户, 再送入模型, 得到我们之间关注的可能性, 按照模型打分的高低顺序给到推荐；
                * 如果被推荐的用户列表未占满, 我们可以采用如下策略:
                    * 对于这个我也关注的用户, 去找到他关注的列表, 送入模型, 得到对应的关系向量, 可以按照这部分我也关注的用户, 他们关注的我没有关注(排除上一步中的用户)的用户的关注可能的概率再排序给到推荐；
                    * 还可以找我也关注的这个用户, 关注他们的用户, 他们关注的用户列表, 给到推荐, 逻辑与上相同；
                * 如果基于第一步中我们没有找到我也关注的用户, 我们可以直接按照这个打分的高低, 将这些用户进行推荐, 如未占满推荐列表, 也可以按照上面的逻辑, 找被关注者关注的用户, 和关注被关注者关注的用户再进行相应的推荐
            
以上, 具体业务逻辑, 可以自行设计

以下参考相关文献和代码:

1. 论文地址:
    * https://arxiv.org/pdf/1708.05031.pdf
2. 资料和代码:
    * https://blog.csdn.net/wuzhongqiang/article/details/108985457
    * https://www.cnblogs.com/sxzhou/p/14585324.html
    * https://github.com/supkoon/neuralCF_tf2

In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.keras as keras
from sklearn.model_selection import train_test_split
from sklearn.metrics import ndcg_score, precision_score, recall_score, f1_score, accuracy_score

2021-11-25 20:28:51.814038: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1


In [2]:
import warnings
warnings.filterwarnings('ignore')

In [3]:
class NeuralMF:
    def __init__(self, num_users, num_items, latent_features=8, mlp_layers_units=[64,32,16,8]):
        self.num_users = num_users
        self.num_items = num_items
        self.latent_features = latent_features
        self.mlp_layers_units = mlp_layers_units

        user_input = keras.layers.Input(shape=(1,), dtype='int32')
        item_input = keras.layers.Input(shape=(1,), dtype='int32')

        user_embedding_gmf = keras.layers.Embedding(self.num_users, self.latent_features,
                                                    name = 'user_embedding_gmf')(user_input)
        item_embedding_gmf = keras.layers.Embedding(self.num_items, self.latent_features,
                                                    name = 'item_embedding_gmf')(item_input)
        user_latent_gmf = keras.layers.Flatten()(user_embedding_gmf)
        item_latent_gmf = keras.layers.Flatten()(item_embedding_gmf)

        result_gmf = keras.layers.Multiply()([user_latent_gmf, item_latent_gmf])

        user_embedding_mlp = keras.layers.Embedding(self.num_users, self.latent_features,
                                                    name='user_embedding_mlp')(user_input)
        item_embedding_mlp = keras.layers.Embedding(self.num_items, self.latent_features, 
                                                    name='item_embedding_mlp')(item_input)

        user_latent_mlp = keras.layers.Flatten()(user_embedding_mlp)
        item_latent_mlp = keras.layers.Flatten()(item_embedding_mlp)

        result_mlp = keras.layers.concatenate([user_latent_mlp, item_latent_mlp])

        for unit in self.mlp_layers_units:
            layer = keras.layers.Dense(unit, activation='relu')
            result_mlp =layer(result_mlp)

        concat = keras.layers.concatenate([result_gmf, result_mlp])

        output = keras.layers.Dense(1, name='output', activation='sigmoid')(concat)

        self.model = keras.Model(inputs=[user_input, item_input], outputs=[output])

    def get_model(self):
        model = self.model
        return model

In [4]:
class Metrics(keras.callbacks.Callback):
    def __init__(self, valid_data):
        super().__init__()
        self.validation_data = valid_data
        
    def on_train_begin(self, logs={}):
        self.val_f1s = []
        self.val_recalls = []
        self.val_precisions = []

    def on_epoch_end(self, epoch, logs={}):
        val_predict = (np.asarray(self.model.predict([
            self.validation_data[0], self.validation_data[1]]))).round()
        val_targ = self.validation_data[2]
        _val_f1 = f1_score(val_targ, val_predict)
        _val_recall = recall_score(val_targ, val_predict)
        _val_precision = precision_score(val_targ, val_predict)
        self.val_f1s.append(_val_f1)
        self.val_recalls.append(_val_recall)
        self.val_precisions.append(_val_precision)
        print('- val_precision: %.4f - val_recall %.4f - val_f1: %.4f' %
              (_val_precision, _val_recall, _val_f1))

In [5]:
def get_data_dict(data, lst=['u_index_value', 'f_index_value']):
    d = dict()
    for idx, row in data[lst].iterrows():
        d[(row[0], row[1])] = 1
    return d

In [6]:
def get_data_instances(train, num_negatives, num_items):
    user_input, item_input, labels = [],[],[]
    for (u, i) in train.keys():
        user_input.append(u)
        item_input.append(i)
        labels.append(1)

        for t in range(num_negatives):
            j = np.random.randint(num_items)
            while train.__contains__((u, j)):
                j = np.random.randint(num_items)
            user_input.append(u)
            item_input.append(j)
            labels.append(0)
    return user_input, item_input, labels

In [7]:
follows = pd.read_csv('follows.csv')
follows.drop('rank', axis=1, inplace=True)
follows.drop_duplicates(inplace=True)

用户uid和fid进行index编码

In [8]:
uid_codes = follows.uid.drop_duplicates().reset_index()
fid_codes = follows.fid.drop_duplicates().reset_index()
uid_codes.rename(columns={'index':'uid_index'}, inplace=True)
fid_codes.rename(columns={'index':'fid_index'}, inplace=True)
uid_codes['u_index_value'] = list(uid_codes.index)
fid_codes['f_index_value'] = list(fid_codes.index)
follows = pd.merge(follows, uid_codes, how='left')
follows = pd.merge(follows, fid_codes, how='left')

In [9]:
train, test = train_test_split(follows, test_size=0.1)
train_dict, test_dict = get_data_dict(train), get_data_dict(test)
train_set, test_set = get_data_instances(train_dict, 4, train['f_index_value'].max()), get_data_instances(test_dict, 4, test['f_index_value'].max())

In [10]:
X1, X2 = np.array(train_set[0]), np.array(train_set[1])

X1_train, X1_val, X2_train, X2_val, y_train, y_val = train_test_split(X1, X2, np.array(train_set[2]), test_size=0.1)

In [11]:
num_users = follows.u_index_value.nunique()
num_items = follows.f_index_value.nunique()

In [12]:
callbacks_list = [
    keras.callbacks.EarlyStopping(
        monitor='val_acc',
        patience=5),
    keras.callbacks.ModelCheckpoint(
        filepath='models/',
        monitor='val_loss',
        save_best_only=True),
    keras.callbacks.TensorBoard(
        log_dir='logs/',
        histogram_freq=0,
        write_graph=True,
        write_images=True)]

2021-11-25 20:35:34.530294: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2021-11-25 20:35:34.530321: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
2021-11-25 20:35:34.531021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2021-11-25 20:35:34.582528: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1365] Profiler found 2 GPUs
2021-11-25 20:35:34.583074: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcupti.so.10.1
2021-11-25 20:35:34.583897: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.
2021-11-25 20:35:34.583978: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1487] CUPTI activity buffer flushed


In [13]:
BATCH = 512

metrics = Metrics((X1_val, X2_val, y_val))
strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
    model = NeuralMF(num_users, num_items, 8, [64,32,16,8]).get_model()
    model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])

model.fit([X1_train, X2_train],
          y_train,
          batch_size=BATCH,
          epochs=100, verbose=1, validation_data=([X1_val, X2_val], y_val),
          callbacks=[metrics] + callbacks_list)

2021-11-25 20:35:34.608589: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set


INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')


2021-11-25 20:35:34.608849: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-25 20:35:34.609303: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.8GHz coreCount: 46 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.23GiB/s
2021-11-25 20:35:34.609445: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:941] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-25 20:35:34.609860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
pciBusID: 0000:03:00.0 name: NVIDIA GeForce RTX 2080 computeCapability: 7.5
coreClock: 1.8GHz coreCount: 46 deviceMemorySize: 7.79GiB deviceMemoryBandwidth

INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).


2021-11-25 20:35:38.116950: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_437"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.auto

Epoch 1/100
INFO:tensorflow:batch_all_reduce: 10 all-reduces with algorithm = nccl, num_packs = 1
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflo

2021-11-25 20:35:40.761463: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.10


   11/88670 [..............................] - ETA: 30:49 - loss: 0.6920 - acc: 0.5518

2021-11-25 20:35:42.986590: I tensorflow/core/profiler/lib/profiler_session.cc:136] Profiler session initializing.
2021-11-25 20:35:42.986624: I tensorflow/core/profiler/lib/profiler_session.cc:155] Profiler session started.
2021-11-25 20:35:42.986697: E tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1415] function cupti_interface_->Subscribe( &subscriber_, (CUpti_CallbackFunc)ApiCallback, this)failed with error CUPTI_ERROR_INSUFFICIENT_PRIVILEGES
2021-11-25 20:35:43.002932: I tensorflow/core/profiler/lib/profiler_session.cc:71] Profiler session collecting data.
2021-11-25 20:35:43.005224: I tensorflow/core/profiler/internal/gpu/cupti_collector.cc:228]  GpuTracer has collected 0 callback api events and 0 activity events. 
2021-11-25 20:35:43.006655: I tensorflow/core/profiler/lib/profiler_session.cc:172] Profiler session tear down.
2021-11-25 20:35:43.008014: I tensorflow/core/profiler/rpc/client/save_profile.cc:137] Creating directory: logs/train/plugins/profile/2021_11_25_20_3



2021-11-25 20:58:11.812212: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_180958"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.a



2021-11-25 20:58:25.048838: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_201326"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.a

- val_precision: 0.8698 - val_recall 0.6678 - val_f1: 0.7556


2021-11-25 21:00:44.277480: W tensorflow/python/util/util.cc:348] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.


INFO:tensorflow:Assets written to: models/assets
Epoch 2/100


2021-11-25 21:23:16.452313: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_558295"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.a

- val_precision: 0.8831 - val_recall 0.6700 - val_f1: 0.7620
INFO:tensorflow:Assets written to: models/assets
Epoch 3/100


2021-11-25 21:48:24.266666: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_915004"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.a

- val_precision: 0.8766 - val_recall 0.6783 - val_f1: 0.7648
Epoch 4/100


2021-11-25 22:12:58.561557: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_1270071"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

- val_precision: 0.8685 - val_recall 0.6876 - val_f1: 0.7676
Epoch 5/100


2021-11-25 22:37:20.447734: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_1625138"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

- val_precision: 0.8610 - val_recall 0.6885 - val_f1: 0.7651
Epoch 6/100


2021-11-25 23:01:55.922316: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_1980205"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

- val_precision: 0.8651 - val_recall 0.6838 - val_f1: 0.7638
Epoch 7/100


2021-11-25 23:26:18.979046: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_2335272"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

- val_precision: 0.8404 - val_recall 0.6962 - val_f1: 0.7615
Epoch 8/100


2021-11-25 23:50:42.595540: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_2690339"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

- val_precision: 0.8396 - val_recall 0.6969 - val_f1: 0.7617
Epoch 9/100


2021-11-26 00:15:45.230685: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_3045406"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

- val_precision: 0.8340 - val_recall 0.6975 - val_f1: 0.7596


<tensorflow.python.keras.callbacks.History at 0x7fab0150fb80>

In [14]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 1)]          0                                            
__________________________________________________________________________________________________
user_embedding_mlp (Embedding)  (None, 1, 8)         8185288     input_1[0][0]                    
__________________________________________________________________________________________________
item_embedding_mlp (Embedding)  (None, 1, 8)         10979616    input_2[0][0]                    
______________________________________________________________________________________________

In [15]:
loss, acc = model.evaluate([np.array(test_set[0]), np.array(test_set[1])], batch_size=BATCH, verbose=1)

2021-11-26 00:18:25.365825: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_3203255"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.



In [16]:
loss, acc

(0.0, 0.0)

In [17]:
y_pred = model.predict([np.array(test_set[0]), np.array(test_set[1])]).round()

2021-11-26 00:18:39.323974: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_3225643"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

In [18]:
accuracy_score(np.array(test_set[2]), y_pred)

0.9068501753847581

In [19]:
precision_score(np.array(test_set[2]), y_pred)

0.8149611765473107

In [20]:
recall_score(np.array(test_set[2]), y_pred)

0.6911863360464743

In [21]:
f1_score(np.array(test_set[2]), y_pred)

0.7479878629787061

In [22]:
# ndcg_score([test_set[2]], [[int(v[0]) for v in y_pred]], k=10)

简单逻辑实现

In [23]:
import random

In [24]:
random_test_id = random.choice(follows['u_index_value'].unique())

In [25]:
random_test_df = follows[follows['u_index_value'] == random_test_id]

In [26]:
test_fids = set(follows[follows['u_index_value'] == random_test_id]['f_index_value'])

In [67]:
len(recom_fid)

1372450

In [28]:
recom_df.head(20)

Unnamed: 0,uid,fid,score
983,884310,983,0.999738
874,884310,874,0.999605
902,884310,902,0.99951
1338,884310,1338,0.999485
1348,884310,1348,0.999455
939,884310,939,0.999443
1055,884310,1055,0.999436
1849,884310,1850,0.999432
1746,884310,1747,0.999319
958,884310,958,0.999313


复杂逻辑实现

In [29]:
from keras.models import Model

intermediate_layer_model = Model(inputs=model.input, outputs=model.get_layer('concatenate_1').output)

In [30]:
intermediate_output = intermediate_layer_model.predict([np.array(follows['u_index_value']), np.array(follows['f_index_value'])])

In [31]:
data = {**train_dict, **test_dict}

In [32]:
vec_dict = dict()
for idx, vec in zip(data.keys(), intermediate_output): 
    vec_dict[idx] = vec
    
index_dict = dict()
for n, idx in enumerate(data.keys()): 
    index_dict[n] = idx

In [33]:
test_intermediate_output = intermediate_layer_model.predict([np.array(random_test_df['u_index_value']), np.array(random_test_df['f_index_value'])])

In [34]:
vecs = []
for i in vec_dict.values():
    vecs.append(list(i))

In [35]:
cosine_loss = tf.keras.losses.CosineSimilarity(axis=1, reduction=tf.keras.losses.Reduction.NONE)

In [36]:
test_vec = [list(v) for v in test_intermediate_output]

In [37]:
data_vecs = [list(v) for v in vecs]

In [38]:
import heapq
import copy

In [39]:
n = 200

max_indexes = []
for vec in test_vec:
    m = list(cosine_loss([vec], data_vecs).numpy())
    max_number = heapq.nlargest(n, m) 
    max_index = []
    for t in max_number:
        index = m.index(t)
        max_index.append(index)
        m[index] = 0
    max_indexes.append(max_index)

In [40]:
drop_dup_indexes = []
for idx in max_indexes:
    if not idx in drop_dup_indexes:
        drop_dup_indexes.append(idx)

In [41]:
recom_list = []
for idxs in drop_dup_indexes:
    for idx in idxs:
        if (index_dict[idx][1] in test_fids) and (index_dict[idx][0] not in test_fids) and (index_dict[idx][0] != random_test_id):
            recom_list.append(index_dict[idx][0])

In [56]:
temp_list = copy.deepcopy(recom_list)

In [57]:
recom_list

[1021762]

In [58]:
K = 20
iter_count = 0

while len(recom_list) < K and len(recom_list) > 0:
    if iter_count < 10:
        t_list = []
        if len(temp_list) != 0:
            for fid in temp_list:
                temp = follows[follows['u_index_value'] == fid]
                temp_intermediate_output = intermediate_layer_model.predict([np.array(temp['u_index_value']), np.array(temp['f_index_value'])])
                temp_vec = [list(v) for v in temp_intermediate_output]
                max_indexes = []
                for vec in temp_vec:
                    m = list(cosine_loss([vec], data_vecs).numpy())
                    max_number = heapq.nlargest(n, m) 
                    max_index = []
                    for t in max_number:
                        index = m.index(t)
                        max_index.append(index)
                        m[index] = 0
                    max_indexes.append(max_index)
                drop_dup_indexes = []
                for idx in max_indexes:
                    if not idx in drop_dup_indexes:
                        drop_dup_indexes.append(idx)
                temp_fids = set(follows[follows['u_index_value'] == fid]['f_index_value'])
                for idxs in drop_dup_indexes:
                    for idx in idxs:
                        if (index_dict[idx][1] in temp_fids) and (index_dict[idx][0] not in temp_fids) and (index_dict[idx][0] != fid) and (index_dict[idx][0] != random_test_id):
                            recom_list.append(index_dict[idx][0])
                            t_list.append(index_dict[idx][0])
                temp_list = copy.deepcopy(t_list)
    else:
        break
    iter_count += 1

In [59]:
if len(recom_list) != 0:
    pred = model.predict([np.array([random_test_id] * len(recom_list)), np.array(recom_list)])
    recom_df = pd.DataFrame({'uid': [random_test_id] * len(recom_list), 'fid': (recom_list), 'score': [p[0] for p in pred]})
    recom_df.sort_values('score', ascending=False, inplace=True)

2021-11-26 09:39:14.942416: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_3794842"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

In [60]:
recom_df

Unnamed: 0,uid,fid,score
0,884310,1021762,0.205373


In [61]:
recom_list = []
if len(recom_list) <= K:
    for idxs in drop_dup_indexes:
        for idx in idxs:
            if index_dict[idx][1] not in test_fids:
                recom_list.append([random_test_id, index_dict[idx][1]])
            if index_dict[idx][0] not in test_fids:
                recom_list.append([random_test_id, index_dict[idx][0]])
                
drop_dup_recom = []
for r in recom_list:
    if not r in drop_dup_recom:
        drop_dup_recom.append(r)

In [62]:
pred = model.predict([np.array(drop_dup_recom)[:,0], np.array(drop_dup_recom)[:,1]])

recom_df = pd.DataFrame({'uid': list(np.array(drop_dup_recom)[:,0]), 'fid': list(np.array(drop_dup_recom)[:,1]), 'score': [p[0] for p in pred]})

recom_df.sort_values('score', ascending=False, inplace=True)

2021-11-26 09:39:15.103215: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:656] In AUTO-mode, and switching to DATA-based sharding, instead of FILE-based sharding as we cannot find appropriate reader dataset op(s) to shard. Error: Did not find a shardable source, walked to a node which is not a dataset: name: "FlatMapDataset/_9"
op: "FlatMapDataset"
input: "PrefetchDataset/_8"
attr {
  key: "Targuments"
  value {
    list {
    }
  }
}
attr {
  key: "f"
  value {
    func {
      name: "__inference_Dataset_flat_map_slice_batch_indices_3795026"
    }
  }
}
attr {
  key: "output_shapes"
  value {
    list {
      shape {
        dim {
          size: -1
        }
      }
    }
  }
}
attr {
  key: "output_types"
  value {
    list {
      type: DT_INT64
    }
  }
}
. Consider either turning off auto-sharding or switching the auto_shard_policy to DATA to shard this dataset. You can do this by creating a new `tf.data.Options()` object then setting `options.experimental_distribute.

In [63]:
recom_df.head(int(K-len(recom_list)))

Unnamed: 0,uid,fid,score
43,884310,874,0.999605
287,884310,939,0.999443
34,884310,1055,0.999436
139,884310,1850,0.999432
115,884310,1019,0.999182
47,884310,1417,0.999148
343,884310,778,0.999098
105,884310,1703,0.999061
321,884310,312,0.999017
226,884310,745,0.998678


以上测试结果:
1. 在使用GPU加速的情况下, 几十, 几百万的用户推荐都很及时, 可以直接使用简单逻辑来推荐, 而对于复杂逻辑看来结果并不那么好；
2. 对于没有GPU的情况下, 可以进一步针对待推荐列表进行过滤, 比如过滤掉非活跃用户, 过滤掉黑名单用户, 过滤掉被举报用户等, 来缩小待推荐列表以尽量实现快速推荐；