# MovieLens-1M CTR模型
在个性化推荐场景中，经常会将个性化排序问题可以被建模成 CTR 预估 问题，在这个Notebook中我们展示如何在 MetaSpore 中使用神经网络模型进行 CTR 模型的离线训练、预测、导出，下面以Google提出的 [Wide & Deep](https://arxiv.org/abs/1606.07792) 模型为例。

**注意**：在进行下面的测试前，我们假设已经运行完[Data Exploration](./data_exploration.ipynb)中提到的数据准备部分，即将MovieLens中的数据upload到S3云存储中。

### 1. 特征生成
这里我们会进行**训练/测试**集合到划分，生成 CTR 模型使用特征列，并把特征列的说明上传到S3存储中，使之可以被MetaSpore识别。

#### 1.1 数据读取与训练/测试集合划分
对于线上的推荐系统而言，我们通常采用时间的划分方式，比如抽取前[-N, -2]天作为训练集，最后1天的数据随机抽取出测试集和验证集。由于我们这里采用的是一个中等规模的电影数据集，数据量并不大，为了处理方便，针对召回、排序的过程并没有特殊的trick，我们这里采用Next-Item的方式：即对一个用户而言，前[-N, -2]个交互序列作为训练集，最后一个交互的电影数据作为测试集：

<center>
    <img src="./resources/split_data.png" alt="split data" width="600"/>
</center>

In [1]:
import metaspore as ms
import yaml
import argparse
import sys
import os
import subprocess
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import StructType, StructField, IntegerType, FloatType, LongType, StringType
from functools import reduce

sys.path.append('../../../../') 
from python.algos.widedeep_net import WideDeep

def load_config(path):
    params = dict()
    with open(path, 'r') as stream:
        params = yaml.load(stream, Loader=yaml.FullLoader)
        print('Debug -- load config: ', params)
    return params

def init_spark():
    subprocess.run(['zip', '-r', os.getcwd()+'/python.zip', 'python'], cwd='../../../../')
    spark_confs={
        "spark.submit.pyFiles":"python.zip",
        "spark.network.timeout":"500",
        "spark.ui.showConsoleProgress": "true",
        "spark.kubernetes.executor.deleteOnTermination":"true",
        # "spark.kubernetes.namespace":"xxx" # put namespace params here if we want to use computing cluster
    }
    spark_session = ms.spark.get_session(local=local,
                                         app_name=app_name,
                                         batch_size=batch_size,
                                         worker_count=worker_count,
                                         server_count=server_count,
                                         worker_memory=worker_memory,
                                         server_memory=server_memory,
                                         coordinator_memory=coordinator_memory,
                                         spark_confs=spark_confs)
    sc = spark_session.sparkContext
    print('Debug -- spark init')
    print('Debug -- version:', sc.version)   
    print('Debug -- applicaitonId:', sc.applicationId)
    print('Debug -- uiWebUrl:', sc.uiWebUrl)
    return spark_session

def stop_spark(spark):
    print('Debug -- spark stop')
    spark.sparkContext.stop()

def read_dataset(**kwargs):
    ### read movies
    movies_schema = StructType([
            StructField("movie_id", LongType(), True),
            StructField("title", StringType(), True),
            StructField("genre", StringType(), True)
    ])

    movies = spark.read.csv(movies_path, sep='::',inferSchema=False, header=False, schema=movies_schema)
    print('Debug -- movies sample:')
    movies.show(10)

    ### read ratings
    ratings_schema = StructType([
            StructField("user_id", LongType(), True),
            StructField("movie_id", LongType(), True),
            StructField("rating", FloatType(), True),
            StructField("timestamp", LongType(), True)
    ])

    ratings = spark.read.csv(ratings_path, sep='::', inferSchema=False, header=False, schema=ratings_schema)
    print('Debug -- ratings sample:')
    ratings.show(10)

    ### read users
    users_schema = StructType([
            StructField("user_id", LongType(), True),
            StructField("gender", StringType(), True),
            StructField("age", IntegerType(), True),
            StructField("occupation", StringType(), True),
            StructField("zip", StringType(), True)
    ])

    users = spark.read.csv(users_path, sep='::', inferSchema=False, header=False, schema=users_schema)
    print('Debug -- users sample:')
    users.show(10)

    return users, movies, ratings

def merge_dataset(users, movies, ratings):
    # merge movies, users, ratings
    dataset = ratings.join(users, on=ratings.user_id==users.user_id, how='leftouter').drop(users.user_id)
    dataset = dataset.join(movies, on=dataset.movie_id==movies.movie_id,how='leftouter').drop(movies.movie_id)
    dataset = dataset.select('user_id', \
                            'gender', \
                            'age', \
                            'occupation', \
                            'zip', \
                            'movie_id', \
                            'title', \
                            'genre', \
                            'rating', \
                            'timestamp'
                            )
    print('Debug -- dataset sample:')
    dataset.show(10)
    return dataset

def split_train_test(dataset):
    dataset.registerTempTable('dataset')        
    query ="""
    select 
        *
    from
    (
        select
            *,
            ROW_NUMBER() OVER(PARTITION BY user_id ORDER BY timestamp DESC) as sample_id
        from
            dataset
    ) ta
    where ta.sample_id = 1
    order by user_id ASC
    """
    test_dataset = spark.sql(query)
    test_dataset = test_dataset.drop('sample_id')
    train_dataset = dataset.exceptAll(test_dataset)
    return train_dataset, test_dataset

In [2]:
print('Debug -- Movielens Feature Generation Demo')
params = load_config('./2-ctr_prediction.yaml')
locals().update(params)
spark = init_spark()

users, movies, ratings = read_dataset(**params)
merged_dataset = merge_dataset(users, movies, ratings)
train_dataset, test_dataset = split_train_test(merged_dataset)

Debug -- Movielens Feature Generation Demo
Debug -- load config:  {'app_name': 'MovieLens-1M CTR', 'local': True, 'worker_count': 1, 'server_count': 1, 'batch_size': 128, 'worker_memory': '5G', 'server_memory': '5G', 'coordinator_memory': '5G', 'movies_path': 's3://alphaide-demo/movielens/ml-1m/movies.dat', 'ratings_path': 's3://alphaide-demo/movielens/ml-1m/ratings.dat', 'users_path': 's3://alphaide-demo/movielens/ml-1m/users.dat', 'rank_train_dataset_path': 's3://alphaide-demo/movielens/ml-1m/rank/train.parquet', 'rank_test_dataset_path': 's3://alphaide-demo/movielens/ml-1m/rank/test.parquet', 'column_name_path': 's3://alphaide-demo/movielens/ml-1m/schema/widedeep/column_schema', 'combine_schema_path': 's3://alphaide-demo/movielens/ml-1m/schema/widedeep/deep_combine_column_schema', 'wide_combine_schema_path': 's3://alphaide-demo/movielens/ml-1m/schema/widedeep/wide_combine_column_schema', 'model_in_path': None, 'model_out_path': 's3://alphaide-demo/movielens/ml-1m/schema/widedeep/mod

22/05/23 20:05:37 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


Debug -- spark init
Debug -- version: 3.1.2
Debug -- applicaitonId: local-1653307538681
Debug -- uiWebUrl: http://movielens-102-0:4040
Debug -- movies sample:
+--------+--------------------+--------------------+
|movie_id|               title|               genre|
+--------+--------------------+--------------------+
|       1|    Toy Story (1995)|Animation|Childre...|
|       2|      Jumanji (1995)|Adventure|Childre...|
|       3|Grumpier Old Men ...|      Comedy|Romance|
|       4|Waiting to Exhale...|        Comedy|Drama|
|       5|Father of the Bri...|              Comedy|
|       6|         Heat (1995)|Action|Crime|Thri...|
|       7|      Sabrina (1995)|      Comedy|Romance|
|       8| Tom and Huck (1995)|Adventure|Children's|
|       9| Sudden Death (1995)|              Action|
|      10|    GoldenEye (1995)|Action|Adventure|...|
+--------+--------------------+--------------------+
only showing top 10 rows

Debug -- ratings sample:
+-------+--------+------+---------+
|user_id|mov

#### 1.2 正负样本划分并将训练/测试数据存储到S3
我们知道MovieLens数据只有用户评分，并没有典型的推荐场景那种（样本, 曝光，点击）这样的用户反馈数据。在这里，我们参考[AutoInt: Automatic Feature Interaction Learning via Self-Attentive Neural Networks](https://arxiv.org/abs/1810.11921) 文章中，对数据集进行划分的方式：

| 电影评分 | 样本数量  | 划分类型  |
|---|---|---|
| >3  |  575281  | 正样本  |
| =3  |  261197  | 负样本  | 
|  < 3 | 163731  | 负样本  |

经过以上的数据划分，我们就可以对CTR模型的正负样本进行划分。以上处理方式较为简单，训练出的模型相对于真实场景来说预测较为容易，因为一般观看电影其实已经是用户对电影比较感兴趣了。

In [3]:
import time
def prepare_rank_train(spark, dataset, verbose=True, mode='train'):
    start = time.time()
    dataset = dataset.filter(dataset['rating'] != 3)
    dataset = dataset.select(F.when(F.col('rating')> 3, '1', ).otherwise('0').alias('label'), '*')
    dataset = dataset.withColumn('rand', F.rand(seed=100)).orderBy('rand')
    dataset = dataset.drop('rand', 'timestamp', 'rating')
    dataset = dataset.select(*(F.col(c).cast('string').alias(c) for c in dataset.columns))
    print('Debug -- prepare_rank_train cost time:', time.time() - start)
    if verbose:
        print('Debug -- rank %s sample size:'% mode, dataset.count())
        print('Debug -- rank %s data types:'% mode, dataset.dtypes)
        print('Debug -- rank %s sample:'% mode)
        dataset.show(10)
        print('Debug -- prepare_rank_train total cost time:', time.time() - start)
    return dataset

def prepare_rank_test(spark, dataset, verbose=True):
    return prepare_rank_train(spark, dataset, verbose=verbose, mode='test')

def write_dataset_to_s3(rank_train_dataset, rank_test_dataset, **kwargs):
    start = time.time()
    rank_train_dataset.write.parquet(rank_train_dataset_path, mode="overwrite")
    print('Debug -- write_dataset_to_s3 train cost time:', time.time() - start)
    start = time.time()
    rank_test_dataset.write.parquet(rank_test_dataset_path, mode="overwrite")
    print('Debug -- write_dataset_to_s3 test cost time:', time.time() - start)
    return True

In [4]:
rank_train_dataset = prepare_rank_train(spark, train_dataset)
rank_test_dataset = prepare_rank_test(spark, test_dataset)
write_dataset_to_s3(rank_train_dataset, rank_test_dataset, **params)

Debug -- prepare_rank_train cost time: 0.18449735641479492


                                                                                

Debug -- rank train sample size: 734372
Debug -- rank train data types: [('label', 'string'), ('user_id', 'string'), ('gender', 'string'), ('age', 'string'), ('occupation', 'string'), ('zip', 'string'), ('movie_id', 'string'), ('title', 'string'), ('genre', 'string')]
Debug -- rank train sample:


                                                                                

+-----+-------+------+---+----------+-----+--------+--------------------+--------------------+
|label|user_id|gender|age|occupation|  zip|movie_id|               title|               genre|
+-----+-------+------+---+----------+-----+--------+--------------------+--------------------+
|    1|   2305|     M| 50|        20|48104|    1641|Full Monty, The (...|              Comedy|
|    0|   5605|     F| 18|         2|95008|    3499|       Misery (1990)|              Horror|
|    1|   2387|     M| 25|        17|32224|    1120|People vs. Larry ...|               Drama|
|    1|   1624|     M| 25|         0|06810|      10|    GoldenEye (1995)|Action|Adventure|...|
|    1|   3377|     M| 25|        17|03570|    1200|       Aliens (1986)|Action|Sci-Fi|Thr...|
|    1|   2427|     F| 25|        14|94010|    2997|Being John Malkov...|              Comedy|
|    1|    454|     M| 25|        20|55092|    3160|     Magnolia (1999)|               Drama|
|    1|    339|     M| 50|         7|80207|     11

                                                                                

Debug -- rank test sample size: 4640
Debug -- rank test data types: [('label', 'string'), ('user_id', 'string'), ('gender', 'string'), ('age', 'string'), ('occupation', 'string'), ('zip', 'string'), ('movie_id', 'string'), ('title', 'string'), ('genre', 'string')]
Debug -- rank test sample:


                                                                                

+-----+-------+------+---+----------+-----+--------+--------------------+--------------------+
|label|user_id|gender|age|occupation|  zip|movie_id|               title|               genre|
+-----+-------+------+---+----------+-----+--------+--------------------+--------------------+
|    1|   5966|     F| 35|         9|10021|     497|Much Ado About No...|      Comedy|Romance|
|    0|   4173|     F| 18|        11|29063|      95| Broken Arrow (1996)|     Action|Thriller|
|    1|   1718|     M| 18|         4|43235|    1396|     Sneakers (1992)|  Crime|Drama|Sci-Fi|
|    0|    962|     F| 25|         1|80020|    3896|Way of the Gun, T...|      Crime|Thriller|
|    1|    937|     M| 25|        15|60513|     750|Dr. Strangelove o...|          Sci-Fi|War|
|    1|   3212|     F| 25|        17|11215|    2959|   Fight Club (1999)|               Drama|
|    1|   4368|     M| 18|        17|22043|    1645|Devil's Advocate,...|Crime|Horror|Myst...|
|    1|   2678|     M| 25|         6|49707|    239

                                                                                

Debug -- write_dataset_to_s3 train cost time: 32.89811635017395


                                                                                

Debug -- write_dataset_to_s3 test cost time: 21.252429008483887


True

In [5]:
!aws s3 ls s3://alphaide-demo/movielens/ml-1m/rank/

                           PRE test.parquet/
                           PRE train.parquet/


#### 1.4 上传特征列与特征组合的Schema
上传特征列与特征组合的Schema到S3云存储，这里有3点需要说明:
 * 在描述文件 `column_schema` 中，我们把每列的名字进行罗列;
 * 在 `combine_column_schema` 中是模型中真实使用的特征，每个`feas`占用一行；
 * 如果模型中使用组合特征，只需要在combine_column_schema文件中加入`feas1#feas2#feas3`即可。 

我们这里为了演示，只用了user id和movie id两个特征，特征描述列：

In [6]:
!cat ./schema/column_schema

0 label
1 user_id
2 gender
3 age
4 occupation
5 zip
6 movie_id
7 title
8 genre

模型wide部分使用的特征：

In [7]:
!cat ./schema/wide_combine_column_schema

user_id
movie_id

模型deep部分使用的特征：

In [8]:
!cat ./schema/deep_combine_column_schema

user_id
movie_id

上传数据：

In [9]:
!aws s3 cp --recursive schema/  s3://alphaide-demo/movielens/ml-1m/schema/widedeep/

upload: schema/.ipynb_checkpoints/wide_combine_column_schema-checkpoint to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/.ipynb_checkpoints/wide_combine_column_schema-checkpoint
upload: schema/.ipynb_checkpoints/deep_combine_column_schema-checkpoint to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/.ipynb_checkpoints/deep_combine_column_schema-checkpoint
upload: schema/.ipynb_checkpoints/column_schema-checkpoint to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/.ipynb_checkpoints/column_schema-checkpoint
upload: schema/wide_combine_column_schema to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/wide_combine_column_schema
upload: schema/column_schema to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/column_schema
upload: schema/deep_combine_column_schema to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/deep_combine_column_schema


### 2. 模型训练
我们使用 Wide & Deep 模型训练、测试并导出 CTR 模型，这里使用`Cross Entropy` 作为模型的 Loss 。我们在 `MetaSpore` 代码库中已经定义了 Wide & Deep 的模型结构，在调用模型的时候，我们只需要对特征、学习率、训练轮数等一些超参数进行定义，就可以进行模型训练了。

#### 2.1 定义训练过程和参数
我们定义我们的训练所需要的一些函数，并设置超参数

In [10]:
def train(spark, train_dataset, **model_params):
    ## init wide and deep model
    module = WideDeep(use_wide=True,
                      wide_embedding_dim=embedding_size,
                      deep_embedding_dim=embedding_size,
                      wide_column_name_path=column_name_path,
                      wide_combine_schema_path=wide_combine_schema_path,
                      deep_column_name_path=column_name_path,
                      deep_combine_schema_path=combine_schema_path,
                      dnn_hidden_units=dnn_hidden_units,
                      ftrl_l1=ftrl_l1,
                      ftrl_l2=ftrl_l2,
                      ftrl_alpha=ftrl_alpha,
                      ftrl_beta=ftrl_beta)
    
    estimator = ms.PyTorchEstimator(module=module,
                                    worker_count=worker_count,
                                    server_count=server_count,
                                    model_out_path=model_out_path,
                                    model_export_path=model_export_path,
                                    model_version=model_version,
                                    experiment_name=experiment_name,
                                    input_label_column_index=input_label_column_index,
                                    metric_update_interval=100)
    model = estimator.fit(train_dataset)
     ## dnn learning rate
    estimator.updater = ms.AdamTensorUpdater(adam_learning_rate)
    return model

def transform(spark, model, test_dataset):
    test_result = model.transform(test_dataset)
    print('Debug -- test result sample:')
    test_result.show(20)
    return test_result

def evaluate(spark, test_result):
    evaluator = pyspark.ml.evaluation.BinaryClassificationEvaluator()
    auc = evaluator.evaluate(test_result)
    return auc

#### 2.2 模型训练与导出
当模型训练完成后，模型会以 `ONNX` 的格式导出到实现定义的 `model_export_path` 中，方便后续 `MetaSpore Serving` 加载。

In [11]:
model = train(spark, rank_train_dataset, **params)

Get aws endpoint from env: obs.cn-southwest-2.myhuaweicloud.com
[WARN] 2022-05-23 12:07:15.719 STSAssumeRoleWithWebIdentityCredentialsProvider [140013911443264] Token file must be specified to use STS AssumeRole web identity creds provider.
[2022-05-23 20:07:15.719] [info] [s3_sdk_filesys.cpp:357] Try to open S3 stream: s3://alphaide-demo/movielens/ml-1m/schema/widedeep/column_schema, read_only true
[2022-05-23 20:07:15.728] [info] [s3_sdk_filesys.cpp:380] Opened read-only stream for object: s3://alphaide-demo/movielens/ml-1m/schema/widedeep/column_schema with total length 78
[2022-05-23 20:07:15.731] [info] [s3_sdk_filesys.cpp:419] Read S3 object s3://alphaide-demo/movielens/ml-1m/schema/widedeep/column_schema with size 78 at position 0 larger than total size: 78, change size to 78
[2022-05-23 20:07:15.735] [info] [s3_sdk_filesys.cpp:413] Read S3 object s3://alphaide-demo/movielens/ml-1m/schema/widedeep/column_schema reached end 78
[2022-05-23 20:07:15.736] [info] [s3_sdk_filesys.cpp:

[2022-05-23 20:07:17.162] [info] PS job with coordinator address 172.16.0.198:39951 started.
[2022-05-23 20:07:17.162] [info] PSRunner::RunPS: pid: 8383, tid: 8390, thread: 0x7fa6b266b700
[2022-05-23 20:07:17.162] [info] PSRunner::RunPSWorker: pid: 8383, tid: 8390, thread: 0x7fa6b266b700
[38;5;046mps agent registered for process 8383 thread 0x7fa6d4083740[m
[2022-05-23 20:07:17.163] [info] ActorProcess::Receiving: Worker pid: 8383, tid: 8393, thread: 0x7fa6b0e68700
[2022-05-23 20:07:17.164] [info] PS job with coordinator address 172.16.0.198:39951 started.
[2022-05-23 20:07:17.164] [info] PSRunner::RunPS: pid: 8382, tid: 8382, thread: 0x7fa6d4083740
[2022-05-23 20:07:17.164] [info] PSRunner::RunPSServer: pid: 8382, tid: 8382, thread: 0x7fa6d4083740
[2022-05-23 20:07:17.165] [info] ActorProcess::Receiving: Server pid: 8382, tid: 8396, thread: 0x7fa6b1669700
[2022-05-23 20:07:17.166] [info] S[0]:10 has connected to others.
PS Server node [38;5;196mS[0]:10[m is ready.
[2022-05-23 20:0

[2022-05-23 20:07:17.165] [info] C[0]:9: The coordinator has connected to 1 servers and 1 workers.
PS Coordinator node [32mC[0]:9[m is ready.


[38;5;046minit dense tensor dnn.dnn.3.bias with shape (512, 1), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.5.weight with shape (256, 512), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.5.bias with shape (256, 1), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.7.weight with shape (128, 256), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.7.bias with shape (128, 1), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.9.weight with shape (1, 128), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046mi

2022-05-23 20:07:46.403 -- auc: 0.49422377225527303, Δauc: 0.49422377225527303, pcoc: 0.7836304284398992, Δpcoc: 0.7836304284398992, #instance: 12533


[Stage 63:>                 (0 + 1) / 1][Stage 79:>               (6 + 1) / 200]

2022-05-23 20:07:50.401 -- auc: 0.49890712418103467, Δauc: 0.497904490442715, pcoc: 0.8683975595478327, Δpcoc: 0.9515381753143699, #instance: 25272


[Stage 63:>                 (0 + 1) / 1][Stage 79:>              (10 + 1) / 200]

2022-05-23 20:07:54.466 -- auc: 0.5287786246112418, Δauc: 0.642067587764524, pcoc: 0.90941921775345, Δpcoc: 0.9915295151353235, #instance: 37806


[Stage 63:>                 (0 + 1) / 1][Stage 79:>              (13 + 1) / 200]

2022-05-23 20:07:58.466 -- auc: 0.5888355307627691, Δauc: 0.7513183425581421, pcoc: 0.9320910861776546, Δpcoc: 1.000450662390994, #instance: 50341


[Stage 63:>                 (0 + 1) / 1][Stage 79:=>             (16 + 1) / 200]

2022-05-23 20:08:02.436 -- auc: 0.6396572668453981, Δauc: 0.7937302880246894, pcoc: 0.9470856641611272, Δpcoc: 1.0070823574348688, #instance: 62980


[Stage 63:>                 (0 + 1) / 1][Stage 79:=>             (20 + 1) / 200]

2022-05-23 20:08:06.473 -- auc: 0.6742934204180482, Δauc: 0.8047745367908747, pcoc: 0.956052296589052, Δpcoc: 1.0007457830250033, #instance: 75580


[Stage 63:>                 (0 + 1) / 1][Stage 79:=>             (23 + 1) / 200]

2022-05-23 20:08:10.416 -- auc: 0.7006104337414019, Δauc: 0.8205197035683711, pcoc: 0.9624447314459359, Δpcoc: 1.000958561750649, #instance: 88099


[Stage 63:>                 (0 + 1) / 1][Stage 79:==>            (27 + 1) / 200]

2022-05-23 20:08:14.453 -- auc: 0.721036653930323, Δauc: 0.8314867025046908, pcoc: 0.9672436316047122, Δpcoc: 1.0008212347064718, #instance: 100555


[Stage 63:>                 (0 + 1) / 1][Stage 79:==>            (30 + 1) / 200]

2022-05-23 20:08:18.407 -- auc: 0.7360609505618323, Δauc: 0.8332981718158742, pcoc: 0.9713106417069155, Δpcoc: 1.003957587936223, #instance: 113144


[Stage 63:>                 (0 + 1) / 1][Stage 79:==>            (33 + 1) / 200]

2022-05-23 20:08:22.394 -- auc: 0.748514233165499, Δauc: 0.8391457025607515, pcoc: 0.9741423670836263, Δpcoc: 0.9993435213555709, #instance: 125799


[Stage 63:>                 (0 + 1) / 1][Stage 79:==>            (37 + 1) / 200]

2022-05-23 20:08:26.389 -- auc: 0.7588248959939028, Δauc: 0.8437465538368657, pcoc: 0.9769481618805929, Δpcoc: 1.0054529424897742, #instance: 138336


[Stage 63:>                 (0 + 1) / 1][Stage 79:===>           (40 + 1) / 200]

2022-05-23 20:08:30.445 -- auc: 0.7674185914844701, Δauc: 0.8453951452211194, pcoc: 0.9787015810761907, Δpcoc: 0.9977837106455927, #instance: 150987


[Stage 63:>                 (0 + 1) / 1][Stage 79:===>           (44 + 1) / 200]

2022-05-23 20:08:34.449 -- auc: 0.7742558175839999, Δauc: 0.8442803128560836, pcoc: 0.9803677148358024, Δpcoc: 1.0003642125444767, #instance: 163538


[Stage 63:>                 (0 + 1) / 1][Stage 79:===>           (47 + 1) / 200]

2022-05-23 20:08:38.401 -- auc: 0.7807961080018897, Δauc: 0.8512976513711672, pcoc: 0.9818010484564217, Δpcoc: 1.0002652802931011, #instance: 176230


[Stage 63:>                 (0 + 1) / 1][Stage 79:===>           (51 + 1) / 200]

2022-05-23 20:08:42.414 -- auc: 0.7860887028092984, Δauc: 0.8498806257375501, pcoc: 0.9829759262143312, Δpcoc: 0.9996190370944708, #instance: 188749


[Stage 63:>                 (0 + 1) / 1][Stage 79:====>          (54 + 1) / 200]

2022-05-23 20:08:46.358 -- auc: 0.79147000412859, Δauc: 0.8593051655420713, pcoc: 0.9840582408852915, Δpcoc: 1.0001892871750242, #instance: 201409


[Stage 63:>                 (0 + 1) / 1][Stage 79:====>          (58 + 1) / 200]

2022-05-23 20:08:50.355 -- auc: 0.7960490818889361, Δauc: 0.8583267956617575, pcoc: 0.984971073708194, Δpcoc: 0.9994507584901441, #instance: 214131


[Stage 63:>                 (0 + 1) / 1][Stage 79:====>          (61 + 1) / 200]

2022-05-23 20:08:54.310 -- auc: 0.7997948233523399, Δauc: 0.8554564624016072, pcoc: 0.9859496461635634, Δpcoc: 1.0025498097551206, #instance: 226760


[Stage 63:>                 (0 + 1) / 1][Stage 79:====>          (65 + 1) / 200]

2022-05-23 20:08:58.323 -- auc: 0.8032455380859316, Δauc: 0.8571086709279416, pcoc: 0.9864773623756669, Δpcoc: 0.9960891844823386, #instance: 239243


[Stage 63:>                 (0 + 1) / 1][Stage 79:=====>         (68 + 1) / 200]

2022-05-23 20:09:02.278 -- auc: 0.8060230381641149, Δauc: 0.851965315615598, pcoc: 0.9873046011333659, Δpcoc: 1.0030271214045288, #instance: 251847


[Stage 63:>                 (0 + 1) / 1][Stage 79:=====>         (72 + 1) / 200]

2022-05-23 20:09:06.241 -- auc: 0.8089253229479643, Δauc: 0.8598191732923223, pcoc: 0.9877668545595187, Δpcoc: 0.9969752154134528, #instance: 264473


[Stage 63:>                 (0 + 1) / 1][Stage 79:=====>         (75 + 1) / 200]

2022-05-23 20:09:10.206 -- auc: 0.8114624954279942, Δauc: 0.8597447048081837, pcoc: 0.9884736587118131, Δpcoc: 1.003316257359617, #instance: 277049


[Stage 63:>                 (0 + 1) / 1][Stage 79:=====>         (78 + 1) / 200]

2022-05-23 20:09:14.195 -- auc: 0.8141558753859014, Δauc: 0.8662475028469393, pcoc: 0.9888086279058417, Δpcoc: 0.9961381659944848, #instance: 289743




2022-05-23 20:09:18.185 -- auc: 0.8162431765845226, Δauc: 0.8600389005711304, pcoc: 0.9893339411507548, Δpcoc: 1.0016104762822373, #instance: 302194




2022-05-23 20:09:22.149 -- auc: 0.8182342918518255, Δauc: 0.8613128029867511, pcoc: 0.9898772179238875, Δpcoc: 1.0029007894660915, #instance: 314763




2022-05-23 20:09:26.140 -- auc: 0.820061304168748, Δauc: 0.8621031652971993, pcoc: 0.9904282302819968, Δpcoc: 1.0043947964227964, #instance: 327204




2022-05-23 20:09:30.088 -- auc: 0.8218061586296854, Δauc: 0.8632396108616541, pcoc: 0.9907622314317446, Δpcoc: 0.9993186113638799, #instance: 339911




2022-05-23 20:09:34.102 -- auc: 0.8234311215090402, Δauc: 0.8638005214885462, pcoc: 0.991096399717256, Δpcoc: 1.0001451540452253, #instance: 352477




2022-05-23 20:09:38.041 -- auc: 0.8252315015181787, Δauc: 0.8706568910715318, pcoc: 0.9914616811757274, Δpcoc: 1.0017763027237907, #instance: 365035




2022-05-23 20:09:42.055 -- auc: 0.8265862361383689, Δauc: 0.8628463204855064, pcoc: 0.9917454213277063, Δpcoc: 0.9999692803161297, #instance: 377625




2022-05-23 20:09:46.058 -- auc: 0.8280680646117167, Δauc: 0.8683902532904393, pcoc: 0.9920244667240932, Δpcoc: 1.0003717009508144, #instance: 390164




2022-05-23 20:09:50.064 -- auc: 0.8291447116444933, Δauc: 0.8605978257977581, pcoc: 0.9922559001298171, Δpcoc: 0.9994491518166327, #instance: 402624




2022-05-23 20:09:54.070 -- auc: 0.8303347022431606, Δauc: 0.8659471315001728, pcoc: 0.9925274070550318, Δpcoc: 1.001115468509988, #instance: 415243




2022-05-23 20:09:58.140 -- auc: 0.8317132478841627, Δauc: 0.8727909171877761, pcoc: 0.9928398010852153, Δpcoc: 1.0032019918579578, #instance: 427793




2022-05-23 20:10:02.137 -- auc: 0.8329857937538236, Δauc: 0.8725854841723397, pcoc: 0.9931091411343312, Δpcoc: 1.0022777451680074, #instance: 440358




2022-05-23 20:10:06.186 -- auc: 0.8341153830133963, Δauc: 0.871082122514588, pcoc: 0.9933546925519995, Δpcoc: 1.0019100763769402, #instance: 452962




2022-05-23 20:10:10.145 -- auc: 0.8352178469658977, Δauc: 0.8717008928680813, pcoc: 0.9934653879152017, Δpcoc: 0.9974760955804275, #instance: 465545




2022-05-23 20:10:14.146 -- auc: 0.8363562082154956, Δauc: 0.8748263468465465, pcoc: 0.9936237784984041, Δpcoc: 0.9994308562009495, #instance: 478195




2022-05-23 20:10:18.175 -- auc: 0.8374285848081741, Δauc: 0.8746925927358653, pcoc: 0.9938281099291347, Δpcoc: 1.0015512030415454, #instance: 490788




2022-05-23 20:10:22.167 -- auc: 0.8384347649793672, Δauc: 0.8741079564088388, pcoc: 0.9939581412253754, Δpcoc: 0.9990240443659734, #instance: 503456




2022-05-23 20:10:26.140 -- auc: 0.8392615799894467, Δauc: 0.870214609862788, pcoc: 0.9941442723970149, Δpcoc: 1.0016284645123625, #instance: 515983




2022-05-23 20:10:30.087 -- auc: 0.8400273999548222, Δauc: 0.8694263778462346, pcoc: 0.9943016989878417, Δpcoc: 1.0007573305665807, #instance: 528516




2022-05-23 20:10:34.136 -- auc: 0.8409779967933007, Δauc: 0.8774578123609937, pcoc: 0.9943880892001025, Δpcoc: 0.9979920792973751, #instance: 541180




2022-05-23 20:10:38.082 -- auc: 0.8416458645896819, Δauc: 0.8685627418600791, pcoc: 0.9944302490645893, Δpcoc: 0.9962524671024203, #instance: 553730




2022-05-23 20:10:42.047 -- auc: 0.8424669988673367, Δauc: 0.8760784496668605, pcoc: 0.9946032368395659, Δpcoc: 1.0022303430503592, #instance: 566301




2022-05-23 20:10:46.105 -- auc: 0.8431529343766698, Δauc: 0.8719080079389613, pcoc: 0.9947184926681282, Δpcoc: 0.9999218158004934, #instance: 578849




2022-05-23 20:10:50.054 -- auc: 0.8438298259547852, Δauc: 0.8732212970792731, pcoc: 0.9947883596911961, Δpcoc: 0.9979668541844416, #instance: 591465




2022-05-23 20:10:54.055 -- auc: 0.8444170013517426, Δauc: 0.8701641096953727, pcoc: 0.994931524332662, Δpcoc: 1.001681122944879, #instance: 604073




2022-05-23 20:10:58.039 -- auc: 0.8449374801224437, Δauc: 0.8689805442878972, pcoc: 0.9950920884666914, Δpcoc: 1.0027847531032776, #instance: 616726




2022-05-23 20:11:02.005 -- auc: 0.8455860860085385, Δauc: 0.8752171997892438, pcoc: 0.9952085817049114, Δpcoc: 1.0009941312974435, #instance: 629205




2022-05-23 20:11:06.010 -- auc: 0.8461483619907917, Δauc: 0.8729239022356212, pcoc: 0.9950554580387787, Δpcoc: 0.9873406658178495, #instance: 641738




2022-05-23 20:11:09.937 -- auc: 0.8467937047464346, Δauc: 0.8775884323358968, pcoc: 0.9951825435744156, Δpcoc: 1.001597049189549, #instance: 654406




2022-05-23 20:11:13.936 -- auc: 0.8473807842706049, Δauc: 0.8764561942535799, pcoc: 0.9952860441191634, Δpcoc: 1.0006923074966823, #instance: 666910




2022-05-23 20:11:17.861 -- auc: 0.8479029661206665, Δauc: 0.8740445480143987, pcoc: 0.9953728788864882, Δpcoc: 0.9999203624692625, #instance: 679586




2022-05-23 20:11:21.824 -- auc: 0.8484101718883047, Δauc: 0.8743503426642422, pcoc: 0.9954749602693321, Δpcoc: 1.0009456122416736, #instance: 692204




2022-05-23 20:11:25.839 -- auc: 0.8488859261872121, Δauc: 0.87354896456235, pcoc: 0.9955914919338288, Δpcoc: 1.001945999804905, #instance: 704845




2022-05-23 20:11:29.819 -- auc: 0.8493746646307279, Δauc: 0.8748765376148551, pcoc: 0.9956389968071322, Δpcoc: 0.998297133300831, #instance: 717448




2022-05-23 20:11:33.800 -- auc: 0.8498796457504295, Δauc: 0.8771850143050635, pcoc: 0.9957329827907089, Δpcoc: 1.0011051315793287, #instance: 729994


                                                                                

2022-05-23 20:11:35.484 -- auc: 0.8500847201867379, Δauc: 0.8817758831923865, pcoc: 0.9957564136858917, Δpcoc: 0.9996463119234915, #instance: 734372


[38;5;196msaving model to s3://alphaide-demo/movielens/ml-1m/schema/widedeep/model_out/[m
Get aws endpoint from env: obs.cn-southwest-2.myhuaweicloud.com
[WARN] 2022-05-23 12:11:35.511 STSAssumeRoleWithWebIdentityCredentialsProvider [140354498606912] Token file must be specified to use STS AssumeRole web identity creds provider.
[2022-05-23 20:11:35.511] [info] [s3_sdk_filesys.cpp:357] Try to open S3 stream: s3://alphaide-demo/movielens/ml-1m/schema/widedeep/model_out/lr_sparse__sparse_meta.json, read_only false
Get aws endpoint from env: obs.cn-southwest-2.myhuaweicloud.com
[WARN] 2022-05-23 12:11:35.582 STSAssumeRoleWithWebIdentityCredentialsProvider [140353917589248] Token file must be specified to use STS AssumeRole web identity creds provider.
[2022-05-23 20:11:35.582] [info] [s3_sdk_filesys.cpp:357] Try to open S3 stream: s3://alphaide-demo/movielens/ml-1m/schema/widedeep/model_out/lr_sparse__sparse_0.dat, read_only false
[WARN] 2022-05-23 12:11:35.640 STSAssumeRoleWithWebIdent

[2022-05-23 20:11:37.428] [info] C[0]:9 has stopped.
[2022-05-23 20:11:37.429] [info] PS job with coordinator address 172.16.0.198:39951 stopped.


graph(%lr_sparse : Float(*, 20, strides=[20, 1], requires_grad=0, device=cpu),
      %dnn_sparse : Float(*, 20, strides=[20, 1], requires_grad=0, device=cpu),
      %dnn.dnn.0.weight : Float(20, strides=[1], requires_grad=0, device=cpu),
      %dnn.dnn.0.bias : Float(20, strides=[1], requires_grad=0, device=cpu),
      %dnn.dnn.0.running_mean : Float(20, strides=[1], requires_grad=0, device=cpu),
      %dnn.dnn.0.running_var : Float(20, strides=[1], requires_grad=0, device=cpu),
      %dnn.dnn.1.weight : Float(1024, 20, strides=[20, 1], requires_grad=0, device=cpu),
      %dnn.dnn.1.bias : Float(1024, strides=[1], requires_grad=0, device=cpu),
      %dnn.dnn.3.weight : Float(512, 1024, strides=[1024, 1], requires_grad=0, device=cpu),
      %dnn.dnn.3.bias : Float(512, strides=[1], requires_grad=0, device=cpu),
      %dnn.dnn.5.weight : Float(256, 512, strides=[512, 1], requires_grad=0, device=cpu),
      %dnn.dnn.5.bias : Float(256, strides=[1], requires_grad=0, device=cpu),
      %dnn

#### 2.3 模型验证
我们打印出模型在测试集的 AUC。

In [12]:
 ## transform
test_result = transform(spark, model, rank_test_dataset)
## evaluate
test_auc = evaluate(spark, test_result)
print('Debug -- Test AUC: ', test_auc)

[2022-05-23 20:11:37.476] [info] PS job with coordinator address 172.16.0.198:39951 stopped.
[38;5;196mps agent deregistered for process 8383 thread 0x7fa6d4083740[m
[2022-05-23 20:11:37.485] [info] PS job with coordinator address 172.16.0.198:44305 started.
[2022-05-23 20:11:37.485] [info] PSRunner::RunPS: pid: 8383, tid: 8383, thread: 0x7fa6d4083740
[2022-05-23 20:11:37.485] [info] PSRunner::RunPSServer: pid: 8383, tid: 8383, thread: 0x7fa6d4083740
[2022-05-23 20:11:37.485] [info] ActorProcess::Receiving: Server pid: 8383, tid: 9033, thread: 0x7fa6b1669700
[2022-05-23 20:11:37.487] [info] PS job with coordinator address 172.16.0.198:44305 started.
[2022-05-23 20:11:37.487] [info] PSRunner::RunPS: pid: 8382, tid: 9037, thread: 0x7fa6b266b700
[2022-05-23 20:11:37.487] [info] PSRunner::RunPSWorker: pid: 8382, tid: 9037, thread: 0x7fa6b266b700
[38;5;046mps agent registered for process 8382 thread 0x7fa6d4083740[m
[2022-05-23 20:11:37.488] [info] ActorProcess::Receiving: Worker pid: 8

[2022-05-23 20:11:37.464] [info] PS job with coordinator address 172.16.0.198:44305 started.
[2022-05-23 20:11:37.464] [info] PSRunner::RunPS: pid: 7895, tid: 9024, thread: 0x7f56f2cc6700
[2022-05-23 20:11:37.464] [info] PSRunner::RunPSCoordinator: pid: 7895, tid: 9024, thread: 0x7f56f2cc6700
[2022-05-23 20:11:37.467] [info] ActorProcess::Receiving: Coordinator pid: 7895, tid: 9027, thread: 0x7f5721151700
[2022-05-23 20:11:37.488] [info] C[0]:9: The coordinator has connected to 1 servers and 1 workers.
PS Coordinator node [32mC[0]:9[m is ready.


[38;5;196minit sparse tensor lr_sparse with slice shape (10,), updater FTRLTensorUpdater(1.0, 1.0, 0.02, 1.0) and initializer NormalTensorInitializer(0.0, 0.01)[m
[38;5;196minit sparse tensor dnn_sparse with slice shape (10,), updater FTRLTensorUpdater(1.0, 1.0, 0.02, 1.0) and initializer NormalTensorInitializer(0.0, 0.01)[m
[38;5;046minit dense tensor dnn.dnn.0.weight with shape (20, 1), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer OneTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.0.bias with shape (20, 1), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer ZeroTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.1.weight with shape (1024, 20), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit dense tensor dnn.dnn.1.bias with shape (1024, 1), updater AdamTensorUpdater(1e-05, 0.9, 0.999, 1e-08) and initializer DefaultTensorInitializer()[m
[38;5;046minit den

2022-05-23 20:11:53.751 -- auc: 0.8590159808331953, Δauc: 0.8590159808331953, pcoc: 0.9939411351199142, Δpcoc: 0.9939411351199142, #instance: 2321




2022-05-23 20:11:59.074 -- auc: 0.8564639950217534, Δauc: 0.8539929548498687, pcoc: 0.9888738322643614, Δpcoc: 0.9837838695860895, #instance: 4640
2022-05-23 20:11:59.156 -- auc: 0.8564639950217534, Δauc: 1.0, pcoc: 0.9888738322643614, Δpcoc: nan, #instance: 4640
[2022-05-23 20:11:59.165] [info] C[0]:9 has stopped.
[2022-05-23 20:11:59.166] [info] PS job with coordinator address 172.16.0.198:44305 stopped.
Debug -- test result sample:
+-----+-------+------+---+----------+-----+--------+--------------------+--------------------+--------------------+
|label|user_id|gender|age|occupation|  zip|movie_id|               title|               genre|       rawPrediction|
+-----+-------+------+---+----------+-----+--------+--------------------+--------------------+--------------------+
|  1.0|   5966|     F| 35|         9|10021|     497|Much Ado About No...|      Comedy|Romance|  0.7502685785293579|
|  0.0|   4173|     F| 18|        11|29063|      95| Broken Arrow (1996)|     Action|Thriller|  0

[2022-05-23 20:11:59.165] [info] W[0]:12 has stopped.                           
[2022-05-23 20:11:59.165] [info] S[0]:10 has stopped.
[2022-05-23 20:11:59.166] [info] PS job with coordinator address 172.16.0.198:44305 stopped.
[2022-05-23 20:11:59.223] [info] PS job with coordinator address 172.16.0.198:44305 stopped.
[38;5;196mps agent deregistered for process 8382 thread 0x7fa6d4083740[m
                                                                                

Debug -- Test AUC:  0.8564576452173804


In [13]:
spark.stop()