## Instructions

1. Use krylov namespace to run the notebook. It is verified for the krylov configuration:
```json
{
  "application": "jupyterlab",
  "description": "",
  "workspaceConfiguration": {
    "image": "ecr.vip.ebayc3.com/ppetrov/krylov-passion:latest",
    "hadoop": {
      "batchUser": "b_selling_research",
      "hadoopCluster": "apollo-rno"
    }
  }
}
```

2. Setup git acces from krylov workspace (do only once)
    - Go to https://github.ebay.com/settings/tokens 
    - Click `Generate new token`, copy it
    - Open terminal in krylov workspace
    - Save the token in your home dir, e.g. in `~/.gittoken` file


3. Clone `pretrainer_utils` under some `<repos_path>` on krylov
```bash
cd <repos_path>
git clone https://`cat ~/.gittoken`@github.ebay.com/dbasin/pretrainer_utils.git
```

## Installs

In [2]:
import os
os.environ['HTTP_PROXY'] = 'http://httpproxy-tcop.vip.ebay.com:80'
os.environ['HTTPS_PROXY']='http://httpproxy-tcop.vip.ebay.com:80'

In [7]:
! pip3 install xgboost

Defaulting to user installation because normal site-packages is not writeable
Collecting xgboost
  Downloading xgboost-1.6.1-py3-none-manylinux2014_x86_64.whl (192.9 MB)
     |████████████████████████████████| 192.9 MB 68.9 MB/s            
Installing collected packages: xgboost
Successfully installed xgboost-1.6.1
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


## Imports

#### General

In [3]:
import pickle
import pandas as pd
from pyspark.sql import functions as F, Row
from fsspec.implementations import hdfs
from functools import partial

import xgboost as xgb
from sklearn.model_selection import train_test_split

ImportError: cannot import name 'hdfs' from 'fsspec.implementations' (/opt/conda/lib/python3.10/site-packages/fsspec/implementations/__init__.py)

#### Pretrainer

In [4]:
import sys
repos_path = '/home/mmandelbrod/repositories/'
utils_path = f'{repos_path}/pretrainer_utils/utils'
sys.path.append(utils_path)

In [5]:
from pretrainer import load_npy_path, Fetcher
from spark_utils import load_spark
from pretrainer_utils import numpy_data_to_pdf, label_extract_processor, parse_category, calc_active_features, extract_label
from hdfs_utils import HDFS

In [6]:
from xgb_utils import pretrainer_train_test_split, create_dmatrix, calc_feature_imp, RecordEval, load_bst_model
from xgb_utils import calc_rank, sale_rank_stats, calc_pred_score, calc_sale_rank, calc_comb_score, model_vs_prods_ranks, calc_groups

ModuleNotFoundError: No module named 'xgboost'

## Spark setup

In [7]:
spark = load_spark(queue='hdlq-struct-default')

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/apache/releases/spark-3.1.1.1.0.0-bin-ebay/jars/parquet-kms-client-0.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/apache/releases/spark-3.1.1.1.0.0-bin-ebay/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


2024-09-02T18:42:58.106+0000: [GC (Allocation Failure) 
Desired survivor size 20447232 bytes, new threshold 7 (max 15)
[PSYoungGen: 122880K->14980K(142848K)] 122880K->14996K(469504K), 0.0175472 secs] [Times: user=0.04 sys=0.01, real=0.02 secs] 
2024-09-02T18:42:58.159+0000: [GC (Metadata GC Threshold) 
Desired survivor size 20447232 bytes, new threshold 7 (max 15)
[PSYoungGen: 21032K->6230K(265728K)] 21048K->6254K(592384K), 0.0062523 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2024-09-02T18:42:58.165+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 6230K->0K(265728K)] [ParOldGen: 24K->6148K(152064K)] 6254K->6148K(417792K), [Metaspace: 20235K->20235K(1067008K)], 0.0301337 secs] [Times: user=0.06 sys=0.00, real=0.03 secs] 


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


2024-09-02T18:42:59.480+0000: [GC (Metadata GC Threshold) 
Desired survivor size 20447232 bytes, new threshold 7 (max 15)
[PSYoungGen: 157324K->12053K(265728K)] 163473K->18210K(417792K), 0.0126452 secs] [Times: user=0.01 sys=0.01, real=0.01 secs] 
2024-09-02T18:42:59.493+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 12053K->0K(265728K)] [ParOldGen: 6156K->17079K(216064K)] 18210K->17079K(481792K), [Metaspace: 33462K->33456K(1079296K)], 0.0627781 secs] [Times: user=0.14 sys=0.01, real=0.06 secs] 
2024-09-02T18:43:01.560+0000: [GC (Metadata GC Threshold) 
Desired survivor size 20447232 bytes, new threshold 7 (max 15)
[PSYoungGen: 219378K->19964K(409088K)] 236457K->38577K(625152K), 0.0211153 secs] [Times: user=0.03 sys=0.00, real=0.02 secs] 
2024-09-02T18:43:01.581+0000: [Full GC (Metadata GC Threshold) [PSYoungGen: 19964K->0K(409088K)] [ParOldGen: 18612K->29174K(282112K)] 38577K->29174K(691200K), [Metaspace: 56486K->56486K(1101824K)], 0.0854573 secs] [Times: user=0.18 sys=0.02, real

24/09/02 18:43:02 WARN HiveConf: DEPRECATED: hive.metastore.ds.retry.* no longer has any effect.  Use hive.hmshandler.retry.* instead
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.metastore.local does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.enforce.sorting does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.server2.proxyuser.hue.groups does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.server2.proxyuser.hue.hosts does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.interval does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.enforce.bucketing does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.metastore.ds.retry.attempts does not exist
24/09/02 18:43:02 WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
24/09/02 18:43:04 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading li

2024-09-02T18:43:07.142+0000: [GC (Allocation Failure) 
Desired survivor size 31457280 bytes, new threshold 6 (max 15)
[PSYoungGen: 389120K->19936K(453632K)] 418294K->56127K(735744K), 0.0442534 secs] [Times: user=0.04 sys=0.06, real=0.04 secs] 


24/09/02 18:43:09 ERROR SparkContext: Error initializing SparkContext.
org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User b_perso does not have permission to submit application_1724912156839_757521 to queue hdlq-struct-default
	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:513)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:390)
	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:742)
	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:290)
	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:617)
	at org.apache.h

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User b_perso does not have permission to submit application_1724912156839_757521 to queue hdlq-struct-default
	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:513)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:390)
	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:742)
	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:290)
	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:617)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:698)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:666)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:650)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1105)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1164)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1091)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3173)
Caused by: org.apache.hadoop.security.AccessControlException: User b_perso does not have permission to submit application_1724912156839_757521 to queue hdlq-struct-default
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:516)
	... 14 more

	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
	at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateYarnException(RPCUtil.java:75)
	at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:116)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:304)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:438)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:168)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:160)
	at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
	at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:365)
	at com.sun.proxy.$Proxy21.submitApplication(Unknown Source)
	at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.submitApplication(YarnClientImpl.java:318)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:207)
	at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:579)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.YarnException): org.apache.hadoop.security.AccessControlException: User b_perso does not have permission to submit application_1724912156839_757521 to queue hdlq-struct-default
	at org.apache.hadoop.yarn.ipc.RPCUtil.getRemoteException(RPCUtil.java:38)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:513)
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.submitApplication(RMAppManager.java:390)
	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.submitApplication(ClientRMService.java:742)
	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.submitApplication(ApplicationClientProtocolPBServiceImpl.java:290)
	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:617)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:698)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:666)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:650)
	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1105)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1164)
	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1091)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:422)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878)
	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3173)
Caused by: org.apache.hadoop.security.AccessControlException: User b_perso does not have permission to submit application_1724912156839_757521 to queue hdlq-struct-default
	at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:516)
	... 14 more

	at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1926)
	at org.apache.hadoop.ipc.Client.call(Client.java:1852)
	at org.apache.hadoop.ipc.Client.call(Client.java:1795)
	at org.apache.hadoop.ipc.Client.call(Client.java:1670)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:270)
	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:144)
	at com.sun.proxy.$Proxy20.submitApplication(Unknown Source)
	at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.submitApplication(ApplicationClientProtocolPBClientImpl.java:301)
	... 27 more


In [33]:
# spark.stop()

In [2]:
!ls /apache


confs	    hbase-1.4	     optimus	      spark-3.1.1.1.0  zookeeper
hadoop	    hive	     pig	      spark2.3
hadoop-2.7  hive-hadoop-2.7  releases	      spark2.4
hadoop-3.3  hive-hadoop-3.3  spark	      spark3.1
hbase	    java	     spark-3.1.1.0.9  spark_bak


## Constants

#### Parameters description

- `base_path` - should  reference your model base path, e.g. `/apps/b_perso/vlp/simplark/pretrainer/RecommendedBrandOutletWithMLR` (note `no` viewfs prefix here)
- `base_out_path` - specify if you use `Extender` to add features. Defines output for numpy files generated by `Extender`. `Extender` will create a folder with run timestamp for each run.
- `start_date`,`end_date` - specify date range of loaded training data (inclusively)
- `num_workers` - number of spark executors used for fetching the data



#### Values

In [6]:
start_date = '20220801'
end_date = '20220819'

root_path = '/apps/b_perso/hp/simplark/pretrainer'
models = ['PersonalizedTopicsV2WithMetaOrganicPRecall','PersonalizedTopicsV2WithTopicMLR']
base_paths = [f'{root_path}/{m}' for m in models]

target_label = 'labelPurchase'

#### PIYI features

In [58]:
piyi_v5_features=[
  "BibowatchRelPosition",
  "RecallSourceBullseye",
  "RecallSourceTora",
  "TitleCosineSimilarityToShoppingcartCentroid",
  "FreqSameLeafCatIdInWatchBadge",
  "MaxViewedItemTitleJaccardBigrams",
  "NumSameRviInLastWeek",
  "AvgSameLeafRviPriceRatio",
  "ItemSalesOverImpPricePrior7DayDecayLogSmoothDomesticWebAndMobile",
  "ItemVariantSalesOverImpressions7DayDecayLogSmoothDomesticWebAndMobileV2",
  "MaxViewedItemTitleJaccard",
  "ItemTimeOnSiteV2",
  "ItemWatchesOverImp7DayDecayLogSmoothDomesticWebAndMobileV2",
  "PriceDiffMedianRecall",
  "FreqSameItemInWatchBadge",
  "RecallSourceBestMatch",
  "ItemSalesOverImpPricePrior7DayDecayLogSmoothInternationalWebAndMobileNorm",
  "PoissonNextEventProbSameItemInWatch",
  "FreqWatchPriceBellowItemPrice",
  "MaxSameLeafRvihPriceDiff",
  "MerchImpressionsDecayed",
  "PlImpressionsDecayed",
  "AvgSameLeafRviPriceDiff",
  "ItemSalesOverImpPricePrior7DayDecayLogSmoothDomesticWebAndMobileNorm",
  "BullseyeRelRVILeafCatMedianPriceDiffV2",
  "BullseyeAbsRVILeafCatMedianPriceDiffV2",
  "BullseyeRVILeafCatMedianPriceV2",
  "LeafCatRVICondition",
  "ItemConditionOrdinal",
  "ItemConditionNorm",
  "SameItemConditionInRvi"
]

## Inspecting sample of data

In [7]:
fetcher = Fetcher(base_paths[0], start_date, end_date, hdfs.HadoopFileSystem(), num_workers=128)

  """Entry point for launching an IPython kernel.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.2.0.18/share/hadoop/common/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.2.0.18/share/hadoop/httpfs/tomcat/webapps/webhdfs/WEB-INF/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/apache/releases/hadoop-2.7.3.2.6.4.2.0.18/share/hadoop/kms/tomcat/webapps/kms/WEB-INF/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
22/09/21 11:48:29 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/21 11:48:29 WARN shortcircuit.Domai

In [8]:
fetcher.paths[:5]

['/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithMetaOrganicPRecall/2022/08/01/data/part-00000-21a60939-e7fe-497d-887e-e222a3c8e0c0.npy.gz',
 '/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithMetaOrganicPRecall/2022/08/01/data/part-00001-21a60939-e7fe-497d-887e-e222a3c8e0c0.npy.gz',
 '/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithMetaOrganicPRecall/2022/08/01/data/part-00002-21a60939-e7fe-497d-887e-e222a3c8e0c0.npy.gz',
 '/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithMetaOrganicPRecall/2022/08/01/data/part-00003-21a60939-e7fe-497d-887e-e222a3c8e0c0.npy.gz',
 '/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithMetaOrganicPRecall/2022/08/01/data/part-00004-21a60939-e7fe-497d-887e-e222a3c8e0c0.npy.gz']

In [9]:
len(fetcher.paths)

550

In [10]:
data = load_npy_path(fetcher.paths[0])

In [11]:
type(data)

numpy.ndarray

In [None]:
data.dtype.descr

In [13]:
len(data['features'].dtype.names)

1363

In [14]:
calc_active_features??

[0;31mSignature:[0m [0mcalc_active_features[0m[0;34m([0m[0mpaths[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mcalc_active_features[0m[0;34m([0m[0mpaths[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0mdata_list[0m [0;34m=[0m [0;34m[[0m[0mload_npy_path[0m[0;34m([0m[0mp[0m[0;34m)[0m [0;32mfor[0m [0mp[0m [0;32min[0m [0mpaths[0m[0;34m][0m[0;34m[0m
[0;34m[0m    [0mfeature_cols[0m [0;34m=[0m [0;34m[[0m[0md[0m[0;34m[[0m[0;36m0[0m[0;34m][0m [0;32mfor[0m [0md[0m [0;32min[0m [0mdata_list[0m[0;34m[[0m[0;36m0[0m[0;34m][0m[0;34m[[0m[0;34m'features'[0m[0;34m][0m[0;34m.[0m[0mdtype[0m[0;34m.[0m[0mdescr[0m [0;32mif[0m [0mlen[0m[0;34m([0m[0md[0m[0;34m)[0m [0;34m==[0m [0;36m2[0m [0;32mand[0m [0md[0m[0;34m[[0m[0;36m1[0m[0;34m][0m [0;34m==[0m [0;34m'<f4'[0m[0;34m][0m[0;34m[0m
[0;34m[0m    [0mdata_pdfs[0m [0;3

In [15]:
active_feature_cols = calc_active_features(fetcher.paths[:2])
len(active_feature_cols)

667

In [None]:
numpy_data_to_pdf??

In [16]:
numpy_data_to_pdf(data, feature_cols=active_feature_cols)

Unnamed: 0_level_0,meta,meta,meta,meta,meta,meta,labels,labels,labels,features,features,features,features,features,features,features,features,features,features,features,features
Unnamed: 0_level_1,itemId,meid,userId,siteId,rank,category,labelClick,labelPurchase,labelCombined,NormItemViewCount7DayDecayDomesticWebAndMobile,...,MaxSameLeafRvihPriceDiff,FreqWatchPriceBellowItemPrice,MaxTransactionPriceRatio,UserLowPricePrpnstyDiff,TitleCosineSimilarityCentroidRvisInLastDay,AvgWatchPriceBidRatioBadge,AvgTransactionPriceRatio,NumSameRviLeafCatInLastTwoDay,TimeSinceAddedWatch,MaxWatchPriceBinRatioBadge
0,314073542053,b'0031e70828624a469b0500fd142ab59e',2392490943,3,0,b'63861',0,0,0,0.009248,...,1.657025,4.467643e-09,1.886038,-1.0,0.227002,1.427802,2.402736,1.0,25.906553,0.917736
1,333338270853,b'0031e70828624a469b0500fd142ab59e',2392490943,3,1,b'63861',0,0,0,0.151473,...,6.397025,2.680586e-09,2.936545,-1.0,0.279175,0.917026,1.543191,1.0,25.906553,1.428907
2,164781472281,b'0031e70828624a469b0500fd142ab59e',2392490943,3,2,b'63861',1,0,0,0.028934,...,5.837026,2.680586e-09,2.755237,-1.0,0.233450,0.977371,1.644741,1.0,25.906553,1.340684
3,394134436416,b'0031e70828624a469b0500fd142ab59e',2392490943,3,3,b'63861',0,0,0,0.025835,...,2.747025,3.574115e-09,2.055099,-1.0,0.249568,1.310345,2.205077,1.0,25.906553,1.000000
4,392881016823,b'0031e70828624a469b0500fd142ab59e',2392490943,3,4,b'63861',0,0,0,0.043820,...,6.157025,2.680586e-09,2.856000,-1.0,0.223298,0.942888,1.586713,1.0,25.906553,1.389714
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
67417,303160964134,b'fff2ffec53b0445fa6f36472ebdabde6',0,0,5,b'33712',0,0,0,0.000642,...,4.900002,-1.000000e+00,-1.000000,-1.0,0.183556,-1.000000,-1.000000,2.0,-1.000000,-1.000000
67418,304555384526,b'fff2ffec53b0445fa6f36472ebdabde6',0,0,6,b'33712',0,0,0,0.001063,...,6.990002,-1.000000e+00,-1.000000,-1.0,0.220267,-1.000000,-1.000000,2.0,-1.000000,-1.000000
67419,233745264868,b'fff2ffec53b0445fa6f36472ebdabde6',0,0,7,b'33712',0,0,0,0.000000,...,10.000000,-1.000000e+00,-1.000000,-1.0,0.380970,-1.000000,-1.000000,2.0,-1.000000,-1.000000
67420,195155559280,b'fff2ffec53b0445fa6f36472ebdabde6',0,0,8,b'33712',0,0,0,0.000000,...,26.000002,-1.000000e+00,-1.000000,-1.0,0.390130,-1.000000,-1.000000,2.0,-1.000000,-1.000000


In [18]:
label_extract_processor??

[0;31mSignature:[0m
[0mlabel_extract_processor[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtarget_label[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeature_cols[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmeta_cols[0m[0;34m=[0m[0;34m[[0m[0;34m'itemId'[0m[0;34m,[0m [0;34m'meid'[0m[0;34m,[0m [0;34m'userId'[0m[0;34m,[0m [0;34m'siteId'[0m[0;34m,[0m [0;34m'rank'[0m[0;34m,[0m [0;34m'category'[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlabel_cols[0m[0;34m=[0m[0;34m[[0m[0;34m'labelClick'[0m[0;34m,[0m [0;34m'labelPurchase'[0m[0;34m,[0m [0;34m'labelCombined'[0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mlabel_extract_processor[0m[0;34m([0m[0mdata[0m[0;34m,[0m [0mtarget_label[0m[0;34m,[0m [0mfeature_cols[0m[0;34m,[0m [0mmeta_cols[0m [0;34m=[0m [

In [17]:
label_extract_processor(data, target_label='labelPurchase', feature_cols=active_feature_cols)

Unnamed: 0_level_0,meta,meta,meta,meta,meta,meta,labels,labels,labels,features,features,features,features,features,features,features,features,features,features,features,features
Unnamed: 0_level_1,itemId,meid,userId,siteId,rank,category,labelPurchase,labelClick,labelCombined,NormItemViewCount7DayDecayDomesticWebAndMobile,...,MaxSameLeafRvihPriceDiff,FreqWatchPriceBellowItemPrice,MaxTransactionPriceRatio,UserLowPricePrpnstyDiff,TitleCosineSimilarityCentroidRvisInLastDay,AvgWatchPriceBidRatioBadge,AvgTransactionPriceRatio,NumSameRviLeafCatInLastTwoDay,TimeSinceAddedWatch,MaxWatchPriceBinRatioBadge
919,234506805927,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,0,b'43961',1,1,1,0.003148,...,0.000000,0.000003,-1.000000,-1.0,0.000000,2.6396,-1.000000,0.0,0.007478,-1.000000
920,195243638026,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,1,b'43961',0,0,0,0.002032,...,5.039997,0.000003,-1.000000,-1.0,0.000000,2.4380,-1.000000,0.0,0.007478,-1.000000
921,194965666729,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,2,b'43961',0,0,0,0.004533,...,1.040001,0.000003,-1.000000,-1.0,0.000000,2.5980,-1.000000,0.0,0.007478,-1.000000
922,324847563039,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,3,b'43961',0,0,0,0.001133,...,20.989998,0.000003,-1.000000,-1.0,0.000000,1.8000,-1.000000,0.0,0.007478,-1.000000
923,155086381712,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,4,b'43961',0,0,0,0.000855,...,36.989998,0.000003,-1.000000,-1.0,0.000000,1.1600,-1.000000,0.0,0.007478,-1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65933,324803642979,b'96e7e87b78dd49a8aa197ea727d5cd36',2347710389,0,4,b'170098',0,0,0,0.000287,...,17.139999,-1.000000,20.681080,-1.0,0.429356,-1.0000,0.131846,7.0,-1.000000,11.351351
65934,224839234785,b'96e7e87b78dd49a8aa197ea727d5cd36',2347710389,0,5,b'170098',0,0,0,0.000455,...,16.500000,-1.000000,15.365461,-1.0,0.436646,-1.0000,0.177457,7.0,-1.000000,8.433735
65935,224678965763,b'96e7e87b78dd49a8aa197ea727d5cd36',2347710389,0,6,b'170098',0,0,0,0.001963,...,16.400000,-1.000000,14.772201,-1.0,0.461304,-1.0000,0.184584,7.0,-1.000000,8.108109
65936,224611318421,b'96e7e87b78dd49a8aa197ea727d5cd36',2347710389,0,7,b'170098',0,0,0,0.000597,...,16.500000,-1.000000,15.365461,-1.0,0.389225,-1.0000,0.177457,7.0,-1.000000,8.433735


## Fetching train data

In [19]:
base_paths

['/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithMetaOrganicPRecall',
 '/apps/b_perso/hp/simplark/pretrainer/PersonalizedTopicsV2WithTopicMLR']

In [20]:
fetchers = [Fetcher(base_path, start_date, end_date, hdfs.HadoopFileSystem(), num_workers=128) for base_path in base_paths]

  """Entry point for launching an IPython kernel.


In [21]:
pdfs = [ft.fetch_pandas_df(spark, partial(label_extract_processor, target_label=target_label,feature_cols=active_feature_cols)) for ft  in fetchers]

                                                                                

In [22]:
pdf = pd.concat(pdfs, ignore_index=True)

In [23]:
pdf

Unnamed: 0_level_0,meta,meta,meta,meta,meta,meta,labels,labels,labels,features,features,features,features,features,features,features,features,features,features,features,features
Unnamed: 0_level_1,itemId,meid,userId,siteId,rank,category,labelCombined,labelClick,labelPurchase,NormItemViewCount7DayDecayDomesticWebAndMobile,...,MaxSameLeafRvihPriceDiff,FreqWatchPriceBellowItemPrice,MaxTransactionPriceRatio,UserLowPricePrpnstyDiff,TitleCosineSimilarityCentroidRvisInLastDay,AvgWatchPriceBidRatioBadge,AvgTransactionPriceRatio,NumSameRviLeafCatInLastTwoDay,TimeSinceAddedWatch,MaxWatchPriceBinRatioBadge
0,234506805927,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,0,b'43961',1,1,1,0.003148,...,0.000000,3.051753e-06,-1.000000,-1.0,0.000000,2.6396,-1.000000,0.0,0.007478,-1.000000
1,195243638026,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,1,b'43961',0,0,0,0.002032,...,5.039997,3.051753e-06,-1.000000,-1.0,0.000000,2.4380,-1.000000,0.0,0.007478,-1.000000
2,194965666729,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,2,b'43961',0,0,0,0.004533,...,1.040001,3.051753e-06,-1.000000,-1.0,0.000000,2.5980,-1.000000,0.0,0.007478,-1.000000
3,324847563039,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,3,b'43961',0,0,0,0.001133,...,20.989998,3.051753e-06,-1.000000,-1.0,0.000000,1.8000,-1.000000,0.0,0.007478,-1.000000
4,155086381712,b'4120b73a3201499fb48e4186efb83e4a',2434096930,0,4,b'43961',0,0,0,0.000855,...,36.989998,3.051753e-06,-1.000000,-1.0,0.000000,1.1600,-1.000000,0.0,0.007478,-1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
717358,175354598698,b'ded3590672d149bfba10c5eecf3f0683',671499032,77,5,b'31388',0,0,0,0.004635,...,111.409424,0.000000e+00,0.026271,-1.0,0.509175,-1.0000,69.554443,2.0,37.460117,0.020227
717359,125465238852,b'ded3590672d149bfba10c5eecf3f0683',671499032,77,6,b'31388',0,0,0,0.001070,...,-243.120544,3.089699e-10,0.018915,-1.0,0.406181,-1.0000,96.603050,2.0,37.460117,0.014563
717360,175385450683,b'ded3590672d149bfba10c5eecf3f0683',671499032,77,7,b'31388',0,0,0,0.000892,...,-191.460632,3.089699e-10,0.019720,-1.0,0.513200,-1.0000,92.661697,2.0,37.460117,0.015183
717361,165635082173,b'ded3590672d149bfba10c5eecf3f0683',671499032,77,8,b'31388',0,0,0,0.000353,...,-181.330627,3.089699e-10,0.019885,-1.0,0.449050,-1.0000,91.888832,2.0,37.460117,0.015311


In [24]:
pdf[('meta', 'meid')] = pdf.meta.meid.map(lambda v: v.decode())

In [26]:
pdf['features']

Unnamed: 0,NormItemViewCount7DayDecayDomesticWebAndMobile,ItemViewsOverImp7DayDecayLogSmoothDomesticWebAndMobileNorm,PlSellerSalesOverImpressions,ItemViewsOverImp7DayDecay,AtmometerPpm,ShippingIsPlus,ItemSaleCount7DayDecayDomesticWebAndMobile,ItemVariantWatchOverImpressions7DayDecayLogSmoothDomesticWebAndMobileNorm,ItemSaleCount7DayDecayAll,IsSellerMarketingOfferMessage,...,MaxSameLeafRvihPriceDiff,FreqWatchPriceBellowItemPrice,MaxTransactionPriceRatio,UserLowPricePrpnstyDiff,TitleCosineSimilarityCentroidRvisInLastDay,AvgWatchPriceBidRatioBadge,AvgTransactionPriceRatio,NumSameRviLeafCatInLastTwoDay,TimeSinceAddedWatch,MaxWatchPriceBinRatioBadge
0,0.003148,0.541955,-3.755054,0.008665,0.000000,0.0,2.35,0.643256,1.01,0.0,...,0.000000,3.051753e-06,-1.000000,-1.0,0.000000,2.6396,-1.000000,0.0,0.007478,-1.000000
1,0.002032,0.510849,-3.516945,0.003504,0.000000,0.0,1.48,0.683541,0.00,0.0,...,5.039997,3.051753e-06,-1.000000,-1.0,0.000000,2.4380,-1.000000,0.0,0.007478,-1.000000
2,0.004533,0.500786,-3.516945,0.007544,0.000499,0.0,3.97,0.669890,1.49,0.0,...,1.040001,3.051753e-06,-1.000000,-1.0,0.000000,2.5980,-1.000000,0.0,0.007478,-1.000000
3,0.001133,0.585978,-3.746583,0.021535,0.000000,0.0,0.77,0.559003,0.00,0.0,...,20.989998,3.051753e-06,-1.000000,-1.0,0.000000,1.8000,-1.000000,0.0,0.007478,-1.000000
4,0.000855,0.421043,-4.174220,0.002273,0.000000,0.0,0.00,0.578653,0.00,0.0,...,36.989998,3.051753e-06,-1.000000,-1.0,0.000000,1.1600,-1.000000,0.0,0.007478,-1.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
717358,0.004635,0.505994,-3.473964,0.006382,0.000000,0.0,0.00,0.559738,0.00,0.0,...,111.409424,0.000000e+00,0.026271,-1.0,0.509175,-1.0000,69.554443,2.0,37.460117,0.020227
717359,0.001070,0.351522,-3.473964,0.005162,0.000000,0.0,0.00,0.593666,0.00,0.0,...,-243.120544,3.089699e-10,0.018915,-1.0,0.406181,-1.0000,96.603050,2.0,37.460117,0.014563
717360,0.000892,0.306725,-3.473964,0.004711,0.000000,0.0,0.00,0.373950,0.00,0.0,...,-191.460632,3.089699e-10,0.019720,-1.0,0.513200,-1.0000,92.661697,2.0,37.460117,0.015183
717361,0.000353,0.448197,-3.473964,0.015793,0.000000,0.0,0.00,0.469752,0.00,0.0,...,-181.330627,3.089699e-10,0.019885,-1.0,0.449050,-1.0000,91.888832,2.0,37.460117,0.015311


## Prepare train/test data

#### All non-leaking feature cols

In [47]:
active_cols = pdf.features.columns
exclusion_cols = ['PurchasedQuantity', 
                  'InvertedRank', 
                  'InvertedRankV2',
                  'FinalScore', 
                  'LegacyMlrScoreLogStd', 
                  'LegacyMlrScoreLogMax', 
                  'MlrModelScore', 
                  'LegacyMlrScoreLogMeanInteractMax', 
                  'LegacyMlrScoreLogMean',
                  'LegacyMlrScoreLogMeanInteractMedian',
                  'LegacyMlrScoreLogMedian',
                  'LegacyMlrScoreLogMin'
                 ]
candidate_cols = list(set(active_cols) - set(exclusion_cols))

#### Input for training

In [6]:
pretrainer_train_test_split??

[0;31mSignature:[0m [0mpretrainer_train_test_split[0m[0;34m([0m[0mpdf[0m[0;34m,[0m [0mtest_size[0m[0;34m,[0m [0mrandom_stats[0m[0;34m=[0m[0;36m7[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mSource:[0m   
[0;32mdef[0m [0mpretrainer_train_test_split[0m[0;34m([0m[0mpdf[0m[0;34m,[0m [0mtest_size[0m[0;34m,[0m [0mrandom_stats[0m[0;34m=[0m[0;36m7[0m[0;34m)[0m[0;34m:[0m[0;34m[0m
[0;34m[0m    [0mpartition_keys[0m [0;34m=[0m [0mnp[0m[0;34m.[0m[0munique[0m[0;34m([0m[0mpdf[0m[0;34m[[0m[0;34m'meta'[0m[0;34m][0m[0;34m[[0m[0;34m'meid'[0m[0;34m][0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mtrain_keys[0m[0;34m,[0m [0mvalid_keys[0m [0;34m=[0m [0mtrain_test_split[0m[0;34m([0m[0mpartition_keys[0m[0;34m,[0m [0mtest_size[0m[0;34m=[0m[0mtest_size[0m[0;34m,[0m [0mrandom_state[0m[0;34m=[0m[0;36m7[0m[0;34m)[0m[0;34m[0m
[0;34m[0m    [0mtrain_set[0m[0;34m,[0m [

In [27]:
train, valid = pretrainer_train_test_split(pdf, test_size=0.1)

In [28]:
valid.head(2)

Unnamed: 0_level_0,meta,meta,meta,meta,meta,meta,labels,labels,labels,features,features,features,features,features,features,features,features,features,features,features,features
Unnamed: 0_level_1,itemId,meid,userId,siteId,rank,category,labelCombined,labelClick,labelPurchase,NormItemViewCount7DayDecayDomesticWebAndMobile,...,MaxSameLeafRvihPriceDiff,FreqWatchPriceBellowItemPrice,MaxTransactionPriceRatio,UserLowPricePrpnstyDiff,TitleCosineSimilarityCentroidRvisInLastDay,AvgWatchPriceBidRatioBadge,AvgTransactionPriceRatio,NumSameRviLeafCatInLastTwoDay,TimeSinceAddedWatch,MaxWatchPriceBinRatioBadge
39,313901486252,75f96c6bc16e4d94aa4a7d2b759c2fce,2411690625,3,0,b'38659',0,0,0,0.010951,...,2.272552,0.0,1.981707,-1.0,-1.0,-1.0,1.550552,0.0,46.884651,549.032043
40,154213799196,75f96c6bc16e4d94aa4a7d2b759c2fce,2411690625,3,1,b'38659',1,1,1,0.002059,...,-1.377449,0.0,1.783265,-1.0,-1.0,-1.0,1.723099,0.0,46.884651,494.053497


In [29]:
candidate_cols = active_feature_cols

In [33]:
y_train =train.labels.labelPurchase
y_valid = valid.labels.labelPurchase

In [None]:
X_train = train.features[candidate_cols]
X_valid = valid.features[candidate_cols]

# IMPORTANT !!!! train and valid must by ordered by meid for group calculations
group_train = calc_groups(train)
group_valid = calc_groups(valid)

In [49]:
eval_set = [(X_train, y_train), (X_valid, y_valid)]
eval_group = [group_train, group_valid]
eval_metric = ['map', 'ndcg@10-']
eval_result={}
# cbs = [RecordEval()]

## Simple XGB fit

In [50]:
params = {
    'objective': "rank:pairwise",
    'nthread': -1
}

In [51]:
# ranker = xgb.XGBRanker(tree_method='gpu_hist', **params)
ranker = xgb.XGBRanker(tree_method='gpu_hist', **params)

In [None]:
model = ranker.fit(X_train, 
          y_train,
          group=group_train,
          eval_set=eval_set, 
          eval_group=eval_group, 
          eval_metric=eval_metric,
          early_stopping_rounds=50, 
#                callbacks=cbs,
               verbose=True
         )

In [53]:
top_weight_features = calc_feature_imp(model, imp_type='weight'); top_weight_features.reset_index(drop=True)[:30]

Unnamed: 0,feature,score
0,BullseyeRelRVILeafCatMedianPriceDiffV2,188.0
1,MaxCartItemTitleJaccard,93.0
2,BibowatchRelPosition,93.0
3,RecallSize,92.0
4,PriceDiffMedianRecall,91.0
5,TitleCosineSimilarityToShoppingcartCentroid,86.0
6,FreqSameLeafCatIdInWatchBadge,84.0
7,ItemWatchOverImpLogSmoothAllNorm,84.0
8,SeedItemTimeLeftSec,75.0
9,ItemTimeOnSiteV2Std,67.0


In [54]:
top_gain_features = calc_feature_imp(model, imp_type='gain'); top_gain_features.reset_index(drop=True)[:30]

Unnamed: 0,feature,score
0,BibowatchRelPosition,2255.786865
1,RecallSourceBullseye,1089.329468
2,TimeDiffFromLastRvi,1044.473633
3,MaxViewedItemTitleJaccardBigrams,805.235596
4,IsItemAuctionPure,633.687561
5,AvgSameItemWatchPriceBidDiffBadge,507.469971
6,RecallSourceTora,477.740265
7,ItemVariantSalesOverImpressions7DayDecayLogSmo...,468.413025
8,TitleCosineSimilarityToShoppingcartCentroid,413.959442
9,MaxSameItemWatchPriceRatio,281.034485


In [55]:
model_vs_prods_ranks(model, valid, candidate_cols,dmatrix=False)

Unnamed: 0,clicks,prod_clicks,purchases,prod_purchases
count,7471.0,7471.0,7471.0,7471.0
mean,2.266899,2.541695,2.533128,2.998394
std,1.97636,2.124025,2.187085,2.433809
min,1.0,1.0,1.0,1.0
25%,1.0,1.0,1.0,1.0
50%,1.0,2.0,1.0,2.0
75%,3.0,3.0,3.0,4.0
max,10.0,10.0,10.0,10.0


## Eval saved models

In [56]:
m = xgb.Booster()
m.load_model('./simplex_piyi_v5.bst')



In [None]:
model_vs_prods_ranks(m, valid, piyi_v5_features)