<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

In [1]:
# !cd .. && pip install -e .

# DKN : Deep Knowledge-Aware Network for News Recommendation

DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representation learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer.

## Properties of DKN:

- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering.
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representation.


## Data format

DKN takes several files as input as follows:

- **training / validation / test files**: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br>
`[label] [userid] [CandidateNews]%[impressionid] `<br>
e.g., `1 train_U1 N1%0` <br>

- **user history file**: each line in this file represents a users' click history. You need to set `history_size` parameter in the config file, which is the max number of user's click history we use. We will automatically keep the last `history_size` number of user click history, if user's click history is more than `history_size`, and we will automatically pad with 0 if user's click history is less than `history_size`. the format is : <br>
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br>

- **document feature file**: It contains the word and entity features for news articles. News articles are represented by aligned title words and title entities. To take a quick example, a news title may be: <i>"Trump to deliver State of the Union address next week"</i>, then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word "Trump". The title value and entity value is hashed from 1 to `n` (where `n` is the number of distinct words or entities). Each feature length should be fixed at k (`doc_size` parameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should pad 0 to the end.
the format is like: <br>
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`

- **word embedding/entity embedding/ context embedding files**: These are `*.npy` files of pretrained embeddings. After loading, each file is a `[n+1,k]` two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding.

In this experiment, we used GloVe\[4\] vectors to initialize the word embedding. We trained entity embedding using TransE\[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## MIND dataset

MIND dataset\[3\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

A smaller version, [MIND-small](https://azure.microsoft.com/en-us/services/open-datasets/catalog/microsoft-news-dataset/), is a small version of the MIND dataset by randomly sampling 50,000 users and their behavior logs from the MIND dataset.

The datasets contains these files for both training and validation data:

#### behaviors.tsv

The behaviors.tsv file contains the impression logs and users' news click hostories. It has 5 columns divided by the tab symbol:

+ Impression ID. The ID of an impression.
+ User ID. The anonymous ID of a user.
+ Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
+ History. The news click history (ID list of clicked news) of this user before this impression.
+ Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).

One simple example:

`1    U82271    11/11/2019 3:28:58 PM    N3130 N11621 N12917 N4574 N12140 N9748    N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0 `

#### news.tsv

The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file. It has 7 columns, which are divided by the tab symbol:

+ News ID
+ Category
+ SubCategory
+ Title
+ Abstract
+ URL
+ Title Entities (entities contained in the title of this news)
+ Abstract Entities (entites contained in the abstract of this news)

One simple example:

`N46466    lifestyle    lifestyleroyals    The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By    Shop the notebooks, jackets, and more that the royals can't live without.    https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata    [{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]    [] `

#### entity_embedding.vec & relation_embedding.vec

The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values.

One simple example:

`Q42306013  0.014516 -0.106958 0.024590 ... -0.080382`


## DKN architecture

The following figure shows the architecture of DKN.

![](https://recodatasets.z20.web.core.windows.net/images/dkn_architecture.png)

DKN takes one piece of candidate news and one piece of a user’s clicked news as input. For each piece of news, a specially designed KCNN is used to process its title and generate an embedding vector. KCNN is an extension of traditional CNN that allows flexibility in incorporating symbolic knowledge from a knowledge graph into sentence representation learning.

With the KCNN, we obtain a set of embedding vectors for a user’s clicked history. To get final embedding of the user with
respect to the current candidate news, we use an attention-based method to automatically match the candidate news to each piece
of his clicked news, and aggregate the user’s historical interests with different weights. The candidate news embedding and the user embedding are concatenated and fed into a deep neural network (DNN) to calculate the predicted probability that the user will click the candidate news.

## Global settings and imports

In [4]:
%load_ext autoreload
%autoreload 2

In [5]:
import os
import sys
import logging
from pathlib import Path
import zipfile
from requests import HTTPError, get
from time import time

import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

logging.basicConfig(level=logging.INFO)
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

# logging.basicConfig(level=logging.INFO)

import numpy as np
import polars as pl

from recommenders.datasets.download_utils import maybe_download
from recommenders.models.deeprec.deeprec_utils import prepare_hparams
from recommenders.models.deeprec.models.dkn import DKN
from recommenders.models.deeprec.io.dkn_iterator import DKNTextIterator

from gensim.models import KeyedVectors
from nltk import RegexpTokenizer
import dacy

from group_33.test import calculate_rankings

print(f"System version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")

2024-07-04 20:39:12.937172: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-04 20:39:12.978896: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-04 20:39:13.174518: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-04 20:39:13.174613: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-04 20:39:13.210560: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to

System version: 3.11.7 (main, Jan 22 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
Tensorflow version: 2.15.1


In [31]:
# DKN parameters
epochs = 10
history_size = 50
batch_size = 1000

DATASET_NAME = "small" # one of: demo, small, large

# prepare tmp dir
tmp_path = Path("..", "tmp", "dkn")
tmp_data_path = tmp_path / DATASET_NAME
(tmp_data_path / "validation").mkdir(exist_ok=True, parents=True)
(tmp_data_path / "train").mkdir(exist_ok=True, parents=True)
(tmp_data_path / "evaluation").mkdir(exist_ok=True, parents=True)

tmp_test_path = tmp_path / "test"
tmp_test_path.mkdir(exist_ok=True, parents=True)

# train & validation & evaluation
data_path = Path("..", "downloads", DATASET_NAME)
train_file = tmp_data_path / "train" / "behaviours.txt"
valid_file = tmp_data_path / "validation" / "behaviours.txt"
evaluation_file = tmp_data_path / "evaluation" / "behaviors.txt"
user_history_file = tmp_data_path / "user_history.txt"
articles_file = data_path / "articles.parquet"
articles_tokenized_file = tmp_data_path / "articles_tokenized.parquet"
word_embeddings_file = tmp_data_path / "word_embeddings.npy"
entity_embeddings_file = tmp_data_path / "entity_embeddings.npy"
context_embeddings_file = tmp_data_path / "context_embeddings.npy"
news_feature_file = tmp_data_path / "news_feature.txt"
infer_embedding_file = tmp_data_path / "infer_embedding.txt"

# test
test_raw_file = data_path / ".." / "ebnerd_testset" / "test" / "behaviors.parquet"
test_file = tmp_test_path / "behavior.txt"
test_articles_file = tmp_test_path / "articles.parquet"
test_articles_tokenized_file = tmp_test_path / "articles_tokenized.parquet"

# prediction
indexed_behaviors_file = tmp_data_path / "indexed_behaviors.parquet"
scores_file = tmp_data_path / "scores.txt"
predictions_file = tmp_data_path / "predictions.txt"

LOG.info(data_path)
LOG.info(tmp_path)

pl.Config.set_tbl_rows(100)

INFO:__main__:../downloads/demo
INFO:__main__:../tmp/dkn


polars.config.Config

## Data preparation

In this example, let's go through a real case on how to apply DKN on a raw news dataset from the very beginning. We will download a copy of open-source MIND dataset, in its original raw format. Then we will process the raw data files into DKN's input data format, which is stated previously.

In [5]:
def transform_behaviors(behaviors_raw: pl.LazyFrame, skip_impression=False):
    LOG.info(f"Starting transform_behaviors")
    df = (behaviors_raw
        .select('article_ids_inview', 'article_ids_clicked', 'impression_id', 'user_id')
        .with_columns(
            pl.struct(['article_ids_inview', 'article_ids_clicked'])
                .map_elements(lambda x: [(article, 1) if article in x['article_ids_clicked'] else (article, 0) for article in set(x['article_ids_inview'])])
                .alias('inview_label_combined'))
        .select('impression_id', 'user_id', 'inview_label_combined')
        .explode(['inview_label_combined']) # workaround, got error when exploding both columns directly
        .with_columns(pl.col('inview_label_combined').list[0].alias('article_id'))
        .with_columns(pl.col('inview_label_combined').list[1].alias('label'))
        .select('impression_id', 'user_id', 'article_id', 'label')
    )

    if skip_impression:
        df = df.with_columns(pl.col('article_id').alias('article_impression'))
    else:
        df = df.with_columns(
            pl.struct(['article_id', 'impression_id']) \
                .map_elements(lambda x: f"{x['article_id']}%{x['impression_id']}")
                .alias('article_impression')) \
            .select('label', 'user_id', 'article_impression')
    return df

In [6]:
def transform_history(*input_files):
    LOG.info(f"Starting transform_history for input files: {input_files}")
    df = (
        pl.concat([pl.scan_parquet(file) for file in input_files])
        .select('user_id', 'article_id_fixed')
        .with_columns(pl
            .col('article_id_fixed')
            .map_elements(lambda ids: ','.join(map(str, ids)))
            .alias('article_id_fixed'))
    )
    return df

In [83]:
nlp = dacy.load("large")

def tokenize_articles(articles_file: str, tokenized_articles_file: str):
    tokenizer = RegexpTokenizer(r"\w+")
    articles = (pl.scan_parquet(articles_file)
        .select("article_id", "title")
        .with_columns(pl.col("title").str.to_lowercase().alias("title"))
        .with_columns(pl.col("title")
            .map_elements(lambda title: tokenizer.tokenize(title))
            .alias("word_tokens")
        )
        .with_columns(pl.col("title")
            .map_elements(lambda title: title or " ")
            .map_batches(lambda titles: pl.Series(nlp.pipe(titles)))
            .map_elements(lambda doc: [{"id": ent.kb_id_, "start": ent.start, "end": ent.end} for ent in doc.ents if ent.kb_id_ != 'NIL'])
            .alias("entities")
        )
        .drop("title")
    )

    articles.collect(streaming=True).write_parquet(tokenized_articles_file)


In [6]:
entity_embeddings = {}

def get_entity_embedding(entity: str):
    if entity not in entity_embeddings:
        try:
            response = get(f'https://wembedder.toolforge.org/api/vector/{entity}')
            response.raise_for_status()

            entity_embeddings[entity] = np.array(response.json()['vector'])
        except HTTPError as e:
            if e.response.status_code != 404:
                raise e
            
            entity_embeddings[entity] = None
    
    return entity_embeddings[entity]

In [9]:
context_embeddings = {}

def get_context_embedding(entity: str):
    if entity not in context_embeddings:
        try:
            response = get(f'https://www.wikidata.org/w/rest.php/wikibase/v0/entities/items/{entity}/statements')
            response.raise_for_status()

            statements = [statement for property_statements in response.json().values() for statement in property_statements]
            context = {statement['value']['content'] for statement in statements if statement['property']['data-type'] == 'wikibase-item' and 'value' in statement and 'content' in statement['value']}
            embeddings = [get_entity_embedding(entity) for entity in context]
            embeddings = [embedding for embedding in embeddings if embedding is not None]

            context_embeddings[entity] = np.mean(embeddings, axis=0) if embeddings else None
        except HTTPError as e:
            if e.response.status_code not in [400, 404]:
                raise e
            
            context_embeddings[entity] = None
    
    return context_embeddings[entity]

In [10]:
get_context_embedding("Q35")

{'P2924': [{'id': 'Q35$bcb33894-4e55-0332-3200-4905b22aaafd', 'rank': 'normal', 'qualifiers': [{'property': {'id': 'P1810', 'data-type': 'string'}, 'value': {'type': 'value', 'content': 'ДАНИЯ'}}], 'references': [], 'property': {'id': 'P2924', 'data-type': 'external-id'}, 'value': {'type': 'value', 'content': '1940197'}}], 'P1082': [{'id': 'Q35$c728ce0a-47b2-a152-55b3-91e33876c5ec', 'rank': 'normal', 'qualifiers': [{'property': {'id': 'P585', 'data-type': 'time'}, 'value': {'type': 'value', 'content': {'time': '+2014-07-01T00:00:00Z', 'precision': 11, 'calendarmodel': 'http://www.wikidata.org/entity/Q1985727'}}}], 'references': [{'hash': '0b90985e90b8ba5bb33bac275e87d58a45eb3e7f', 'parts': [{'property': {'id': 'P854', 'data-type': 'url'}, 'value': {'type': 'value', 'content': 'http://www.statistikbanken.dk/FOLK1'}}, {'property': {'id': 'P585', 'data-type': 'time'}, 'value': {'type': 'value', 'content': {'time': '+2014-07-01T00:00:00Z', 'precision': 11, 'calendarmodel': 'http://www.wiki

KeyboardInterrupt: 

In [10]:
if not (tmp_path / "model.bin").exists():
    maybe_download("http://vectors.nlpl.eu/repository/20/38.zip", tmp_path / "word2vec.zip")

    with zipfile.ZipFile(tmp_path / "word2vec.zip", 'r') as zip_ref:
        zip_ref.extractall(tmp_path)

word2vec = KeyedVectors.load_word2vec_format(tmp_path / "model.bin", binary=True)

def get_word_embedding(word):
    return word2vec[word] if word in word2vec else None

INFO:gensim.models.keyedvectors:loading projection weights from ../tmp/model.bin
INFO:gensim.utils:KeyedVectors lifecycle event {'msg': 'loaded (1655886, 100) matrix of type float32 from ../tmp/model.bin', 'binary': True, 'encoding': 'utf8', 'datetime': '2024-06-20T20:17:23.204223', 'gensim': '4.3.2', 'python': '3.11.7 (main, Jan 22 2024, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]', 'platform': 'Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.34', 'event': 'load_word2vec_format'}


In [52]:
def create_embeddings(tokens, get_embedding, default_embedding=np.zeros(100)):
    embedding_tokens = [None]
    embeddings = [default_embedding]

    for token in tokens:
        embedding = get_embedding(token)

        if embedding is not None:
            embedding_tokens.append(token)
            embeddings.append(embedding)

    embeddings = np.stack(embeddings)
    token2id = {token: i for i, token in enumerate(embedding_tokens)}

    return embeddings, token2id

In [77]:
def entities_to_ids(entities, entity2id):
    ids = []

    for entity in entities:
        if entity["id"] in entity2id and entity["end"] > len(ids):
            if entity["start"] > len(ids):
                ids.extend([0] * (entity["start"] - len(ids)))

            ids.extend([entity2id[entity["id"]] for _ in range(entity["end"] - len(ids))])

    return ids
    

def create_feature_file(tokenized_articles_file, test_tokenized_articles_file, word_embeddings_file, entity_embeddings_file, context_embeddings_file, news_feature_file, doc_size):
    articles = pl.concat([
        pl.scan_parquet(tokenized_articles_file),
        pl.scan_parquet(test_tokenized_articles_file)
    ])

    words = (articles
        .select("word_tokens")
        .explode("word_tokens")
        .unique("word_tokens")
        .collect()         
    )["word_tokens"]

    entities = (articles
        .select("entities")
        .explode("entities")
        .drop_nulls("entities")
        .with_columns(pl.col("entities").map_elements(lambda e: e["id"]).alias("entities"))
        .select("entities")
        .unique()
        .collect()
    )["entities"]

    word_embeddings, word2id = create_embeddings(words, get_word_embedding)
    np.save(word_embeddings_file, word_embeddings)
    del word_embeddings

    entity_embeddings, entity2id = create_embeddings(entities, get_entity_embedding)
    np.save(entity_embeddings_file, entity_embeddings)
    del entity_embeddings

    context_embeddings, entity2id = create_embeddings(entities, get_context_embedding)
    np.save(context_embeddings_file, context_embeddings)
    del context_embeddings

    encoded_articles = (articles
        .with_columns(
            pl.col("word_tokens")
                .map_elements(lambda tokens: [word2id[token] if token in word2id else 0 for token in tokens])
                .map_elements(lambda tokens: list(tokens[:doc_size]) + [0] * (doc_size - len(tokens)))
                .map_elements(lambda tokens: ','.join(map(str, tokens)))
                .alias("word_tokens"),
            pl.col("entities")
                .map_elements(lambda entities: entities_to_ids(entities, entity2id))
                .map_elements(lambda entities: list(entities[:doc_size]) + [0] * (doc_size - len(entities)))
                .map_elements(lambda entities: ','.join(map(str, entities)))
                .alias("entities")
        )
    )

    encoded_articles.sink_csv(news_feature_file, separator=' ', quote_style='never', include_header=False)

In [7]:
from group_33.util import train_test_split
pl.Config.set_streaming_chunk_size(500_000)
force_reload = False

if not train_file.exists() or force_reload:
    train = transform_behaviors(pl.scan_parquet(data_path / 'train' / 'behaviors.parquet'))
    train.sink_csv(train_file, separator=' ', quote_style='never', include_header=False)
    # train_test.collect(streaming=True).write_csv(valid_file, separator=' ', quote_style='never', include_header=False)

if not evaluation_file.exists() or force_reload:
    validation_behaviors = pl.scan_parquet(data_path / 'validation' / 'behaviors.parquet')
    validation, evaluation = train_test_split(validation_behaviors, 0.5)

    validation_transformed = transform_behaviors(validation)
    validation_transformed.collect(streaming=True).write_csv(evaluation_file, separator=' ', quote_style='never', include_header=False)

    evaluation_transformed = transform_behaviors(evaluation)
    evaluation_transformed.collect(streaming=True).write_csv(valid_file, separator=' ', quote_style='never', include_header=False)

if not user_history_file.exists() or force_reload:
    user_history = transform_history(
        data_path / 'train' / 'history.parquet',
        data_path / 'validation' / 'history.parquet',
        data_path / '..' / 'ebnerd_testset' / 'test' / 'history.parquet'
    )
    user_history.sink_csv(user_history_file, separator=' ', quote_style='never', include_header=False)

if not articles_tokenized_file.exists() or force_reload:
    tokenize_articles(articles_file, articles_tokenized_file)

if not test_articles_tokenized_file.exists() or force_reload:
    tokenize_articles(test_articles_file, test_articles_tokenized_file)

if not news_feature_file.exists() or force_reload:
    create_feature_file(
        articles_tokenized_file, test_articles_tokenized_file,
        word_embeddings_file, entity_embeddings_file,
        context_embeddings_file, news_feature_file, 10
    )

## Create hyper-parameters

In [15]:
yaml_file = maybe_download(url="https://recodatasets.z20.web.core.windows.net/deeprec/deeprec/dkn/dkn_MINDsmall.yaml",
                           work_directory=data_path)
hparams = prepare_hparams(yaml_file,
                          seed=33,
                          show_step=100,
                          news_feature_file=news_feature_file.as_posix(),
                          user_history_file=user_history_file.as_posix(),
                          wordEmb_file=word_embeddings_file.as_posix(),
                          entityEmb_file=entity_embeddings_file.as_posix(),
                          contextEmb_file=context_embeddings_file.as_posix(),
                          epochs=epochs,
                          save_model=True,
                          MODEL_DIR=(tmp_path / "model" / f"{int(time())}_e{epochs}_h{history_size}").as_posix(),
                          history_size=history_size,
                          batch_size=batch_size)

INFO:recommenders.datasets.download_utils:File ../downloads/demo/dkn_MINDsmall.yaml already downloaded


## Train the DKN model

In [None]:
model = DKN(hparams, DKNTextIterator)

small:   2_585_747
large: 133_810_641
--> factor of 51 between small & large

on small:
    batch 1000 -> 2600 steps
    hist 50
    takes 50 min / epoch

on small:
    batch 1000
    hist 5
    takes 10 min / epoch


In [None]:
model.fit(train_file, valid_file)

## Load model checkpoint

In [None]:
# model.load_model("../tmp/model/epoch_10")

## Evaluate the DKN model

In [None]:
res = model.run_eval(str(evaluation_file))
print(res)

## Predict for RecSys Challenge Testdata

In [None]:
def transform_behaviors_test(test_raw_file: str, indexed_behaviors_file: str, test_file: str):
    LOG.info(f"Writing index behaviors data to: {indexed_behaviors_file}")
    indexed_behaviors = (pl
        .scan_parquet(test_raw_file)
        .with_row_index()
        .select('index', 'article_ids_inview', 'impression_id', 'user_id')
        .rename({'article_ids_inview': 'article_id'})
        .with_columns(pl.col('article_id').map_elements(lambda ids: list(range(len(ids)))).alias('article_index'))
        .explode('article_id', 'article_index')
    )
    indexed_behaviors.collect(streaming=True).write_parquet(indexed_behaviors_file)

    LOG.info(f"Starting transform_behaviors for input: {test_raw_file}, writing to: {test_file}")
    test_data = (pl
        .scan_parquet(indexed_behaviors_file)
        .with_columns(
            pl.struct(['article_id', 'impression_id'])
                .map_elements(lambda x: f"{x['article_id']}%{x['impression_id']}")
                .alias('article_impression'),
            pl.lit(0).alias('label')
        )
        .select('label', 'user_id', 'article_impression')
    )
    test_data.sink_csv(test_file, separator=' ', quote_style='never', include_header=False)

if not Path(test_file).exists() or True:
    transform_behaviors_test(str(test_raw_file), indexed_behaviors_file, test_file)

In [None]:
model.predict(str(test_file), scores_file)

Index the raw data to make reconstructing the original order of test samples possible for the predicitons.

In [1]:
# combine scores with impression id
rankings = calculate_rankings(indexed_behaviors_file, scores_file)
rankings.write_csv(predictions_file, separator=" ", include_header=False)

NameError: name 'calculate_rankings' is not defined

Sort the rankings by the original order (given by the index) and persist the resulting rankings.



In [None]:
# Check if produced prediction matches with original impression row order
# (pl.scan_parquet(test_raw_file)
#     .select("impression_id", "article_ids_inview")
#     .with_row_index()
#     .with_columns(pl.col("article_ids_inview").map_elements(lambda el: len(el)).alias("len_a_ids"))
#    # .filter(pl.col("impression_id") == 0)
# ).collect().head(1)

In [None]:
# d = (pl.scan_csv(test_file, has_header=False, separator=' ')
#     .filter(pl.col("column_3").str.contains("6451383"))
# 
# )
#     
# d.collect()

## Document embedding inference API

After training, you can get document embedding through this document embedding inference API. The input file format is same with document feature file. The output file fomrat is: `[Newsid] [embedding]`

In [None]:
model.run_get_embedding(news_feature_file, infer_embedding_file)

## Results on large MIND dataset

Here are performances using the large MIND dataset (1,000,000 users, 161,013 news articles and 15,777,377 impression logs).

| Models | g-AUC | MRR |NDCG@5 | NDCG@10 |
| :------| :------: | :------: | :------: | :------ |
| LibFM | 0.5993 | 0.2823 | 0.3005 | 0.3574 |
| Wide&Deep | 0.6216 | 0.2931 | 0.3138 | 0.3712 |
| DKN | 0.6436 | 0.3128 | 0.3371 | 0.3908|


Note that the results of DKN are using Microsoft recommender and the results of the first two models come from the MIND paper \[3\].
We compare the results on the same test dataset.

One epoch takes 6381.3s (5066.6s for training, 1314.7s for evaluating) for DKN on GPU. Hardware specification for running the large dataset: <br>
GPU: Tesla P100-PCIE-16GB <br>
CPU: 6 cores Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

## References

\[1\] Hongwei Wang, Fuzheng Zhang, Xing Xie and Minyi Guo, "DKN: Deep Knowledge-Aware Network for News Recommendation", in Proceedings of the 2018 World Wide Web Conference (WWW), 2018, https://arxiv.org/abs/1801.08284. <br>
\[2\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>
\[3\] Fangzhao Wu et al., "MIND: A Large-scale Dataset for News Recommendation", Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, https://msnews.github.io/competition.html. <br>
\[4\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/