<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

# DKN : Deep Knowledge-Aware Network for News Recommendation

DKN \[1\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \[2\] method for knowledge graph representation learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. 

## Properties of DKN:

- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. 
- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knowledge-level representations of news articles.
- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.


## Data format

DKN takes several files as input as follows:

- **training / validation / test files**: each line in these files represents one instance. Impressionid is used to evaluate performance within an impression session, so it is only used when evaluating, you can set it to 0 for training data. The format is : <br> 
`[label] [userid] [CandidateNews]%[impressionid] `<br> 
e.g., `1 train_U1 N1%0` <br> 

- **user history file**: each line in this file represents a users' click history. You need to set `history_size` parameter in the config file, which is the max number of user's click history we use. We will automatically keep the last `history_size` number of user click history, if user's click history is more than `history_size`, and we will automatically pad with 0 if user's click history is less than `history_size`. the format is : <br> 
`[Userid] [newsid1,newsid2...]`<br>
e.g., `train_U1 N1,N2` <br> 

- **document feature file**: It contains the word and entity features for news articles. News articles are represented by aligned title words and title entities. To take a quick example, a news title may be: <i>"Trump to deliver State of the Union address next week"</i>, then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word "Trump". The title value and entity value is hashed from 1 to `n` (where `n` is the number of distinct words or entities). Each feature length should be fixed at k (`doc_size` parameter), if the number of words in document is more than k, you should truncate the document to k words, and if the number of words in document is less than k, you should pad 0 to the end. 
the format is like: <br> 
`[Newsid] [w1,w2,w3...wk] [e1,e2,e3...ek]`

- **word embedding/entity embedding/ context embedding files**: These are `*.npy` files of pretrained embeddings. After loading, each file is a `[n+1,k]` two-dimensional matrix, n is the number of words(or entities) of their hash dictionary, k is dimension of the embedding, note that we keep embedding 0 for zero padding. 

In this experiment, we used GloVe\[4\] vectors to initialize the word embedding. We trained entity embedding using TransE\[2\] on knowledge graph and context embedding is the average of the entity's neighbors in the knowledge graph.<br>

## MIND dataset

MIND dataset\[3\] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

A smaller version, [MIND-small](https://azure.microsoft.com/en-us/services/open-datasets/catalog/microsoft-news-dataset/), is a small version of the MIND dataset by randomly sampling 50,000 users and their behavior logs from the MIND dataset.

The datasets contains these files for both training and validation data:

#### behaviors.tsv

The behaviors.tsv file contains the impression logs and users' news click hostories. It has 5 columns divided by the tab symbol:

+ Impression ID. The ID of an impression.
+ User ID. The anonymous ID of a user.
+ Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".
+ History. The news click history (ID list of clicked news) of this user before this impression.
+ Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click).

One simple example: 

`1    U82271    11/11/2019 3:28:58 PM    N3130 N11621 N12917 N4574 N12140 N9748    N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0 `

#### news.tsv

The news.tsv file contains the detailed information of news articles involved in the behaviors.tsv file. It has 7 columns, which are divided by the tab symbol:

+ News ID
+ Category
+ SubCategory
+ Title
+ Abstract
+ URL
+ Title Entities (entities contained in the title of this news)
+ Abstract Entities (entites contained in the abstract of this news)

One simple example: 

`N46466    lifestyle    lifestyleroyals    The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By    Shop the notebooks, jackets, and more that the royals can't live without.    https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata    [{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]    [] `

#### entity_embedding.vec & relation_embedding.vec

The entity_embedding.vec and relation_embedding.vec files contain the 100-dimensional embeddings of the entities and relations learned from the subgraph (from WikiData knowledge graph) by TransE method. In both files, the first column is the ID of entity/relation, and the other columns are the embedding vector values.

One simple example: 

`Q42306013  0.014516 -0.106958 0.024590 ... -0.080382`


## DKN architecture

The following figure shows the architecture of DKN.

![](https://recodatasets.blob.core.windows.net/images/dkn_architecture.png)

DKN takes one piece of candidate news and one piece of a user’s clicked news as input. For each piece of news, a specially designed KCNN is used to process its title and generate an embedding vector. KCNN is an extension of traditional CNN that allows flexibility in incorporating symbolic knowledge from a knowledge graph into sentence representation learning. 

With the KCNN, we obtain a set of embedding vectors for a user’s clicked history. To get final embedding of the user with
respect to the current candidate news, we use an attention-based method to automatically match the candidate news to each piece
of his clicked news, and aggregate the user’s historical interests with different weights. The candidate news embedding and the user embedding are concatenated and fed into a deep neural network (DNN) to calculate the predicted probability that the user will click the candidate news.

## Global settings and imports

In [1]:
import sys
sys.path.append("../../")

import os
from tempfile import TemporaryDirectory
import logging
import papermill as pm
import tensorflow as tf

from reco_utils.dataset.download_utils import maybe_download
from reco_utils.dataset.mind import (download_mind, 
                                     extract_mind, 
                                     read_clickhistory, 
                                     get_train_input, 
                                     get_valid_input, 
                                     get_user_history,
                                     get_words_and_entities,
                                     generate_embeddings) 
from reco_utils.recommender.deeprec.deeprec_utils import prepare_hparams
from reco_utils.recommender.deeprec.models.dkn import DKN
from reco_utils.recommender.deeprec.io.dkn_iterator import DKNTextIterator

print(f"System version: {sys.version}")
print(f"Tensorflow version: {tf.__version__}")

System version: 3.6.11 | packaged by conda-forge | (default, Nov 27 2020, 18:57:37) 
[GCC 9.3.0]
Tensorflow version: 1.15.2


In [2]:
# Temp dir
tmpdir = TemporaryDirectory()

# Logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter("%(asctime)s %(levelname)s: %(message)s", datefmt='%I:%M:%S')
handler.setFormatter(formatter)
logger.handlers = [handler]

In [3]:
# Mind parameters
MIND_SIZE = "small"

# DKN parameters
epochs = 10
history_size = 50
batch_size = 100

# Paths
data_path = os.path.join(tmpdir.name, "mind-dkn")
train_file = os.path.join(data_path, "train_mind.txt")
valid_file = os.path.join(data_path, "valid_mind.txt")
user_history_file = os.path.join(data_path, "user_history.txt")
infer_embedding_file = os.path.join(data_path, "infer_embedding.txt")


## Data preparation

In this example, let's go through a real case on how to apply DKN on a raw news dataset from the very beginning. We will download a copy of open-source MIND dataset, in its original raw format. Then we will process the raw data files into DKN's input data format, which is stated previously. 

In [4]:
train_zip, valid_zip = download_mind(size=MIND_SIZE, dest_path=data_path)
train_path, valid_path = extract_mind(train_zip, valid_zip)

100%|██████████| 51.7k/51.7k [00:06<00:00, 8.47kKB/s]
100%|██████████| 30.2k/30.2k [00:03<00:00, 7.60kKB/s]


In [5]:
train_path, valid_path

('MINDsmall_train.zip/train', 'MINDsmall_train.zip/valid')

In [6]:
train_session, train_history = read_clickhistory(train_path, "behaviors.tsv")
valid_session, valid_history = read_clickhistory(valid_path, "behaviors.tsv")

11:26:05 INFO: Train file /tmp/tmpvt_ozvte/mind-dkn/train_mind.txt successfully generated
11:26:07 INFO: Validation file /tmp/tmpvt_ozvte/mind-dkn/valid_mind.txt successfully generated
11:26:07 INFO: User history file /tmp/tmpvt_ozvte/mind-dkn/user_history.txt successfully generated


In [15]:
len(train_session), len(valid_session)

(156965, 73152)

In [19]:
assert all([len(s[2]) != 0 for s in train_session])
assert all([len(s[2]) != 0 for s in valid_session])

In [31]:
uid = 'U13740'
sessions = [s for s in train_session if s[0] == uid]

for s in sessions:
    print(s[:2])
    print(s[-2])
    print(s[-1][:5])
    print('-' * 10)

print(train_history[uid])

['U13740', ['N55189', 'N42782', 'N34694', 'N45794', 'N18445', 'N63302', 'N10414', 'N19347', 'N31801']]
['N55689']
['N35729']
----------
['U13740', ['N55189', 'N42782', 'N34694', 'N45794', 'N18445', 'N63302', 'N10414', 'N19347', 'N31801']]
['N28910']
['N20020', 'N3737', 'N43202', 'N18708', 'N30125']
----------
['U13740', ['N55189', 'N42782', 'N34694', 'N45794', 'N18445', 'N63302', 'N10414', 'N19347', 'N31801']]
['N58133']
['N13907', 'N8509', 'N47061', 'N51048', 'N22417']
----------
['N55189', 'N42782', 'N34694', 'N45794', 'N18445', 'N63302', 'N10414', 'N19347', 'N31801']


In [33]:
v_sessions = [s for s in valid_session if s[0] == uid]

for s in v_sessions:
    print(s[:2])
    print(s[-2])
    print(s[-1][:5])
    print('-' * 10)

print(valid_history.get(uid, None))

None


In [26]:
len(train_history), len(valid_history)

(50000, 50000)

In [49]:
train_uids = set(train_history.keys())
valid_uids = set(valid_history.keys())

join = train_uids & valid_uids
n_join = len(join)
n_join, len(train_uids) - n_join, len(valid_uids) - n_join

(5943, 44057, 44057)

In [None]:
get_train_input(train_session, train_file)
get_valid_input(valid_session, valid_file)
get_user_history(train_history, valid_history, user_history_file)

In [36]:
!ls -lh {data_path}

total 975M
-rw-rw-r-- 1 ec2-user ec2-user 1.1K Jan  6 11:47 dkn_MINDsmall.yaml
-rw-rw-r-- 1 ec2-user ec2-user 4.6M Jan  6 11:33 doc_feature.txt
-rw-rw-r-- 1 ec2-user ec2-user  14M Jan  6 11:33 entity_embeddings_5w_100.npy
drwxrwxr-x 2 ec2-user ec2-user 4.0K Jan  6 11:33 glove
-rw-rw-r-- 1 ec2-user ec2-user 823M Jan  6 11:32 glove.6B.zip
-rw-rw-r-- 1 ec2-user ec2-user  25M Jan  6 11:26 train_mind.txt
-rw-rw-r-- 1 ec2-user ec2-user  16M Jan  6 11:26 user_history.txt
-rw-rw-r-- 1 ec2-user ec2-user  73M Jan  6 11:26 valid_mind.txt
-rw-rw-r-- 1 ec2-user ec2-user  23M Jan  6 11:33 word_embeddings_5w_100.npy


In [41]:
!head -10 {train_file}

1 train_U13740 N55689
0 train_U13740 N35729
0 train_U13740 N35729
0 train_U13740 N35729
0 train_U13740 N35729
1 train_U91836 N17059
0 train_U91836 N22407
0 train_U91836 N39317
0 train_U91836 N33677
0 train_U91836 N20678


In [43]:
!head -20 {valid_file}

1 valid_U80234 N31958%0
0 valid_U80234 N28682%0
0 valid_U80234 N48740%0
0 valid_U80234 N34130%0
0 valid_U80234 N6916%0
0 valid_U80234 N5472%0
0 valid_U80234 N50775%0
0 valid_U80234 N24802%0
0 valid_U80234 N19990%0
0 valid_U80234 N33176%0
0 valid_U80234 N62365%0
0 valid_U80234 N5940%0
0 valid_U80234 N6400%0
0 valid_U80234 N58098%0
0 valid_U80234 N42844%0
0 valid_U80234 N49285%0
0 valid_U80234 N51470%0
0 valid_U80234 N53572%0
0 valid_U80234 N11930%0
0 valid_U80234 N21679%0


In [51]:
!head -5 {data_path}/user_history.txt

train_U13740 N55189,N42782,N34694,N45794,N18445,N63302,N10414,N19347,N31801
train_U91836 N31739,N6072,N63045,N23979,N35656,N43353,N8129,N1569,N17686,N13008,N21623,N6233,N14340,N48031,N62285,N44383,N23061,N16290,N6244,N45099,N58715,N59049,N7023,N50528,N42704,N46082,N8275,N15710,N59026,N8429,N30867,N56514,N19709,N31402,N31741,N54889,N9798,N62612,N2663,N16617,N6087,N13231,N63317,N61388,N59359,N51163,N30698,N34567,N54225,N32852,N55833,N64467,N3142,N13912,N29802,N44462,N29948,N4486,N5398,N14761,N47020,N65112,N31699,N37159,N61101,N14761,N3433,N10438,N61355,N21164,N22976,N2511,N48390,N58224,N48742,N35458,N24611,N37509,N21773,N41011,N19041,N25785
train_U73700 N10732,N25792,N7563,N21087,N41087,N5445,N60384,N46616,N52500,N33164,N47289,N24233,N62058,N26378,N49475,N18870
train_U34670 N45729,N2203,N871,N53880,N41375,N43142,N33013,N29757,N31825,N51891
train_U8125 N10078,N56514,N14904,N33740


In [53]:
!ls -lh {train_path}

total 153M
-rw-rw-r-- 1 ec2-user ec2-user   88M Jan  6 11:25 behaviors.tsv
-rw-rw-r-- 1 ec2-user ec2-user   25M Jan  6 11:25 entity_embedding.vec
-rw-rw-r-- 1 ec2-user ec2-user   40M Jan  6 11:25 news.tsv
-rw-rw-r-- 1 ec2-user ec2-user 1021K Jan  6 11:25 relation_embedding.vec


In [7]:
train_news = os.path.join(train_path, "news.tsv")
valid_news = os.path.join(valid_path, "news.tsv")
news_words, news_entities = get_words_and_entities(train_news, valid_news)

In [55]:
!head -1 {train_news}

N55528	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://assets.msn.com/labs/mind/AAGH0ET.html	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]


In [63]:
len(news_words), type(news_words)

(65238, dict)

In [60]:
news_words['N55528']

['the',
 'brands',
 'queen',
 'elizabeth',
 'prince',
 'charles',
 'and',
 'prince',
 'philip',
 'swear',
 'by']

In [64]:
len(news_entities), type(news_entities)

(65238, dict)

In [62]:
news_entities['N55528']

[(['Prince Philip'], 'Q80976'),
 (['Prince Charles'], 'Q43274'),
 (['Queen Elizabeth'], 'Q9682')]

In [75]:
import json
from nltk.tokenize import RegexpTokenizer

def _read_news1(filepath, news_words, news_entities, tokenizer):

    with open(filepath, encoding="utf-8") as f:
        lines = f.readlines()
    for line in lines:
        print('line:')
        print(line)
        print('-' * 50)
        splitted = line.strip("\n").split("\t")
        print('splitted:')
        print(splitted)
        print('-' * 50)
        new_id, category, subcategory, title, abstract, url, title_entities, abstract_entities = splitted
        news_words[new_id] = tokenizer.tokenize(title.lower())
        news_entities[new_id] = []
        for entity in json.loads(title_entities):
            print("entity:")
            print(entity)
            news_entities[new_id].append(
                (entity["SurfaceForms"], entity["WikidataId"])
            )
        break
    return news_words, news_entities

def get_words_and_entities1(train_news, valid_news):
    """Load words and entities
    Args:
        train_news (str): News train file.
        valid_news (str): News validation file.
    Returns: 
        dict, dict: Words and entities dictionaries.
    """
    news_words = {}
    news_entities = {}
    tokenizer = RegexpTokenizer(r"\w+")
    news_words, news_entities = _read_news1(
        train_news, news_words, news_entities, tokenizer
    )
#     news_words, news_entities = _read_news(
#         valid_news, news_words, news_entities, tokenizer
#     )
    return news_words, news_entities

In [76]:
news_words1, news_entities1 = get_words_and_entities1(train_news, valid_news)

line:
N55528	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://assets.msn.com/labs/mind/AAGH0ET.html	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]

--------------------------------------------------
splitted:
['N55528', 'lifestyle', 'lifestyleroyals', 'The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By', "Shop the notebooks, jackets, and more that the royals can't live without.", 'https://assets.msn.com/labs/mind/AAGH0E

In [77]:
news_words1

{'N55528': ['the',
  'brands',
  'queen',
  'elizabeth',
  'prince',
  'charles',
  'and',
  'prince',
  'philip',
  'swear',
  'by']}

In [78]:
news_entities1

{'N55528': [(['Prince Philip'], 'Q80976'),
  (['Prince Charles'], 'Q43274'),
  (['Queen Elizabeth'], 'Q9682')]}

In [80]:
!ls -lh {data_path}

total 975M
-rw-rw-r-- 1 ec2-user ec2-user 1.1K Jan  6 11:47 dkn_MINDsmall.yaml
-rw-rw-r-- 1 ec2-user ec2-user 4.6M Jan  6 11:33 doc_feature.txt
-rw-rw-r-- 1 ec2-user ec2-user  14M Jan  6 11:33 entity_embeddings_5w_100.npy
drwxrwxr-x 2 ec2-user ec2-user 4.0K Jan  6 11:33 glove
-rw-rw-r-- 1 ec2-user ec2-user 823M Jan  6 11:32 glove.6B.zip
-rw-rw-r-- 1 ec2-user ec2-user  25M Jan  6 11:26 train_mind.txt
-rw-rw-r-- 1 ec2-user ec2-user  16M Jan  6 11:26 user_history.txt
-rw-rw-r-- 1 ec2-user ec2-user  73M Jan  6 11:26 valid_mind.txt
-rw-rw-r-- 1 ec2-user ec2-user  23M Jan  6 11:33 word_embeddings_5w_100.npy


In [8]:
train_entities = os.path.join(train_path, "entity_embedding.vec")
valid_entities = os.path.join(valid_path, "entity_embedding.vec")
news_feature_file, word_embeddings_file, entity_embeddings_file = generate_embeddings(
    data_path,
    news_words,
    news_entities,
    train_entities,
    valid_entities,
    max_sentence=10,
    word_embedding_dim=100,
)

11:26:22 INFO: Downloading glove...
100%|██████████| 842k/842k [06:30<00:00, 2.16kKB/s] 
11:33:11 INFO: Loading glove with embedding dimension 100...
11:33:24 INFO: Reading train entities...
11:33:25 INFO: Reading valid entities...
11:33:26 INFO: Generating word and entity indexes...
11:33:28 INFO: Generating word embeddings...
11:33:28 INFO: Generating entity embeddings...
11:33:28 INFO: Saving word and entity features in /tmp/tmpvt_ozvte/mind-dkn/doc_feature.txt
11:33:29 INFO: Saving word embeddings in /tmp/tmpvt_ozvte/mind-dkn/word_embeddings_5w_100.npy
11:33:29 INFO: Saving word embeddings in /tmp/tmpvt_ozvte/mind-dkn/entity_embeddings_5w_100.npy


In [86]:
!ls -lh {data_path}/glove

total 2.1G
-rw-rw-r-- 1 ec2-user ec2-user 332M Jan  6 11:32 glove.6B.100d.txt
-rw-rw-r-- 1 ec2-user ec2-user 662M Jan  6 11:33 glove.6B.200d.txt
-rw-rw-r-- 1 ec2-user ec2-user 990M Jan  6 11:33 glove.6B.300d.txt
-rw-rw-r-- 1 ec2-user ec2-user 164M Jan  6 11:32 glove.6B.50d.txt


In [90]:
!head -3 {data_path}/glove/glove.6B.50d.txt

the 0.418 0.24968 -0.41242 0.1217 0.34527 -0.044457 -0.49688 -0.17862 -0.00066023 -0.6566 0.27843 -0.14767 -0.55677 0.14658 -0.0095095 0.011658 0.10204 -0.12792 -0.8443 -0.12181 -0.016801 -0.33279 -0.1552 -0.23131 -0.19181 -1.8823 -0.76746 0.099051 -0.42125 -0.19526 4.0071 -0.18594 -0.52287 -0.31681 0.00059213 0.0074449 0.17778 -0.15897 0.012041 -0.054223 -0.29871 -0.15749 -0.34758 -0.045637 -0.44251 0.18785 0.0027849 -0.18411 -0.11514 -0.78581
, 0.013441 0.23682 -0.16899 0.40951 0.63812 0.47709 -0.42852 -0.55641 -0.364 -0.23938 0.13001 -0.063734 -0.39575 -0.48162 0.23291 0.090201 -0.13324 0.078639 -0.41634 -0.15428 0.10068 0.48891 0.31226 -0.1252 -0.037512 -1.5179 0.12612 -0.02442 -0.042961 -0.28351 3.5416 -0.11956 -0.014533 -0.1499 0.21864 -0.33412 -0.13872 0.31806 0.70358 0.44858 -0.080262 0.63003 0.32111 -0.46765 0.22786 0.36034 -0.37818 -0.56657 0.044691 0.30392
. 0.15164 0.30177 -0.16763 0.17684 0.31719 0.33973 -0.43478 -0.31086 -0.44999 -0.29486 0.16608 0.11963 -0.41328 -0.42353

In [93]:
data_path

'/tmp/tmpvt_ozvte/mind-dkn'

In [175]:
news_words['N55528']

['the',
 'brands',
 'queen',
 'elizabeth',
 'prince',
 'charles',
 'and',
 'prince',
 'philip',
 'swear',
 'by']

In [173]:
news_entities['N55528']

[(['Prince Philip'], 'Q80976'),
 (['Prince Charles'], 'Q43274'),
 (['Queen Elizabeth'], 'Q9682')]

In [176]:
train_entities

'MINDsmall_train.zip/train/entity_embedding.vec'

In [119]:
import numpy as np 

def generate_embeddings1(
    data_path,
    news_words,
    news_entities,
    train_entities,
    valid_entities,
    max_sentence=10,
    word_embedding_dim=100,
):
    """Generate embeddings.
    Args:
        data_path (str): Data path.
        news_words (dict): News word dictionary.
        news_entities (dict): News entity dictionary.
        train_entities (str): Train entity file.
        valid_entities (str): Validation entity file.
        max_sentence (int): Max sentence size.
        word_embedding_dim (int): Word embedding dimension.
    Returns:
        str, str, str: File paths to news, word and entity embeddings.
    """
    variables = {}
    
    embedding_dimensions = [50, 100, 200, 300]
    if word_embedding_dim not in embedding_dimensions:
        raise ValueError(
            f"Wrong embedding dimension, available options are {embedding_dimensions}"
        )

    logger.info("Downloading glove...")
    glove_path = '/tmp/tmpvt_ozvte/mind-dkn/glove'

    word_set = set()
    word_embedding_dict = {}
    entity_embedding_dict = {}

    logger.info(f"Loading glove with embedding dimension {word_embedding_dim}...")
    glove_file = "glove.6B." + str(word_embedding_dim) + "d.txt"
    fp_pretrain_vec = open(os.path.join(glove_path, glove_file), "r", encoding="utf-8")
    for i, line in enumerate(fp_pretrain_vec):
        linesplit = line.split(" ")
        word_set.add(linesplit[0])
        word_embedding_dict[linesplit[0]] = np.asarray(list(map(float, linesplit[1:])))
        
#         if i > 0:
#             break
#     print('word_set:')
#     print(word_set)
#     print('word_embedding_dict:')
#     print(word_embedding_dict)
    variables['word_set'] = word_set
    variables['word_embedding_dict'] = word_embedding_dict
    
    fp_pretrain_vec.close()

    logger.info("Reading train entities...")
#     print("train_entities:", train_entities)
    fp_entity_vec_train = open(train_entities, "r", encoding="utf-8")
    for i, line in enumerate(fp_entity_vec_train):
        linesplit = line.split()
        entity_embedding_dict[linesplit[0]] = np.asarray(
            list(map(float, linesplit[1:]))
        )
#         if i > 0:
#             break
    
    
    fp_entity_vec_train.close()

    logger.info("Reading valid entities...")
#     print("valid_entities:", valid_entities)
    fp_entity_vec_valid = open(valid_entities, "r", encoding="utf-8")
    for i, line in enumerate(fp_entity_vec_valid):
        linesplit = line.split()
        entity_embedding_dict[linesplit[0]] = np.asarray(
            list(map(float, linesplit[1:]))
        )
    
    variables['entity_embedding_dict'] = entity_embedding_dict
    fp_entity_vec_valid.close()

    logger.info("Generating word and entity indexes...")
    word_dict = {}
    word_index = 1  # 0 is preserve for notfound entity
    news_word_string_dict = {}
    news_entity_string_dict = {}
    entity2index = {}
    entity_index = 1
    for k, doc_id in enumerate(news_words): # { doc_id: ['the','brands','queen','elizabeth','prince'], NID2: {} ...}
        news_word_string_dict[doc_id] = [0 for n in range(max_sentence)]
        news_entity_string_dict[doc_id] = [0 for n in range(max_sentence)]
        surfaceform_entityids = news_entities[doc_id]
        for item in surfaceform_entityids:   # [(['Prince Philip'], 'Q80976'), (['Prince Charles'], 'Q43274'), (['Queen Elizabeth'], 'Q9682')]
            if item[1] not in entity2index and item[1] in entity_embedding_dict:
                entity2index[item[1]] = entity_index
                entity_index = entity_index + 1
        for i in range(len(news_words[doc_id])):  # ['the','brands','queen','elizabeth','prince']
            if news_words[doc_id][i] in word_embedding_dict:
                if news_words[doc_id][i] not in word_dict:
                    word_dict[news_words[doc_id][i]] = word_index
                    word_index = word_index + 1
                    news_word_string_dict[doc_id][i] = word_dict[news_words[doc_id][i]]
                else:
                    news_word_string_dict[doc_id][i] = word_dict[news_words[doc_id][i]]
                for item in surfaceform_entityids:
                    for surface in item[0]:
                        for surface_word in surface.split(" "):
                            if news_words[doc_id][i] == surface_word.lower():
                                if item[1] in entity_embedding_dict:
                                    news_entity_string_dict[doc_id][i] = entity2index[
                                        item[1]
                                    ]
            if i == max_sentence - 1:
                break
        
        
#         if k > 0:
#             break
    variables['word_dict'] = word_dict
    variables['news_word_string_dict'] = news_word_string_dict
    variables['news_entity_string_dict'] = news_entity_string_dict
    variables['entity2index'] = entity2index
    

    logger.info("Generating word embeddings...")
    word_embeddings = np.zeros([word_index, word_embedding_dim])
    for word in word_dict:
        word_embeddings[word_dict[word]] = word_embedding_dict[word]
    
    variables['word_embeddings'] = word_embeddings


    logger.info("Generating entity embeddings...")
    entity_embeddings = np.zeros([entity_index, word_embedding_dim])
    for entity in entity2index:
        entity_embeddings[entity2index[entity]] = entity_embedding_dict[entity]
    
    variables['entity_embeddings'] = entity_embeddings

    news_feature_path = os.path.join(data_path, "doc_feature.txt")
    logger.info(f"Saving word and entity features in {news_feature_path}")
    fp_doc_string = open(news_feature_path, "w", encoding="utf-8")
    for doc_id in news_word_string_dict:
        fp_doc_string.write(
            doc_id
            + " "
            + ",".join(list(map(str, news_word_string_dict[doc_id])))
            + " "
            + ",".join(list(map(str, news_entity_string_dict[doc_id])))
            + "\n"
        )

    word_embeddings_path = os.path.join(
        data_path, "word_embeddings_5w_" + str(word_embedding_dim) + ".npy"
    )
    logger.info(f"Saving word embeddings in {word_embeddings_path}")
    np.save(word_embeddings_path, word_embeddings)

    entity_embeddings_path = os.path.join(
        data_path, "entity_embeddings_5w_" + str(word_embedding_dim) + ".npy"
    )
    logger.info(f"Saving word embeddings in {entity_embeddings_path}")
    np.save(entity_embeddings_path, entity_embeddings)

    return news_feature_path, word_embeddings_path, entity_embeddings_path, variables

In [120]:
news_feature_file, word_embeddings_file, entity_embeddings_file, variables = generate_embeddings1(
    data_path,
    news_words,
    news_entities,
    train_entities,
    valid_entities,
    max_sentence=10,
    word_embedding_dim=100,
)

03:48:09 INFO: Downloading glove...
03:48:09 INFO: Loading glove with embedding dimension 100...
03:48:22 INFO: Reading train entities...
03:48:23 INFO: Reading valid entities...
03:48:24 INFO: Generating word and entity indexes...
03:48:26 INFO: Generating word embeddings...
03:48:27 INFO: Generating entity embeddings...
03:48:27 INFO: Saving word and entity features in /tmp/tmpvt_ozvte/mind-dkn/doc_feature.txt
03:48:27 INFO: Saving word embeddings in /tmp/tmpvt_ozvte/mind-dkn/word_embeddings_5w_100.npy
03:48:27 INFO: Saving word embeddings in /tmp/tmpvt_ozvte/mind-dkn/entity_embeddings_5w_100.npy


In [122]:
for k, v in variables.items():
    print(k, type(v))

word_set <class 'set'>
word_embedding_dict <class 'dict'>
entity_embedding_dict <class 'dict'>
word_dict <class 'dict'>
news_word_string_dict <class 'dict'>
news_entity_string_dict <class 'dict'>
entity2index <class 'dict'>
word_embeddings <class 'numpy.ndarray'>
entity_embeddings <class 'numpy.ndarray'>


In [127]:
!head -5 {news_feature_file}

N55528 1,2,3,4,5,6,7,5,8,9 0,0,3,3,2,2,0,2,1,0
N19639 10,11,12,13,14,15,0,0,0,0 0,0,0,0,4,4,0,0,0,0
N61837 1,16,17,18,19,20,21,22,1,23 0,0,0,0,0,0,0,0,0,0
N53526 24,25,26,27,28,29,19,30,31,32 0,0,0,0,0,0,0,0,0,0
N38324 30,33,34,35,17,36,37,38,33,39 0,0,0,0,0,5,5,0,0,0


In [167]:
word_set = variables['word_set']
word_embedding_dict = variables['word_embedding_dict']
entity_embedding_dict = variables['entity_embedding_dict']
word_dict = variables['word_dict']
news_word_string_dict = variables['news_word_string_dict']
news_entity_string_dict = variables['news_entity_string_dict']
entity2index = variables['entity2index']
word_embeddings = variables['word_embeddings']
entity_embeddings = variables['entity_embeddings']

In [177]:
len(word_set), len(word_embedding_dict), len(entity_embedding_dict), entity_embeddings.shape, len(word_dict)

(400000, 400000, 31451, (17043, 100), 30003)

In [159]:
len(news_words), len(news_word_string_dict), len(entity2index)

(65238, 65238, 17042)

In [153]:
nid = 'N55528'
news_words[nid]

['the',
 'brands',
 'queen',
 'elizabeth',
 'prince',
 'charles',
 'and',
 'prince',
 'philip',
 'swear',
 'by']

In [154]:
news_word_string_dict[nid]

[1, 2, 3, 4, 5, 6, 7, 5, 8, 9]

In [156]:
news_entity_string_dict[nid]

[0, 0, 3, 3, 2, 2, 0, 2, 1, 0]

In [162]:
index2entity = {v: k for k, v in entity2index.items()}
l = [index2entity.get(idx, None) for idx in news_entity_string_dict[nid]]
l

[None,
 None,
 'Q9682',
 'Q9682',
 'Q43274',
 'Q43274',
 None,
 'Q43274',
 'Q80976',
 None]

In [168]:
len(entity_embedding_dict), entity_embeddings.shape

(31451, (17043, 100))

In [141]:
eid = 'Q41'
entity_embedding_dict[eid]

array([-0.063388, -0.181451,  0.057501, -0.091254, -0.076217, -0.052525,
        0.0505  , -0.224871, -0.018145,  0.030722,  0.064276,  0.073063,
        0.039489,  0.159404, -0.128784,  0.016325,  0.026797,  0.13709 ,
        0.001849, -0.059103,  0.012091,  0.045418,  0.000591,  0.211337,
       -0.034093, -0.074582,  0.014004, -0.099355,  0.170144,  0.109376,
       -0.014797,  0.071172,  0.080375,  0.045563, -0.046462,  0.070108,
        0.015413, -0.020874, -0.170324, -0.00113 ,  0.05981 ,  0.054342,
        0.027358, -0.028995, -0.224508,  0.066281, -0.200006,  0.018186,
        0.082396,  0.167178, -0.136239,  0.055134, -0.080195, -0.00146 ,
        0.031078, -0.017084, -0.091176, -0.036916,  0.124642, -0.098185,
       -0.054836,  0.152483, -0.053712,  0.092816, -0.112044, -0.072247,
       -0.114896, -0.036541, -0.186339, -0.16061 ,  0.037342, -0.133474,
        0.11008 ,  0.070678, -0.005586, -0.046667, -0.07201 ,  0.086424,
        0.026165,  0.030561,  0.077888, -0.117226, 

In [135]:
word = 'the'
word_embedding_dict[word]

array([-0.038194, -0.24487 ,  0.72812 , -0.39961 ,  0.083172,  0.043953,
       -0.39141 ,  0.3344  , -0.57545 ,  0.087459,  0.28787 , -0.06731 ,
        0.30906 , -0.26384 , -0.13231 , -0.20757 ,  0.33395 , -0.33848 ,
       -0.31743 , -0.48336 ,  0.1464  , -0.37304 ,  0.34577 ,  0.052041,
        0.44946 , -0.46971 ,  0.02628 , -0.54155 , -0.15518 , -0.14107 ,
       -0.039722,  0.28277 ,  0.14393 ,  0.23464 , -0.31021 ,  0.086173,
        0.20397 ,  0.52624 ,  0.17164 , -0.082378, -0.71787 , -0.41531 ,
        0.20335 , -0.12763 ,  0.41367 ,  0.55187 ,  0.57908 , -0.33477 ,
       -0.36559 , -0.54857 , -0.062892,  0.26584 ,  0.30205 ,  0.99775 ,
       -0.80481 , -3.0243  ,  0.01254 , -0.36942 ,  2.2167  ,  0.72201 ,
       -0.24978 ,  0.92136 ,  0.034514,  0.46745 ,  1.1079  , -0.19358 ,
       -0.074575,  0.23353 , -0.052062, -0.22044 ,  0.057162, -0.15806 ,
       -0.30798 , -0.41625 ,  0.37972 ,  0.15006 , -0.53212 , -0.2055  ,
       -1.2526  ,  0.071624,  0.70565 ,  0.49744 , 

In [145]:
word_dict[word]

1

## Create hyper-parameters

In [9]:
yaml_file = maybe_download(url="https://recodatasets.blob.core.windows.net/deeprec/deeprec/dkn/dkn_MINDsmall.yaml", 
                           work_directory=data_path)
hparams = prepare_hparams(yaml_file,
                          news_feature_file=news_feature_file,
                          user_history_file=user_history_file,
                          wordEmb_file=word_embeddings_file,
                          entityEmb_file=entity_embeddings_file,
                          epochs=epochs,
                          history_size=history_size,
                          batch_size=batch_size)

100%|██████████| 2.00/2.00 [00:00<00:00, 774KB/s]

The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.




The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

11:47:21 INFO: NumExpr defaulting to 8 threads.


In [128]:
!cat {yaml_file}

data:
  doc_size: 10
  history_size: 50
  word_size: 30004
  entity_size: 17043
  data_format: dkn
  
info:
  metrics:
  - auc
  pairwise_metrics:
  - group_auc
  - mean_mrr
  - ndcg@5;10
  show_step: 10000
  
model:
  method : classification
  activation:
  - sigmoid
  attention_activation: relu
  attention_dropout: 0.0
  attention_layer_sizes: 100
  dim: 100
  use_entity: true
  use_context: false
 
  entity_dim: 100
  entity_embedding_method: TransE
  transform: true
 
  dropout:
  - 0.0
  filter_sizes:
  - 1
  - 2
  - 3
  layer_sizes:
  - 300
  # model_type: DKN_without_context
  model_type: dkn
  num_filters: 100
  infer_model_name : epoch_2

train:
  batch_size: 100
  embed_l1: 0.000
  embed_l2: 0.000
  epochs: 10
  init_method: xavier_normal
  init_value: 0.1
  layer_l1: 0.000
  layer_l2: 0.000001
  learning_rate: 0.0003
  loss: log_loss
  is_clip_norm: False
  max_grad_norm: 0.2
  optimizer: adam
  save_model: False
  save_epoch : 2
  enable_BN : True


## Train the DKN model

In [10]:
model = DKN(hparams, DKNTextIterator)

Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts.


Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).


Instructions for updating:
Use keras.layers.BatchNormalization instead.  In particular, `tf.control_dependencies(tf.GraphKeys.UPDATE_OPS)` should not be used (consult the `tf.keras.layers.batch_normalization` documentation).


Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Please use `layer.__call__` method instead.


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


In [None]:
model.fit(train_file, valid_file)

## Evaluate the DKN model

In [None]:
res = model.run_eval(valid_file)
print(res)

In [None]:
pm.record("res", res)

## Document embedding inference API

After training, you can get document embedding through this document embedding inference API. The input file format is same with document feature file. The output file fomrat is: `[Newsid] [embedding]`

In [None]:
model.run_get_embedding(news_feature_file, infer_embedding_file)

## Results on large MIND dataset

Here are performances using the large MIND dataset (1,000,000 users, 161,013 news articles and 15,777,377 impression logs). 

| Models | g-AUC | MRR |NDCG@5 | NDCG@10 |
| :------| :------: | :------: | :------: | :------ |
| LibFM | 0.5993 | 0.2823 | 0.3005 | 0.3574 |
| Wide&Deep | 0.6216 | 0.2931 | 0.3138 | 0.3712 |
| DKN | 0.6436 | 0.3128 | 0.3371 | 0.3908|


Note that the results of DKN are using Microsoft recommender and the results of the first two models come from the MIND paper \[3\].
We compare the results on the same test dataset. 

One epoch takes 6381.3s (5066.6s for training, 1314.7s for evaluating) for DKN on GPU. Hardware specification for running the large dataset: <br>
GPU: Tesla P100-PCIE-16GB <br>
CPU: 6 cores Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz

## References

\[1\] Wang, Hongwei, et al. "DKN: Deep Knowledge-Aware Network for News Recommendation." Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>
\[2\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E <br>
\[3\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[4\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/