# Introduction

This repository proposes two extensions to the implementation of [KRED: Knowledge-Aware Document Representation for News Recommendations](https://arxiv.org/abs/1910.11494) [1]


## Model description



KRED is a knowledge enhanced framework which enhance a document embedding with knowledge information for multiple news recommendation tasks. The framework mainly contains two part: representation enhancement part(left) and multi-task training part(right).

![](./framework.PNG)

Two extensions to this model have been implemented:
- replacing the attention layer over the entities with a multi-head attention layer;
- adding to the context embedding layer the embedding of the news category which consists of a first general category and a second more specific category, the two categories are embedded separately.

The multi-head attention, the embedding of the first news category and the embedding of the second news embedding can be enabled/disabled using the config.

## Initialization

If you download this notebook and run it on colab, you need to run the following commands.

In [None]:
# Clone repo
!git clone https://github.com/lolloberga/Recommendation_System_KRED.git
%cd Recommendation_System_KRED

In [None]:
# Install required libraries
!pip install -r requirements.txt

In [None]:
import os
from utils.util import *
from train_test import *
import sys
import os
import argparse
from parse_config import ConfigParser

## Dataset description and download

MIND dataset [2] is a large-scale English news dataset. It was collected from anonymized behavior logs of Microsoft News website. MIND contains 1,000,000 users, 161,013 news articles and 15,777,377 impression logs. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression.

For quicker training and evaluaiton, we use MINDsmall dataset of 50k users from MINDlarge dataset. The MINDsmall dataset has the same file format as MINDlarge.

MINDsmall_train is used for training, and MINDsmall_dev is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

In [4]:
# Options: demo, small, large
MIND_type = 'small'
data_path = "./data/"

train_news_file = os.path.join(data_path, 'train', r'news.tsv')
train_behaviors_file = os.path.join(data_path, 'train', r'behaviors.tsv')
valid_news_file = os.path.join(data_path, 'valid', r'news.tsv')
valid_behaviors_file = os.path.join(data_path, 'valid', r'behaviors.tsv')
knowledge_graph_file = os.path.join(data_path, 'kg/wikidata-graph', r'wikidata-graph.tsv')
entity_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'entity2vecd100.vec')
relation_embedding_file = os.path.join(data_path, 'kg/wikidata-graph', r'relation2vecd100.vec')

mind_url, mind_train_dataset, mind_dev_dataset, _ = get_mind_data_set(MIND_type)

kg_url = "https://kredkg.blob.core.windows.net/wikidatakg/"

if not os.path.exists(train_news_file):
    download_deeprec_resources(mind_url, os.path.join(data_path, 'train'), mind_train_dataset)
    
if not os.path.exists(valid_news_file):
    download_deeprec_resources(mind_url, \
                               os.path.join(data_path, 'valid'), mind_dev_dataset)

if not os.path.exists(knowledge_graph_file):
    download_deeprec_resources(kg_url, \
                               os.path.join(data_path, 'kg'), "kg.zip")

## Loading config

In [None]:
sys.path.append('')
sys.argv = [''] # added by me, solved problem in this cell

parser = argparse.ArgumentParser(description='KRED')


parser.add_argument('-c', '--config', default="./config.json", type=str,
                    help='config file path (default: None)')
parser.add_argument('-r', '--resume', default=None, type=str,
                    help='path to latest checkpoint (default: None)')
parser.add_argument('-d', '--device', default=None, type=str,
                    help='indices of GPUs to enable (default: all)')

config = ConfigParser.from_args(parser)


## Create hyper-parameters

In [3]:
epochs = 10
batch_size = 64
train_type = "single_task"
task = "user2item" # task should be within: user2item, item2item, vert_classify, pop_predict

config['trainer']['epochs'] = epochs
config['data_loader']['batch_size'] = batch_size
config['trainer']['training_type'] = train_type
config['trainer']['task'] = task
config['trainer']['save_period'] = epochs/2
# The following parameters define which of the extensions are used, 
# by setting them to False the original KRED model is executed 
config['model']['use_mh_attention'] = False
config['model']['mh_number_of_heads'] = 6
config['data']['use_entity_category'] = False
config['data']['use_second_entity_category'] = False



## Process dataset


To speed up the execution, you can save the sentence embeddings on Google drive.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)
# WARNING: the following folder must exist, otherwise it will raise errors when
# saving sentence embeddings
sentence_embedding_folder = "/content/drive/MyDrive/Dataset/"

In [None]:
# Run this cell only once to save the news embeddings in the drive folder
save_embedding_news("train", sentence_embedding_folder)
save_embedding_news("valid", sentence_embedding_folder)

In [None]:
try:
    data = load_data_mind(config, sentence_embedding_folder)
except NameError:
    data = load_data_mind(config)
test_data = data[-1]

In [None]:
# Limit the user2item validation dataset, since otherwise the validation during training takes too long to run
def limit_user2item_validation_data(data, size):
    test_data = data[-1]
    test_data_reduced = {key: test_data[key][:size] for key in test_data.keys()}
    # Concatenate the old tuple with the updated validation data
    return data[:-1] + (test_data_reduced,)

data = limit_user2item_validation_data(data, 10000)

## Train the KRED model

In [None]:
if train_type == "single_task":
    single_task_training(config, data)
else:
    multi_task_training(config, data)

## Evaluate the KRED model

In [None]:
testing(test_data, config)

## Performance of user2item on MINDsmall

We test the performance for the user to item task on MINDsmall dataset, for your reference:

| Models | AUC | NDCG@10 |
| :------- | :------- | :------- |
| KRED(single task training) | 0.6231 | 0.3628 |
| KRED + multi-head attention (6 heads) |  0.5844 | 0.3452|
| KRED + multi-head attention (12 heads) |  0.5845 | 0.3470|
| KRED + news first category embedding |  0.6421 | 0.3837|
| KRED + news second category embedding |  0.6218 | 0.3576|
| KRED + news first and second categories embedding |  0.6397 | 0.3777|


## Reference

[1] Liu, Danyang, et al. "KRED: Knowledge-Aware Document Representation for News Recommendations." Fourteenth ACM Conference on Recommender Systems. 2020.

[2] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.