# LSTUR: Neural News Recommendation with Long- and Short-term User Representations
LSTUR \[1\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. We will use this algorithm to perform the necessary ranking of the Ebnerd-dataset.

## Data format:
The dataformat and available data can be found [here](https://recsys.eb.dk/dataset/), you can select between demo, small and large with an extra testset available. We transformed the data by manipulating, reordering and dropping columns into the following format. This should be suitable for the algorithms implementation.
 
### article data
This file contains news information including articleId, category, title, body and url.
One simple example: <br>

`3044020	underholdning	Prins Harry tvunget til dna-test	Den britiske tabloidavis The Sun fortsætter med at lække historier fra den kommende bog om prinsesse Diana, skrevet af prinsessens veninde Simone Simmons.Onsdag er det historien om, at det britiske kongehus lod prins Harry dna-teste for at sikre sig, at prins Charles var far til ham.Hoffet frygtede, at Dianas tidligere elsker James Hewitt, var far til Harry.Dna-testen fandt sted, da Harry var 11 år gammel.Det var en slet skjult hemmelighed, at Diana og Hewitt hyggede sig i sengen, og der var simpelthen en udbredt frygt på Buckingham Palace for, at lidenskaben havde resulteret i rødhårede Harry.Diana selv afviste rygterne og påpegede, at hvis man regnede på datoerne, kunne Hewitt ikke være far til Harry, men frygten for arvefølgen var så stor, at den 11-årige Harry måtte tage testen trods Dianas forsikringer om hans fædrene ophav.	https://ekstrabladet.dk/underholdning/udlandkendte/article3044020.ece
`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[Article ID] [Category] [Article Title] [Articles Body] [Articles Url]`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.

## Imports and Global settings

In [2]:
import os, sys, zipfile, logging
import numpy as np
import polars as pl
import tensorflow as tf

from pathlib import Path
from tqdm import tqdm

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.11.8 (main, Feb  6 2024, 21:21:21) [Clang 15.0.0 (clang-1500.1.0.2.5)]
Tensorflow version: 2.15.1


### Configure logging settings

In [3]:
# configurations
tf.get_logger().setLevel('ERROR') # only show error messages

logging.basicConfig(level=logging.INFO)
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

pl.Config.set_tbl_rows(100)
pl.Config.set_streaming_chunk_size(500_000)

polars.config.Config

## Prepare Parameters and File Locations

In [4]:
# LSTUR parameters
EPOCHS = 5
SEED = 40
BATCH_SIZE = 32

# whether to re-compute the dataset
FORCE_RELOAD = True

# path to the dataset
DATASET_NAME = "demo" # one of: demo, small, large
TEMP_DIR = "tmp"
GROUP_PROJECT_PATH = "/Users/maxkleinegger/Downloads/"

In [5]:
# dataset parquet files path
PATH = Path(os.path.join(GROUP_PROJECT_PATH, DATASET_NAME))
train_behaviors_path = os.path.join(PATH, 'train', 'behaviors.parquet')
train_history_path = os.path.join(PATH, 'train', 'history.parquet')
val_behaviors_path = os.path.join(PATH, 'validation', 'behaviors.parquet')
val_history_path = os.path.join(PATH, 'validation', 'history.parquet')
articles_path = os.path.join(PATH, 'articles.parquet')

LOG.info(PATH)

INFO:__main__:/Users/maxkleinegger/Downloads/demo


In [6]:
# artifacts file path
TMP_PATH = Path(os.path.join(GROUP_PROJECT_PATH, TEMP_DIR))
tmp_train_path = Path(os.path.join(TMP_PATH, DATASET_NAME, 'train'))
tmp_val_path = Path(os.path.join(TMP_PATH, DATASET_NAME, 'val'))

# create directories if not exist
tmp_train_path.mkdir(exist_ok=True, parents=True)
tmp_val_path.mkdir(exist_ok=True, parents=True)

train_behaviors_file = os.path.join(tmp_train_path, 'behaviors.tsv')
val_behaviors_file = os.path.join(tmp_val_path, 'behaviors.tsv')
articles_file = os.path.join(TMP_PATH, 'articles.tsv')

# hyperparameters
yaml_file = os.path.join('../src/group_33/configs/lstur.yaml')
user_dict_file = os.path.join(tmp_train_path, 'user_dict.pkl')
words_dict_file = os.path.join(tmp_train_path, 'words_dict.pkl')
word_embeddings_file = os.path.join(tmp_train_path, 'word_embeddings.npy')

## Transform parquet files

In [13]:
COL_IMPRESSION_ID_IDX = 0
COL_USER_ID_IDX = 8
COL_USER_ID = "user_id"
COL_IMPRESSION_TIME_IDX = 2
COL_INVIEW_ARTICLE_IDS_IDX = 6
COL_CLICKED_ARTICLE_IDS_IDX = 7


def transfrom_behavior_file(
    behaviors_path, history_path, result_path, history_size=None, sample_size=1
):
    def transform_row(row):
        impression_id = row[COL_IMPRESSION_ID_IDX]
        user_id = row[COL_USER_ID_IDX]
        impression_time = row[COL_IMPRESSION_TIME_IDX]
        clicked_articles = user_history.get(user_id, {}).get("article_id", [])
        timestamps = user_history.get(user_id, {}).get("impression_time", [])

        # Filter click history to include only clicks before the impression time
        user_click_history = [
            article_id
            for article_id, timestamp in zip(clicked_articles, timestamps)
            if timestamp < impression_time
        ]

        user_click_history = user_click_history[-history_size:]
        user_click_history_str = " ".join(user_click_history)

        # Prepare impression news
        inview_articles = row[COL_INVIEW_ARTICLE_IDS_IDX]
        clicked_articles = row[COL_CLICKED_ARTICLE_IDS_IDX]
        impression_news = [
            f"{article_id}-{1 if article_id in clicked_articles else 0}"
            for article_id in inview_articles
        ]
        impression_news_str = " ".join(impression_news)

        return (
            impression_id,
            user_id,
            impression_time,
            user_click_history_str,
            impression_news_str,
        )

    behaviors = pl.read_parquet(behaviors_path)
    history = pl.read_parquet(history_path)

    if history_size is None:
        history_size = history.shape[0]

    # Transform history to a dictionary for fast lookup
    user_history = {}
    for row in history.iter_rows(named=True):
        user_history[f"U{row['user_id']}"] = {
            "article_id": row["article_id_fixed"],
            "impression_time": row["impression_time_fixed"],
        }

    result_behavior_df = pl.DataFrame(behaviors.map_rows(transform_row))
    result_behavior_df.columns = [
        "Impression ID",
        "User ID",
        "Impression Time",
        "User Click History",
        "Impression News",
    ]
    result_behavior_df.sample(fraction=sample_size).write_csv(
        result_path, quote_style="never", include_header=False, separator="\t"
    )
    return result_behavior_df



if not Path(train_behaviors_file).exists() or FORCE_RELOAD:
    behavior_df = transfrom_behavior_file(train_behaviors_path, train_history_path, train_behaviors_file)
if not Path(val_behaviors_file).exists() or FORCE_RELOAD:
    behavior_df_val = transfrom_behavior_file(val_behaviors_path, val_history_path, val_behaviors_file)

SyntaxError: f-string: unmatched '[' (1512826099.py, line 55)

In [14]:
COL_ARTICLE_ID = "article_id"
COL_ARTICLE_CATEGORY = "category_str"
COL_ARTICLE_TITLE = "title"
COL_ARTICLE_BODY = "body"
COL_ARTICLE_URL = "url"

def transform_articles_file(articles_path, result_path):
    def clean_text(column):
        return column.str.replace_all("\n", "").str.replace_all("\t", " ")

    articles = pl.read_parquet(articles_path)

    # Select relevant columns and apply the cleaning function to 'title' and 'body'
    articles = articles.select(
        [
            COL_ARTICLE_ID,
            COL_ARTICLE_CATEGORY,
            COL_ARTICLE_TITLE,
            COL_ARTICLE_BODY,
            COL_ARTICLE_URL,
        ]
    ).with_columns(
        [
            clean_text(articles[COL_ARTICLE_TITLE]),
            clean_text(articles[COL_ARTICLE_BODY]),
        ]
    )

    articles.write_csv(
        result_path, quote_style="never", include_header=False, separator="\t"
    )
    return articles



if not Path(articles_file).exists() or FORCE_RELOAD:
    train_news = transform_articles_file(articles_path, articles_file)

## Create hyper-parameters

In [10]:
import pickle
user_id_mapping = {
        user_id: i
        for i, user_id in enumerate(behavior_df[COL_USER_ID].unique())
    }

# Dump the dictionary as a pkl file
with open(user_dict_file, "wb") as f:
    pickle.dump(user_id_mapping, f)

NameError: name 'behavior_df' is not defined

In [175]:
import pickle
words = train_news['title'].str.split(' ').explode()
words_id_mapping = {word: i for i, word in enumerate(words.unique())}

# Dump the dictionary as a pkl file
with open(words_dict_file, 'wb') as f:
    pickle.dump(words_id_mapping, f)

In [279]:
from recommenders.models.newsrec.newsrec_utils import word_tokenize

# Tokenize the words
tokens = train_news['title'].map_elements(word_tokenize)
word2id = {word: i for i, word in enumerate(tokens.explode().unique())}
# Dump the dictionary as a pkl file
with open(words_dict_file, 'wb') as f:
    pickle.dump(word2id, f)

In [210]:
from transformers import AutoTokenizer, AutoModel
from ebrec.utils._nlp import generate_embeddings_with_transformers

TRANSFORMER_MODEL_NAME = "FacebookAI/xlm-roberta-base"
MAX_TITLE_LENGTH = 30
batch_size = 8
text_list = train_news['title'].to_list()
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)
t = generate_embeddings_with_transformers(
    transformer_model, transformer_tokenizer, text_list, batch_size, "cpu"
)

Encoding: 100%|██████████| 2593/2593 [03:32<00:00, 12.22it/s]


In [217]:
np.save(word_embeddings_file, t)

In [285]:
import numpy as np

# Function to load GloVe embeddings
def load_glove_embeddings(glove_file_path):
    embeddings_index = {}
    with open(glove_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

# Load GloVe embeddings
glove_file_path = os.path.abspath('/Users/maxkleinegger/Downloads/glove.6B/glove.6B.100d.txt')
embeddings_index = load_glove_embeddings(glove_file_path)

# Create embedding matrix
embedding_dim = 100
embedding_matrix = np.zeros((len(word2id), embedding_dim))

for word, idx in word2id.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[idx] = embedding_vector
    else:
        # Words not found in the embedding index will be initialized randomly
        embedding_matrix[idx] = np.random.normal(scale=0.6, size=(embedding_dim,))

# Save the embedding matrix to a file
np.save(word_embeddings_file, embedding_matrix)

In [284]:
embedding_matrix

array([[ 0.17757   , -0.29135001, -0.70521998, ...,  0.41793999,
        -0.30325001,  0.26774999],
       [-0.36452862, -0.07568185, -0.41076382, ..., -0.37178079,
         0.5245951 ,  1.03171141],
       [ 0.02057405, -0.50906059,  0.16129882, ..., -0.32023004,
         0.00815788, -0.86302826],
       ...,
       [-0.1811851 , -0.45838818,  0.47729788, ..., -0.35097379,
         0.11257933, -0.54628964],
       [-0.79680997, -0.27430999,  0.46555001, ..., -1.16209996,
        -0.035306  , -0.070289  ],
       [-0.23470999, -0.64236999, -0.19595   , ...,  0.43999001,
        -0.36699   ,  0.29971001]])

In [176]:
from transformers import AutoTokenizer, AutoModel
import numpy as np
import torch

from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings
from ebrec.utils._python import write_submission_file, rank_predictions_by_score



# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)

tokenized_titles = train_news['title'].map_elements(lambda x: transformer_tokenizer.tokenize(x))

# Create dictionary mapping each word to an ID
all_tokens = [token for title in tokenized_titles for token in title]
unique_tokens = list(set(all_tokens))
word2id = {word: idx for idx, word in enumerate(unique_tokens)}

embeddings = {}
for word, idx in word2id.items():
    input_ids = torch.tensor([transformer_tokenizer.convert_tokens_to_ids(word)]).unsqueeze(0)  # Batch size 1
    with torch.no_grad():
        outputs = transformer_model(input_ids)
    embeddings[word] = outputs.last_hidden_state.squeeze().mean(dim=0).numpy()




In [182]:
embedding_array = np.vstack([embeddings[word] for word in unique_tokens])

In [209]:
np.save(word_embeddings_file, embedding_array)
print(len(embedding_array[max(word2id.values())]))
print(max(word2id.values()))

1
12133


In [304]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=word_embeddings_file,
                          wordDict_file=words_dict_file, 
                          userDict_file=user_dict_file,
                          batch_size=BATCH_SIZE,
                          epochs=EPOCHS)
print(hparams)

HParams object with values {'support_quick_scoring': True, 'dropout': 0.2, 'attention_hidden_dim': 200, 'head_num': 4, 'head_dim': 100, 'filter_num': 400, 'window_size': 3, 'vert_emb_dim': 100, 'subvert_emb_dim': 100, 'gru_unit': 400, 'type': 'ini', 'user_emb_dim': 50, 'learning_rate': 0.0001, 'optimizer': 'adam', 'epochs': 5, 'batch_size': 32, 'show_step': 100000, 'title_size': 30, 'his_size': 50, 'data_format': 'news', 'npratio': 4, 'metrics': ['group_auc', 'mean_mrr', 'ndcg@5;10'], 'word_emb_dim': 100, 'cnn_activation': 'relu', 'model_type': 'lstur', 'loss': 'cross_entropy_loss', 'wordEmb_file': '/Users/maxkleinegger/Downloads/tmp/word_embeddings.npy', 'wordDict_file': '/Users/maxkleinegger/Downloads/tmp/words_dict.pkl', 'userDict_file': '/Users/maxkleinegger/Downloads/tmp/user_dict.pkl'}


In [305]:
# Copyright (c) Recommenders contributors.
# Licensed under the MIT License.

import tensorflow as tf
import numpy as np
import pickle

from recommenders.models.deeprec.io.iterator import BaseIterator
from recommenders.models.newsrec.newsrec_utils import word_tokenize, newsample

__all__ = ["MINDIterator"]


class MINDIterator(BaseIterator):
    """Train data loader for NAML model.
    The model require a special type of data format, where each instance contains a label, impresion id, user id,
    the candidate news articles and user's clicked news article. Articles are represented by title words,
    body words, verts and subverts.

    Iterator will not load the whole data into memory. Instead, it loads data into memory
    per mini-batch, so that large files can be used as input data.

    Attributes:
        col_spliter (str): column spliter in one line.
        ID_spliter (str): ID spliter in one line.
        batch_size (int): the samples num in one batch.
        title_size (int): max word num in news title.
        his_size (int): max clicked news num in user click history.
        npratio (int): negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.
    """

    def __init__(
        self,
        hparams,
        npratio=-1,
        col_spliter="\t",
        ID_spliter="%",
    ):
        """Initialize an iterator. Create necessary placeholders for the model.

        Args:
            hparams (object): Global hyper-parameters. Some key setttings such as head_num and head_dim are there.
            npratio (int): negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.
            col_spliter (str): column spliter in one line.
            ID_spliter (str): ID spliter in one line.
        """
        self.col_spliter = col_spliter
        self.ID_spliter = ID_spliter
        self.batch_size = hparams.batch_size
        self.title_size = hparams.title_size
        self.his_size = hparams.his_size
        self.npratio = npratio

        self.word_dict = self.load_dict(hparams.wordDict_file)
        self.uid2index = self.load_dict(hparams.userDict_file)

    def load_dict(self, file_path):
        """load pickle file

        Args:
            file path (str): file path

        Returns:
            object: pickle loaded object
        """
        with open(file_path, "rb") as f:
            return pickle.load(f)

    def init_news(self, news_file):
        """init news information given news file, such as news_title_index and nid2index.
        Args:
            news_file: path of news file
        """

        self.nid2index = {}
        news_title = [""]

        with tf.io.gfile.GFile(news_file, "r") as rd:
            for line in rd:
                nid, vert, title, ab, url = line.strip("\n").split(
                    self.col_spliter
                )
                if nid in self.nid2index:
                    continue

                self.nid2index[nid] = len(self.nid2index) + 1
                title = word_tokenize(title)
                news_title.append(title)

        self.news_title_index = np.zeros(
            (len(news_title), self.title_size), dtype="int32"
        )

        for news_index in range(len(news_title)):
            title = news_title[news_index]
            for word_index in range(min(self.title_size, len(title))):
                if title[word_index] in self.word_dict:
                    self.news_title_index[news_index, word_index] = self.word_dict[
                        title[word_index].lower()
                    ]

    def init_behaviors(self, behaviors_file):
        """init behavior logs given behaviors file.

        Args:
        behaviors_file: path of behaviors file
        """
        self.histories = []
        self.imprs = []
        self.labels = []
        self.impr_indexes = []
        self.uindexes = []

        with tf.io.gfile.GFile(behaviors_file, "r") as rd:
            impr_index = 0
            for line in rd:
                uid, time, history, impr = line.strip("\n").split(self.col_spliter)[-4:]

                history = [self.nid2index[i[1:]] for i in history.split()]
                history = [0] * (self.his_size - len(history)) + history[
                    -self.his_size :
                ]

                impr_news = [self.nid2index[(i.split("-")[0])[1:]] for i in impr.split()]
                label = [int(i.split("-")[1]) for i in impr.split()]
                uindex = self.uid2index[uid] if uid in self.uid2index else 0

                self.histories.append(history)
                self.imprs.append(impr_news)
                self.labels.append(label)
                self.impr_indexes.append(impr_index)
                self.uindexes.append(uindex)
                impr_index += 1

    def parser_one_line(self, line):
        """Parse one behavior sample into feature values.
        if npratio is larger than 0, return negtive sampled result.

        Args:
            line (int): sample index.

        Yields:
            list: Parsed results including label, impression id , user id,
            candidate_title_index, clicked_title_index.
        """
        if self.npratio > 0:
            impr_label = self.labels[line]
            impr = self.imprs[line]

            poss = []
            negs = []

            for news, click in zip(impr, impr_label):
                if click == 1:
                    poss.append(news)
                else:
                    negs.append(news)

            for p in poss:
                candidate_title_index = []
                impr_index = []
                user_index = []
                label = [1] + [0] * self.npratio

                n = newsample(negs, self.npratio)
                candidate_title_index = self.news_title_index[[p] + n]
                click_title_index = self.news_title_index[self.histories[line]]
                impr_index.append(self.impr_indexes[line])
                user_index.append(self.uindexes[line])

                yield (
                    label,
                    impr_index,
                    user_index,
                    candidate_title_index,
                    click_title_index,
                )

        else:
            impr_label = self.labels[line]
            impr = self.imprs[line]

            for news, label in zip(impr, impr_label):
                candidate_title_index = []
                impr_index = []
                user_index = []
                label = [label]

                candidate_title_index.append(self.news_title_index[news])
                click_title_index = self.news_title_index[self.histories[line]]
                impr_index.append(self.impr_indexes[line])
                user_index.append(self.uindexes[line])

                yield (
                    label,
                    impr_index,
                    user_index,
                    candidate_title_index,
                    click_title_index,
                )

    def load_data_from_file(self, news_file, behavior_file):
        """Read and parse data from news file and behavior file.

        Args:
            news_file (str): A file contains several informations of news.
            beahaviros_file (str): A file contains information of user impressions.

        Yields:
            object: An iterator that yields parsed results, in the format of dict.
        """

        if not hasattr(self, "news_title_index"):
            self.init_news(news_file)

        if not hasattr(self, "impr_indexes"):
            self.init_behaviors(behavior_file)

        label_list = []
        imp_indexes = []
        user_indexes = []
        candidate_title_indexes = []
        click_title_indexes = []
        cnt = 0

        indexes = np.arange(len(self.labels))

        if self.npratio > 0:
            np.random.shuffle(indexes)

        for index in indexes:
            for (
                label,
                imp_index,
                user_index,
                candidate_title_index,
                click_title_index,
            ) in self.parser_one_line(index):
                candidate_title_indexes.append(candidate_title_index)
                click_title_indexes.append(click_title_index)
                imp_indexes.append(imp_index)
                user_indexes.append(user_index)
                label_list.append(label)

                cnt += 1
                if cnt >= self.batch_size:
                    yield self._convert_data(
                        label_list,
                        imp_indexes,
                        user_indexes,
                        candidate_title_indexes,
                        click_title_indexes,
                    )
                    label_list = []
                    imp_indexes = []
                    user_indexes = []
                    candidate_title_indexes = []
                    click_title_indexes = []
                    cnt = 0

        if cnt > 0:
            yield self._convert_data(
                label_list,
                imp_indexes,
                user_indexes,
                candidate_title_indexes,
                click_title_indexes,
            )

    def _convert_data(
        self,
        label_list,
        imp_indexes,
        user_indexes,
        candidate_title_indexes,
        click_title_indexes,
    ):
        """Convert data into numpy arrays that are good for further model operation.

        Args:
            label_list (list): a list of ground-truth labels.
            imp_indexes (list): a list of impression indexes.
            user_indexes (list): a list of user indexes.
            candidate_title_indexes (list): the candidate news titles' words indices.
            click_title_indexes (list): words indices for user's clicked news titles.

        Returns:
            dict: A dictionary, containing multiple numpy arrays that are convenient for further operation.
        """

        labels = np.asarray(label_list, dtype=np.float32)
        imp_indexes = np.asarray(imp_indexes, dtype=np.int32)
        user_indexes = np.asarray(user_indexes, dtype=np.int32)
        candidate_title_index_batch = np.asarray(
            candidate_title_indexes, dtype=np.int64
        )
        click_title_index_batch = np.asarray(click_title_indexes, dtype=np.int64)
        return {
            "impression_index_batch": imp_indexes,
            "user_index_batch": user_indexes,
            "clicked_title_batch": click_title_index_batch,
            "candidate_title_batch": candidate_title_index_batch,
            "labels": labels,
        }

    def load_user_from_file(self, news_file, behavior_file):
        """Read and parse user data from news file and behavior file.

        Args:
            news_file (str): A file contains several informations of news.
            beahaviros_file (str): A file contains information of user impressions.

        Yields:
            object: An iterator that yields parsed user feature, in the format of dict.
        """

        if not hasattr(self, "news_title_index"):
            self.init_news(news_file)

        if not hasattr(self, "impr_indexes"):
            self.init_behaviors(behavior_file)

        user_indexes = []
        impr_indexes = []
        click_title_indexes = []
        cnt = 0

        for index in range(len(self.impr_indexes)):
            click_title_indexes.append(self.news_title_index[self.histories[index]])
            user_indexes.append(self.uindexes[index])
            impr_indexes.append(self.impr_indexes[index])

            cnt += 1
            if cnt >= self.batch_size:
                # print(self._convert_user_data(
                #     user_indexes,
                #     impr_indexes,
                #     click_title_indexes,
                # ))
                yield self._convert_user_data(
                    user_indexes,
                    impr_indexes,
                    click_title_indexes,
                )
                user_indexes = []
                impr_indexes = []
                click_title_indexes = []
                cnt = 0

        if cnt > 0:
            yield self._convert_user_data(
                user_indexes,
                impr_indexes,
                click_title_indexes,
            )

    def _convert_user_data(
        self,
        user_indexes,
        impr_indexes,
        click_title_indexes,
    ):
        """Convert data into numpy arrays that are good for further model operation.

        Args:
            user_indexes (list): a list of user indexes.
            click_title_indexes (list): words indices for user's clicked news titles.

        Returns:
            dict: A dictionary, containing multiple numpy arrays that are convenient for further operation.
        """

        user_indexes = np.asarray(user_indexes, dtype=np.int32)
        impr_indexes = np.asarray(impr_indexes, dtype=np.int32)
        click_title_index_batch = np.asarray(click_title_indexes, dtype=np.int64)

        return {
            "user_index_batch": user_indexes,
            "impr_index_batch": impr_indexes,
            "clicked_title_batch": click_title_index_batch,
        }

    def load_news_from_file(self, news_file):
        """Read and parse user data from news file.

        Args:
            news_file (str): A file contains several informations of news.

        Yields:
            object: An iterator that yields parsed news feature, in the format of dict.
        """
        if not hasattr(self, "news_title_index"):
            self.init_news(news_file)

        news_indexes = []
        candidate_title_indexes = []
        cnt = 0

        for index in range(len(self.news_title_index)):
            news_indexes.append(index)
            candidate_title_indexes.append(self.news_title_index[index])

            cnt += 1
            if cnt >= self.batch_size:
                yield self._convert_news_data(
                    news_indexes,
                    candidate_title_indexes,
                )
                news_indexes = []
                candidate_title_indexes = []
                cnt = 0


        if cnt > 0:
            yield self._convert_news_data(
                news_indexes,
                candidate_title_indexes,
            )

    def _convert_news_data(
        self,
        news_indexes,
        candidate_title_indexes,
    ):
        """Convert data into numpy arrays that are good for further model operation.

        Args:
            news_indexes (list): a list of news indexes.
            candidate_title_indexes (list): the candidate news titles' words indices.

        Returns:
            dict: A dictionary, containing multiple numpy arrays that are convenient for further operation.
        """

        news_indexes_batch = np.asarray(news_indexes, dtype=np.int32)
        candidate_title_index_batch = np.asarray(
            candidate_title_indexes, dtype=np.int32
        )

        return {
            "news_index_batch": news_indexes_batch,
            "candidate_title_batch": candidate_title_index_batch,
        }

    def load_impression_from_file(self, behaivors_file):
        """Read and parse impression data from behaivors file.

        Args:
            behaivors_file (str): A file contains several informations of behaviros.

        Yields:
            object: An iterator that yields parsed impression data, in the format of dict.
        """

        if not hasattr(self, "histories"):
            self.init_behaviors(behaivors_file)

        indexes = np.arange(len(self.labels))

        for index in indexes:
            impr_label = np.array(self.labels[index], dtype="int32")
            impr_news = np.array(self.imprs[index], dtype="int32")

            yield (
                self.impr_indexes[index],
                impr_news,
                self.uindexes[index],
                impr_label,
            )

In [306]:
iterator = MINDIterator

## Train the LSTUR model

In [307]:
model = LSTURModel(hparams, iterator, seed=SEED)

2024-06-19 21:44:34.634132: W tensorflow/c/c_api.cc:305] Operation '{name:'embedding_50/embeddings/Assign' id:46277 op device:{requested: '', assigned: ''} def:{{{node embedding_50/embeddings/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](embedding_50/embeddings, embedding_50/embeddings/Initializer/stateless_random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


Tensor("conv1d_22/Relu:0", shape=(None, 30, 400), dtype=float32)
Tensor("att_layer2_22/Sum_1:0", shape=(None, 400), dtype=float32)


  super().__init__(name, **kwargs)


In [308]:
print(model.run_eval(train_news_file, val_behaviors_file))

  updates=self.state_updates,
2024-06-19 21:44:40.631393: W tensorflow/c/c_api.cc:305] Operation '{name:'conv1d_22/bias/Assign' id:46309 op device:{requested: '', assigned: ''} def:{{{node conv1d_22/bias/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](conv1d_22/bias, conv1d_22/bias/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
649it [00:07, 89.76it/s] 
0it [00:00, ?it/s]2024-06-19 21:44:45.374056: W tensorflow/c/c_api.cc:305] Operation '{name:'gru_22/strided_slice_2' id:47202 op device:{requested: '', assigned: ''} def:{{{node gru_22/strided_slice_2}} = StridedSlice[Index=DT_INT32, T=DT_FLOAT, _has_manual_control_dependencies=true, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0, shrink_axis_mask=1](gru_22/TensorArrayV2Stack/TensorListS

{'group_auc': 0.5154, 'mean_mrr': 0.3221, 'ndcg@5': 0.3538, 'ndcg@10': 0.4407}


In [309]:
%%time
model.fit(train_news_file, train_behaviors_file, train_news_file, val_behaviors_file)

0it [00:00, ?it/s]2024-06-19 21:45:03.119619: W tensorflow/c/c_api.cc:305] Operation '{name:'loss_22/mul' id:48299 op device:{requested: '', assigned: ''} def:{{{node loss_22/mul}} = Mul[T=DT_FLOAT, _has_manual_control_dependencies=true](loss_22/mul/x, loss_22/activation_44_loss/value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2024-06-19 21:45:05.579513: W tensorflow/c/c_api.cc:305] Operation '{name:'training_2/Adam/gru_22/gru_cell/bias/m/Assign' id:48916 op device:{requested: '', assigned: ''} def:{{{node training_2/Adam/gru_22/gru_cell/bias/m/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](training_2/Adam/gru_22/gru_cell/bias/m, training_2/Adam/gru_22/gru_cell/bias/m/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mut

at epoch 1
train info: logloss loss:1.6152885620658461
eval info: group_auc:0.546, mean_mrr:0.3452, ndcg@10:0.4607, ndcg@5:0.3835
at epoch 1 , train time: 42.5 eval time: 13.2


74it [00:36,  2.02it/s]
649it [00:02, 302.86it/s]
77it [00:09,  7.70it/s]
2446it [00:00, 4348.66it/s]


at epoch 2
train info: logloss loss:1.5847850651354403
eval info: group_auc:0.5428, mean_mrr:0.3417, ndcg@10:0.4585, ndcg@5:0.3785
at epoch 2 , train time: 36.6 eval time: 13.8


74it [00:36,  2.04it/s]
649it [00:01, 370.87it/s]
77it [00:10,  7.43it/s]
2446it [00:00, 5481.42it/s]


at epoch 3
train info: logloss loss:1.5530094182169116
eval info: group_auc:0.5381, mean_mrr:0.3381, ndcg@10:0.456, ndcg@5:0.3763
at epoch 3 , train time: 36.2 eval time: 13.6


74it [00:35,  2.07it/s]
649it [00:01, 373.28it/s]
77it [00:09,  7.84it/s]
2446it [00:00, 13926.87it/s]


at epoch 4
train info: logloss loss:1.539530462509877
eval info: group_auc:0.5345, mean_mrr:0.335, ndcg@10:0.4521, ndcg@5:0.3731
at epoch 4 , train time: 35.8 eval time: 12.8


74it [00:35,  2.08it/s]
649it [00:02, 311.29it/s]
77it [00:10,  7.50it/s]
2446it [00:00, 21643.75it/s]


at epoch 5
train info: logloss loss:1.5066116033373653
eval info: group_auc:0.5429, mean_mrr:0.34, ndcg@10:0.4578, ndcg@5:0.3789
at epoch 5 , train time: 35.6 eval time: 13.6
CPU times: user 16min 11s, sys: 2min 34s, total: 18min 45s
Wall time: 4min 13s


<recommenders.models.newsrec.models.lstur.LSTURModel at 0x357090090>

In [311]:
res_syn = model.run_eval(articles_file, val_behaviors_file)
print(res_syn)

649it [00:01, 376.64it/s]
77it [00:09,  8.07it/s]
2446it [00:00, 15752.11it/s]


{'group_auc': 0.5429, 'mean_mrr': 0.34, 'ndcg@5': 0.3789, 'ndcg@10': 0.4578}
CPU times: user 45.1 s, sys: 5.72 s, total: 50.8 s
Wall time: 12.5 s


In [312]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [None]:
model_path = os.path.join(PATH, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "lstur_native_model"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [None]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

In [None]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

In [None]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## Reference
\[1\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>
\[2\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/