<i>Copyright (c) Recommenders contributors.</i>

<i>Licensed under the MIT License.</i>

# LSTUR: Neural News Recommendation with Long- and Short-term User Representations
LSTUR \[1\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. The core of LSTUR is a news encoder and a user encoder.  In the news encoder, we learn representations of news from their titles. In user encoder, we propose to learn long-term
user representations from the embeddings of their IDs. In addition, we propose to learn short-term user representations from their recently browsed news via GRU network. Besides, we propose two methods to combine
long-term and short-term user representations. The first one is using the long-term user representation to initialize the hidden state of the GRU network in short-term user representation. The second one is concatenating both
long- and short-term user representations as a unified user vector.

## Properties of LSTUR:
- LSTUR captures users' both long-term and short term preference.
- It uses embeddings of users' IDs to learn long-term user representations.
- It uses users' recently browsed news via GRU network to learn short-term user representations.

## Data format:
For quicker training and evaluaiton, we sample MINDdemo dataset of 5k users from [MIND small dataset](https://msnews.github.io/). The MINDdemo dataset has the same file format as MINDsmall and MINDlarge. If you want to try experiments on MINDsmall and MINDlarge, please change the dowload source. Select the MIND_type parameter from ['large', 'small', 'demo'] to choose dataset.
 
**MINDdemo_train** is used for training, and **MINDdemo_dev** is used for evaluation. Training data and evaluation data are composed of a news file and a behaviors file. You can find more detailed data description in [MIND repo](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md)

### news data
This file contains news information including newsid, category, subcatgory, news title, news abstarct, news url and entities in news title, entities in news abstarct.
One simple example: <br>

`N46466	lifestyle	lifestyleroyals	The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By	Shop the notebooks, jackets, and more that the royals can't live without.	https://www.msn.com/en-us/lifestyle/lifestyleroyals/the-brands-queen-elizabeth,-prince-charles,-and-prince-philip-swear-by/ss-AAGH0ET?ocid=chopendata	[{"Label": "Prince Philip, Duke of Edinburgh", "Type": "P", "WikidataId": "Q80976", "Confidence": 1.0, "OccurrenceOffsets": [48], "SurfaceForms": ["Prince Philip"]}, {"Label": "Charles, Prince of Wales", "Type": "P", "WikidataId": "Q43274", "Confidence": 1.0, "OccurrenceOffsets": [28], "SurfaceForms": ["Prince Charles"]}, {"Label": "Elizabeth II", "Type": "P", "WikidataId": "Q9682", "Confidence": 0.97, "OccurrenceOffsets": [11], "SurfaceForms": ["Queen Elizabeth"]}]	[]`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained glove embeddings.

### behaviors data
One simple example: <br>
`1	U82271	11/11/2019 3:28:58 PM	N3130 N11621 N12917 N4574 N12140 N9748	N13390-0 N7180-0 N20785-0 N6937-0 N15776-0 N25810-0 N20820-0 N6885-0 N27294-0 N18835-0 N16945-0 N7410-0 N23967-0 N22679-0 N20532-0 N26651-0 N22078-0 N4098-0 N16473-0 N13841-0 N15660-0 N25787-0 N2315-0 N1615-0 N9087-0 N23880-0 N3600-0 N24479-0 N22882-0 N26308-0 N13594-0 N2220-0 N28356-0 N17083-0 N21415-0 N18671-0 N9440-0 N17759-0 N10861-0 N21830-0 N8064-0 N5675-0 N15037-0 N26154-0 N15368-1 N481-0 N3256-0 N20663-0 N23940-0 N7654-0 N10729-0 N7090-0 N23596-0 N15901-0 N16348-0 N13645-0 N8124-0 N20094-0 N27774-0 N23011-0 N14832-0 N15971-0 N27729-0 N2167-0 N11186-0 N18390-0 N21328-0 N10992-0 N20122-0 N1958-0 N2004-0 N26156-0 N17632-0 N26146-0 N17322-0 N18403-0 N17397-0 N18215-0 N14475-0 N9781-0 N17958-0 N3370-0 N1127-0 N15525-0 N12657-0 N10537-0 N18224-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[News ID 1]-[label1] ... [News ID n]-[labeln]`

<br>
Label represents whether the news is clicked by the user. All information of news in User Click History and Impression News can be found in news data file.

## Imports and Global settings

In [2]:
import os
import sys
import numpy as np
import polars as pl
import zipfile
import logging
from pathlib import Path
from tqdm import tqdm
import tensorflow as tf

from recommenders.models.deeprec.deeprec_utils import download_deeprec_resources 
from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel
from recommenders.models.newsrec.io.mind_iterator import MINDIterator
from recommenders.models.newsrec.newsrec_utils import get_mind_data_set
from recommenders.utils.notebook_utils import store_metadata

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))



System version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Tensorflow version: 2.15.1


In [3]:
# configurations
tf.get_logger().setLevel('ERROR') # only show error messages

logging.basicConfig(level=logging.INFO)
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

pl.Config.set_tbl_rows(100)
pl.Config.set_streaming_chunk_size(500_000)

polars.config.Config

# Prepare Parameters and File Locations

In [15]:
# LSTUR parameters
EPOCHS = 5
SEED = 40
BATCH_SIZE = 32

FORCE_RELOAD = False

DATASET_NAME = "demo" # one of: demo, small, large
GROUP_PROJECT_PATH = '/home/e11920555/Group_33'

In [112]:
# dataset parquet files path
data_path = Path(os.path.join(GROUP_PROJECT_PATH, "data"))
train_behaviors_path = os.path.join(data_path, 'train', 'behaviors.parquet')
train_history_path = os.path.join(data_path, 'train', 'history.parquet')
val_behaviors_path = os.path.join(data_path, 'validation', 'behaviors.parquet')
val_history_path = os.path.join(data_path, 'validation', 'history.parquet')
articles_path = os.path.join(data_path, 'articles.parquet')
embedding_path = os.path.join(data_path, 'artifacts', 'document_vector.parquet')

# artifacts file path
tmp_path = Path(os.path.join(GROUP_PROJECT_PATH, "tmp"))
tmp_train_path = Path(os.path.join(tmp_path, DATASET_NAME, 'train'))
tmp_val_path = Path(os.path.join(tmp_path, DATASET_NAME, 'val'))
tmp_eval_path = Path(os.path.join(tmp_path, DATASET_NAME, 'eval'))

(tmp_train_path).mkdir(exist_ok=True, parents=True)
(tmp_val_path).mkdir(exist_ok=True, parents=True)
(tmp_eval_path).mkdir(exist_ok=True, parents=True)

train_news_file = os.path.join(tmp_train_path, 'news.tsv')
train_behaviors_file = os.path.join(tmp_train_path, 'behaviors.tsv')
val_news_file = os.path.join(tmp_val_path, 'news.tsv')
val_behaviors_file = os.path.join(tmp_val_path, 'behaviors.tsv')
eval_behaviors_file = os.path.join(tmp_eval_path, 'behaviors.tsv')
yaml_file = os.path.join(tmp_path, 'lstur.yaml')

LOG.info(data_path)

INFO:__main__:/home/e11920555/Group_33/data


In [115]:
df = pl.read_parquet(embedding_path)
df

article_id,document_vector
i32,list[f32]
3000022,"[0.065424, -0.047425, … 0.035706]"
3000063,"[0.028815, -0.000166, … 0.027167]"
3000613,"[0.037971, 0.033923, … 0.063961]"
3000700,"[0.046524, 0.002913, … 0.023423]"
3000840,"[0.014737, 0.024068, … 0.045991]"
3001278,"[0.014249, -0.026272, … 0.068203]"
3001299,"[0.046163, -0.034065, … 0.01523]"
3001353,"[0.055219, 0.011371, … 0.007982]"
3001457,"[0.057401, 0.018003, … 0.023905]"
3001459,"[0.023249, -0.062233, … -0.002709]"


## Transform parquet files

In [88]:
COL_IMPRESSION_ID = 0
COL_USER_ID = 8
COL_IMPRESSION_TIME = 2
COL_INVIEW_ARTICLE_IDS = 6
COL_CLICKED_ARTICLE_IDS = 7

def create_behavior_file(behaviors, history, file_path):
    
    # Transform history to a dictionary for fast lookup
    user_history = {}
    for row in history.iter_rows(named=True):
        user_history[f"U{row['user_id']}"] = {
            'article_id': row['article_id_fixed'],
            'impression_time': row['impression_time_fixed']
        }
    
    def transform_row(row):

        impression_id = row[COL_IMPRESSION_ID]
        user_id = f"U{row[COL_USER_ID]}"
        impression_time = row[COL_IMPRESSION_TIME]
        clicked_articles = user_history.get(user_id, {}).get('article_id', [])
        timestamps = user_history.get(user_id, {}).get('impression_time', [])
        
        # Filter click history to include only clicks before the impression time
        user_click_history = [
            f"N{article_id}" for article_id, timestamp in zip(clicked_articles, timestamps) if timestamp < impression_time
        ]
        user_click_history_str = ' '.join(user_click_history)
        
        # Prepare impression news
        inview_articles = row[COL_INVIEW_ARTICLE_IDS]
        clicked_articles = row[COL_CLICKED_ARTICLE_IDS]
        impression_news = [
            f"N{article_id}-{1 if article_id in clicked_articles else 0}" for article_id in inview_articles
        ]
        impression_news_str = ' '.join(impression_news)
        
        return impression_id, user_id, impression_time, user_click_history_str, impression_news_str
    

    behavior_df = pl.DataFrame(behaviors.map_rows(transform_row))
    behavior_df.columns = ["Impression ID", "User ID", "Impression Time", "User Click History", "Impression News"]
    behavior_df.write_csv(file_path, quote_style='never', include_header=False, separator='\t')

In [111]:
# Process training data
if not Path(train_behaviors_file).exists() or FORCE_RELOAD:
    train_behaviors = pl.read_parquet(train_behaviors_path)
    train_history = pl.read_parquet(train_history_path)
    print(train_behaviors.columns)
    print(train_history.columns)

    behaviors_data = \
        create_behavior_file(train_behaviors, train_history, train_behaviors_file)
    
# `[News ID] [Category] [Subcategory] [News Title] [News Abstrct] [News Url] [Entities in News Title] [Entities in News Abstract] ...`

# if not Path(train_news_file).exists() or FORCE_RELOAD:
train_news = pl.read_parquet(articles_path)
print(train_news.columns)

train_news = train_news.select(["article_id", "category_str", "title", "body", "url"])
train_news.write_csv(train_news_file, quote_style='never', include_header=False, separator='\t')

['article_id', 'title', 'subtitle', 'last_modified_time', 'premium', 'body', 'published_time', 'image_ids', 'article_type', 'url', 'ner_clusters', 'entity_groups', 'topics', 'category', 'subcategory', 'category_str', 'total_inviews', 'total_pageviews', 'total_read_time', 'sentiment_score', 'sentiment_label']


## Create hyper-parameters

In [None]:
hparams = prepare_hparams(yaml_file, 
                          wordEmb_file=wordEmb_file,
                          wordDict_file=wordDict_file, 
                          userDict_file=userDict_file,
                          batch_size=BATCH_SIZE,
                          epochs=EPOCHS)
print(hparams)

In [None]:
iterator = MINDIterator

## Train the LSTUR model

In [None]:
model = LSTURModel(hparams, iterator, seed=SEED)

In [None]:
print(model.run_eval(valid_news_file, valid_behaviors_file))

In [None]:
%%time
model.fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file)

In [None]:
%%time
res_syn = model.run_eval(valid_news_file, valid_behaviors_file)
print(res_syn)

In [None]:
# Record results for tests - ignore this cell
store_metadata("group_auc", res_syn['group_auc'])
store_metadata("mean_mrr", res_syn['mean_mrr'])
store_metadata("ndcg@5", res_syn['ndcg@5'])
store_metadata("ndcg@10", res_syn['ndcg@10'])

## Save the model

In [None]:
model_path = os.path.join(data_path, "model")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "lstur_ckpt"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [MIND Competition Submission Tutorial](https://competitions.codalab.org/competitions/24122#learn_the_details-submission-guidelines).

Please change the `MIND_type` parameter to `large` if you want to submit your prediction to [MIND Competition](https://msnews.github.io/competition.html).

In [None]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(valid_news_file, valid_behaviors_file)

In [None]:
with open(os.path.join(data_path, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

In [None]:
f = zipfile.ZipFile(os.path.join(data_path, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(data_path, 'prediction.txt'), arcname='prediction.txt')
f.close()

## Reference
\[1\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>
\[2\] Wu, Fangzhao, et al. "MIND: A Large-scale Dataset for News Recommendation" Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. https://msnews.github.io/competition.html <br>
\[3\] GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/