# LSTUR: Neural News Recommendation with Long- and Short-term User Representations
LSTUR \[1\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. We will use this algorithm to perform the necessary ranking of the Ebnerd-dataset.

## Data format:
The dataformat and available data is defined in \[2\], you can select between demo, small and large with an extra testset available. We transformed the data by manipulating, reordering and dropping columns into the following format. This should be suitable for the algorithms implementation.
 
### article data
This file contains news information including articleId, category, title, body and url.
One simple example: <br>

`3044020	underholdning	Prins Harry tvunget til dna-test	Den britiske tabloidavis The Sun fortsætter med at lække historier fra den kommende bog om prinsesse Diana, skrevet af prinsessens veninde Simone Simmons.Onsdag er det historien om, at det britiske kongehus lod prins Harry dna-teste for at sikre sig, at prins Charles var far til ham.Hoffet frygtede, at Dianas tidligere elsker James Hewitt, var far til Harry.Dna-testen fandt sted, da Harry var 11 år gammel.Det var en slet skjult hemmelighed, at Diana og Hewitt hyggede sig i sengen, og der var simpelthen en udbredt frygt på Buckingham Palace for, at lidenskaben havde resulteret i rødhårede Harry.Diana selv afviste rygterne og påpegede, at hvis man regnede på datoerne, kunne Hewitt ikke være far til Harry, men frygten for arvefølgen var så stor, at den 11-årige Harry måtte tage testen trods Dianas forsikringer om hans fædrene ophav.	https://ekstrabladet.dk/underholdning/udlandkendte/article3044020.ece
`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[Article ID] [Category] [Article Title] [Articles Body] [Articles Url]`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained embeddings danish embeddings \[3\].

### behaviors data
One simple example: <br>
`326175906	615468	2023-05-21T07:44:40.000000	9740058 9739999 9740046 9740003 9739999 9735311 9739864 9739893 9739837 9647575 9735579 9737243 9739883 9739870 9739471 9739783 9739802 9741910 9741871 9741802 9741788 9741832 9741819 9742027 9741850 9742001 9741896 9742681 9667501 9742693 9742225 9742161 9742261 9742764 9743386 9741848 9740576 9736862 9743298 9743755 9743733 9743767 9740845 9743692 9743574 9739399 9737199 9745221 9745367 9744347 9744733 9744988 9745034 9745034 9744897 9745016 9745016 9746395 9747119 9747119 9747092 9747074 9747762 9747781 9747789 9747404 9745848 9746105 9747781 9746105 9746342 9747437 9737243 9747437 9730564 9747480 9747495 9747320 9747220 9749392 9750815 9750708 9751284 9751284 9749154 9751135 9751220 9752155 9751962 9752146 9751786 9751975 9752323 9725978 9748035 9749966 9751786 9752400 9752402 9752403 9752350 9752320 9750873 9750891 9752243 9753543 9753503 9753526 9753455 9753479 9753351 9750107 9752332 9752332 9753168 9740047 9752994 9752905 9754413 9754271 9754490 9754269 9752685 9750772 9752463 9752463 9752463 9752463 9755364 9755361 9755361 9755298 9753905 9753995 9753775 9754350 9755178 9753653 9754603 9754925 9754882 9753207 9753207 9759461 9759309 9759389 9759164 9760934 9760521 9760829 9758182 9757801 9760857 9760528 9761083 9761087 9761047 9760112 9760962 9760563 9761031 9760944 9759782 9760747 9760747 9761772 9761683 9761531 9761620 9761359 9760796 9761638 9759157 9760290 9759433 9761635 9685790 9761363 9761422 9761469 9754571 9761288 9762353 9762225 9761914 9759708 9761561 9762028 9761803 9761858 9763150 9763159 9763120 9763634 9763656 9763489 9763401 9763634 9763702 9763401 9763247 9761588 9761768 9765804 9765753 9765675 9763400 9765894 9766949 9762678 9766592 9759476 9759476 9766307 9766307 9767426 9767399 9766225 9767233	9773857-0 9771996-0 9772545-0 9774297-0 9774187-0 9774142-1 9770028-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[Article ID 1]-[label1] ... [Article ID n]-[labeln]`

<br>
Label represents whether the article is clicked by the user. All information of articles in User Click History and Impression News can be found in behavior's data file.

## Imports and Global settings

In [1]:
import os, sys, zipfile, logging
import numpy as np
import tensorflow as tf

from pathlib import Path
from tqdm import tqdm

from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.11.8 (main, Feb  6 2024, 21:21:21) [Clang 15.0.0 (clang-1500.1.0.2.5)]
Tensorflow version: 2.15.1


### Configure logging settings

In [2]:
# configurations
tf.get_logger().setLevel('ERROR') # only show error messages

logging.basicConfig(level=logging.INFO)
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

## Prepare Parameters and File Locations

In [3]:
# LSTUR parameters
EPOCHS = 5
SEED = 40
BATCH_SIZE = 32

# whether to re-compute the dataset
FORCE_RELOAD = True
SAMPLE_SIZE = 0.01

# path to the dataset
DATASET_NAME = "demo" # one of: demo, small, large
TEMP_DIR = "tmp"
GROUP_PROJECT_PATH = "/Users/maxkleinegger/Downloads/"

In [4]:
# dataset parquet files path
PATH = Path(os.path.join(GROUP_PROJECT_PATH, DATASET_NAME))
train_behaviors_path = os.path.join(PATH, 'train', 'behaviors.parquet')
train_history_path = os.path.join(PATH, 'train', 'history.parquet')
val_behaviors_path = os.path.join(PATH, 'validation', 'behaviors.parquet')
val_history_path = os.path.join(PATH, 'validation', 'history.parquet')
articles_path = os.path.join(PATH, 'articles.parquet')

embedding_path = os.path.abspath('/Users/maxkleinegger/Downloads/38/model.txt')
LOG.info(PATH)

INFO:__main__:/Users/maxkleinegger/Downloads/demo


In [5]:
# artifacts file path
TMP_PATH = Path(os.path.join(GROUP_PROJECT_PATH, TEMP_DIR))
tmp_train_path = Path(os.path.join(TMP_PATH, DATASET_NAME, 'train'))
tmp_val_path = Path(os.path.join(TMP_PATH, DATASET_NAME, 'val'))

# create directories if not exist
tmp_train_path.mkdir(exist_ok=True, parents=True)
tmp_val_path.mkdir(exist_ok=True, parents=True)

train_behaviors_file = os.path.join(tmp_train_path, 'behaviors.tsv')
val_behaviors_file = os.path.join(tmp_val_path, 'behaviors.tsv')
articles_file = os.path.join(TMP_PATH, 'articles.tsv')

# hyperparameters
yaml_file = os.path.join('../src/group_33/configs/lstur.yaml')
user_dict_file = os.path.join(tmp_train_path, 'user_dict.pkl')
words_dict_file = os.path.join(tmp_train_path, 'words_dict.pkl')
word_embeddings_file = os.path.join(tmp_train_path, 'word_embeddings.npy')

## Transform parquet files
This sections transforms the provided `parquet` files into readable `tsv` files.

In [6]:
from group_33.preprocessing import transfrom_behavior_file, transform_articles_file

In [7]:
if not Path(train_behaviors_file).exists() or FORCE_RELOAD:
    df_behavior_train = transfrom_behavior_file(train_behaviors_path, train_history_path, train_behaviors_file, sample_size=SAMPLE_SIZE)
if not Path(val_behaviors_file).exists() or FORCE_RELOAD:
    df_behavior_val = transfrom_behavior_file(val_behaviors_path, val_history_path, val_behaviors_file, sample_size=SAMPLE_SIZE)

if not Path(articles_file).exists() or FORCE_RELOAD:
    df_articles = transform_articles_file(articles_path, articles_file)

## Create hyper-parameters

In [8]:
from group_33.preprocessing import generate_user_mapping, generate_word_dict, generate_word_embeddings

if not Path(user_dict_file).exists() or FORCE_RELOAD:
    generate_user_mapping(df_behavior_train, user_dict_file)


In [9]:
if not Path(words_dict_file).exists() or FORCE_RELOAD:
    words_id_mapping = generate_word_dict(df_articles, words_dict_file)
    
if not Path(word_embeddings_file).exists() or FORCE_RELOAD:
    generate_word_embeddings(words_id_mapping, embedding_path, word_embeddings_file)


In [10]:
hparams = prepare_hparams(
    yaml_file,
    wordEmb_file=word_embeddings_file,
    wordDict_file=words_dict_file,
    userDict_file=user_dict_file,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
)

In [11]:
from group_33.ebnerditerator import EbnerdIterator

iterator = EbnerdIterator

## Train the LSTUR model

In [12]:
model = LSTURModel(hparams, iterator, seed=SEED)

2024-06-21 19:02:08.504844: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:388] MLIR V1 optimization pass is not enabled
2024-06-21 19:02:08.511440: W tensorflow/c/c_api.cc:305] Operation '{name:'embedding/embeddings/Assign' id:27 op device:{requested: '', assigned: ''} def:{{{node embedding/embeddings/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](embedding/embeddings, embedding/embeddings/Initializer/stateless_random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.


Tensor("conv1d/Relu:0", shape=(None, 30, 400), dtype=float32)
Tensor("att_layer2/Sum_1:0", shape=(None, 400), dtype=float32)


  super().__init__(name, **kwargs)


### Score without Training

In [13]:
print(model.run_eval(articles_file, val_behaviors_file))

  updates=self.state_updates,
2024-06-21 19:02:09.253024: W tensorflow/c/c_api.cc:305] Operation '{name:'gru/gru_cell/recurrent_kernel/Assign' id:413 op device:{requested: '', assigned: ''} def:{{{node gru/gru_cell/recurrent_kernel/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](gru/gru_cell/recurrent_kernel, gru/gru_cell/recurrent_kernel/Initializer/random_uniform)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
369it [00:01, 297.87it/s]
0it [00:00, ?it/s]2024-06-21 19:02:10.281611: W tensorflow/c/c_api.cc:305] Operation '{name:'gru/strided_slice_2' id:954 op device:{requested: '', assigned: ''} def:{{{node gru/strided_slice_2}} = StridedSlice[Index=DT_INT32, T=DT_FLOAT, _has_manual_control_dependencies=true, begin_mask=0, ellipsis_mask=0, end_mask=0, new_axis_mask=0

{'group_auc': 0.4628, 'mean_mrr': 0.2842, 'ndcg@5': 0.2953, 'ndcg@10': 0.4015}


### Score with Training

In [14]:
model.fit(articles_file, train_behaviors_file, articles_file, val_behaviors_file)

0it [00:00, ?it/s]2024-06-21 19:02:11.881592: W tensorflow/c/c_api.cc:305] Operation '{name:'loss/mul' id:2051 op device:{requested: '', assigned: ''} def:{{{node loss/mul}} = Mul[T=DT_FLOAT, _has_manual_control_dependencies=true](loss/mul/x, loss/activation_loss/value)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session.
2024-06-21 19:02:12.004937: W tensorflow/c/c_api.cc:305] Operation '{name:'training/Adam/att_layer2/b/v/Assign' id:2699 op device:{requested: '', assigned: ''} def:{{{node training/Adam/att_layer2/b/v/Assign}} = AssignVariableOp[_has_manual_control_dependencies=true, dtype=DT_FLOAT, validate_shape=false](training/Adam/att_layer2/b/v, training/Adam/att_layer2/b/v/Initializer/zeros)}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in t

at epoch 1
train info: logloss loss:1.610147163271904
eval info: group_auc:0.4749, mean_mrr:0.2756, ndcg@10:0.404, ndcg@5:0.2916
at epoch 1 , train time: 4.9 eval time: 2.5


8it [00:04,  1.93it/s]
369it [00:01, 308.96it/s]
8it [00:01,  6.82it/s]
253it [00:00, 3442.97it/s]


at epoch 2
train info: logloss loss:1.5991408824920654
eval info: group_auc:0.4895, mean_mrr:0.2808, ndcg@10:0.4117, ndcg@5:0.3042
at epoch 2 , train time: 4.1 eval time: 2.6


8it [00:05,  1.55it/s]
369it [00:01, 236.44it/s]
8it [00:01,  7.02it/s]
253it [00:00, 22937.53it/s]


at epoch 3
train info: logloss loss:1.5816838890314102
eval info: group_auc:0.5042, mean_mrr:0.2977, ndcg@10:0.426, ndcg@5:0.3266
at epoch 3 , train time: 5.2 eval time: 2.9


8it [00:03,  2.10it/s]
369it [00:01, 299.77it/s]
8it [00:01,  6.72it/s]
253it [00:00, 24731.03it/s]


at epoch 4
train info: logloss loss:1.5750245451927185
eval info: group_auc:0.506, mean_mrr:0.3041, ndcg@10:0.4308, ndcg@5:0.3294
at epoch 4 , train time: 3.8 eval time: 2.6


8it [00:03,  2.01it/s]
369it [00:01, 344.69it/s]
8it [00:01,  6.57it/s]
253it [00:00, 28427.96it/s]


at epoch 5
train info: logloss loss:1.5603198260068893
eval info: group_auc:0.5116, mean_mrr:0.315, ndcg@10:0.4381, ndcg@5:0.3432
at epoch 5 , train time: 4.0 eval time: 2.5


<recommenders.models.newsrec.models.lstur.LSTURModel at 0x2db083310>

In [15]:
res_syn = model.run_eval(articles_file, val_behaviors_file)
print(res_syn)

369it [00:01, 315.04it/s]
8it [00:01,  6.29it/s]
253it [00:00, 17212.08it/s]


{'group_auc': 0.5116, 'mean_mrr': 0.315, 'ndcg@5': 0.3432, 'ndcg@10': 0.4381}


## Save the model

In [16]:
model_path = os.path.join(PATH, "model", "lstur-native")
os.makedirs(model_path, exist_ok=True)

model.model.save_weights(os.path.join(model_path, "lstur_native_model"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [Ebnerd Competition Submission Guidelines](https://recsys.eb.dk/#dataset).


In [17]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(articles_file, val_behaviors_file)

369it [00:01, 365.34it/s]
8it [00:01,  5.76it/s]
253it [00:00, 26087.44it/s]


In [18]:
with open(os.path.join(PATH, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

253it [00:00, 130028.05it/s]


In [19]:
f = zipfile.ZipFile(os.path.join(PATH, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(PATH, 'prediction.txt'), arcname='prediction.txt')
f.close()

## Reference
\[1\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>
\[2\] Ebnerd-benchmarkin data + description: https://recsys.eb.dk/dataset/ <br>
\[3\] Pre-Trained embeddings: http://vectors.nlpl.eu/repository/#