# LSTUR: Neural News Recommendation with Long- and Short-term User Representations
## Important:
This notebook was only used to have a hands on experience implementing the LSTUR algorithm for gaining better understanding of its architecture and functionality. Furthermore it allowed us to fully understand how the data needs to pre-processed and provided. Further optimizations and final prediction, have been done within the `lstur.ipynb`-notebook. For this reason, this notebook is only executable with a jupyter kernel and does not provide final results. Therefore do not take this notebook into consideration for grading (see Report)

## LSTUR
LSTUR \[1\] is a news recommendation approach capturing users' both long-term preferences and short-term interests. We will use this algorithm to perform the necessary ranking of the Ebnerd-dataset.

## Data format:
The dataformat and available data is defined in \[2\], you can select between demo, small and large with an extra testset available. We transformed the data by manipulating, reordering and dropping columns into the following format. This should be suitable for the algorithms implementation.
 
### article data
This file contains news information including articleId, category, title, body and url.
One simple example: <br>

`3044020	underholdning	Prins Harry tvunget til dna-test	Den britiske tabloidavis The Sun fortsætter med at lække historier fra den kommende bog om prinsesse Diana, skrevet af prinsessens veninde Simone Simmons.Onsdag er det historien om, at det britiske kongehus lod prins Harry dna-teste for at sikre sig, at prins Charles var far til ham.Hoffet frygtede, at Dianas tidligere elsker James Hewitt, var far til Harry.Dna-testen fandt sted, da Harry var 11 år gammel.Det var en slet skjult hemmelighed, at Diana og Hewitt hyggede sig i sengen, og der var simpelthen en udbredt frygt på Buckingham Palace for, at lidenskaben havde resulteret i rødhårede Harry.Diana selv afviste rygterne og påpegede, at hvis man regnede på datoerne, kunne Hewitt ikke være far til Harry, men frygten for arvefølgen var så stor, at den 11-årige Harry måtte tage testen trods Dianas forsikringer om hans fædrene ophav.	https://ekstrabladet.dk/underholdning/udlandkendte/article3044020.ece
`
<br>

In general, each line in data file represents information of one piece of news: <br>

`[Article ID] [Category] [Article Title] [Articles Body] [Articles Url]`

<br>

We generate a word_dict file to tranform words in news title to word indexes, and a embedding matrix is initted from pretrained embeddings danish embeddings \[3\].

### behaviors data
One simple example: <br>
`326175906	615468	2023-05-21T07:44:40.000000	9740058 9739999 9740046 9740003 9739999 9735311 9739864 9739893 9739837 9647575 9735579 9737243 9739883 9739870 9739471 9739783 9739802 9741910 9741871 9741802 9741788 9741832 9741819 9742027 9741850 9742001 9741896 9742681 9667501 9742693 9742225 9742161 9742261 9742764 9743386 9741848 9740576 9736862 9743298 9743755 9743733 9743767 9740845 9743692 9743574 9739399 9737199 9745221 9745367 9744347 9744733 9744988 9745034 9745034 9744897 9745016 9745016 9746395 9747119 9747119 9747092 9747074 9747762 9747781 9747789 9747404 9745848 9746105 9747781 9746105 9746342 9747437 9737243 9747437 9730564 9747480 9747495 9747320 9747220 9749392 9750815 9750708 9751284 9751284 9749154 9751135 9751220 9752155 9751962 9752146 9751786 9751975 9752323 9725978 9748035 9749966 9751786 9752400 9752402 9752403 9752350 9752320 9750873 9750891 9752243 9753543 9753503 9753526 9753455 9753479 9753351 9750107 9752332 9752332 9753168 9740047 9752994 9752905 9754413 9754271 9754490 9754269 9752685 9750772 9752463 9752463 9752463 9752463 9755364 9755361 9755361 9755298 9753905 9753995 9753775 9754350 9755178 9753653 9754603 9754925 9754882 9753207 9753207 9759461 9759309 9759389 9759164 9760934 9760521 9760829 9758182 9757801 9760857 9760528 9761083 9761087 9761047 9760112 9760962 9760563 9761031 9760944 9759782 9760747 9760747 9761772 9761683 9761531 9761620 9761359 9760796 9761638 9759157 9760290 9759433 9761635 9685790 9761363 9761422 9761469 9754571 9761288 9762353 9762225 9761914 9759708 9761561 9762028 9761803 9761858 9763150 9763159 9763120 9763634 9763656 9763489 9763401 9763634 9763702 9763401 9763247 9761588 9761768 9765804 9765753 9765675 9763400 9765894 9766949 9762678 9766592 9759476 9759476 9766307 9766307 9767426 9767399 9766225 9767233	9773857-0 9771996-0 9772545-0 9774297-0 9774187-0 9774142-1 9770028-0`
<br>

In general, each line in data file represents one instance of an impression. The format is like: <br>

`[Impression ID] [User ID] [Impression Time] [User Click History] [Impression News]`

<br>

User Click History is the user historical clicked news before Impression Time. Impression News is the displayed news in an impression, which format is:<br>

`[Article ID 1]-[label1] ... [Article ID n]-[labeln]`

<br>
Label represents whether the article is clicked by the user. All information of articles in User Click History and Impression News can be found in behavior's data file.

## Imports and Global settings
Within this section, we import all the necessary packages and configure the logger. 

In [1]:
import os, sys, zipfile, logging
import numpy as np
import tensorflow as tf

from pathlib import Path
from tqdm import tqdm

from recommenders.models.newsrec.newsrec_utils import prepare_hparams
from recommenders.models.newsrec.models.lstur import LSTURModel

from group_33.native_lstur import transfrom_behavior_file, transform_articles_file
from group_33.native_lstur import generate_user_mapping, generate_word_dict, generate_word_embeddings
from group_33.ebnerditerator import EbnerdIterator

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

2024-07-08 18:59:14.818574: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-08 18:59:14.818636: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-08 18:59:14.820685: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-08 18:59:14.834585: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


System version: 3.11.9 | packaged by conda-forge | (main, Apr 19 2024, 18:36:13) [GCC 12.3.0]
Tensorflow version: 2.15.1


### Configure logging settings

In [2]:
# configurations
tf.get_logger().setLevel('ERROR') # only show error messages

logging.basicConfig(level=logging.INFO)
LOG = logging.getLogger(__name__)
LOG.setLevel(logging.INFO)

## Prepare Parameters and File Locations
First of all, we need to prepare all the necessary paths and parameters, this section should also create not existing paths and is the main point to configure the execution. Interesting to say is: that we provide the respective word embeddings within the native-lstur directory without the need to download it.

In [3]:
# LSTUR parameters
EPOCHS = 5
SEED = 40
BATCH_SIZE = 32

# whether to re-compute the dataset
FORCE_RELOAD = False
SAMPLE_SIZE = 0.01

# path to the dataset
DATASET_NAME = "small" # one of: demo, small, large
DATA_PATH = Path("~/shared/194.035-2024S/groups/Gruppe_33/Group_33/data")

In [4]:
# dataset parquet files path
PATH = Path(os.path.join(DATA_PATH, DATASET_NAME))
train_behaviors_path = os.path.join(PATH, 'train', 'behaviors.parquet')
train_history_path = os.path.join(PATH, 'train', 'history.parquet')
val_behaviors_path = os.path.join(PATH, 'validation', 'behaviors.parquet')
val_history_path = os.path.join(PATH, 'validation', 'history.parquet')
articles_path = os.path.join(PATH, 'articles.parquet')

embedding_path = os.path.join(DATA_PATH, "native-lstur", "embeddings.txt")

In [5]:
# artifacts file path
TMP_PATH = Path(os.path.join(DATA_PATH, "native-lstur"))
tmp_train_path = Path(os.path.join(TMP_PATH, DATASET_NAME, 'train'))
tmp_val_path = Path(os.path.join(TMP_PATH, DATASET_NAME, 'val'))

# create directories if not exist
tmp_train_path.mkdir(exist_ok=True, parents=True)
tmp_val_path.mkdir(exist_ok=True, parents=True)

train_behaviors_file = os.path.join(tmp_train_path, 'behaviors.tsv')
val_behaviors_file = os.path.join(tmp_val_path, 'behaviors.tsv')
articles_file = os.path.join(TMP_PATH, 'articles.tsv')

# hyperparameters
yaml_file = os.path.join('../src/group_33/configs/lstur.yaml')
user_dict_file = os.path.join(tmp_train_path, 'user_dict.pkl')
words_dict_file = os.path.join(tmp_train_path, 'words_dict.pkl')
word_embeddings_file = os.path.join(tmp_train_path, 'word_embeddings.npy')

## Transform parquet files
This sections transforms the provided `parquet` files into readable `tsv` files, with the specified format as above.

In [8]:
if not Path(train_behaviors_file).exists() or FORCE_RELOAD:
    df_behavior_train = transfrom_behavior_file(train_behaviors_path, train_history_path, train_behaviors_file, fraction=SAMPLE_SIZE)
if not Path(val_behaviors_file).exists() or FORCE_RELOAD:
    df_behavior_val = transfrom_behavior_file(val_behaviors_path, val_history_path, val_behaviors_file, fraction=SAMPLE_SIZE)

if not Path(articles_file).exists() or FORCE_RELOAD:
    df_articles = transform_articles_file(articles_path, articles_file)

## Create hyper-parameters
For the next step we need to prepare and create the user_id-to-unique-ids-mappings, aswell as the word-to-unique_id-mapping and the word-embeddings mapping. Afteward we add them all the `hyperparam`-object.

In [9]:
if not Path(user_dict_file).exists() or FORCE_RELOAD:
    generate_user_mapping(df_behavior_train, user_dict_file)

In [None]:
if not Path(words_dict_file).exists() or True:
    words_id_mapping = generate_word_dict(df_articles, words_dict_file)
    
if not Path(word_embeddings_file).exists() or FORCE_RELOAD:
    generate_word_embeddings(words_id_mapping, embedding_path, word_embeddings_file)

In [None]:
hparams = prepare_hparams(
    yaml_file,
    wordEmb_file=word_embeddings_file,
    wordDict_file=words_dict_file,
    userDict_file=user_dict_file,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS
)

## Train the LSTUR model

### Iterator
Finally, we had to adapt the data-iterator to make it fully work, with our data. Therefore, we had to adapt small functions concerning the loading of articles and behavoirs. But afterwards, we could easily use it.

In [None]:
iterator = EbnerdIterator

### Model
Now lastly, let's create the model itself.

In [None]:
model = LSTURModel(hparams, iterator, seed=SEED)

#### Score without Training

In [None]:
print(model.run_eval(articles_file, val_behaviors_file))

#### Score with Training

In [None]:
model.fit(articles_file, train_behaviors_file, articles_file, val_behaviors_file)

In [None]:
res_syn = model.run_eval(articles_file, val_behaviors_file)
print(res_syn)

### Save the model
Lastly, we save the models-weights for reproducibility.

In [None]:
MODEL_PATH = Path("~/shared/194.035-2024S/groups/Gruppe_33/Group_33/model/LSTUR/native")
os.makedirs(MODEL_PATH, exist_ok=True)

model.model.save_weights(os.path.join(MODEL_PATH, "lstur_native_model"))

## Output Prediction File
This code segment is used to generate the prediction.zip file, which is in the same format in [Ebnerd Competition Submission Guidelines](https://recsys.eb.dk/#dataset).


In [None]:
group_impr_indexes, group_labels, group_preds = model.run_fast_eval(articles_file, val_behaviors_file)

In [None]:
with open(os.path.join(TMP_PATH, 'prediction.txt'), 'w') as f:
    for impr_index, preds in tqdm(zip(group_impr_indexes, group_preds)):
        impr_index += 1
        pred_rank = (np.argsort(np.argsort(preds)[::-1]) + 1).tolist()
        pred_rank = '[' + ','.join([str(i) for i in pred_rank]) + ']'
        f.write(' '.join([str(impr_index), pred_rank])+ '\n')

In [None]:
f = zipfile.ZipFile(os.path.join(TMP_PATH, 'prediction.zip'), 'w', zipfile.ZIP_DEFLATED)
f.write(os.path.join(TMP_PATH, 'prediction.txt'), arcname='prediction.txt')
f.close()

## Reference
\[1\] Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019<br>
\[2\] ACM RecSys Challenge 24 - Data: https://recsys.eb.dk/dataset/ <br>
\[3\] Pre-Trained embeddings: http://vectors.nlpl.eu/repository/#