An evaluation framework that simulates a real-life recommendation scenario, in which recommendation algorithms receive user clicks and newly published articles chronologically. After each click, algorithms generate recommendation lists, which are then evaluated by comparing them with the next user clicks. Most of the algorithms are implemented so that they learn incrementally from every click.
This framework is described in detail in the following research paper:
Michael Jugovac, Dietmar Jannach, and Mozhgan Karimi. 2018. StreamingRec: A Framework for Benchmarking Stream-based News Recommenders. In Proceedings of the Twelfth ACM Conference on Recommender Systems (RecSys ’18), October 2–7, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3240323.3240384. forthcoming.
A preprint is available on the author's homepage.
If you are using StreamingRec in your paper, please cite the above-mentioned reference in your bibliography.
- Most Popular
- Recently Published
- Recently Popular
- Recently Clicked
- Item-Item CF
- Co-Occurrence
- k-Nearest Neighbor
- Sequential Pattern Mining
- Lucene
- Keyword-based
- Baysian Personalized Ranking
- Download the pre-compiled jar from the releases tab
- Acquire item and click CSV input files (see below)
- Create algorithm and metric JSON config files (see below)
- Run with
java -jar StreamingRec.jar <parameters>
- Commonly, at least the following parameters should be used:
java -jar StreamingRec.jar --items=<path_to_item_meta_data_file> --clicks=<path_to_click_data_file> --algorithm-config=<path_to_algorithm_json_config_file> --metrics-config=<path_to_metrics_json_config_file> --session-inactivity-threshold
- For a full list and description of the available parameters, run with
-h
- For systems with small RAM, adjusting the
--thread-count=<N>
parameter can help. By default, it is set to the number of available CPU cores - 1, but in general, less concurrent threads result in less RAM usage.
- Commonly, at least the following parameters should be used:
The Outbrain data set is publicly available. To use it with StreamingRec, the following steps are necessary.
- Register an account at Kaggle and download the data set files from
https://www.kaggle.com/c/outbrain-click-prediction/data.
Only the following files are necessary:
- page_views.csv.zip
- documents_meta.csv.zip
- documents_categories.csv.zip
- documents_entities.csv.zip
- documents_topics.csv.zip
- Put all above-mentioned files in one folder.
- Run the StreamingRec import script with at least the following parameters:
java -cp StreamingRec.jar tudo.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=<id_of_publisher_to_be_extracted>
- After processing, two data set files (one for the item meta data and one for the click data) will be created that can be used in with StreamingRec.
The 2016 Plista challenge data set is not publicly available anymore. However, for researchers who still have access to it, the following step need to be executed to use this data set.
- Collect all the
.gz
files of the data set in one folder - Run the first stage import script with
java -cp StreamingRec.jar tudo.streamingrec.data.loading.ReadPlista <parameters>
. For help about the parameters run with-h
- This generates, intermediate input files. To create the final input files,
run the second stage import script with
java -cp StreamingRec.jar tudo.streamingrec.data.loading.JoinPlistaTransactionsWithMetaInfo <parameters>
. For help about the parameters run with-h
- After processing, two data set files (one for the item meta data and one for the click data) will be created that can be used in with StreamingRec. The intermediate input files can be deleted.
Algorithms and metrics are configured via JSON files, one for algorithms, one for metrics.
Each of the files contains a JSON array that is made up of one JSON object per algorithm/metric.
A general metrics configuration can be found in the project folder at config/metrics-config.json
.
For a quick test, a simple algorithm configuration is included at config/algorithm-config-simple.json
.
To configure algorithms or metrics manually, a new JSON config file can be created. The format of the JSON object representing the algorithm or metric is always the same:
{
"name": "Co-Occurrence",
"algorithm": ".FastSessionCoOccurrence",
"wholeSession": false
}
- name: Is the pretty-print name of the algorithm that will appear as an identifier in the result list
- algorithm: the qualified name of the algorithm class relative to the tudo.streamingrec.algorithms package
- wholeSession: Is an additional parameter specific to this algorithm (Co-Occurrence)
- "Additional parameters" are all parameters that the respective algorithm class offers via its setter methods. To find out which parameters are offered by which algorithm, consult the javadoc from the releases tab
- Download/Clone the source code
- Import the project into Eclipse via
Import -> Existing Maven Projects
- Follow the above-mentioned instructions on how to acquire input files and configure the algorithms and metrics
- To set parameters, adjust the class attributes directly in
tudo.Framework.StreamingRec
or set the call parameters via Eclipse'sRun Configurations
menu - Run the class
tudo.Framework.StreamingRec
viaRun as ... -> Java Application
Custom algorithms have to extend the class tudo.streamingrec.algorithms.Algorithm
.
The implementation shall update its model every time the method void trainInternal(List<Item> items, List<ClickData> transactions)
is called and it shall generate a recommendation list every time the method
LongArrayList recommend(ClickData clickData)
is called.
Custom metrics have to extend the class tudo.streamingrec.evaluation.metrics.Metric
. The implementation
shall calculate, update, and store its current metric value every time the method
void evaluate(Transaction transaction, LongArrayList recommendations, LongOpenHashSet userTransactions)
is called. The final metric value shall be returned when the method double getResults()
is called.
In any case the value of the class attribute k
shall be observed.
Copyright [2017,2018] [Mozhgan Karimi, Michael Jugovac, Dietmar Jannach]
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.