Here is an implementation of collaborative filtering system, which is the most popular algorithm for recommender systems.
More information can be obtained from the slider: report/OpenCF-report.pdf
OpenCF implemented both user-based CF and item-based CF, and optimizing methods like:
- Row normalization for Rating matrix.
- Similarity functions:
- raw cosine
- adjusted cosine
- pearson correlation
- Similarity Summing:
- Direct summing
- Normalized(1-order) similarity summing
- Probability similarity summing
compact
is used to compact discontinuous user-id and item-id.
# compact data/uir to data/uir.compact, -U and -I specifies mapping file names, which is used to restore ids.
./compact -f data/uir -o data/uir.compact -U data/user.map -I data/item.map
# help
./compact -h
similarity
computes similarity matrix, with various methods.
# help
./similarity -h
# raw cosine similarity
./similarity -f data/uir.compact -o data/ii.cos
# adjusted cosine similarity
./similarity -a 1 data/uir.compact -o data/ii.acos
# pearson correlation similarity
./similarity -a 2 data/uir.compact -o data/ii.corr
The data format similarity
accept is:
userid1 itemid1 rating1
userid2 itemid2 rating2
...
where userid
and itemid
are int
compatible, rating
s are float
compatible.
prediction
uses similarity-file and rating-file to compute prediction matrix.
./predict -s data/ii.cos -f data/uir.compact -o data/uip.ii.compact
Use compact -r
to restore user/item ids, mapping-files should be specified.
./compact -r -f data/uip.ii.compact -o data/uip.ii -U data/user.map -I data/item.map
Dataset data/train
and data/test
are extracted from data/t_alibaba_data.csv
manually. The format in train
is not compatible. rating
is used to generate compatible rating-file from data-files with this format:
userid1 itemid1 operation1 month1 day1
userid2 itemid2 operation2 month2 day2
...
# generate compatible rating-file: data/uir
rating/rating -i data/train -o data/uir -l data/dealdate.last -c data/deal.count
# help
rating/rating
postprocess
is used to de-emphasize items that already purchased by user.
postprocess/postprocess -p data/uip.ii -l data/dealdate.last -c data/deal.count -o data/uip.ii.mod
# prepare test-file, generates data/test.ui
evaluate/prepare_test.sh data/test
# sort and filter prediction-file data/uip.ii to data/ii.ui
evaluate/sort_prediction.sh data/uip.ii
# evaluate sorted-prediction-file data/uip.ii.ui
evaluate/evaluate -p data/uip.ii.ui -t data/test.ui -o data/ii.curve