Skip to content

Commit

Permalink
Merge branch 'release/v1.2.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
ummae committed Nov 29, 2020
2 parents 411aecd + d806b11 commit faa7f8d
Show file tree
Hide file tree
Showing 18 changed files with 189 additions and 52 deletions.
14 changes: 7 additions & 7 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ KakaoReco730M | 21,940,315 | 1,467,298 | 730M

Note that there is no Python version of QMF. Since we ran benchmark by Python script, we have to capture printed datetime information from standard output of QMF.

There is a restriction such that the number of the latent dimensions must be multiple of 32 when using GPU in implicit. For example, 80 demensions has been upscaled to 96 but not for 160. Therefore, it is not an accurate comparison between implicit-gpu and buffalo-gpu.
There is a restriction such that the number of the latent dimensions must be multiple of 32 when using GPU in implicit. For example, 80 dimensions has been upscaled to 96 but not for 160. Therefore, it is not an accurate comparison between implicit-gpu and buffalo-gpu.

#### 2.1.1. KakaoBrunch12M

Expand All @@ -84,7 +84,7 @@ implicit | 212.646 | 156.561 | 128.528 | 122.587 | 125.323
qmf | 201.709 | 113.166 | 73.3526 | 124.546 | 144.251
pyspark | 370.907 | 193.428 | 116.088 | 77.8977 | 55.7786

- D setted as 20.
- D set to 20.


#### 2.1.2. Movielens20M
Expand All @@ -108,7 +108,7 @@ implicit | 126.473 | 94.0671 | 72.4117 | 54.702 | 39.5668
pyspark | 422.733 | 218.339 | 123.377 | 77.7848 | 54.8635
qmf | 168.467 | 87.7365 | 46.8157 | 31.0115 | 33.9857

- D setted as 20.
- D set to 20.

#### 2.1.3. KakaoReco730M
KakaoReco730M, the biggest dataset among our datasets. Within given system resource, only Buffalo and Implicit could manage to train data in reasonable amount of time. Owing to lack of GPU device memory, even implicit does not run on GPU accelerator mode. For buffalo-gpu, the memory management option `batch_mb` works consistently in GPU accelerator mode, allowing it to work with KakaoReco730M that data size does not fit in memory.
Expand Down Expand Up @@ -145,22 +145,22 @@ Implicit also provides a GPU accelerator mode for BPRMF, but buffalo doesn't. Im

#### 2.2.1. KakaoBrunch12M

<center><img src="./fig/20190828.buffalo.bpr.kakaobrunch12m.d.png" width="1024px"></center>
<center><img src="./fig/20190828.buffalo.bpr.kakaobrunch12m.t.png" width="1024px"></center>
<center><img src="./fig/20200712.buffalo.bpr.kakaobrunch12m.d.png" width="1024px"></center>
<center><img src="./fig/20200712.buffalo.bpr.kakaobrunch12m.t.png" width="1024px"></center>

method | D=10 | D=20 | D=40 | D=80 | D=160
-- | -- | -- | -- | -- | --
buffalo | 17.1951 | 14.6433 | 15.6937 | 16.6561 | 23.426
implicit | 15.0314 | 16.1355 | 19.3006 | 25.9833 | 39.4239
qmf | 67.006193 | 76.501249 | 99.842923 | 139.275130666667 | 193.918801
lightfm | 4480.07857577006 | 4499.68288469315 | 4465.49154909452 | 4491.95924011866 | 4585.76058634122
lightfm | 58.1398 | 71.8523 | 97.1582 | 143.212 | 231.268

method | T=1 | T=2 | T=4 | T=8 | T=16
-- | -- | -- | -- | -- | --
buffalo | 59.4573 | 36.8466 | 22.5258 | 16.9438 | 26.7515
implicit | 90.2548 | 42.4105 | 24.4276 | 15.6033 | 13.4407
qmf | 85.493298 | 75.46227 | 75.4510053333333 | 79.250403 | 76.7110853333333
lightfm | 4170.78732784589 | 3468.09006055196 | 3411.35963026683 | 4552.11646389961 | 5788.33071891467
lightfm | 431.583 | 225.155 | 128.233 | 83.8259 | 67.8295

#### 2.2.2. Movielens20M
tbw.
Expand Down
7 changes: 4 additions & 3 deletions benchmark/accuracy_warp.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,17 +24,18 @@ WARP | 0.17361 | 0.62401 | 0.25332 | 0.12941
Please run following command to reproduce this experiment: `$> python test_accuracy.py compare_warp_brp ml20m`

## Compare with LightFM
/(this experiment is not run with hyper-parameter tuning)/

**Parameters**

- `num_iters`: 100
- `num_iters`: 30
- `d`: 40

**Top10** accuracy of validation samples for MovieLens100K:

method | NDCG | AUC | ACCURACY | MAP |
-- | -- | -- | -- | --
BUFFALO| 0.16562 | 0.62012 | 0.00610| 0.16562
LIGHTFM| 0.03657 | 0.50008 | 0.24548| 0.00365
BUFFALO| 0.15890| 0.62473| 0.25480| 0.11059
LIGHTFM| 0.15827| 0.61191| 0.22909| 0.12027

Please run following command to reproduce this experiment: `$> python test_accuracy.py accuracy warp ml100k --libs=buffalo,lightfm`
Binary file not shown.
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 10 additions & 8 deletions benchmark/models.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,7 @@ def get_option(self, lib_name, algo_name, **kwargs):
from buffalo.algo.options import BPRMFOption
opt = BPRMFOption().get_default_option()
opt.update({'d': kwargs.get('d', 100),
'lr': kwargs.get('lr', 0.05),
'validation': kwargs.get('validation'),
'num_iters': kwargs.get('num_iters', 10),
'num_workers': kwargs.get('num_workers', 10),
Expand All @@ -96,6 +97,7 @@ def get_option(self, lib_name, algo_name, **kwargs):
from buffalo.algo.options import WARPOption
opt = WARPOption().get_default_option()
opt.update({'d': kwargs.get('d', 100),
'lr': kwargs.get('lr', 0.05),
'validation': kwargs.get('validation'),
'num_iters': kwargs.get('num_iters', 10),
'max_trials': 100,
Expand Down Expand Up @@ -168,7 +170,7 @@ def __init__(self):

def get_database(self, name, **kwargs):
if name in ['ml20m', 'ml100k', 'kakao_reco_730m', 'kakao_brunch_12m']:
db = h5py.File(DB[name])
db = h5py.File(DB[name], 'r')
ratings = db_to_coo(db)
db.close()
return ratings
Expand Down Expand Up @@ -314,7 +316,7 @@ def __init__(self):

def get_database(self, name, **kwargs):
if name in ['ml20m', 'ml100k', 'kakao_reco_730m', 'kakao_brunch_12m']:
db = h5py.File(DB[name])
db = h5py.File(DB[name], 'r')
ratings = db_to_coo(db)
db.close()
return ratings
Expand All @@ -327,8 +329,8 @@ def bpr(self, database, **kwargs):
opts = self.get_option('lightfm', 'bpr', **kwargs)
data = self.get_database(database, **kwargs)
bpr = LightFM(loss='bpr',
no_components=kwargs.get('num_workers'))
elapsed, mem_info = self.run(bpr.fit, data, data, **opts)
no_components=kwargs.get('d'))
elapsed, mem_info = self.run(bpr.fit, data, **opts)
if kwargs.get('return_instance'):
return bpr
bpr = None
Expand All @@ -354,12 +356,12 @@ def warp(self, database, **kwargs):
data = self.get_database(database, **kwargs)
warp = LightFM(loss='warp',
learning_schedule='adagrad',
no_components=kwargs.get('num_workers'),
no_components=kwargs.get('d'),
max_sampled=100)
elapsed, mem_info = self.run(warp.fit, data, data, **opts)
elapsed, mem_info = self.run(warp.fit, data, **opts)
if kwargs.get('return_instance'):
return warp
bpr = None
warp = None
return elapsed, mem_info


Expand Down Expand Up @@ -482,7 +484,7 @@ def __init__(self):

def get_database(self, name, **kwargs):
if name in ['ml20m', 'ml100k', 'kakao_reco_730m', 'kakao_brunch_12m']:
db = h5py.File(DB[name])
db = h5py.File(DB[name], 'r')
ratings = db_to_dataframe(db, kwargs.get('spark'), kwargs.get('context'))
db.close()
return ratings
Expand Down
1 change: 1 addition & 0 deletions benchmark/test_accuracy.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ def _get_validation_score(algo_name, lib, database):
'validation': {'topk': 10},
'd': 40},
'warp': {'num_workers': 8,
'lr': 0.2,
'batch_mb': 4098,
'compute_loss_on_training': False,
'num_iters': 100,
Expand Down
34 changes: 17 additions & 17 deletions buffalo/algo/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ def get_default_option(self):
:ivar bool evaluation_on_learning: Set True to do run evaluation on training phrase. (default: True)
:ivar bool compute_loss_on_training: Set True to calculate loss on training phrase. (default: True)
:ivar int early_stopping_rounds: The number of exceed epochs after reached minimum loss on training phrase. If set 0, it doesn't work. (default: 0)
:ivar bool save_best: Whenver the loss improved, save the model.
:ivar bool save_best: Whenever the loss improved, save the model.
:ivar int evaluation_period: How often will do evaluation in epochs. (default: 1)
:ivar int save_period: How often will do save_best routine in epochs. (default: 10)
:ivar int random_seed: Random Seed
Expand Down Expand Up @@ -82,9 +82,9 @@ def get_default_option(self):
:ivar float eps: epsilon for numerical stability (default: 1e-10)
:ivar float cg_tolerance: tolerance of conjugate gradient for early stopping iterations (default: 1e-10)
:ivar str optimizer: The name of optimizer, should be in [llt, ldlt, manual_cg, eigen_cg, eigen_bicg, eigen_gmres, eigen_dgmres, eigen_minres]. (default: manual_cg)
:ivar int num_cg_max_iters: The number of maximum iterations for conjuaget gradient optimizer. (default: 3)
:ivar int num_cg_max_iters: The number of maximum iterations for conjugate gradient optimizer. (default: 3)
:ivar str model_path: Where to save model.
:ivar dict data_opt: This options will be used to load data if given.
:ivar dict data_opt: This option will be used to load data if given.
"""

opt = super().get_default_option()
Expand Down Expand Up @@ -118,7 +118,7 @@ def get_default_optimize_option(self):
:ivar int min_trials: The minimum experiments before deploying model. (Since the best parameter may not be found after `min_trials`, the first best parameter is always deployed)
:ivar bool deployment: Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
:ivar bool start_with_default_parameters: If set to True, the loss value of the default parameter is used as the starting loss to beat.
:ivar dict space: The parameter space definition. For more information, pleases reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Pleases see optimize.py to check how we deal with `randint`.k
:ivar dict space: The parameter space definition. For more information, please check reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with `randint`.
"""
opt = super().get_default_optimize_option()
opt.update({
Expand Down Expand Up @@ -156,9 +156,9 @@ def get_default_option(self):
:ivar float alpha: The coefficient of giving more weights to losses on positive samples. (default: 8.0)
:ivar float l: The relative weight of loss on user-item relation over item-context relation. (default: 1.0)
:ivar str optimizer: The name of optimizer, should be in [llt, ldlt, manual_cg, eigen_cg, eigen_bicg, eigen_gmres, eigen_dgmres, eigen_minres]. (default: manual_cg)
:ivar int num_cg_max_iters: The number of maximum iterations for conjuaget gradient optimizer. (default: 3)
:ivar int num_cg_max_iters: The number of maximum iterations for conjugate gradient optimizer. (default: 3)
:ivar str model_path: Where to save model. (default: '')
:ivar dict data_opt: This options will be used to load data if given. (default: {})
:ivar dict data_opt: This option will be used to load data if given. (default: {})
"""
opt = super().get_default_option()
opt.update({
Expand Down Expand Up @@ -190,7 +190,7 @@ def get_default_optimize_option(self):
:ivar int min_trials: Minimum experiments before deploying model. (Since the best parameter may not be found after `min_trials`, the first best parameter is always deployed)
:ivar bool deployment(: Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
:ivar bool start_with_default_parameters: If set to True, the loss value of the default parameter is used as the starting loss to beat.
:ivar dict space: Parameter space definition. For more information, pleases reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Pleases see optimize.py to check how we deal with `randint`.k
:ivar dict space: Parameter space definition. For more information, please check reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with `randint`.
"""
opt = super().get_default_optimize_option()
opt.update({
Expand All @@ -203,9 +203,9 @@ def get_default_optimize_option(self):
'd': ['randint', ['d', 10, 30]],
'reg_u': ['uniform', ['reg_u', 0.1, 1]],
'reg_i': ['uniform', ['reg_i', 0.1, 1]],
'reg_c': ['uniform', ['reg_i', 0.1, 1]],
'reg_c': ['uniform', ['reg_c', 0.1, 1]],
'alpha': ['randint', ['alpha', 1, 32]],
'l': ['randint', ['alpha', 1, 32]]
'l': ['randint', ['l', 1, 32]]
}
})
return Option(opt)
Expand Down Expand Up @@ -245,12 +245,12 @@ def get_default_option(self):
:ivar float min_lr: The minimum of learning rate, to prevent going to zero by learning rate decaying. (default: 0.0001)
:ivar float beta1: The parameter of Adam optimizer. (default: 0.9)
:ivar float beta2: The parameter of Adam optimizer. (default: 0.999)
:ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with graidents, do normalize gradients per class by its number of contributed samples. (default: False)
:ivar float sampling_power: This paramemter control sampling distribution. When it set to 0, it draw negative items from uniform distribution, while to set 1, it draw from the given data popularation. (default: 0.0)
:ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with gradients, do normalize gradients per class by its number of contributed samples. (default: False)
:ivar float sampling_power: This parameter control sampling distribution. When it set to 0, it draw negative items from uniform distribution, while to set 1, it draw from the given data popularation. (default: 0.0)
:ivar bool random_positive: Set True, to draw positive sample uniformly instead of using straight forward positive sample, only implemented in cuda mode, according to the original paper, set True, but we found out False usually produces better results) (default: False)
:ivar bool verify_neg: Set True, to ensure negative sample does not belong to positive samples. (default True)
:ivar str model_path: Where to save model.
:ivar dict data_opt: This options will be used to load data if given.
:ivar dict data_opt: This option will be used to load data if given.
"""
opt = super().get_default_option()
opt.update({
Expand Down Expand Up @@ -331,10 +331,10 @@ def get_default_option(self):
:ivar float min_lr: The minimum of learning rate, to prevent going to zero by learning rate decaying. (default: 0.0001)
:ivar float beta1: The parameter of Adam optimizer. (default: 0.9)
:ivar float beta2: The parameter of Adam optimizer. (default: 0.999)
:ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with graidents, do normalize gradients per class by its number of contributed samples. (default: False)
:ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with gradients, do normalize gradients per class by its number of contributed samples. (default: False)
:ivar bool random_positive: Set True, to draw positive sample uniformly instead of using straight forward positive sample, only implemented in cuda mode, according to the original paper, set True, but we found out False usually produces better results) (default: False)
:ivar str model_path: Where to save model.
:ivar dict data_opt: This options will be used to load data if given.
:ivar dict data_opt: This option will be used to load data if given.
"""
opt = super().get_default_option()
opt.update({
Expand Down Expand Up @@ -366,7 +366,7 @@ def get_default_option(self):
return Option(opt)

def get_default_optimize_option(self):
"""Optimization options for BPRMF.
"""Optimization options for WARP.
"""
opt = super().get_default_optimize_option()
opt.update({
Expand All @@ -380,7 +380,7 @@ def get_default_optimize_option(self):
'threshold': ['uniform', ['threshold', 0.5, 5.0]],
'reg_u': ['uniform', ['reg_u', 0.01, 1.0]],
'reg_i': ['uniform', ['reg_i', 0.0, 0.001]],
'reg_j': ['uniform', ['reg_i', 0.0, 0.001]]
'reg_j': ['uniform', ['reg_j', 0.0, 0.001]]
}
})
return Option(opt)
Expand All @@ -402,7 +402,7 @@ def get_default_option(self):
:ivar float sample: The sampling ratio to downsample the frequent words. (default: 0.001)
:ivar float lr: The learning rate.
:ivar str model_path: Where to save model.
:ivar dict data_opt: This options will be used to load data if given.
:ivar dict data_opt: This option will be used to load data if given.
"""
opt = super().get_default_option()
opt.update({
Expand Down
2 changes: 1 addition & 1 deletion buffalo/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ def run(self, opt_path):
self.logger.info(f'ALS finished with loss({loss}).')
if opt.save_factors:
self.logger.info(f'Saving model to {opt.model_path}.')
als.dump(opt.model_path)
als.save(opt.model_path)

def optimize(self, opt_path):
als = _ALS(opt_path)
Expand Down
Loading

0 comments on commit faa7f8d

Please sign in to comment.