Merge branch 'release/v1.2.1'

kakao · Nov 29, 2020 · faa7f8d · faa7f8d
2 parents 411aecd + d806b11
commit faa7f8d
Show file tree

Hide file tree

Showing 18 changed files with 189 additions and 52 deletions.
diff --git a/benchmark/README.md b/benchmark/README.md
@@ -61,7 +61,7 @@ KakaoReco730M | 21,940,315 | 1,467,298 | 730M
 
 Note that there is no Python version of QMF. Since we ran benchmark by Python script, we have to capture printed datetime information from standard output of QMF.
 
-There is a restriction such that the number of the latent dimensions must be multiple of 32 when using GPU in implicit. For example, 80 demensions has been upscaled to 96 but not for 160. Therefore, it is not an accurate comparison between implicit-gpu and buffalo-gpu.
+There is a restriction such that the number of the latent dimensions must be multiple of 32 when using GPU in implicit. For example, 80 dimensions has been upscaled to 96 but not for 160. Therefore, it is not an accurate comparison between implicit-gpu and buffalo-gpu.
 
 #### 2.1.1. KakaoBrunch12M
 
@@ -84,7 +84,7 @@ implicit | 212.646 | 156.561 | 128.528 | 122.587 | 125.323
 qmf | 201.709 | 113.166 | 73.3526 | 124.546 | 144.251
 pyspark | 370.907 | 193.428 | 116.088 | 77.8977 | 55.7786
 
-- D setted as 20.
+- D set to 20.
 
 
 #### 2.1.2. Movielens20M
@@ -108,7 +108,7 @@ implicit | 126.473 | 94.0671 | 72.4117 | 54.702 | 39.5668
 pyspark | 422.733 | 218.339 | 123.377 | 77.7848 | 54.8635
 qmf | 168.467 | 87.7365 | 46.8157 | 31.0115 | 33.9857
 
-- D setted as 20.
+- D set to 20.
 
 #### 2.1.3. KakaoReco730M
 KakaoReco730M, the biggest dataset among our datasets. Within given system resource, only Buffalo and Implicit could manage to train data in reasonable amount of time. Owing to lack of GPU device memory, even implicit does not run on GPU accelerator mode. For buffalo-gpu, the memory management option `batch_mb` works consistently in GPU accelerator mode, allowing it to work with KakaoReco730M that data size does not fit in memory.
@@ -145,22 +145,22 @@ Implicit also provides a GPU accelerator mode for BPRMF, but buffalo doesn't. Im
 
 #### 2.2.1. KakaoBrunch12M
 
-<center><img src="./fig/20190828.buffalo.bpr.kakaobrunch12m.d.png" width="1024px"></center>
-<center><img src="./fig/20190828.buffalo.bpr.kakaobrunch12m.t.png" width="1024px"></center>
+<center><img src="./fig/20200712.buffalo.bpr.kakaobrunch12m.d.png" width="1024px"></center>
+<center><img src="./fig/20200712.buffalo.bpr.kakaobrunch12m.t.png" width="1024px"></center>
 
 method | D=10 | D=20 | D=40 | D=80 | D=160
 -- | -- | -- | -- | -- | --
 buffalo | 17.1951 | 14.6433 | 15.6937 | 16.6561 | 23.426
 implicit | 15.0314 | 16.1355 | 19.3006 | 25.9833 | 39.4239
 qmf | 67.006193 | 76.501249 | 99.842923 | 139.275130666667 | 193.918801
-lightfm | 4480.07857577006 | 4499.68288469315 | 4465.49154909452 | 4491.95924011866 | 4585.76058634122
+lightfm | 58.1398 | 71.8523 | 97.1582 | 143.212 | 231.268 
 
 method | T=1 | T=2 | T=4 | T=8 | T=16
 -- | -- | -- | -- | -- | --
 buffalo | 59.4573 | 36.8466 | 22.5258 | 16.9438 | 26.7515
 implicit | 90.2548 | 42.4105 | 24.4276 | 15.6033 | 13.4407
 qmf | 85.493298 | 75.46227 | 75.4510053333333 | 79.250403 | 76.7110853333333
-lightfm | 4170.78732784589 | 3468.09006055196 | 3411.35963026683 | 4552.11646389961 | 5788.33071891467
+lightfm | 431.583 | 225.155 | 128.233 | 83.8259 | 67.8295
 
 #### 2.2.2. Movielens20M
 tbw.

diff --git a/benchmark/accuracy_warp.md b/benchmark/accuracy_warp.md
@@ -24,17 +24,18 @@ WARP | 0.17361 | 0.62401 | 0.25332 | 0.12941
 Please run following command to reproduce this experiment: `$> python test_accuracy.py compare_warp_brp ml20m`
 
 ## Compare with LightFM
+/(this experiment is not run with hyper-parameter tuning)/
 
 **Parameters**
 
-- `num_iters`: 100
+- `num_iters`: 30
 - `d`: 40
 
 **Top10** accuracy of validation samples for MovieLens100K:
 
 method | NDCG | AUC | ACCURACY | MAP |
 -- | -- | -- | -- | -- 
-BUFFALO| 0.16562 | 0.62012 | 0.00610| 0.16562
-LIGHTFM| 0.03657 | 0.50008 | 0.24548| 0.00365
+BUFFALO| 0.15890| 0.62473| 0.25480| 0.11059
+LIGHTFM| 0.15827| 0.61191| 0.22909| 0.12027
 
 Please run following command to reproduce this experiment: `$> python test_accuracy.py accuracy warp ml100k --libs=buffalo,lightfm`
diff --git a/benchmark/fig/20190828.buffalo.bpr.kakaobrunch12m.d.png b/benchmark/fig/20190828.buffalo.bpr.kakaobrunch12m.d.png
diff --git a/benchmark/fig/20190828.buffalo.bpr.kakaobrunch12m.t.png b/benchmark/fig/20190828.buffalo.bpr.kakaobrunch12m.t.png
diff --git a/benchmark/fig/20200712.buffalo.bpr.kakaobrunch12m.d.png b/benchmark/fig/20200712.buffalo.bpr.kakaobrunch12m.d.png
diff --git a/benchmark/fig/20200712.buffalo.bpr.kakaobrunch12m.t.png b/benchmark/fig/20200712.buffalo.bpr.kakaobrunch12m.t.png
diff --git a/benchmark/models.py b/benchmark/models.py
@@ -87,6 +87,7 @@ def get_option(self, lib_name, algo_name, **kwargs):
                 from buffalo.algo.options import BPRMFOption
                 opt = BPRMFOption().get_default_option()
                 opt.update({'d': kwargs.get('d', 100),
+                            'lr': kwargs.get('lr', 0.05),
                             'validation': kwargs.get('validation'),
                             'num_iters': kwargs.get('num_iters', 10),
                             'num_workers': kwargs.get('num_workers', 10),
@@ -96,6 +97,7 @@ def get_option(self, lib_name, algo_name, **kwargs):
                 from buffalo.algo.options import WARPOption
                 opt = WARPOption().get_default_option()
                 opt.update({'d': kwargs.get('d', 100),
+                            'lr': kwargs.get('lr', 0.05),
                             'validation': kwargs.get('validation'),
                             'num_iters': kwargs.get('num_iters', 10),
                             'max_trials': 100,
@@ -168,7 +170,7 @@ def __init__(self):
 
     def get_database(self, name, **kwargs):
         if name in ['ml20m', 'ml100k', 'kakao_reco_730m', 'kakao_brunch_12m']:
-            db = h5py.File(DB[name])
+            db = h5py.File(DB[name], 'r')
             ratings = db_to_coo(db)
             db.close()
             return ratings
@@ -314,7 +316,7 @@ def __init__(self):
 
     def get_database(self, name, **kwargs):
         if name in ['ml20m', 'ml100k', 'kakao_reco_730m', 'kakao_brunch_12m']:
-            db = h5py.File(DB[name])
+            db = h5py.File(DB[name], 'r')
             ratings = db_to_coo(db)
             db.close()
             return ratings
@@ -327,8 +329,8 @@ def bpr(self, database, **kwargs):
         opts = self.get_option('lightfm', 'bpr', **kwargs)
         data = self.get_database(database, **kwargs)
         bpr = LightFM(loss='bpr',
-                      no_components=kwargs.get('num_workers'))
-        elapsed, mem_info = self.run(bpr.fit, data, data, **opts)
+                      no_components=kwargs.get('d'))
+        elapsed, mem_info = self.run(bpr.fit, data, **opts)
         if kwargs.get('return_instance'):
             return bpr
         bpr = None
@@ -354,12 +356,12 @@ def warp(self, database, **kwargs):
         data = self.get_database(database, **kwargs)
         warp = LightFM(loss='warp',
                        learning_schedule='adagrad',
-                       no_components=kwargs.get('num_workers'),
+                       no_components=kwargs.get('d'),
                        max_sampled=100)
-        elapsed, mem_info = self.run(warp.fit, data, data, **opts)
+        elapsed, mem_info = self.run(warp.fit, data, **opts)
         if kwargs.get('return_instance'):
             return warp
-        bpr = None
+        warp = None
         return elapsed, mem_info
 
 
@@ -482,7 +484,7 @@ def __init__(self):
 
     def get_database(self, name, **kwargs):
         if name in ['ml20m', 'ml100k', 'kakao_reco_730m', 'kakao_brunch_12m']:
-            db = h5py.File(DB[name])
+            db = h5py.File(DB[name], 'r')
             ratings = db_to_dataframe(db, kwargs.get('spark'), kwargs.get('context'))
             db.close()
             return ratings

diff --git a/benchmark/test_accuracy.py b/benchmark/test_accuracy.py
@@ -57,6 +57,7 @@ def _get_validation_score(algo_name, lib, database):
                 'validation': {'topk': 10},
                 'd': 40},
         'warp': {'num_workers': 8,
+                 'lr': 0.2,
                  'batch_mb': 4098,
                  'compute_loss_on_training': False,
                  'num_iters': 100,

diff --git a/buffalo/algo/options.py b/buffalo/algo/options.py
@@ -12,7 +12,7 @@ def get_default_option(self):
         :ivar bool evaluation_on_learning: Set True to do run evaluation on training phrase. (default: True)
         :ivar bool compute_loss_on_training: Set True to calculate loss on training phrase. (default: True)
         :ivar int early_stopping_rounds: The number of exceed epochs after reached minimum loss on training phrase. If set 0, it doesn't work. (default: 0)
-        :ivar bool save_best: Whenver the loss improved, save the model.
+        :ivar bool save_best: Whenever the loss improved, save the model.
         :ivar int evaluation_period: How often will do evaluation in epochs. (default: 1)
         :ivar int save_period: How often will do save_best routine in epochs. (default: 10)
         :ivar int random_seed: Random Seed
@@ -82,9 +82,9 @@ def get_default_option(self):
         :ivar float eps: epsilon for numerical stability (default: 1e-10)
         :ivar float cg_tolerance: tolerance of conjugate gradient for early stopping iterations (default: 1e-10)
         :ivar str optimizer: The name of optimizer, should be in [llt, ldlt, manual_cg, eigen_cg, eigen_bicg, eigen_gmres, eigen_dgmres, eigen_minres]. (default: manual_cg)
-        :ivar int num_cg_max_iters: The number of maximum iterations for conjuaget gradient optimizer. (default: 3)
+        :ivar int num_cg_max_iters: The number of maximum iterations for conjugate gradient optimizer. (default: 3)
         :ivar str model_path: Where to save model.
-        :ivar dict data_opt: This options will be used to load data if given.
+        :ivar dict data_opt: This option will be used to load data if given.
         """
 
         opt = super().get_default_option()
@@ -118,7 +118,7 @@ def get_default_optimize_option(self):
         :ivar int min_trials: The minimum experiments before deploying model. (Since the best parameter may not be found after `min_trials`, the first best parameter is always deployed)
         :ivar bool deployment: Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
         :ivar bool start_with_default_parameters: If set to True, the loss value of the default parameter is used as the starting loss to beat.
-        :ivar dict space: The parameter space definition. For more information, pleases reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Pleases see optimize.py to check how we deal with `randint`.k
+        :ivar dict space: The parameter space definition. For more information, please check reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with `randint`.
         """
         opt = super().get_default_optimize_option()
         opt.update({
@@ -156,9 +156,9 @@ def get_default_option(self):
         :ivar float alpha: The coefficient of giving more weights to losses on positive samples. (default: 8.0)
         :ivar float l: The relative weight of loss on user-item relation over item-context relation. (default: 1.0)
         :ivar str optimizer: The name of optimizer, should be in [llt, ldlt, manual_cg, eigen_cg, eigen_bicg, eigen_gmres, eigen_dgmres, eigen_minres]. (default: manual_cg)
-        :ivar int num_cg_max_iters: The number of maximum iterations for conjuaget gradient optimizer. (default: 3)
+        :ivar int num_cg_max_iters: The number of maximum iterations for conjugate gradient optimizer. (default: 3)
         :ivar str model_path: Where to save model. (default: '')
-        :ivar dict data_opt: This options will be used to load data if given. (default: {})
+        :ivar dict data_opt: This option will be used to load data if given. (default: {})
         """
         opt = super().get_default_option()
         opt.update({
@@ -190,7 +190,7 @@ def get_default_optimize_option(self):
         :ivar int min_trials: Minimum experiments before deploying model. (Since the best parameter may not be found after `min_trials`, the first best parameter is always deployed)
         :ivar bool deployment(: Set True to train model with the best parameter. During the optimization, it try to dump the model which beated the previous best loss.
         :ivar bool start_with_default_parameters: If set to True, the loss value of the default parameter is used as the starting loss to beat.
-        :ivar dict space: Parameter space definition. For more information, pleases reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Pleases see optimize.py to check how we deal with `randint`.k
+        :ivar dict space: Parameter space definition. For more information, please check reference hyperopt's express. Note) Due to hyperopt's `randint` does not provide lower value, we had to implement it a bait tricky. Please see optimize.py to check how we deal with `randint`.
         """
         opt = super().get_default_optimize_option()
         opt.update({
@@ -203,9 +203,9 @@ def get_default_optimize_option(self):
                 'd': ['randint', ['d', 10, 30]],
                 'reg_u': ['uniform', ['reg_u', 0.1, 1]],
                 'reg_i': ['uniform', ['reg_i', 0.1, 1]],
-                'reg_c': ['uniform', ['reg_i', 0.1, 1]],
+                'reg_c': ['uniform', ['reg_c', 0.1, 1]],
                 'alpha': ['randint', ['alpha', 1, 32]],
-                'l': ['randint', ['alpha', 1, 32]]
+                'l': ['randint', ['l', 1, 32]]
             }
         })
         return Option(opt)
@@ -245,12 +245,12 @@ def get_default_option(self):
         :ivar float min_lr: The minimum of learning rate, to prevent going to zero by learning rate decaying. (default: 0.0001)
         :ivar float beta1: The parameter of Adam optimizer. (default: 0.9)
         :ivar float beta2: The parameter of Adam optimizer. (default: 0.999)
-        :ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with graidents, do normalize gradients per class by its number of contributed samples. (default: False)
-        :ivar float sampling_power: This paramemter control sampling distribution. When it set to 0, it draw negative items from uniform distribution, while to set 1, it draw from the given data popularation. (default: 0.0)
+        :ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with gradients, do normalize gradients per class by its number of contributed samples. (default: False)
+        :ivar float sampling_power: This parameter control sampling distribution. When it set to 0, it draw negative items from uniform distribution, while to set 1, it draw from the given data popularation. (default: 0.0)
         :ivar bool random_positive: Set True, to draw positive sample uniformly instead of using straight forward positive sample, only implemented in cuda mode, according to the original paper, set True, but we found out False usually produces better results) (default: False)
         :ivar bool verify_neg: Set True, to ensure negative sample does not belong to positive samples. (default True)
         :ivar str model_path: Where to save model.
-        :ivar dict data_opt: This options will be used to load data if given.
+        :ivar dict data_opt: This option will be used to load data if given.
         """
         opt = super().get_default_option()
         opt.update({
@@ -331,10 +331,10 @@ def get_default_option(self):
         :ivar float min_lr: The minimum of learning rate, to prevent going to zero by learning rate decaying. (default: 0.0001)
         :ivar float beta1: The parameter of Adam optimizer. (default: 0.9)
         :ivar float beta2: The parameter of Adam optimizer. (default: 0.999)
-        :ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with graidents, do normalize gradients per class by its number of contributed samples. (default: False)
+        :ivar bool per_coordinate_normalize: This is a bit tricky option for Adam optimizer. Before update factors with gradients, do normalize gradients per class by its number of contributed samples. (default: False)
         :ivar bool random_positive: Set True, to draw positive sample uniformly instead of using straight forward positive sample, only implemented in cuda mode, according to the original paper, set True, but we found out False usually produces better results) (default: False)
         :ivar str model_path: Where to save model.
-        :ivar dict data_opt: This options will be used to load data if given.
+        :ivar dict data_opt: This option will be used to load data if given.
         """
         opt = super().get_default_option()
         opt.update({
@@ -366,7 +366,7 @@ def get_default_option(self):
         return Option(opt)
 
     def get_default_optimize_option(self):
-        """Optimization options for BPRMF.
+        """Optimization options for WARP.
         """
         opt = super().get_default_optimize_option()
         opt.update({
@@ -380,7 +380,7 @@ def get_default_optimize_option(self):
                 'threshold': ['uniform', ['threshold', 0.5, 5.0]],
                 'reg_u': ['uniform', ['reg_u', 0.01, 1.0]],
                 'reg_i': ['uniform', ['reg_i', 0.0, 0.001]],
-                'reg_j': ['uniform', ['reg_i', 0.0, 0.001]]
+                'reg_j': ['uniform', ['reg_j', 0.0, 0.001]]
             }
         })
         return Option(opt)
@@ -402,7 +402,7 @@ def get_default_option(self):
         :ivar float sample: The sampling ratio to downsample the frequent words. (default: 0.001)
         :ivar float lr: The learning rate.
         :ivar str model_path: Where to save model.
-        :ivar dict data_opt: This options will be used to load data if given.
+        :ivar dict data_opt: This option will be used to load data if given.
         """
         opt = super().get_default_option()
         opt.update({

diff --git a/buffalo/cli.py b/buffalo/cli.py
@@ -17,7 +17,7 @@ def run(self, opt_path):
         self.logger.info(f'ALS finished with loss({loss}).')
         if opt.save_factors:
             self.logger.info(f'Saving model to {opt.model_path}.')
-            als.dump(opt.model_path)
+            als.save(opt.model_path)
 
     def optimize(self, opt_path):
         als = _ALS(opt_path)