SSLRec does not seem to be able to effectively reproduce some of the baseline results #14

BlueGhostZ · 2023-09-20T13:15:29Z

I tried to utilize the classic dataset Yelp2018 dataset used in the LightGCN and SimGCL papers. I noticed that the interface for adding new datasets was not presented in data_handler_general_cf.py, so I modified the file so that it could directly read the official code of LightGCN that presents the train.txt and the test.txt files.

After completing this step, I ran LightGCN and SimGCL according to the best-parameters of the original paper, which were set as follows:
LightGCN: batch_size = 2048, layer_num=3, reg_weight=1.0e-4, embedding_size=64
SimGCL: batch_size = 2048, layer_num=3, reg_weight=1.0e-4, cl_weight=0.5, temperature=0.2, eps=0.05, embedding_size=64

Unfortunately, LightGCN and SimGCL do not seem to train properly at all. the loss of LightGCN has been fixed at 0.6931, while the Recall@20 of SimGCL is only 0.04. Subsequently, I further tuned the parameters of SimGCL, but the highest performance is only 0.066 (the best result of the original paper is 0.072).
I have some confusion about this, and I would like to inquire if there are inconsistencies in SSLRec's sampling, training, and strategy with mainstream methods that lead to the above phenomenon.

Re-bin · 2023-09-20T14:19:48Z

Hi!

Thanks for your interest! The code for sampling positive/negative items can be found in the data_utils/datasets_general_cf.py file, as shown below:

class PairwiseTrnData(data.Dataset):
    def __init__(self, coomat):
    self.rows = coomat.row
    self.cols = coomat.col
    self.dokmat = coomat.todok()
    self.negs = np.zeros(len(self.rows)).astype(np.int32)
    
    def sample_negs(self):
    for i in range(len(self.rows)):
	u = self.rows[i]
	while True:
            iNeg = np.random.randint(configs['data']['item_num'])
            if (u, iNeg) not in self.dokmat:
                break
	self.negs[i] = iNeg
    
    def __len__(self):
    return len(self.rows)
    
    def __getitem__(self, idx):
    return self.rows[idx], self.cols[idx], self.negs[idx]

While it may differ from the sampling strategy used in LightGCN-PyTorch, which employs uniform sampling by first sampling users and then sampling both positive and negative items, SSLRec follows the distribution of the original interactions. This means that we sample negative items for each interaction during training.

Additionally, during training, we employ early stopping to avoid manually tuning the number of training epochs for all methods. This approach has proven effective for the currently implemented methods. However, when dealing with new datasets, the hyperparameters for early stopping (such as the patience) may vary to achieve optimal performance for different methods.

Moreover, several model-agnostic sub-details (models/model_utils.py, models/loss_utils.py), such as InfoNCE loss and regularization loss, are uniformly implemented in our SSLRec for all methods. However, it is important to note that the implementation may differ from other repositories. Each repository has its own unique design and implementation approach. Nonetheless, in all our current experiments, SimGCL consistently outperforms LightGCN in various scenarios and datasets.

Lastly, thank you for testing our framework and for your valuable questions. If you have any ideas to improve SSLRec, please don't hesitate to submit a pull request (PR). We appreciate your contributions to enhance the SSLRec framework : )

Best regards,
Xubin

HKUDS closed this as completed Sep 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSLRec does not seem to be able to effectively reproduce some of the baseline results #14

SSLRec does not seem to be able to effectively reproduce some of the baseline results #14

BlueGhostZ commented Sep 20, 2023

Re-bin commented Sep 20, 2023 •

edited

Loading

SSLRec does not seem to be able to effectively reproduce some of the baseline results #14

SSLRec does not seem to be able to effectively reproduce some of the baseline results #14

Comments

BlueGhostZ commented Sep 20, 2023

Re-bin commented Sep 20, 2023 • edited Loading

Re-bin commented Sep 20, 2023 •

edited

Loading