Abrupt training slow down issue

Hi, I've been using this library for the GraphSageDGL/GraphSage model and it has worked fine for months. However, recently I needed to create multiple models for different data, and I encountered this slowdown during training. At first I thought it was due to my server having less compute due to multiple inferences running on it, so I moved to a new one but to no effect.


The GraphSageDGL training usually happens at a speed of **22 it/s** (**7 it/s** for GraphSage). Sometimes it keeps this speed and completes for **_100_** epochs.
However sometimes (usually after half of 1st epoch or at the beginning of the 2nd) it slows down to **1-1.5 it/s**.
Sometimes it goes back and forth between 22 and 1, sometimes it keeps 1 it/s throughout the process. Thanks to this, the model that used to be trained in 40 minutes is still training after 15 hours.

I thought this might be a **_DGL_** problem so I tried the vanilla GraphSage and it also has the same problem, the speed goes down from **7 it/s** to **1-1.5 it/s** at some random epoch and then follows the same behaviour as the **_DGL_** one.

I have tried both the CPU and Cuda version of Pytorch (I am not using GPU so the CUDA version also defaults to CPU) and both reproduce the same problem. However as I mentioned it happens randomly and sometimes does not happen at all.

Also it never happens on my macbook m2 pro where the process is a bit slower (**15 it/s**) but always consistent. This issue happens on the Amazon EC2 server which has 8 CPUS (x86_64) and 32GB RAM. I checked and no other background process was running during the training.

Here are the libraries I am using:
dgl==1.1.2
LibRecommender==1.3.0
loguru==0.7.2
pandas==2.1.1
psycopg2-binary==2.9.7
python-dotenv==1.0.0
scikit-learn==1.3.1
scipy==1.11.2
tensorflow==2.14.0
torch==2.1.0
torchvision
torchaudio
boto3==1.28.55
botocore==1.31.55
python-dotenv==1.0.0
![Screenshot 2024-05-30 at 5 59 52 PM](https://github.com/massquantity/LibRecommender/assets/96649476/a470e5fa-d654-4366-87ca-a9d61327a0c9)




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abrupt training slow down issue #474

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Abrupt training slow down issue #474

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions