-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Hi, I've been using this library for the GraphSageDGL/GraphSage model and it has worked fine for months. However, recently I needed to create multiple models for different data, and I encountered this slowdown during training. At first I thought it was due to my server having less compute due to multiple inferences running on it, so I moved to a new one but to no effect.
The GraphSageDGL training usually happens at a speed of 22 it/s (7 it/s for GraphSage). Sometimes it keeps this speed and completes for 100 epochs.
However sometimes (usually after half of 1st epoch or at the beginning of the 2nd) it slows down to 1-1.5 it/s.
Sometimes it goes back and forth between 22 and 1, sometimes it keeps 1 it/s throughout the process. Thanks to this, the model that used to be trained in 40 minutes is still training after 15 hours.
I thought this might be a DGL problem so I tried the vanilla GraphSage and it also has the same problem, the speed goes down from 7 it/s to 1-1.5 it/s at some random epoch and then follows the same behaviour as the DGL one.
I have tried both the CPU and Cuda version of Pytorch (I am not using GPU so the CUDA version also defaults to CPU) and both reproduce the same problem. However as I mentioned it happens randomly and sometimes does not happen at all.
Also it never happens on my macbook m2 pro where the process is a bit slower (15 it/s) but always consistent. This issue happens on the Amazon EC2 server which has 8 CPUS (x86_64) and 32GB RAM. I checked and no other background process was running during the training.
Here are the libraries I am using:
dgl==1.1.2
LibRecommender==1.3.0
loguru==0.7.2
pandas==2.1.1
psycopg2-binary==2.9.7
python-dotenv==1.0.0
scikit-learn==1.3.1
scipy==1.11.2
tensorflow==2.14.0
torch==2.1.0
torchvision
torchaudio
boto3==1.28.55
botocore==1.31.55
python-dotenv==1.0.0
