-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
warn: Cannot forward task (ID) to processing node (IP):3000: Failed sending data to the peer #23
Comments
It's likely a network configuration error.
And try to send a task, what do you get? |
The hosts were scripted builds and are exactly the same. All hosts are in the same rack, on the same switch. Locking shifts the load to the other host, but with only 3 'splits' this time (which is probably normal since the split is set to 400): | Node | Status | Queue | Version | Flags1 | 192.168.1.172:3000 | Online | 0/4 | 1.5.2 | L |
Mm, so if both nodes are unlocked, only one node is receiving all the tasks. This is odd, I wonder if it's due to the network being really fast and the tasks being received all at once before they can be forwarded. The choice of a node is done in this function https://github.com/OpenDroneMap/ClusterODM/blob/master/libs/nodes.js#L135 but if two tasks are racing they might both end up on the same node, because they are not aware of each other. I think this is a separate issue and doesn't explain the |
This is still an issue. It seems to keep stacking the same submodel onto the same node over and over. It complains about read timeouts but yet has no trouble piling on the jobs:
|
Yes, it is correctly distributed:
There's something wrong with the splitting it seems. |
Maybe it's the number of parallel connections that's tripping something on your network. I would try to patch https://github.com/OpenDroneMap/ODM/blob/09109f33f94b8e9f2ec804fec94c3a53783daf4a/opendm/remote.py#L356 by adding the |
Thanks, that helped. I set it to 3. It seemed to upload just as fast to the processing nodes with 3 connections as it did with no limit. It's still piling everything onto one node though:
And the console says that there are two sub-models, so a rogue one has been dropped in there. This is a dev system, there are no other jobs processing. I can see from the console that it didn't try to upload any sub-models twice, so there's something else going on there. |
@ChipwizBen see if the changes in #32 help this issue. Also try to set |
Closing this with the assumption that #32 fixed / help the network issues. |
Experimenting with ClusterODM. Have got a two node test cluster setup and a VM with ClusterODM running on it. Everything seems to be setup correctly, both devices show correctly in :10000 and with NODE LIST.
When submitting a job with split, it throws the error
warn: Cannot forward task (ID) to processing node (IP):3000: Failed sending data to the peer
. The error appears with both IPs alternately with several test datasets from 14 - 986 images. No GCPs on these. I tried the following splits on a 986 image dataset:50 - error above
100 - error above
400 - error above
500 - job splits into FIVE parts and only make it to one node. Raising the issue via my mobile but hopefully the below table comes out OK:
1 192.168.1.172:3000 Online 5/4 1.5.2
2 192.168.1.173:3000 Online 0/4 1.5.2
The text was updated successfully, but these errors were encountered: