PyTorch DataLoading issue (with Equinox ?) #248

pablo2909 · 2022-12-19T14:40:07Z

Hello,

I encountered an issue (and a fix) about loading data with a PyTorch DataLoader, when used with JAX (and I think Equinox). I am not sure this belongs exactly here so please feel free to tell me and I will move it somewhere else. I also mention #137, since I feel this is related.

The setup:

I train a small MLP on a classification task on MNIST. I use the data loader given in the JAX documentation(https://jax.readthedocs.io/en/latest/notebooks/Neural_Network_and_Data_Loading.html) notably the NumpyLoader and the associated collate function. When I run the training script I get incoherent losses and accuracies. if I use the standard PyTorch DataLoader, I do not face the issue.

I link two files (one failing and one passing). If anyone has an idea on why it does not work, I would love to know. I hope this can also help others since I have been looking into it for 2 days now.

Thank you for any input

link to file : https://gist.github.com/pablo2909/3a2cec869a43421859520750990f263e

The text was updated successfully, but these errors were encountered:

pablo2909 · 2022-12-20T15:39:39Z

Update:

I think I can pinpoint the location of the problem a bit more accurately. I replaced the PyTorch DataLoader and Dataset with custom classes. This is to make sure that there was no PyTorch mechanism messing up with JAX/Equinox during training. It's a very simple code, provided in the link below.
Additionally, I provide a second file that trains an MLP on MNIST. The training will:

fail if we extract the data from the PyTorch MNIST dataset (uncomment line 57-59)
pass if we extract the data from the PyTorch DataLoader (uncomment line 55-56, comment out 57-59)

Note that even though I extract the data from PyTorch Dataset/DataLoader I still train using my custom Dataset/DataLoader.

https://gist.github.com/pablo2909/91127b9c7cb441b3b897bbebd9c0eff1

Thank you for any input

PS: Apologies for the length of messages and code.

jatentaki · 2023-01-26T18:43:57Z

I suggest against numpy collate in general, instead 'tree_map(lambda tensor: tensor.numpy(), batch)'. The reason is torch tensors have special treatment when passed around multiple processes whereas numpy arrays get the standard serialize/deserialize treatment, resulting in a big performance hit last I checked. Maybe this also makes your bug go away?

pablo2909 · 2023-11-28T07:19:18Z

Sorry I failed to reply to that and close the issue. I can't recall exactly what was the issue but I ended up doing that and it fixed it.

patrick-kidger added the question User queries label Dec 20, 2022

pablo2909 closed this as completed Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch DataLoading issue (with Equinox ?) #248

PyTorch DataLoading issue (with Equinox ?) #248

pablo2909 commented Dec 19, 2022

pablo2909 commented Dec 20, 2022

jatentaki commented Jan 26, 2023

pablo2909 commented Nov 28, 2023

PyTorch DataLoading issue (with Equinox ?) #248

PyTorch DataLoading issue (with Equinox ?) #248

Comments

pablo2909 commented Dec 19, 2022

pablo2909 commented Dec 20, 2022

jatentaki commented Jan 26, 2023

pablo2909 commented Nov 28, 2023