Resnet #108

alexander-g · 2020-11-15T13:22:03Z

ResNet model architecture and an example for training on ImageNet
- code is mostly adapted from the flax library
- pretrained ResNet50 with 76.5% accuracy
- pretrained ResNet18 with 68.7% accuracy
Experimental support for mixed precision: previously all layers set their parameters' dtype to the input's dtype. This is incorrect, for numerical stability reasons all parameters should be float32 even when performing float16 computations. See more here.
Some issues I had during training:
- There seems to be a memory leak during training, RAM constantly increased
- I had to use smaller batch sizes than when training with flax or with TensorFlow before maxing out GPU memory (64 instead of 128 for ResNet50 on a RTX2080Ti). This might be of course due to a mistake in my code, but the number of parameters is identical to the flax and PyTorch versions, so I think it might be somewhere else

…into resnet

codecov-io · 2020-11-15T13:24:38Z

Codecov Report

❗ No coverage uploaded for pull request base (feature/custom_jit@f341e2b). Click here to learn what that means.
The diff coverage is n/a.

@@                  Coverage Diff                  @@
##             feature/custom_jit     #108   +/-   ##
=====================================================
  Coverage                      ?   75.91%           
=====================================================
  Files                         ?      101           
  Lines                         ?     4695           
  Branches                      ?        0           
=====================================================
  Hits                          ?     3564           
  Misses                        ?     1131           
  Partials                      ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f341e2b...3b5ec39. Read the comment docs.

cgarciae · 2020-11-15T16:00:46Z

@alexander-g this is awesome, thanks! I'll start reviewing the code.

charlielito · 2020-11-15T22:44:02Z

examples/imagenet/requirements.txt

@@ -0,0 +1,3 @@
+#additional requirements
+tensorflow-datasets==4.0.1
+tensorflow-gpu==2.2.0  #tensorflow-cpu also ok, but with gpu faster


why is it faster with tf-gpu? 🤔 Here we are only using tf.data which runs in CPU, right?

I believe this is because JPEG decoding is performed on the GPU. To be honest I have not tested it with elegy, I had noticed this when training with flax and simply adopted it here.

cgarciae · 2020-11-15T23:01:29Z

elegy/nets/resnet.py

+                x = self.block_type(dtype=self.dtype)(x, 64 * 2 ** i, strides=strides)
+        x = jnp.mean(x, axis=(1, 2))
+        x = nn.Linear(1000, dtype=self.dtype)(x)
+        x = jnp.asarray(x, jnp.float32)


The asarray here is for casting right? This would be similar to this?

x = x.astype(jnp.float32)

By default astype performs a copy of the array even if the type is the same, whereas asarray only performs the copy if conversion is needed.

cgarciae · 2020-11-16T04:22:29Z

Hey @alexander-g, thanks a lot, this is an amazing contribution! We really appreciate it.

Some notes:

Regarding the memory leak, since I don't yet have the ImageNet files needed to train the model I created a script that uses a generator of random images and labels solely for the purpose to test the training code, you can find it here. I wasn't able to reproduce the memory leak on my machine, the only thing that could potentially explain it right now is that since we yet don't have any device placement policies the scalar arrays for the logs that accumulate during training remain in the GPU indefinitely. Not guaranteed this is the problem and 2 floats per step doesn't seem all that much. Regardless, a device placement API could maybe be discussed in a separate issue.
We have to decide the name for the module where we will keep these standard architectures, nets sounds nice but Keras uses applications.
A point we can start discussing is how to better load pre-trained models, we should try (again) to save the weights separate from the model in a serializable format the avoid library version dependence that come with pickle, HDF5 through the tables package was relatively easy. We should probably follow the weights="imagenet" API from Keras to load pretrained weights.

CC: @charlielito

alexander-g · 2020-11-16T07:01:51Z

I'm glad you appreciate it. I really like JAX and Elegy and would like to contribute more in the future.

to 1: I want to do some profiling myself and will let you know if I find out something
to 3: Consider adding some kind of version control for the pretrained models. I may retrain R18 because the performance is slightly worse than PyTorch, R50 is fine though.

alexander-g · 2020-11-16T08:23:20Z

Another issue I forgot: I first tried to convert a pretrained Flax model to Elegy but I got completely different results for identical inputs. Even the very first convolutional layer gave different outputs.
This doesn't have to be wrong but I wanted to mention it.

cgarciae · 2020-11-16T17:08:58Z

Our current implementation for most of the layers is taken from Haiku, for the case of Conv it only has a slight modification to support feature grouping. This should be a separate issue, but I think we should do the same we do for losses and metrics:

Try to expose the Keras API
Test numerically equivalence against a base implementation, could be Flax if we get the benefit of porting their pretrained models.

cgarciae · 2020-11-18T04:09:47Z

@alexander-g I am going to merge this branch as it contains useful fixes.
Feel free to open a new PR if you want to continue improving it.

* save * initial refactor * jit * jit + init_jit * handle rng * jit + value_and_grad * save * save * save * fix metrics_loss * save * save * *_on_batch methods * get_states * save * fix tests * fix examples * format black * use pickle only to save * clean model * save * [Fix] Return all files to 0644 file permisions * fix docs * update module-system guide * update README * fix elegy.jit * update jax * fix tests * small refactor * jupyter dev dependency * update docs * update poetry in github actions * use --no-hashes * use --without-hashes * update requirements during docs deployment * especify poetry >= 1.1.4 as a dev dependency * fix wraps init * Resnet (#108) * added resnet18 * imagenet input pipeline, from https://github.com/google/flax * experimental support for mixed precision * full training script * black + resnet test * format black * re-jit when loading a model for compability among platforms * format black * use different poetry installer Co-authored-by: Cristian Garcia <cgarcia.e88@gmail.com> Co-authored-by: David Cardozo <david@cerberusdata.ai> Co-authored-by: alexander-g <3867427+alexander-g@users.noreply.github.com>

alexander-g added 6 commits November 6, 2020 17:37

added resnet18

a8d8e97

imagenet input pipeline, from https://github.com/google/flax

11b6347

experimental support for mixed precision

e1522dc

full training script

c0f9b6d

black + resnet test

9059b66

Merge branch 'feature/custom_jit' of https://github.com/poets-ai/elegy …

4d596fb

…into resnet

charlielito reviewed Nov 15, 2020

View reviewed changes

cgarciae reviewed Nov 15, 2020

View reviewed changes

format black

061b0e9

re-jit when loading a model for compability among platforms

e30cf9a

cgarciae mentioned this pull request Nov 17, 2020

[Bug] Accuracy from Model.evaluate() is inconsistent with manually computed accuracy #109

Closed

cgarciae added 2 commits November 17, 2020 22:57

format black

c5d8ed8

use different poetry installer

3b5ec39

cgarciae merged commit 22757e0 into poets-ai:feature/custom_jit Nov 18, 2020

cgarciae mentioned this pull request Nov 18, 2020

merge resnet into master #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resnet #108

Resnet #108

alexander-g commented Nov 15, 2020

codecov-io commented Nov 15, 2020 •

edited

cgarciae commented Nov 15, 2020

charlielito Nov 15, 2020

alexander-g Nov 16, 2020

cgarciae Nov 15, 2020

alexander-g Nov 16, 2020

cgarciae commented Nov 16, 2020 •

edited

alexander-g commented Nov 16, 2020

alexander-g commented Nov 16, 2020 •

edited

cgarciae commented Nov 16, 2020

cgarciae commented Nov 18, 2020

Resnet #108

Resnet #108

Conversation

alexander-g commented Nov 15, 2020

codecov-io commented Nov 15, 2020 • edited

Codecov Report

cgarciae commented Nov 15, 2020

charlielito Nov 15, 2020

Choose a reason for hiding this comment

alexander-g Nov 16, 2020

Choose a reason for hiding this comment

cgarciae Nov 15, 2020

Choose a reason for hiding this comment

alexander-g Nov 16, 2020

Choose a reason for hiding this comment

cgarciae commented Nov 16, 2020 • edited

alexander-g commented Nov 16, 2020

alexander-g commented Nov 16, 2020 • edited

cgarciae commented Nov 16, 2020

cgarciae commented Nov 18, 2020

codecov-io commented Nov 15, 2020 •

edited

cgarciae commented Nov 16, 2020 •

edited

alexander-g commented Nov 16, 2020 •

edited