-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GroupNorm in ResnetBlocks #64
Comments
Hi Jionghao and thanks for the kind words Your results line up with some papers I've been reading recently. Could you try a pixelnorm, either in place of the groupnorm, or in the direct main path, and see if it leads to comparable results to your layernorm run? Today is my last day open sourcing, but I can throw this last change if you get the experiments to me in time |
Sure I will try it! But I will have to get back to you after I wake up in the morning, in no less than 7-8 hours... |
@shanemankiw there's a trend in transformers to remove the mean centering in layernorms (rmsnorm), so it lines up with Tero Karras' usage of pixelnorm |
@shanemankiw thank you for those results 🙏 i've made the change in 1.1.0 just in the nick of time! alright, time to get back to those emails. go make the holodeck happen 😉 |
@lucidrains Thanks for your efforts! @MarcusLoppe Thanks for the experiments! The 'catastrophe forgetting' problem you talked about is precisely the problem that made me start debugging. You would think that a model this size could figure out a way to overfit on a few hundred meshes, but it always fail on around 10% of the cases. In the paper, MeshGPT could achieve 98% accuracy on even the test set, so this is definitely not normal... |
@MarcusLoppe awesome! thanks for the corroboration! you should switch into the field.. i really think you have a lot of potential even your name is initialed ML lol PS: i'm not kidding about the holodeck. in a decade, mark my words |
In the above I used 150 chairs and 150 tables and augmented each x50 times, so the dataset is 15000 meshes. I'm pretty sure that it's possible to get great results as in the paper.
While running I stored the worse and best results, in the image you'll see the row the best and the second the worse, the rest is 40 random samples. Here is the worst mesh, you can see some defects but that's pretty good after a few hours training! I trained across 16 different categories with 50 models (800 models total) each and augmented them x100 times (80k meshes), I let it run for about 10hrs and got 0.5 mse loss. The results usually gets good at 0.4 loss so some fragments are expected. |
I've heard about the ring attention in the last week in AI podcast, it seems like they used with sparse attention and some dozen of other small things. I'm not quite sure about it lives up to the hype, in the testing I see that they ask it about one thing in the context window, what if you ask it a abstract question which it needs to find 10-20 needles in the haystack/context window? 😕 Maybe, I'm not a ML programmer or know how to debug a model as @shanemankiw did, if I did I've might've been able to resolve this issue a long time ago :( But I like using and training them in my software, for example I used Mistral-7B to extract and output the requirements from job adverts in JSON. I was extracting informations such as hard skills, soft skills, certifications, company culture, education and other qualifications. I then convert it to a ONNX model used it in my asp .net backend and made a little nice react front-end this way you can quickly sift through many job ads so you don't need to waste time and read the whole thing until you'll realize they want 7+ years of experience :) |
@MarcusLoppe Your results are great! Thank you so much for sharing. All of these with only 2k codebook? This thing sure has a lot of potential. |
Correct only using 2k, Thank you very much :) I appreciate yours and @lucidrains comments, not many people in real life cares about this so it's refreshing and heart warming to get some compliments :) https://file.io/Mpg7AoUYoBgC (the mse_rows(63) contains the original model plus the reconstructed) |
@shanemankiw Then I tested using 128 codebook size and I had great success, 0 fragments and took like 2hrs to reach 0.44 loss. You'll probably need a bigger codebook for more meshes but when dealing with testing/smaller dataset it's probably better to use a smaller codebook size. |
…removing groupnorms from all repos
Hi,
Thanks for your code. Your implementation is an amazing starting point for further research based on MeshGPT. However, I could not make it overfit on a small dataset of around 200 triangle shapes(the shapes have varying numbers of faces) when using batchsize > 1. I highly suspect it is because you used Groupnorm instead of Layernorm in your decoder resblocks here
meshgpt-pytorch/meshgpt_pytorch/meshgpt_pytorch.py
Line 262 in f8e30ed
I found during debugging that the outputs of self.norm(x)[mask] and self.norm(x[mask]) (not exactly the code, but you get the idea) are significantly different with groupnorm. So models trained under batchsize>1(when mask is valid) produce wrong meshes when used for evaluating at batchsize=1. So I rewrote it with layernorm:
class Block(Module):
def init(self, dim, dim_out=None, groups=8, dropout=0.0):
super().init()
dim_out = default(dim_out, dim)
After I switch to layernorm, it overfits fairly smoothly under large batchsize and could also work for bs=1. Note that I did also make some other changes for my personal use, but I think this normalization choice is a key factor here.
Looking forward to your opinion on this.
The text was updated successfully, but these errors were encountered: