Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this a bug? (Edit: Replace start_image with NotNANtoN's img clip embed?) #49

Closed
afiaka87 opened this issue Feb 16, 2021 · 9 comments
Closed

Comments

@afiaka87
Copy link

elif text is not None:

Saw this new functionality added. Super useful. Just making sure this function works correctly. It looks like it's called during init, but because it returns in its nested ifs, it only ever runs the code for the img_embed if you didn't specify a clip_encode (I think).

@NotNANtoN
Copy link
Contributor

Hey! I thought of that. My idea was to either have an image, a text, or a custom embedding as an optimization target. So if an encoding is passed it should be used directly - merging the encoding from text and images can be done outside that class. I think it's not clean code to return inside the nested ifs, that might cause the issue.

I wanted to submit a PR soon anyways, in which I add that train_step also returns the latest generated image instead of saving it to disk for a project of mine. I can clean this up there.

@afiaka87 afiaka87 changed the title Is this a bug? Is this a bug? (Edit: Replace start_image with NotNANtoN's img clip embed?) Feb 20, 2021
@afiaka87
Copy link
Author

afiaka87 commented Feb 20, 2021

Hm. So we already actually had a "start_image" parameter that trained Siren directly on the image itself for a few hundred iterations. I've not had much success with that technique though. It successfully "neuralizes" the image, but after it starts training on the phrase CLIP embed, it just sort of swirls around in the existing colors of the image, slowly blackening more and more of them as I mention here.

Training on the cosine similarity of an image CLIP embed (as well as your text) on the other hand, does seem to pick up some of the composition of the original image in a way that doesn't break the training for the text embed.

I guess what I'm saying is I'd definitely like to be able to do both in one go without having to know about load up CLIP and encode/combine various things before passing them in (even though I technically do). While I appreciate the power you get with that approach, and I agree it should remain in the code, I don't think beginners will be super excited about having to figure out how CLIP works just to generate some visuals.

@NotNANtoN
Copy link
Contributor

A good solution could be to merge text and IMG embedding if both are put in. By default it could be the average embedding of the two and one could add a text_weight parameter that controls how much the text embedding influences the final embedding.

I think it could also be nice to pull out the pre-training routine of the start image into a proper method. For one application I'd like to continuously train siren directly on a video stream, while optimizing for certain text similarity.

Btw, this might be completely off-topic but I noticed that using train_step directly instead of the forward method saves me about 2.5GB of VRAM (In my case from 7.5 to 5). I'll check that again at some point and open an issue about it if I can replicate it

@afiaka87
Copy link
Author

Btw, this might be completely off-topic but I noticed that using train_step directly instead of the forward method saves me about 2.5GB of VRAM (In my case from 7.5 to 5). I'll check that again at some point and open an issue about it if I can replicate it

Please do! VRAM usage has consistently gone up for awhile now but I'm not skilled enough with pytorch/machine learning in general to know when/where to delete stuff that's no longer needed.

I'm fairly certain VRAM usage went up quite a deal after the "warmup step" was added to the forward method if that helps your search.

@afiaka87
Copy link
Author

I've got a notebook here that I've been using to just manually define my own forward method in order to keep VRAM usage somewhat under my control. Not sure if it's the best way to go about it, but lots of people keep forking the original research notebook because it gives them full control over everything, despite having worse code quality and fewer features. #50

@NotNANtoN
Copy link
Contributor

Btw, this might be completely off-topic but I noticed that using train_step directly instead of the forward method saves me about 2.5GB of VRAM (In my case from 7.5 to 5). I'll check that again at some point and open an issue about it if I can replicate it

Please do! VRAM usage has consistently gone up for awhile now but I'm not skilled enough with pytorch/machine learning in general to know when/where to delete stuff that's no longer needed.

I'm fairly certain VRAM usage went up quite a deal after the "warmup step" was added to the forward method if that helps your search.

I looked into it and you seem to be right. The warmup step is the problem, I fixed it by just doing it within a "with torch.no_grad():"

@NotNANtoN
Copy link
Contributor

I cleaned this up and fixed the VRAM issue here: #58

@NotNANtoN
Copy link
Contributor

The PR was merged - seems like this issue could be closed

@afiaka87
Copy link
Author

Sorry bout that. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants