New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Where should I start if I want to train a model for usage with Neural-Style? #292
Comments
There are at least two parts to this question:
One has to start from the technical part. Caffe http://caffe.berkeleyvision.org is a good choice to start with. It is not too difficult to install, no coding is needed to use it and it directly produces caffemodel files. To train a model, one needs
With these in place, training using caffe will create a model initialized with random weights (according to what is stated in the prototxt file) and start training it using the dataset. Training a deep network from scratch can be difficult and time-consuming. One might start with a small model first, with only a limited number of convolutional layers, or one might try finetuning an existing model. Finetuning means taking an existing, already trained model and training it further using a different dataset. Like in this example http://caffe.berkeleyvision.org/gathered/examples/finetune_flickr_style.html . Either way, one can without much difficulty create models that work with neural-style, in the sense that the model loads, iterations start and even the losses may start diminishing. The visual results are often a disappointment, however. I have done this several times already, using wikiart, my own photo library and a programmatically created dataset of geometrical images. Nothing really useful yet, but learning all the time. Some more detailed notes: For VGG networks, it looks like that training prototxt files are not available in the web, but I managed to piece together one that works. Training a VGG network from scratch is not really recommended. From what I have heard, the creators of the model couldn't train the deeper models from scratch, but had to train smaller models first and then add layers for a new training round. But maybe a VGG with only 1st and 2nd conv layers levels as a first try. Or a VGG finetuned on one's own dataset. |
I sucessfully trained a model that is similar to NIN but with less layers and produced the following images after training it for 70,000 iterations: I used the CIFAR10 data set and this github page along with the supplied scripts in https://gist.github.com/mavenlin/d802a5849de39225bcc6 I am currently wondering if there is a data set of artwork available at the moment that I could use for training? I found this data set: http://people.bath.ac.uk/hc551/dataset.html but that's it from what I have been able to find thus far for artwork data sets. I was also considering grabbing all the images posted to /r/art/ on Reddit for use in training. Maybe also using my massive collection of styles as well. |
Your results look familiar to me. They can be interesting as such, but if the model does not respond to the different styles, then it is very limited what it can achieve. I cannot now locate the example from where I obtained the wikiart materials. It was not a caffe example if I remember correctly. More like someone's python project, from which I got a list of wikiart urls with label data. Not all urls worked, but out of those which did I put together an LMDB. I'll look further if I find something. |
Here's one of my results: Only the colors derive from the style. Changing layers, weights and style image produces a number of variation, but quite limited. Another model I trained produced mainly clouds or blobs of color: It seems to me that these limitations derive from a too small dataset and too few training iterations. One needs also to consider the contents of the dataset. Even if the training is successful, the model only learns to recognize such features that stand out in the dataset. To work well, it should recognize the features that are essential in both content and style images. My geometrical shapes dataset resulted in clouds of color, then clearly the model failed to recognize essential features in the images. I have not used CIFAR10, but I assume that the small size of the images might be a handicap. In another thread here, a hypothesis was raised that a model in neural style works best with images of the size of the training images. Roaming a bit further, I have recently been interested in unsupervised training, using a model which first crunches the image into a vector (such as FC6 output) and then reconstructs the image using deconvolutional and unpooling layers. With this approach, we don't need labels, as the model will learn by comparing the input and output images. |
The material about finetuning using wikiart can be found here https://computing.ece.vt.edu/~f15ece6504/homework2/ . I see it mainly useful for the image urls and labels, as a basis for making LMDB for caffe. And for neural-style, forget Alexnet, it requires GROUP which is not supported by loadcaffe. |
For anyone who is interested, here's one of my VGG16 train prototxt files. Some configuration will be needed if you want to use it.
You need to change the pointers to your dataset and mean files, as well as the batch sizes maybe. You may also want to comment out the prob layer to have cleaner output using training. |
If you want a big image set for training, you can download the imagenet database. It is what was used to train the default vgg-19 model. |
Imagenet is certainly a good choice if one wants to train with a general image set and has the computing platform for large scale training. I am planning to get another linux machine dedicated for training but for the moment I cannot tie up my linux computer long enough for other than small experiments (which are good for learning anyway). |
@htoyryla So I have this data set here with art images: I just posted a few examples but every category seems to have between 50 and 80 images. People-Art has multiple areas such as Annotations and JPEG images where as Photo-Art does not. Would the wiki-art data set be better or would the People-Art/Photo-Art-50 data set be better for training?
And this previously fine tuned model here that already produces good images in neural-style: https://gist.github.com/jimmie33/509111f8a00a9ece2c3d5dde6a750129#file-readme-md How would I step by step, convert this data set into the lmdb files and then how would I exactly use your prototxt to train the already made caffemodel? What train.prototxt and solver.txt files do I need and which ones do I modify? What modifications do I make? I have tried modifying ones that were unclear based on the naming, which file I should to replace it. I tried making a NIN model like the one in Neural-Style using the CIFAR10 data set, but it had the exact same amount of layers that my previous CIFAR10 model had and not the same layers as Neural-Style's NIN model has. I found this fine tuning command on the Berkeley site:
I can easily modify the paths and filenames, but is it the right command to use? With the wiki-art data set, how exactly do I convert it to the lmdb files that I need? This lmdb part is probably the most confusing part of neural networks for me because I have not found any guides that let me make sense of what exactly I have to do. And @htoyryla , if possible, could you post the lmdb files and mean files you made from the wiki-art data set for me to download? |
So I tried to fine-tune the VGG16 SOD model on the CIFAR10 data set, and received the following error:
I was also using this solver.prototxt: https://github.com/ruimashita/caffe-train/blob/master/vgg.solver.prototxt and htoyryla's train_val.prototxt Same error on the normal VGG-16 model:
|
I took the Cubo-Futurism jpg files from the people art data set. I then tried and failed to successfully create the val and train lmdb files. |
You get the error because my training VGG16 prototxt (and any imagenet based prototxt) expects 256x256 images (then cropped accroding to the prototxt to 224x224) and CIFAR is 32x32.
I can help with LMDB and prototxt but for a few days I am terribly busy with other things and mostly not even near a computer. LMDB is created using a script like in caffe/examples/imagenet/create_imagenet.sh, but the script usually needs to be adjusted for paths etc. It can take some time to get used to it and get everything to match, so that the script finds the train.txt and val.txt files as well as the images referred to in them, the image sizes are correct, then it creates two LMDB files. Then you calculate the mean images based on the LMDBs using caffe/examples/imagenet/make_imagenet_mean.sh (or something like that). Then modify the training prototxt to point to your LMDBs and binaryproto files. And make sure the solver.prototxt points to the correct training prototxt. The train.txt and val.txt for the LMDB creation contain lines like path_to_an_image label where label is an integer from 0 .. number_of_categories-1 The handling of paths can be a bit tricky. They are relative to paths set in create_imagenet.sh, but it took me some time to get the paths right. This is all I can contribute right now. After a few days I will have better time to respond. I am not sure if I have my wikiart LMDB any more, I have other LMDBs but they are usually quite large files. PS. See also the caffe imagenet example for the LMDB part (never mind if the page talks about leveldb instead of lmdb, it is an alternative option). http://caffe.berkeleyvision.org/gathered/examples/imagenet.html |
So I have my images at:
Full list of the folders containing images and Each folder of images has a "gt.txt" file. This is what the gt.txt file looks like: https://gist.github.com/ProGamerGov/2339b815b9e462cb69cd5bb7d156ee9a Though I believe this may be part of the Cross-Depiction aspect of the data set. My train.txt and val.txt at:
train.txt: https://gist.github.com/ProGamerGov/1be5afe398c825cfc3ea119005af71fb My create_imagenet.sh file: https://gist.github.com/ProGamerGov/5f92bdc8e7d83756268f438cf15261eb
The prototxt of the model I want to fine tune has
I then run:
This creates two folders:
Inside both folders are Trying to run the script again results in this:
This is the readme.txt that came with the data set: https://gist.github.com/ProGamerGov/dfc8652f3db5bc91acdf34ff22c86bd2 I am not exactly sure what is causing my issue, but could it be that the script is not accounting for the structure of my data set? |
You need to put all the information into train.txt and val.txt. That is where caffe expects to find the urls and the labels. Like this:
" A total of 0 images." means that caffe does not find the image files. Setting the paths in the train.txt versus create_imagenet.sh can be a bit confusing. Unfortunately I don't have the script file for wikiart anymore. But I think what worked for me was to use full path in the train.txt and set the paths in the script as follows:
The root paths are set to / because the train.txt contains full paths. It should also work so that one sets the data root path to directory and has relative urls in the txt files, but I remember having some difficulty with that. I usually write small python scripts to manipulate or create the txt files in the correct format. For my geometrical shapes test I had image files name rect000001.png, ellipse000001.png and so on, then I wrote a python script like this:
and run the output into train.txt. Nothing fancy but it worked. |
You might have a problem with your caffe installation, too, as you had this error message:
I haven't seen this. As far as I understand, this library is for FireWire connection which should not be needed. Found this on google https://kradnangel.gitbooks.io/caffe-study-guide/content/caffe_errors.html |
https://stackoverflow.com/questions/11003761/notepad-add-to-every-line I just used this trick to fix my train and val files quickly.
libdc1394 is for video camera usage and not critical to Caffe as far as I understand. I have a few times disabled it and everything still works fine. |
Perhaps you can manage with notepad but for instance for Wikiart, I think I created the txt files from a downloaded csv file which had all the paths and labels but not in the correct format. Also once I needed to change the label numbering starting from zero instead of one. |
One more thing if you are planning to finetune. You should change the dimension of fc8 layer (assuming training a VGG) to match the number of categories in your dataset. Also, change the name of fc8 to something else, so that caffe will not try to initialize the weights from the original caffemodel which would fail because of the size mismatch. It is typical to use a name like fc8-10 if you have ten categories. Like this in the training prototxt:
|
The changes to my create_imagenet_2.sh file, val.txt, train.txt: https://gist.github.com/ProGamerGov/8267d29262f1bd6570e5918719600695 Still result in the same error. |
@htoyryla Thanks, I'll make the modifications to my train_val.prototxt. |
Changing the fc8 layer will not solve the LMDB creation. It is another issue which you'll face once you get the LMDB and start finetuning. |
I still don't see the labels in your train.txt, only the image paths. |
For the labels, do I put it as a different number value for each category? |
Yes, the labels should be integers from 0 to number_of_categories - 1 as I wrote earlier. During training, caffe will feed each image into the model and, as there are outputs for each labels, train the model to activate the correct output for each image. Without the labels, there is nothing to guide the training and the model will not learn anything. Also, if all images have the same label, the model simply learns to always output that label regardless of the image, so it will not learn anything about the images. It is only when the labels tell something essential about the images that meaningful learning is possible. |
Ok, I think I got it now. Change the |
train.txt and val.txt both have to conform to this format. They also should not include same files, as the val.txt is used to crosscheck that the model really learns to generalize and not simply remember the individual images. I usually first make a train.txt containing all images & labels and then use a script to move every tenth entry to val.txt. I might first make very short txt files to test if the lmdb creation succeeds. There may still be an issue in the create_imagenet.sh, too. I have sometimes struggled with the paths, everything looked ok but 0 images found, until suddenly after changing something back and forth it worked. |
I didn't understand your "Then change it to fcpa_43". It should be enough to change to fc8_43, so that the layer name is not fc8 which is in the caffemodel which you will finetune. |
@htoyryla Ok, thanks for the help! |
So I successfully create the lmdb files! https://gist.github.com/ProGamerGov/d0038f7e3186d057bb7b26398bd764f9 It seems that a few of the images listed in the train.txt and val.txt files, did not exist in the actual data set. |
It happened to me too, now that you mention. Many (most?) datasets do not contain the actual images, only links for downloading from the original location. Probably the wikiart urls no longer work for some files, those files don't get downloaded. It is like broken links, not unusual in internet. |
From previous testing, I found this interesting: https://i.imgur.com/XHg8CPA.jpg |
@htoyryla The difference between "layers" and "layers" is that "layers" is the outdated version of the prototxt. You can use upgrade_net_proto_text to update the prototxt file to the newer version.
|
I seem to have figured out how to change the output in an neutral manor that only affects the seed value in Neural-Style by fine-tuning the VGG-16 SOD Finetune model. Interestingly enough my data set was composed of art produced by neural networks. Edit: On closer inspection, it appears like the differences between the original and the fine-tuned version are in terms of smaller details. I only ran it for 600 iterations as I have to use AWS spot instances for this kind of stuff, but it looks like the newly fine-tuned model version produces more intricate details than the original model. If I have achieved settings that result in an almost neutral change, then I can now theoretically change single parameters, target layers, etc... to achieve better artistic outputs. |
So targeting specific layers seems to produce different output that are not worse than the original model's outputs. Really wish I had the resources to fully flesh this out, as it looks really promising for enhancing Neural-Style's outputs. I think that by targeting different combinations of the default layers that Neural-Style uses, one can improve the model's ability in specific areas with the proper data set. |
This prototxt here has been configured to stop learning on all layers by default: https://gist.github.com/ProGamerGov/1514d74dc6b799389875ce1764c1a12e I was using the VGG16_SOD_finetune model: https://gist.github.com/jimmie33/509111f8a00a9ece2c3d5dde6a750129 And I ran You can allow learning on your layer of choice by changing the following lines of code on the desired layer:
To:
The learning related values are from this Caffe guide here for training certain layers exclusively: https://github.com/BVLC/caffe/wiki/Fine-Tuning-or-Training-Certain-Layers-Exclusively Another note is that edge detection abilities of the model do not seem to be positively or negatively impacted by this layer specific training. I can also provide my two category Deepart.io and Ostagram data set which contains aproximately 3000 images for each of the two categories, if you want. |
crowsonkb's style_transfer has an updated Amazon AMI, which has the latest version of Caffe already installed. |
It looks like training a specific layer, or the default Neural-Style layers, requires a lot longer training time to notice major differences between the original and fine-tuned model. Here are the results from some small scale experiments I ran using the newly found neutral training parameters on the upgraded model and protoxt files: https://i.imgur.com/k0jxvtv.png |
So, just in case I am making the wrong assumptions, as per the prototxt file and Neural-Style's default layer related settings, the
Or is Neural-Style using the part below each "conv" layer which has "relu" instead of "conv"? Example of the prototxt layout:
The prototxt I was using can be found here: https://gist.github.com/ProGamerGov/1514d74dc6b799389875ce1764c1a12e |
I am not fully sure I understand your question. Especially when you say 'is Neural-Style using the part below each "conv" layer which has "relu" instead of "conv"' Below in the sense "below in the prototxt file" or "in a lower layer". But never mind. ReLU is really nothing more than an add-on function on top of a conv layer which sets all negative values to zero. This is why it is also called a rectifier. So in theory convx_y can output both negative and positive values, but after relux_y all negative values have been replaced by zero. Furthermore, this discussion #93 hints that in an implementation such as Torch, the ReLU layer is actually perfomed in-place, which I read to mean that the ReLU directly modifies the memory containing the output of the convlayer. If this is true then there is actually no difference whether one uses conv or relu layers in neural_style, the ReLU function is there anyway, even if you access the conv layer. |
@htoyryla You are correct that ReLU is performed in-place in Torch so after a forward pass it doesn't matter whether you pick a conv layer or its associated ReLU layer; they will both have the same value. However there will be a difference during the backward pass: when you backprop through a ReLU layer, the upstream gradients will be zeroed in the same places the activations were zeroed during the forward pass; if you ask neural-style to work with a conv layer then it will not backprop through the ReLU during the backward pass. This means that when you ask neural-style to use activations on a conv layer, ReLU gets used during the forward pass but not during the backward pass, so the backward pass will not be correct in this case. You can still get nice style transfer effects even when the gradients are incorrect in this way, but for this reason I'd generally expect better results using ReLU layers. |
@jcjohnson, good point, I did not think about the backward pass. |
I suspect that image quality affects training accuracy . This research paper seems to show the effects of image quality on training neural network models: "Understanding How Image Quality Affects Deep Neural Networks" |
I recently trained a NIN model on a roughly sorted custom data set of about 40,000 faces. There appear to be direct improvements to how the model handles faces in terms of content images. But style images which do not have faces, do not work as well. I think that if one could train the model on artwork, in addition to common content images, it would help the model understand both. |
I have sometimes been thinking about using two models, one for style, one for content, both trained with limited material. Don't know if it would work though, and memory usage would certainly be a problem. Yet it could be an interesting exercise. |
@htoyryla That idea could be more resource efficient by using two small NIN like models that are trained on one target category each only. So it turns out that at least for the NIN model, it still has the knowledge required for style transfer, in addition to the newer face related knowledge that I gave it. The unmodified NIN model is on the right, and the fine tuned NIN model is on the left: I used a DeepDream project based on Neural-Style to try and determine why things had changed in the modified NIN model. Below are the DeepDream layer activation tests for all 29 layers used by the NIN model: The original model: The modified model: These DeepDream images helped me figure out that by simply changing the The NIN model itself that I created, had 15700 iterations during training, and seemed to maintain 86-96% accuracy during the last couple thousand iterations. With around 40k training images, I calculated around 24-25 epochs occurred during the training session? I also stopped the training 11600 iterations, in order to lower the learning rate so that the loss would continue going down. I'm not sure if I was over-fitting the model, but it seemed to have improved abilities on an image that it was not apart of the training data set. After the NIN experiments, I attempted to fine tuned a VGG-16 model on my rough faces data set. It's a lot slower to fine tune VGG-16 models than it is to fine tune NIN models. From iterations 1000 to 8000, it seems that the model is actually improving on it's ability to recognize facial features: The output from the non fine tuned SOD_FINETUNE model can be found here: https://i.imgur.com/wWtWysT.png Obviously for my experiments I used the exact same parameters, seed values, etc... to eliminate any other things that might cause different outputs. An album with the full versions of the images I posted in this comment can be found here: Edit: To clarify, the VGG-16 model that I fine tuned is called the "VGG-16 SOD Finetune" model. The "finetune" in the original model's name is because it was fine tuned for salient object detection from the regular VGG-16 model. I have now fine tuned this previously fine tuned model, with a new data set. |
Trying to train a NIN model from scratch with my data set did not work, and only produces blurry style transfer images, and broken DeepDream images. Maybe there are certain classes that help the model learn other classes? Or maybe I just choose bad training parameters? Edit: Analyzing the training loss (idk what graphing tool to use), it appears like the NIN model from scratch had the loss decrease quickly, and then stay constant. For the fine-tuned NIN model, the training loss dropped quickly and seems to have very slowly decreased/maybe stayed the same. Though it must have worked better than when I tried to train from scratch, seeing as it does appear to have better facial feature detection abilities. The fine tuned SOD model has the loss drop continuously over time, which I imagine looks like what one should expect with good training parameters. So I think the results from my fine tuned NIN model are questionable and needs better training parameters, but the VGG-16 SOD model seems to actually be improved in a way that is appropriately reflected in the loss values. Second Edit: After some more testing on my fine tuned SOD model, it appears that I may have actually improved the model with very little change to the model's other abilities. It now more accurately deals with faces, and possibly other parts of the human body (upper portion of the body I think). I wonder if the "roughly sorted" nature of my data set helps the model's new abilities, or weakens the model's new abilities? |
The training loss graphs seem to support my results. The NIN model from scratch is on the left, and the fine tuned NIN model is on the right: The fine tuned VGG-16 model: I think using a larger batch size (64 instead of less than 10) compared to earlier experiments, is part of the reason for this recent training success. |
I think I might be onto something here as my fine tuned model appears to be better at facial feature preservation: An album with the full images can be found here: https://imgur.com/a/tArrY It looks as though my fine tuned model is more accurately detecting the eyes, and mouth of the person in the photo.
The solver.prototxt file and the train_val.prototxt can be found here: https://gist.github.com/ProGamerGov/2bdf7659ee14dac03269a3ec3a7f1fcd |
Imagemagick seems to be slow for resizing large data sets of images (especially when using the
You can get Parallel via Source: https://stackoverflow.com/questions/26168783/imagemagick-convert-and-gnu-parallel-together |
I uploaded the Rough Faces model and added a link to download it on the alternative models wiki page: https://github.com/jcjohnson/neural-style/wiki/Using-Other-Neural-Models Hopefully it can help those seeking better facial preservation with Neural-Style. |
I did test your content image with my fine tuned model, and I think the issue may be that the "rough faces" training data was not very diverse and as a result it performs best with certain images. My example image for testing, was also a part of training data, so that may skew the results (though I did test it on other images that I think were not part of the training data). |
I was looking through my old experiments, and I see that I didn't seem to actually share the two successfully fine-tuned models that I had created. The one model in particular (The "Plaster" model) creates a very different output than the non fine-tuned version. Some experimentation with parameters may be required to achieve satisfactory results as like with all the models I trained and fine-tuned, I would only test them in Neural-Style with certain parameter values. I'm not sure if the "Low Noise" model is actually different than the non fine-tuned model in a way that that's useful for certain styles like the "Plaster" model is, so it can be removed if it's not useful. I posted the models on the wiki page here: https://github.com/jcjohnson/neural-style/wiki/Using-Other-Neural-Models Seeing as both models are from 2016, I am going to test them with a bunch of more "modern" Neural-Style parameters, like setting the TV weight to 0, using the Adam parameters I discovered in addition to L-BFGS, and using multiscale resolution. |
I want to change iteration numbers.Where I have to change? |
Where should I start if I want to train a model for usage with Neural-Style?
Are Network In Network (NIN) models easier to train than VGG models?
Does anyone know of any guides that cover training a model that is compatible with Neural-Style from start to finish? If not, then what do I need to look for in order to make sure the model I am learning to train is compatible with Neural-Style?
What is the easiest way to train a model for use with neural-style? Are there any AMIs available that will let me start messing around with training right away?
The text was updated successfully, but these errors were encountered: