-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
are there any available models to evaluate #1
Comments
Hi JH-Lam. There is currently only a single checkpoint file available at the GitHub release tab. But this model is not really suitable for your intention. Since I don't have good hardware to train the models, I only did a short training period to see if the model converges during training. So if you want to evaluate the model you have to training your own model. |
Hi @michael-mueller-git , thanks and sorry for that I did not notice that link. I'll dive into it later. I thought some approaches before, eg. calculation based on landmarks, but I think regression should be the easiest one since there are many complex patterns on porn scenarios. any clues? |
I agree with you on that. This was also my consideration when I developed the regression models. Additionally, you get the advantage that you don't need special labels for training, because existing funscripts + videos can be used as training material for this model types. The main problem of the model 1 is that CNN does not encode the relative position of the different features and therefore large receptive fields are required to capture long-range dependencies within an image. Increasing the size of the convolution kernels or increasing the depth of the model can increase the representational capacity of the network, but this also loses the computational and statistical efficiency achieved by using a local convolutional structure. I think wee need some sort of self attention to solve this problem. See also: Transformers in Vision: A Survey and ConViT. |
Yep. It may be better to use RNN than CNN at sequence(temporal) data though I am not familiar with the RNN. |
First of all, it's nice to see the progress you've made.
if label is the output (predicted class) of your model, then this will not provide propper results. Because you have multiple correct classes assigned to an output. e.g. label 50 belongs to [43, 44, .. 50, .. 46, 57] and 51 belongs to [44, 44, .. 50, .. 46, 58], ... They overlap!
I think the many negative images do not influence the whole thing as much as the extremely unbalanced training material between 1..99
The most important thing is that the loss curve of the validation data set decreases while training. (I recommend always use some sort of validation with Early Stopping)
The main problem here is that you probably need very large models (which are then correspondingly slow) to get an acceptable accuracy. The main problem is: hat CNN does not encode the relative position of the different features and therefore large receptive fields are required to capture long-range dependencies within an image. Or directly translated: To determine the length of the toothbrush
Not in my opinion. You probably won't get satisfactory results with such an approach.
From your description i assume you use somthing like the In summary, the main problem with the approach is that it only makes a prediction locally per frame but does not take into account the previously predicted results. Since there are usually no jumps in a video (except video cuts). You could help the model by reintroducing the previously predicted regression values by using LSTMs after the feature extraction trough an CNN. |
|
|
Hi, after I dived into the code 'Regression/util/dataset.py' again, I noticed that :
if yes it's a good idea to generate even position values, no matter what positions the users made( but I used the uploaded funscripts' lables directly before) |
now that you mention the labels i noticed that i didn't put the script to convert the funscripts to the frame labels into git. I have now added these. For each funscript (corresponding to an video) i creates labels for each frame through the python script. The labels are simply the regression values for the corresponding frame, where the funscript gets simply interpolated for each frame. |
got it , thanks. some more questions:
|
a) there was a change in ffmpeg 5 which made the code not work anymore. Take a look at ffmpegstream.py in the Python-Funscript-Editor repository where I have fixed the issue. (I should probably use a git submodule in the future so that you don't have to copy it everywhere). |
Hi michael, I'm here again;) The core thing I met currently is : thanks |
Nice to see that you are still working on a funscript predictor model. I have currently stopped working on it due to other projects. Theoretically you could remove Yes |
yep but I just mean that it costed so heavy it was not reasonable. I think this function item() will only propagate the loss from gpu back to memory. in additionally, many train code uses the loss to draw learning curve so it should not be the pain point in fact. |
I think I had found the direct cause: the (BP) complexity of the model. since when I did a transfer-learning today, I found that it's much slower for fine-tuning (non-freezed weights) than fixed feature extractor(freezed weights) as the former has much more computation w.r.t BP. this is consistent w/ the scenario of commenting out the loss.backward() would cost much less. but I can't find the root cause since when I dived into the source loss.item() I was navigated to the ttk.pyi which is the stub file but not the implementation. do you known how to find that? thanks |
If you want to search in a deeper view you have to look in the source code of torch. But for debugging you probably need to compile with debug flags. |
I thought I might get something yesterday. basically it's the lazy optimization mechanism . if net(), backward() , loss.item() are all enable then the main cost(time consumed) is loss.item(), otherwise if I comment out the 'loss.item()' then the total cost will be scatterred to the net() and backward() . one more thing , when I decrease the num_hidden from 64 to 8, then it's about 5x faster gain. note this number is not the real #hidden layers since it will be multiplied by a fix constant in code. so it's very large in fact. but the performance(loss) seems like no drop . So what's idea for designing this complex model? did you do it yourself from scratch ? thanks and sorry if some words sound uncomfortable but it's not my expected meaning |
We try to extract complex features from video so the idea is to use complex model to detect this features.
Not from scratch but i have designed the model architecture. Since I have little experience in this area I do not know if this is a good design! Additional Notes: Have you tried to call |
oh sorry, the 'from scratch' here I mean that if you make the models(ESP model1) without any open papers/references? I think there will be no changes if call 'loss.item()' once per 10 iterations , since as stated above, it cost similarly even if I just comment the 'loss.item()' out, that is the group 'bp + fp' costs same as 'bp+fp+loss.item()' actually. |
I have read through a number of papers but have not been able to adopt any of them to create a regression. Accordingly, I have tried my own ideas.
Note: today I think that the architecture of Model 1 is not useful for a reliable prediction of fun-script points. |
Well thanks.I'll have a glance . I learned RNN(basic) and trained the model1 these days but found that it's possible to converge(even to 0.00xx scale) , but the valid loss (valid set is split from one video) is high, say 0.0yy, yep it was overfitting but it's hard to do better though I introduced early-stopping , reduce-lr-on-plateau etc.I had no idea what's the root cause of this problem, and tired. on the other hand, I have another solution ,say faces/hands based detection to calculate the amplitude. but I think there are also something difficult . eg.Q1: how to make a stable prediction? I think it's possible to derive it against the moving distance divided by the diagonal length of an image, but it's hard to do that in fact. Q2: how to do if there is a scenario without hands nor faces but only a hip with a dick ? I think there may be a function in opencv to derive the similarity b/w two images, but what is the direction? plus or minus to last frame... |
one more thing, do you have any twitter/WeChat for more convenient discuss, may be other technique but not only this topic. |
These models I have unfortunately never fully implemented there is still a part missing. The problem of model 2+3 is: training transformer models from scratch is even more time-consuming because they are usually require more data to converge. The problem for all prediction models is have not implemented randomness to data loader e.g. Gaussian noise, random transformation of frame sequences, ... to improve the generalization.
you probably mean Optical flow. I have already tried this and actually achieved useful results with it if you only want to determine the turning points. To determine the points, I simply performed a principal component decomposition with 2 components that reflect the movement of the two people in the video. As you have already indicated, it is difficult to determine the height of movement here because this information is not directly available to me with this approach.
no unfortunately not |
hi guy, thanks for your awesome effort. Now I'm try to predict some videos(Regression module) but found that some related models are not there. can you please
share some models to verify the performance of them( training data is optional).
thanks in advance and looking forward your reply!
The text was updated successfully, but these errors were encountered: