Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

are there any available models to evaluate #1

Open
JH-Lam opened this issue May 12, 2022 · 25 comments
Open

are there any available models to evaluate #1

JH-Lam opened this issue May 12, 2022 · 25 comments

Comments

@JH-Lam
Copy link

JH-Lam commented May 12, 2022

hi guy, thanks for your awesome effort. Now I'm try to predict some videos(Regression module) but found that some related models are not there. can you please
share some models to verify the performance of them( training data is optional).
thanks in advance and looking forward your reply!

@michael-mueller-git
Copy link
Owner

Hi JH-Lam. There is currently only a single checkpoint file available at the GitHub release tab. But this model is not really suitable for your intention. Since I don't have good hardware to train the models, I only did a short training period to see if the model converges during training. So if you want to evaluate the model you have to training your own model.
I think that the regression models are currently not the best approach to solve the problem (At least with my current implemented approach).

@JH-Lam
Copy link
Author

JH-Lam commented May 14, 2022

Hi @michael-mueller-git , thanks and sorry for that I did not notice that link. I'll dive into it later. I thought some approaches before, eg. calculation based on landmarks, but I think regression should be the easiest one since there are many complex patterns on porn scenarios. any clues?

@michael-mueller-git
Copy link
Owner

I think regression should be the easiest one since there are many complex patterns on porn scenarios. any clues?

I agree with you on that. This was also my consideration when I developed the regression models. Additionally, you get the advantage that you don't need special labels for training, because existing funscripts + videos can be used as training material for this model types.
Please note that my code currently only handles VR videos. If you have powerful hardware for training it would be useful to increase the number of ConvLSTM layers in model1.

The main problem of the model 1 is that CNN does not encode the relative position of the different features and therefore large receptive fields are required to capture long-range dependencies within an image. Increasing the size of the convolution kernels or increasing the depth of the model can increase the representational capacity of the network, but this also loses the computational and statistical efficiency achieved by using a local convolutional structure. I think wee need some sort of self attention to solve this problem. See also: Transformers in Vision: A Survey and ConViT.

@JH-Lam
Copy link
Author

JH-Lam commented May 16, 2022

Yep. It may be better to use RNN than CNN at sequence(temporal) data though I am not familiar with the RNN.
in addition, I want to add a binary classification to give a probability that indicates whether this regression is reasonable or not.
Thanks for detailed reply.

@JH-Lam
Copy link
Author

JH-Lam commented Jul 18, 2022

Hi michael, pardon me if it's a bit late to raise a new thread here.

During these days ,I did a transfer learning on MobileNetV2 to train a custom regression for porn images(videos frames).
My scenarios:
-to detect how deep a toothbrush moved into the mouth
-only w/ range labels[0,100] but no other landmarks. 0 means fully moved into a mouth while 100 is out(or the tip of toothbrush is closed to lip)
-transform the predicted regression scalar to classification accuracy by : correct if and only if prediction belongs to [label-7,label+7]
-resize training samples to 224 to fit for mobileNetV2 input. so the object brushes are small enough to be called "small object detection"
-the training set is not even and some of them are mis-labeled.
QQ20220717-0

(current distribution: 0-case is huge which is caused by data feature it-self while 100-case since I add many negative images to training set. it seems not reasonable if it labeled as the largest value of the range?

Results:
-low acc prediction on not-seem negative samples (ie. images which doesn't include brush actions), about 66%. and even worse in case of test set.
-train mae and valid mae are both closed to 18, while mse are closed 500. it seems very high.

What to do :
a. re-inspect all the training set to label them correctly
b. sample sparely the range into 10 times, eg. 0,10,..., 100 for gain more samples per label value. especially transform from regression to classification w/ these classes?
c. uses other models to handle the small object detection problem

Questions:

  1. is it viable for my thinks described above
  2. does the MobileNetV2 fit for small object detection in practice stated above?
  3. what factors affect the model acc mainly?

thanks

@michael-mueller-git
Copy link
Owner

michael-mueller-git commented Jul 18, 2022

First of all, it's nice to see the progress you've made.

transform the predicted regression scalar to classification accuracy by : correct if and only if prediction belongs to [label-7,label+7]

if label is the output (predicted class) of your model, then this will not provide propper results. Because you have multiple correct classes assigned to an output. e.g. label 50 belongs to [43, 44, .. 50, .. 46, 57] and 51 belongs to [44, 44, .. 50, .. 46, 58], ... They overlap!

current distribution: 0-case is huge which is caused by data feature it-self while 100-case since I add many negative images to training set. it seems not reasonable if it labeled as the largest value of the range?

I think the many negative images do not influence the whole thing as much as the extremely unbalanced training material between 1..99

train mae und valid mae sind beide auf 18 geschlossen, während mse 500 geschlossen sind. das scheint sehr hoch.

The most important thing is that the loss curve of the validation data set decreases while training. (I recommend always use some sort of validation with Early Stopping)

uses other models to handle the small object detection problem

The main problem here is that you probably need very large models (which are then correspondingly slow) to get an acceptable accuracy. The main problem is: hat CNN does not encode the relative position of the different features and therefore large receptive fields are required to capture long-range dependencies within an image. Or directly translated: To determine the length of the toothbrush

is it viable for my thinks described above
does the MobileNetV2 fit for small object detection in practice stated above?

Not in my opinion. You probably won't get satisfactory results with such an approach.

what factors affect the model acc mainly?

From your description i assume you use somthing like the categorical_crossentropy for Loss. This does not make much sense in this case because otherwise the coherent distance of the regression values is discarded. You could try to add an Regesion Layer to the model and use MSE for Loss. If not already done.

In summary, the main problem with the approach is that it only makes a prediction locally per frame but does not take into account the previously predicted results. Since there are usually no jumps in a video (except video cuts). You could help the model by reintroducing the previously predicted regression values by using LSTMs after the feature extraction trough an CNN.

@JH-Lam
Copy link
Author

JH-Lam commented Jul 19, 2022

Hi michael, thanks for your quick and detailed reply first.
I think there are some (may be wrong) statements described in my comment above:

  1. the term 'toothbrush' is a avatar of 'blowjob' like terms which is not easy to describe in somewhere I asked before. sorry I forgot to restore it here :)
  2. I dont want to use LSTMs related RNN techniques because it seemed slow when I tested your model before. in addition, I trained a simple MobileNet v2 (transfer learning in fact) long time before, it seemed good(only w/ one completely correct labeled funscript video) on valid set:
    Snipaste_2022-07-19_14-56-49
    ( it trends well though it's not absolutely correct. the training set has not negative images )
  3. Yep I used regression on output instead of classification so I used MSE as loss while MAE is as metrics only in tensorflow. so statement 'transform the regression to accuracy' means I want to gain acc similar result but not train a classification model in fact. so I uses a workaround by doing : if prediction[i] in range [label[i]-k, label[i] + k] that means it's tolerant (correct)
  4. No doubt , the valid loss curve is most important , but in my case it seems it tried hard to hard to go down further
    Snipaste_2022-07-19_16-01-01

Note the pos in funsripts are farely subjective (except 0 and 100) so I usually seem some mis-labeled ones in files.sometimes she maybe use hand to do that , or uses mouth sometimes, and some of the amplitude are explicitly really wrong. Maybe this is another cause to hard to train well?

  1. "Since there are usually no jumps in a video (except video cuts)" . I think it's necessary to blend the pos which b/w the top and bottom pos even if I use skip-frames detection.
    eg. 0,10,...,100 , then when prediction come in '1,3,5,5, 10,11,20,5,0,...' , the resulted output(by round()) : 0,0,10,10,10,10, 20,10,0 => 0,20,0 . but I think this is tricky by program since there are more complex than this one. eg. big slope curve followed by small slope curve but you can't consider it as exactly one action .do you have any clues?
  2. brainstorm: split sparely the regression range to 10x to do a classification instead is possible if I dont care the trivial changes b/w two ends ?
  3. maybe I need to follow some age-face models to see if some inspirations are there . but I think it's harder to capture the porn action feature than face-texture feature ,right? DL too?

@michael-mueller-git
Copy link
Owner

  1. This is understandable. LSTMs slow the whole thing down considerably.
  2. As long as this statement is not included in the loss-function and thus the backpropagation algorithm, you can of course do this without any disadvantages.
  3. OK the learning curve looks good. One reason why you can't get any further down is definitely the point you mentioned about subjective labels. Have you already tried to reduce the learning rate after x epochs, maybe the backpropagation algorithm will find a better minimum. But probably this will not achieve much more improvements. The impact of the subjective labels is probably too high.
  4. For your model it is irrelevant because each frame is considered closed for itself. That was only a suggestion for the use of a recurrent network architecture which you do not want to use according to 2.
  5. if you use a classification loss function that takes into account the distance of the classes from each other, this would be possible.
  6. If you mean the face detection models. The predict the abs position in the frame which makes the training much better specified. Instead of labeling the frames you could label the top and bottom point of the good piece in the frames and use a face detection model architecture for prediction. This will probably result in much better than the current approach. That would be a slightly lighter approach compared to my Mask RCNN in this repository.

@JH-Lam
Copy link
Author

JH-Lam commented Jul 20, 2022

  1. yes I used :

lr_reduce = ReduceLROnPlateau(monitor='val_mae', factor=0.3, patience=5, verbose=1, cooldown=0, min_lr=0.5e-6)
but no luck. it seems no significant improvement

  1. there is a fact that even if I predicted result well , but there are many actions generated via model to be saved in funscript files.
    Snipaste_2022-07-20_10-40-30
    Snipaste_2022-07-20_10-34-27
    so it's hard to know which actions should be ignored while others not. I think it's reasonable to do nothing in first iteration.
  2. I saw a similar case which split age range from [0,100] to age groups ,says "0-2 to 0, 3-6 to 1, ...." to do a real classification model yesterday. https://github.com/CVxTz/face_age_gender
  3. I mainly refer to 'small object detection' problem. the age model predicts age against face texture etc . so the texture is sure much small compared to a face. but I think it's easier than my job since face has explicit features to capture, eg. eyes, nose, mouth etc. in addition there are many complex scenarios w/o faces in fact, eg. sit down and up on a man's body but no faces at all. but I am not sure if DL agrees w/ me.

@JH-Lam
Copy link
Author

JH-Lam commented Aug 10, 2022

Hi, after I dived into the code 'Regression/util/dataset.py' again, I noticed that :

  1. therer are lables files which corresponds to each video , right?
  2. and these files should be generated against uploaded funscripts by users respectively. so the actions in each label file is continuous b/w two frames labeled in funscript

if yes it's a good idea to generate even position values, no matter what positions the users made( but I used the uploaded funscripts' lables directly before)

@michael-mueller-git
Copy link
Owner

now that you mention the labels i noticed that i didn't put the script to convert the funscripts to the frame labels into git. I have now added these.

For each funscript (corresponding to an video) i creates labels for each frame through the python script. The labels are simply the regression values for the corresponding frame, where the funscript gets simply interpolated for each frame.

@JH-Lam
Copy link
Author

JH-Lam commented Aug 15, 2022

got it , thanks.

some more questions:
a. why was it stuck in ffmpegstream.py at line 202 ( rojection = np.frombuffer(...) ) when I run label.py ?
b. why the behavior b/w training step and prediction step is not consitent? ie. the loss during training is calculated against each frame in a seq_len respectively, but in prediction the current action's pos is derived by the first value of current prediction. I think both should be consistent, eg. (pos[1]1,pos[2],...,pos[k]) ->(output) pos[k+1], so uses the last output is enough to compute the final loss.

train.py :
loss = criterion(out_pos, pos)
out_pos = model(frames);

test.py:
x = model(frames_tensor)
x = x.cpu().numpy()[0][0][0]
funscript.add_action(round(x * 100), round(frame_numer * (1000/fps)))

Snipaste_2022-08-15_15-59-40

@michael-mueller-git
Copy link
Owner

a) there was a change in ffmpeg 5 which made the code not work anymore. Take a look at ffmpegstream.py in the Python-Funscript-Editor repository where I have fixed the issue. (I should probably use a git submodule in the future so that you don't have to copy it everywhere).
b) The current implementation only allow batch_size == 1. I think you mean the individual training sequence length. The problem is that i use the last seq_len frames for each frame. The idea was to use all values for the loss calculation to improve the convergence while training. But unfortunately has not helped much. The code is no longer completely consistent due to the experiments. Sorry for that. If you are lucky there is still something in the git history. But I think I published the code after my tests.

@JH-Lam
Copy link
Author

JH-Lam commented Oct 17, 2022

Hi michael, I'm here again;)
I improved a bit performance these days. eg. give a clear start_frame to fast locate at the goal action since some funscripts uploaded are not started at first frame , but likely locate at any one.

The core thing I met currently is :
train_epoch_loss += loss.item()
It costs about 300ms in my gtx 1660 super with model1 ! I have no idea what happens in fact. but when I comment out 'loss.backward()' then it decrease at 100ms. so it seems likely this is BP related.

thanks

@michael-mueller-git
Copy link
Owner

Nice to see that you are still working on a funscript predictor model. I have currently stopped working on it due to other projects.

Theoretically you could remove train_epoch_loss += loss.item(). But then you would no longer have monitoring during the training and would not know if and how the model improves.

Yes loss.backward() is BP related and is required to train the model weights.

@JH-Lam
Copy link
Author

JH-Lam commented Oct 18, 2022

yep but I just mean that it costed so heavy it was not reasonable. I think this function item() will only propagate the loss from gpu back to memory. in additionally, many train code uses the loss to draw learning curve so it should not be the pain point in fact.

@JH-Lam
Copy link
Author

JH-Lam commented Oct 18, 2022

I think I had found the direct cause: the (BP) complexity of the model. since when I did a transfer-learning today, I found that it's much slower for fine-tuning (non-freezed weights) than fixed feature extractor(freezed weights) as the former has much more computation w.r.t BP. this is consistent w/ the scenario of commenting out the loss.backward() would cost much less.

but I can't find the root cause since when I dived into the source loss.item() I was navigated to the ttk.pyi which is the stub file but not the implementation. do you known how to find that? thanks

@michael-mueller-git
Copy link
Owner

If you want to search in a deeper view you have to look in the source code of torch. But for debugging you probably need to compile with debug flags.

@JH-Lam
Copy link
Author

JH-Lam commented Oct 21, 2022

I thought I might get something yesterday. basically it's the lazy optimization mechanism . if net(), backward() , loss.item() are all enable then the main cost(time consumed) is loss.item(), otherwise if I comment out the 'loss.item()' then the total cost will be scatterred to the net() and backward() .

one more thing , when I decrease the num_hidden from 64 to 8, then it's about 5x faster gain. note this number is not the real #hidden layers since it will be multiplied by a fix constant in code. so it's very large in fact. but the performance(loss) seems like no drop .

So what's idea for designing this complex model? did you do it yourself from scratch ? thanks and sorry if some words sound uncomfortable but it's not my expected meaning

@michael-mueller-git
Copy link
Owner

michael-mueller-git commented Oct 23, 2022

So what's idea for designing this complex model?

We try to extract complex features from video so the idea is to use complex model to detect this features.

Did you do it yourself from scratch ?

Not from scratch but i have designed the model architecture. Since I have little experience in this area I do not know if this is a good design!

Additional Notes: Have you tried to call loss.item() only e.g. every 10 iteration to decrease the compositional cost? I think the main problem is, you can not measure the direct influence because the code is running asynchronous on the gpu. If an instruction does not fetch a result value back into the cpu memory like loss.item() does, you don't get the actual computation time. For the most commands the instruction is probably passed into some sort of pipeline for the gpu, but you don't know exactly when it completes unless the instruction needs to translate a value back to cpu memory, as is the case with loss.item().

@JH-Lam
Copy link
Author

JH-Lam commented Oct 24, 2022

oh sorry, the 'from scratch' here I mean that if you make the models(ESP model1) without any open papers/references?

I think there will be no changes if call 'loss.item()' once per 10 iterations , since as stated above, it cost similarly even if I just comment the 'loss.item()' out, that is the group 'bp + fp' costs same as 'bp+fp+loss.item()' actually.

@michael-mueller-git
Copy link
Owner

michael-mueller-git commented Oct 25, 2022

I mean that if you make the models(ESP model1) without any open papers/references?

I have read through a number of papers but have not been able to adopt any of them to create a regression. Accordingly, I have tried my own ideas.
since it was a long time ago, I unfortunately don't know which papers these were. The only ones I can remember are:

Note: today I think that the architecture of Model 1 is not useful for a reliable prediction of fun-script points.

@JH-Lam
Copy link
Author

JH-Lam commented Oct 26, 2022

Well thanks.I'll have a glance .

I learned RNN(basic) and trained the model1 these days but found that it's possible to converge(even to 0.00xx scale) , but the valid loss (valid set is split from one video) is high, say 0.0yy, yep it was overfitting but it's hard to do better though I introduced early-stopping , reduce-lr-on-plateau etc.I had no idea what's the root cause of this problem, and tired.
how about model2, 3?

on the other hand, I have another solution ,say faces/hands based detection to calculate the amplitude. but I think there are also something difficult . eg.Q1: how to make a stable prediction? I think it's possible to derive it against the moving distance divided by the diagonal length of an image, but it's hard to do that in fact. Q2: how to do if there is a scenario without hands nor faces but only a hip with a dick ? I think there may be a function in opencv to derive the similarity b/w two images, but what is the direction? plus or minus to last frame...

@JH-Lam
Copy link
Author

JH-Lam commented Oct 26, 2022

one more thing, do you have any twitter/WeChat for more convenient discuss, may be other technique but not only this topic.

@michael-mueller-git
Copy link
Owner

how about model2, 3?

These models I have unfortunately never fully implemented there is still a part missing. The problem of model 2+3 is: training transformer models from scratch is even more time-consuming because they are usually require more data to converge.

The problem for all prediction models is have not implemented randomness to data loader e.g. Gaussian noise, random transformation of frame sequences, ... to improve the generalization.

I think there may be a function in opencv to derive the similarity b/w two images, but what is the direction?

you probably mean Optical flow. I have already tried this and actually achieved useful results with it if you only want to determine the turning points. To determine the points, I simply performed a principal component decomposition with 2 components that reflect the movement of the two people in the video. As you have already indicated, it is difficult to determine the height of movement here because this information is not directly available to me with this approach.

on the other hand, I have another solution ,say faces/hands based detection to calculate the amplitude. but I think there are also something difficult . eg.Q1: how to make a stable prediction? I think it's possible to derive it against the moving distance divided by the diagonal length of an image, but it's hard to do that in fact. Q2: how to do if there is a scenario without hands nor faces but only a hip with a dick ?

  • Q1: Maybe use a simple lstm model and use the output of the tracking point as input. For signal stable prediction you could apply a kalman filter to the prediction.
  • Q2: Create an simple CNN and train for classification of poses in videos (use the tracking points as input data).

one more thing, do you have any twitter/WeChat for more convenient discuss, may be other technique but not only this topic.

no unfortunately not

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants