Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CELEB DF #38

Closed
MartaUch opened this issue Apr 17, 2021 · 17 comments
Closed

CELEB DF #38

MartaUch opened this issue Apr 17, 2021 · 17 comments
Labels
question Further information is requested

Comments

@MartaUch
Copy link

Hello,
I'm trying to create a model for Celeb dataset and I have a problem with creating indexes. I'm not sure what should the file "List_of_testing_videos.txt" contain. I only gave two paths: one to the file containing two folders with synthesis and real videos and the second, to save the video dataFrames.

Thank You in advance

@nicobonne
Copy link
Member

nicobonne commented Apr 17, 2021

You probably downloaded a wrong version of CelebDF v2 or there is some file missing. That file should be in the root folder of the dataset, as pointed in their official github page https://github.com/yuezunli/celeb-deepfakeforensics

@MartaUch
Copy link
Author

Right, I didn't notice this file in my folder. I copied only videos to my another computer few months ago, without this one file. Thank you very much :)

@MartaUch MartaUch reopened this Apr 18, 2021
@MartaUch
Copy link
Author

MartaUch commented Apr 18, 2021

Hi again, I have another general question. I wouldn't like to create index and extract faces from all videos in Celeb Dataset.
Example of my dataset: 200 Celeb-real, 200 Youtube-real and 400 Celeb-synthesis. When I've tried to do that the error 'index out of bounds' appeared and I'm not sure what can I do in this situation. Do you know what might be the cause of this error?
Przechwytywanie

@CrohnEngineer CrohnEngineer added the question Further information is requested label Apr 18, 2021
@CrohnEngineer
Copy link
Collaborator

Hey @MartaUch ,

have you tried playing with the parameter '--num' in extract_faces.py?
It allows you to choose how many videos to process from the dataset, so you can just extract faces from 800 videos (400 real, 400 fake) for instance.
Bests,

Edoardo

@MartaUch
Copy link
Author

MartaUch commented Apr 18, 2021

Hi @CrohnEngineer,
No I haven't, but I will try. When I set --num as 800, then --source should indicate to the file containing all videos from the first step (creating indexes) or to this file with 800 videos? I'm wondering because I'm not sure whether properly videos will be choosen when --source will indicate to this full set.

When I set --source as the folder of 800 videos the errors appeared: 'FileNotFoundError: Unable to find C:\Users\seminarium\Marta_U\praktyka\DataSet\Celeb_DF_model6\Celeb-real\id33_0000.mp4. Are you sure that C:\Users\seminarium\Marta_U\praktyka\DataSet\Celeb_DF_model6 is the correct source directory for the video you indexed in the dataframe?"

@CrohnEngineer
Copy link
Collaborator

Hey @MartaUch ,

just a quick check in order to verify if you made all the necessary steps (and remember, please refer always to the scripts if you have any doubt on how to use our code: in this case we will look at make_dataset.sh ):

  1. have you indexed CelebDF with index_celebdf.py? You need to pass as arguments --source, the directory where the dataset is stored, and --videodataset , the path where to store the resulting DataFrame with all the info on the dataset;
  2. you need then to execute extract_faces.py, giving as required arguments:
    2.1 --source, the directory where the dataset is stored (the same argument of step 1);
    2.2 --videodf, the Path to the DataFrame resulting from index_celebdf.py;
    2.3 --facesfolder, the path to the directory where to store the extracted faces;
    2.4 --facesdf, path to the DataFrame containing all the info about the extracted faces, and finally
    2.5 --checkpoint, another path to a directory where the intermediate results of the face extraction process are going to be.

Sorry if I sound niggling, but I'm not 100% sure about your situation. You have a reduced version of the Celeb-DF dataset right?
If that is so, the file List_of_testing_videos.txt does contain the list for the full dataset, or for your small dataset?
In the first case, as you were maybe suggesting earlier (I cannot find the comment anymore sorry), the index from index_celebdf.py may contain info about the full dataset; when extract_faces.py tries to process a video that is not present in your smaller dataset, it gives error.
In the second case instead, my guess is that you should be able to perform the steps above and have no problem whatsoever.
Let us know!
Bests,

Edoardo

@MartaUch
Copy link
Author

MartaUch commented Apr 19, 2021

Hi @CrohnEngineer,

I understand your doubts about my problem, so I will try to explain that better. These were my steps:
1. run index_celebdf.py with --source (full dataset) - I also tried to set this source as my reduced version of Celeb-DF (800 movies) but then I got error:
image
So eventually I set --source as full dataset. I didn't defined --videodataset argument, because default value was given.
2. run extract_faces.py. In this step I set --source as my reduced dataset (800 videos) and then errors occured, but after few times I decided not to stop this program and I was waiting until it stops extracting faces even with errors. But that's how you said, it is probably because of the file List_of_testing_videos.txt which contains list of full set (from the first step) and indeed the errors refered to the videos which my small dataset doesn't contain. When the program stopped I've noticed that data saved in given folders (facesfolder, facesdf, checkpoint) refer to all wideos of my reduced dataset. So even with errors I reckon this step ended with success :)

If I may, I have one more question about next step- training. I want to split my dataset to 80% for training and 20% for validation. I understand that ----traindb and --valdb should refer to two different folders.

  • Do this folders have to have the same structure (folders: Celeb-real, Youtube-real, Celeb-synthesis) as in two first steps or can I just paste videos and mix them in each folder?

Also, after training step, my test set doesn't have to contain videos from this 800 videos, does it? Because if this is necessary, then I need to split my dataset differently (for example 70% for training, 15 % for validation and 15% for test).

Sorry for so many questions. I'm writting my master's dissertation about deepfakes and I want to be sure of each step that I make. I'm really gratefull for your help.

Bests,
Marta

@CrohnEngineer
Copy link
Collaborator

Hey @MartaUch ,

thank you for all the info, now I have a clearer picture in mind!

  1. run extract_faces.py. In this step I set --source as my reduced dataset (800 videos) and then errors occured, but after few times I decided not to stop this program and I was waiting until it stops extracting faces even with errors. But that's how you said, it is probably because of the file List_of_testing_videos.txt which contains list of full set (from the first step) and indeed the errors refered to the videos which my small dataset doesn't contain. When the program stopped I've noticed that data saved in given folders (facesfolder, facesdf, checkpoint) refer to all wideos of my reduced dataset. So even with errors I reckon this step ended with success :)

Perfect, that's good to know :)

If I may, I have one more question about next step- training. I want to split my dataset to 80% for training and 20% for validation. I understand that ----traindb and --valdb should refer to two different folders.

Actually, the parameters --traindb, --valdb of train_binclass.py refer to the splits that are available for training and testing.
Let me explain better: when you call train_binclass.py, you are required to pass two string arguments that refer to the dataset splits we used for our experiments. You can see them in split.py line 16.
In train_binclass.py line 226 these parameters are fed to the function make_split.py (which you can find again in split.py), which in turn calls the function get_split_df at line 40: here, starting from the faces DataFrame created by extract_faces.py, we create the splits for training, validation and testing.

Also, after training step, my test set doesn't have to contain videos from this 800 videos, does it? Because if this is necessary, then I need to split my dataset differently (for example 70% for training, 15 % for validation and 15% for test).

This is true, so what I would suggest you is to create a 75-15-15 split modifying the function get_split_df from line 80.
You don't need to create any separate folder, but just working with the DataFrame inside that function.
Hope this helps!

Sorry for so many questions. I'm writting my master's dissertation about deepfakes and I want to be sure of each step that I make.

Good luck with your dissertation 💪 and let us know how it goes :)
Bests,

Edoardo

@MartaUch
Copy link
Author

MartaUch commented Apr 20, 2021

Hi @CrohnEngineer,
Thank you so much for detailed explanation. I think that know your code is more understandable for me.
After trying to modify get_split_df function I realise that there is another thing that I'm not sure about.
Maybe it's a trivial question, but why in line 87 we set only numbers of 'real' videos? What about the fake ones? I'm asking because I'm not sure, whether I should set this value only for 'original' videos. For example my dataset account 800 videos, so 70% of it (560=fake+real) is for training, 15% is for validation (120=fake+real) and another for a test. At first I set this value as 560 and then change a little bit range of train_orig, val_orig and test_orig (which I added). But I'm not sure about this "num_real_train" variable, and why it is only for 'real' samples.

Bests,
Marta

@CrohnEngineer
Copy link
Collaborator

Hey @MartaUch ,

Maybe it's a trivial question, but why in line 87 we set only numbers of 'real' videos? What about the fake ones?

We take the fake videos starting from line 93.
We first only select the real videos (line 90-92), then for each real video we take the fake ones that have been generated from it (with this instruction split_df = pd.concat((df[df['original'].isin(train_orig)], df[df['video'].isin(train_orig)]), axis=0)).
In your case, I would set the constant num_real_train = 400, since in your dataset of 800 videos half of them are real.
You can create the train/validation/test splits right after following your policy.
Bests,

Edoardo

@MartaUch
Copy link
Author

@CrohnEngineer,
Right, I didn't understand this line at first, but know it is clear.
Thank you for your help! :)

Bests,
Marta

@MartaUch
Copy link
Author

Hello,
After training my model I wanted to test it on my dataset, but something weird is happening with my 'traindb' parameter. It seems that program is splitting its name from 'celebdf' to 'c-e-l-e-b-d-f'. Because of that my path to 'weight' parameter is incorrect. Do you know why is this happening?
Bests,
Marta
Zrzut ekranu 2021-04-23 200209

@CrohnEngineer
Copy link
Collaborator

Hey @MartaUch ,

there was a small bug in the type of argument required by the script for --traindb.
I have fixed it in the last committ, let me know if the issue persists.
Anyway, remember that you can provide the full path for the model to test directly with the --model_path argument (that should be the preferred way to use the script in my opinion).
Bests,

Edoardo

@MartaUch
Copy link
Author

MartaUch commented Apr 25, 2021

Hi @CrohnEngineer,
I used the --model_path argument and it worked. Thank you :)

I've just got my results from the test and I'm a little bit confused, because number of testing videos was equal:

  • Real frames: 1913
  • Fake frames: 576
  • Real videos: 60
  • Fake videos: 18

My dataset accounts 800 videos and in split.py I split it to 0.7 (train), 0.15 (val) and 0.15 (test). On the training step I've got correct values:

  • Training samples: 17906 ~(560 videos)
  • Validation samples: 3840 ~ (120 videos)

This is a peace of my code from split.py:
Zrzut ekranu 2021-04-25 203134
Do you know why on the testing step only 78 videos were taken to account? It should be 120 (60 real, 60 fake)

I've also calculated the avg score for these testing videos and I just want to ask for advice, whether this results seems to be so bad because I was training my model for too short? I set --maxiter to 100, because I thought it would be enough.
The results look like that:
Zrzut ekranu 2021-04-25 204718 Zrzut ekranu 2021-04-25 204740

Bests,
Marta

@MartaUch
Copy link
Author

Hello @CrohnEngineer,

I've already figured out why number of testing videos didn't contain all videos, especially the fake ones. I think it is because I selected only 400 fake wideos, but they don't necessary fit to the originals videos which my dataset contains. Do you think that might be the reason?

Ans still, I'm not sure about my results and whether I should train the model longer. Maybe my dataset is too small.

Bests,
Marta

@CrohnEngineer
Copy link
Collaborator

Hey @MartaUch ,

I've already figured out why number of testing videos didn't contain all videos, especially the fake ones. I think it is because I selected only 400 fake wideos, but they don't necessary fit to the originals videos which my dataset contains. Do you think that might be the reason?

That might actually be the reason. You should check if for any FAKE video in your small dataset, there exists a REAL counterpart, so be sure that your dataset in the end gets balanced in all splits.

I've also calculated the avg score for these testing videos and I just want to ask for advice, whether this results seems to be so bad because I was training my model for too short? I set --maxiter to 100, because I thought it would be enough.

Ans still, I'm not sure about my results and whether I should train the model longer. Maybe my dataset is too small.

Your dataset indeed is quite small, and 100 iterations definitevely is a small number for training your model.
Keep in mind that in the context of our paper, iteration = batch iteration, not training epochs: if your batch size is small, your model might not even have seen all the training samples.
It might be best for your if you modify the training procedure of train_model.py to handle epochs instead of iterations, but in any case be also careful to not overfit your model on the training data! Since your dataset is not so big, overfitting is right behind the corner.
Bests,

Edoardo

@MartaUch
Copy link
Author

Hi @CrohnEngineer,

Thank you very much for your advices. I prepared my dataset once more and now each fake video has its real counterpart. As I see the split works properly. Now I have 540 real and 540 fakes videos. I also limited number of max iterations to 550 and I hope it will be enough. To be honest, I think changing the training procedure in train_model.py would be too difficult for me now, so I'll keep it unchanged.
Thank you once more, I do appreciate your help.

Bests,
Marta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants