# Deepfake Detection - Dataset Preparation

In this notebook, we will be preparing the dataset for deepfake detection training and testing. The dataset will be a combination of the following datasets:
1. **[Kaggle - Deepfake Detection Challenge](https://www.kaggle.com/c/deepfake-detection-challenge/data)**
2. **[Celeb-DF](https://github.com/yuezunli/celeb-deepfakeforensics)**
3. **[FaceForensics DeepfakeDetectionDataset](https://ai.googleblog.com/2019/09/contributing-data-to-deepfake-detection.html)**

The process will be as follows:
1. **Exploratory Data Analysis on datasets:** Create a dataframe for each dataset and conduct a simple EDA on each of the three datasets. 
2. **Collate datasets from source datasets:** The datasets currently each have their own structure/labelling format. We will be constructing single source dataset containing all videos used used for the training process.
3. **Sampling dataset:** To ensure both classes are balanced.
4. **Extracting faces from the sampled dataset:** Faces are extracted and formed into its own dataset.

## Setup and Imports

In [1]:
from platform import python_version

python_version()

'3.8.13'

In [2]:
import cv2
import json
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import pandas as pd
import pickle
import os
import time

from IPython.display import Video
from pathlib import Path
from tqdm import tqdm

from typing import List

In [3]:
# Folder names/locations
BASE = "../datasets"

KAGGLE_DFDC = "Kaggle-dfdc"
CELEB_DF = "Celeb-DF-v2"
FF_DFDC = "DeepfakeDetectionDataset"

# Create full paths
DATA_KAGGLE = os.path.join(BASE, KAGGLE_DFDC)
DATA_CELEBDF = os.path.join(BASE, CELEB_DF)
DATA_FACEFORENSICS = os.path.join(BASE, FF_DFDC)

DATA_KAGGLE, DATA_CELEBDF, DATA_FACEFORENSICS

('../datasets/Kaggle-dfdc',
 '../datasets/Celeb-DF-v2',
 '../datasets/DeepfakeDetectionDataset')

## Dataset EDA

### Helper Functions

In [4]:
# Helper Functions
def isdir(fullpath: str) -> bool:
    """Returns true if is a file, and false if otherwise (a directory)."""
    try:
        if os.path.exists(fullpath):
            if os.path.isdir(fullpath):
                return True
            return False
    except FileNotFoundError:
        print(f'{fullpath} does not exist.')

def iterate_files(directory: str) -> List:
    """Iterates over the files in the given directory and returns a list of 
    found files."""
    files = []
    for file in os.listdir(directory):
        filename = os.fsdecode(file)
        fullpath = os.path.join(directory, filename)
        if (isdir(fullpath)):
            files += iterate_files(fullpath)
        else:
            files.append(fullpath)
    return files
  
def get_filename(source: str):
    """Returns the filename given a full path."""
    return os.path.basename(source)

### Kaggle Deepfake Detection Dataset

The original Microsoft Deepfake Detection dataset available on Kaggle comprises of .mp4 files containing videos of deepfake videos denoted by the string "REAL" or "FAKE" in the label file. The original dataset is over 470 GB, which is incredibly large and outside the limits of our training. Instead, we will **only be using the train sample videos** available in the Kaggle data explorer section.

The dataset has been kept with the following folder structure:
<!--TODO: Folder structure-->

We will be starting off by exploring the labels contained in the `metadata.json` file that is contained in the `train_sample_videos` folder. The file has the following columns:
- `filename` - the filename of the video
- `label` - whether the video is REAL or FAKE
- `original` - in the case that a train set video is FAKE, the original video is listed here
- `split` - this is always equal to "train".

#### Create dataframe from labels file: `metadata.json`

In [5]:
# Folders for the Kaggle dataset
data1_paths = [
  "dfdc_train_part_0",
  "dfdc_train_part_1",
  "train_sample_videos",
]
label_file = "metadata.json"

# 
label_dicts = []
for path in data1_paths:
  data1_label_path = os.path.join(DATA_KAGGLE, path, label_file)
  with open(data1_label_path,'r', encoding='utf-8') as f:
     file = json.load(f)
  # Add indicator of which folder the data comes from
  for k in file.keys():
    file[k]['path'] = path
  label_dicts.append(file)

data1_dict = {}
# Merge/concatenate dictionaries
for dictionary in label_dicts:
  data1_dict = dict(**data1_dict, **dictionary)

len(data1_dict)

3433

In [6]:
# Preview the first line of the dictionary 
print(next(iter(data1_dict.items())))

('owxbbpjpch.mp4', {'label': 'FAKE', 'split': 'train', 'original': 'wynotylpnm.mp4', 'path': 'dfdc_train_part_0'})


In [7]:
# Restructure dictionary for pandas dataframe input
filenames = list(data1_dict.keys())
labels = [ data1_dict[name]['label'] for name in filenames ]
fullpaths = [ os.path.join(DATA_KAGGLE, data1_dict[name]['path'], name) for name in filenames]
labels_dict = {
  'filename': filenames,
  'label': labels,
  'fullpath': fullpaths
}

# Create dataframe1
data1 = pd.DataFrame(labels_dict)
data1 = data1.sort_values(by='filename', ignore_index=True)
data1.head(10)

Unnamed: 0,filename,label,fullpath
0,aagfhgtpmv.mp4,FAKE,../datasets/Kaggle-dfdc/train_sample_videos/aa...
1,aapnvogymq.mp4,FAKE,../datasets/Kaggle-dfdc/train_sample_videos/aa...
2,aaqaifqrwn.mp4,FAKE,../datasets/Kaggle-dfdc/dfdc_train_part_0/aaqa...
3,aassnaulhq.mp4,FAKE,../datasets/Kaggle-dfdc/dfdc_train_part_1/aass...
4,aayrffkzxn.mp4,REAL,../datasets/Kaggle-dfdc/dfdc_train_part_0/aayr...
5,abarnvbtwb.mp4,REAL,../datasets/Kaggle-dfdc/train_sample_videos/ab...
6,abebnhqyzv.mp4,FAKE,../datasets/Kaggle-dfdc/dfdc_train_part_1/abeb...
7,abhggqdift.mp4,FAKE,../datasets/Kaggle-dfdc/dfdc_train_part_0/abhg...
8,abofeumbvv.mp4,FAKE,../datasets/Kaggle-dfdc/train_sample_videos/ab...
9,abqwwspghj.mp4,FAKE,../datasets/Kaggle-dfdc/train_sample_videos/ab...


#### Simple EDA

In [8]:
data1.shape

(3433, 3)

In [9]:
# Validate unique values
print("Unique inputs:", len(data1['filename'].unique()))

Unique inputs: 3433


In [10]:
data1['label'].value_counts()

FAKE    3162
REAL     271
Name: label, dtype: int64

### Celeb-DF-v2

Celeb-DF-v2 is a datasets generated dfor deepfake forensics containing 590 original videos collected from YouTube with subjects of different ages, ethnic groups and genders, and 5639 corresponding DeepFake videos. 

The dataset has been kept with the following folder structure:
```
Celeb-DF-v2
|--- Celeb-real # 590 Celebrity videos downloaded from YouTube
|--- YouTube-real # 300 Additional videos downloaded from YouTube
|--- Celeb-synthesis # 5639 Synthesized videos from Celeb-real
|--- List_of_testing_videos.txt # 518 videos
```

Thus, the dataset has a total of 6529, with 890 "REAL" videos and 5639 "FAKE" videos.

#### Create dataframe

In [11]:
# Subfolder names
yt_real = "YouTube-real"
celeb_real = "Celeb-real"
celeb_fake = "Celeb-synthesis"

real_data_path = iterate_files(os.path.join(DATA_CELEBDF, yt_real)) + iterate_files(os.path.join(DATA_CELEBDF, celeb_real))
fake_data_path = iterate_files(os.path.join(DATA_CELEBDF, celeb_fake))

len(real_data_path), len(fake_data_path)

(890, 5639)

In [12]:
# Prepare data for dataframe
filenames = []

for path in real_data_path:
  filenames.append(get_filename(path))
labels = ["REAL"] * len(real_data_path)

for path in fake_data_path:
  filenames.append(get_filename(path))
labels = labels + ["FAKE"] * len(fake_data_path)

labels_dict = {
  'filename': filenames,
  'label': labels,
  'fullpath': real_data_path + fake_data_path
}

# Create dataframe
data2 = pd.DataFrame(labels_dict)
data2 = data2.sort_values(by='filename', ignore_index=True)
data2.head(10)

Unnamed: 0,filename,label,fullpath
0,00000.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00000.mp4
1,00001.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00001.mp4
2,00002.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00002.mp4
3,00003.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00003.mp4
4,00004.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00004.mp4
5,00005.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00005.mp4
6,00006.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00006.mp4
7,00007.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00007.mp4
8,00008.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00008.mp4
9,00009.mp4,REAL,../datasets/Celeb-DF-v2/YouTube-real/00009.mp4


#### Simple EDA

In [13]:
data2.shape

(6529, 3)

In [14]:
# Validate unique values
print("Unique inputs:", len(data2['filename'].unique()))

Unique inputs: 6529


In [15]:
data2['label'].value_counts()

FAKE    5639
REAL     890
Name: label, dtype: int64

### FaceForensics++ DeepFakeDetection

This dataset is provided by Google and Jigsaw for deepfakes detectionr esearch. Public generation methods were used to create over 3000 manipulated videos from 28 actors in various scenes. 

The dataset contains approximately 3068 "FAKE" videos and 363 "REAL" videos.

Downloaded using the script:
```
python faceforensics_download_v4.py . -d DeepFakeDetection -c c23 -t videos      
```
To download the original unmanipulated dataset:
```
python faceforensics_download_v4.py . -d DeepFakeDetection_original -c c23 -t videos 
```

#### Create dataframe

In [16]:
df_path = "manipulated_sequences/DeepFakeDetection/c23/videos"
df_file_path = iterate_files(os.path.join(DATA_FACEFORENSICS, df_path))
rl_path = "original_sequences/actors/c23/videos"
rl_file_path = iterate_files(os.path.join(DATA_FACEFORENSICS, rl_path))

In [17]:
filenames = []
for path in df_file_path:
  filenames.append(get_filename(path))
labels = ["FAKE"] * len(df_file_path)

for path in rl_file_path:
  filenames.append(get_filename(path))
labels += ["REAL"] * len(rl_file_path)

labels_dict = {
  'filename': filenames,
  'label': labels,
  'fullpath': df_file_path + rl_file_path
}

data3 = pd.DataFrame(labels_dict)
data3 = data3.sort_values(by='filename', ignore_index=True)
data3.head(10)

Unnamed: 0,filename,label,fullpath
0,01_02__exit_phone_room__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
1,01_02__hugging_happy__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
2,01_02__meeting_serious__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
3,01_02__outside_talking_still_laughing__YVGY8LO...,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
4,01_02__secret_conversation__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
5,01_02__talking_against_wall__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
6,01_02__talking_angry_couch__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
7,01_02__walk_down_hall_angry__YVGY8LOK.mp4,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
8,01_02__walking_and_outside_surprised__YVGY8LOK...,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...
9,01_02__walking_down_indoor_hall_disgust__YVGY8...,FAKE,../datasets/DeepfakeDetectionDataset/manipulat...


#### Simple EDA

In [18]:
data3.shape

(3431, 3)

In [19]:
print("Unique inputs:", len(data3['filename'].unique()))

Unique inputs: 3431


In [20]:
data3['label'].value_counts()

FAKE    3068
REAL     363
Name: label, dtype: int64

## Merge dataset

In [21]:
# Create a new column indicating their path
data1['dataset'] = KAGGLE_DFDC
data2['dataset'] = CELEB_DF
data3['dataset'] = FF_DFDC

In [22]:
# Merge dataframes
frames = [data1, data2, data3]
full_dataset = pd.concat(frames)
full_dataset.head()

Unnamed: 0,filename,label,fullpath,dataset
0,aagfhgtpmv.mp4,FAKE,../datasets/Kaggle-dfdc/train_sample_videos/aa...,Kaggle-dfdc
1,aapnvogymq.mp4,FAKE,../datasets/Kaggle-dfdc/train_sample_videos/aa...,Kaggle-dfdc
2,aaqaifqrwn.mp4,FAKE,../datasets/Kaggle-dfdc/dfdc_train_part_0/aaqa...,Kaggle-dfdc
3,aassnaulhq.mp4,FAKE,../datasets/Kaggle-dfdc/dfdc_train_part_1/aass...,Kaggle-dfdc
4,aayrffkzxn.mp4,REAL,../datasets/Kaggle-dfdc/dfdc_train_part_0/aayr...,Kaggle-dfdc


In [23]:
full_dataset.shape

(13393, 4)

In [24]:
# Validate all unique files
print("Unique inputs:", len(full_dataset['filename'].unique()))

Unique inputs: 13393


**Number of video files for each dataset:**

In [25]:
full_dataset['dataset'].value_counts()

Celeb-DF-v2                 6529
Kaggle-dfdc                 3433
DeepfakeDetectionDataset    3431
Name: dataset, dtype: int64

**REAL/FAKE video distribution:**

In [26]:
full_dataset['label'].value_counts()

FAKE    11869
REAL     1524
Name: label, dtype: int64

### Save as csv

In [27]:
save_path = "../dataset-full.csv"

full_dataset.to_csv(save_path, index=False)
# full_dataset = pd.read_csv(save_path)

## Spliting the Dataset

From the full dataset above, we can see that the datset is incredibly unbalanced with over 7 times more deepfake videos than real videos. This could potentially lead to overfitting during training.

To prevent overfitting, and to reduce compute, we will be sampling the "FAKE" videos based on the number of "REAL" videos. The final dataset will therefore contain 1524 samples of both classes.

In [28]:
seed = 42

fake_data = full_dataset[full_dataset['label'] == "FAKE"]
real_data = full_dataset[full_dataset['label'] == "REAL"]

# Take sample of the data
fake_sample = fake_data.sample(n=len(real_data), random_state=seed)

# Create final dataset
sample_dataset = pd.concat([real_data, fake_sample])
sample_dataset = sample_dataset.sort_values(by='fullpath', ignore_index=True)

sample_dataset.head()

Unnamed: 0,filename,label,fullpath,dataset
0,id0_0000.mp4,REAL,../datasets/Celeb-DF-v2/Celeb-real/id0_0000.mp4,Celeb-DF-v2
1,id0_0001.mp4,REAL,../datasets/Celeb-DF-v2/Celeb-real/id0_0001.mp4,Celeb-DF-v2
2,id0_0002.mp4,REAL,../datasets/Celeb-DF-v2/Celeb-real/id0_0002.mp4,Celeb-DF-v2
3,id0_0003.mp4,REAL,../datasets/Celeb-DF-v2/Celeb-real/id0_0003.mp4,Celeb-DF-v2
4,id0_0004.mp4,REAL,../datasets/Celeb-DF-v2/Celeb-real/id0_0004.mp4,Celeb-DF-v2


In [29]:
sample_dataset['label'].value_counts()

FAKE    1524
REAL    1524
Name: label, dtype: int64

In [30]:
datasets = list(sample_dataset['dataset'].unique())

for d in datasets:
  vc = sample_dataset[sample_dataset['dataset'] == d]['label'].value_counts()
  print(d, '\n', vc, '\n')

Celeb-DF-v2 
 REAL    890
FAKE    725
Name: label, dtype: int64 

DeepfakeDetectionDataset 
 FAKE    394
REAL    363
Name: label, dtype: int64 

Kaggle-dfdc 
 FAKE    405
REAL    271
Name: label, dtype: int64 



In [31]:
save_path = "../dataset-sample.csv"

sample_dataset.to_csv(save_path, index=False)
# sample_dataset = pd.read_csv(save_path)

## Extract Frames and Faces

To make the dataset portable, the faces are extracted from the frames and saved as an image. To reduce the file size, only the faces are saved, not the frames themselves. 

### FaceExtractor and FrameExtractor

To extract the faces from the frames we will be using the MTCNN face detector. MTCNN and RetinaFace were both evaluated in the [Deepfake Detection - Comparing Face Detectors.ipynb](./Deepfake%20Detection%20-%20Comparing%20Face%20Detectors.ipynb) notebook. RetinaFace was clearly demonstrated to be much faster than MTCNN in classification time. In [[1]](https://arxiv.org/pdf/1905.00641v2.pdf), the accuracy for RetinaFace was slightly higher than MTCNN. Benchmarking also showed that RetinaFace was able to achieve a higher precision on the WIDER Face dataset at 0.914 compared to MTCNN 0.809 [[2]](https://paperswithcode.com/sota/face-detection-on-wider-face-hard). 

However, RetinaFace also had a few weaknesses and features that were not favourable for generating the training dataset. For images where the face dominated the frame, RetinaFace was unable to identify that as face was present, while MTCNN was. In videos where the camera zooms in and out from the subject's face, this can be a detriment as RetinaFace will be unable to detect that the face is present. In circumstances where there is face yaw, 

In [44]:
import os
import sys
# Enables import of modules under 'model' folder
sys.path.append(os.path.dirname(os.getcwd()))

from extractors import FaceExtractor, FrameExtractor

# Config variables
extract_path = "../extracted_faces/"
face_confidence_thresh = 0.87
frames_per_video = 32
padding = 0.3
rescale_video = False

face_extractor = FaceExtractor(model="mtcnn", thresh=face_confidence_thresh)

def extract_faces(video_source: str, video_filename: str):
  
  frame_extractor = FrameExtractor(video_source, verbose=False)
  
  frames = frame_extractor.get_frames_evenly(n_frames=frames_per_video,
                                            rescale=rescale_video)
  vidname, _ = os.path.splitext(video_filename) 
  image_path_base = os.path.join(extract_path, vidname)
  Path(image_path_base).mkdir(parents=True, exist_ok=True)
  
  face_filenames = []
  face_fullpath = []
  face_confidence = []
  
  for frame in frames:
    # Gets the frame_id and frame
    loc, img = frame['loc'], frame['frame']
    # Extracts only the best face from the frame 
    face, conf = face_extractor.crop_best_face(img, padding=padding)
    if face is not None:
      image_filename = f"{vidname}_{loc}.png"
      image_path = os.path.join(image_path_base, image_filename)
      cv2.imwrite(image_path, face)
      face_filenames.append(image_filename)
      face_fullpath.append(image_path)
      face_confidence.append(conf)
  
  res = {
    "video_filename": video_filename,
    "image_filename": face_filenames,
    "image_fullpath": face_fullpath,
    "face_confidence":  face_confidence
  }
  return res

### GPU config

In [49]:
# Turns off plaidml warnings (a plugin that enables tensorflow gpu on a Mac)
!export PLAIDML_VERBOSE=1

In [51]:
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

Num GPUs Available:  1


### Extracting faces with MultiThreading enabled

In [53]:
import concurrent.futures

def time_delta_str(seconds) -> str:
  h = int(seconds/(60*60))
  m = int( (seconds - (h * 60 * 60)) / 60 )
  s = round( seconds - (m * 60), 2 )
  delta = "{} hours(s) {} minute(s) {} second(s)".format(h, m, s)
  return delta

def run_extract_faces(limit, workers):
  # Proxy function for extract_frames using dataframe ilocs
  def extract_faces_proxy(i):
    return extract_faces(sample_dataset["fullpath"].iloc[i], sample_dataset["filename"].iloc[i])
  
  # Creates a list of indices
  ilocs = list(np.arange(limit))
  
  stime = time.time()
  with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
      results = list(tqdm(executor.map(extract_faces_proxy, ilocs), total=limit))
  etime = time.time()
  print("Complete in", time_delta_str(etime-stime))
  return results

extracted_files = run_extract_faces(limit=len(sample_dataset), workers=8)

  7%|████████▍                                                                                                         | 226/3048 [1:47:27<13:59:48, 17.86s/it]2022-05-26 20:37:20.038752: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:113] Plugin optimizer for device_type GPU is enabled.
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3048/3048 [23:33:43<00:00, 27.83s/it]

Complete in 23 hours(s) 33 minute(s) 82843.47 second(s)





In [54]:
# Saves variable in format
import pickle

with open("pickle/exported_files.pickle", "wb") as f:
  pickle.dump(extracted_files, f)
  
with open("pickle/exported_files.pickle", "rb") as f:
  extracted_files = pickle.load(f)

In [55]:
len(extracted_files)

3048

### Create final training dataset

The goal is to create an image dataset file with the following columns:
- `image_filename`: filename of the image
- `image_fullpath`: relative path location of the image
- `face_confidence`: confidence level detected by MTCNN
- `label`
- `video_filename`
- `video_fullpath`
- `video_dataset`

In [63]:
vid_filename = "id0_0000.mp4"
sample_dataset[sample_dataset['filename']==vid_filename]['label'].iloc[0]

'REAL'

In [78]:
columns = [
  'image_filename',
  'image_fullpath',
  'face_confidence',
  'label',
  'video_filename',
  'video_fullpath',
  'video_dataset'
]
face_dataset = pd.DataFrame(columns=columns)

for images in tqdm(extracted_files):
  vid_filename = images['video_filename']
  sample_ref = sample_dataset[sample_dataset['filename']==vid_filename]
  
  vid_deets = {}
  vid_deets['video_filename'] = vid_filename
  vid_deets['video_fullpath'] = sample_ref['fullpath'].iloc[0]
  vid_deets['video_dataset'] = sample_ref['dataset'].iloc[0]
  vid_deets['label'] = sample_ref['label'].iloc[0]
  
  image_deets = {}
  for i in range(len(images['image_filename'])):
    image_deets['image_filename'] = images['image_filename'][i]
    image_deets['image_fullpath'] = images['image_fullpath'][i]
    image_deets['face_confidence'] = images['face_confidence'][i]
    
    row = dict(vid_deets, **image_deets)
    face_dataset = face_dataset.append(row, ignore_index=True)

100%|██████████████████████████████| 3048/3048 [24:09<00:00,  2.10it/s]


### Face Dataset EDA

In [79]:
face_dataset.shape

(95032, 7)

In [80]:
face_dataset['label'].value_counts()

REAL    48200
FAKE    46832
Name: label, dtype: int64

In [87]:
# Verify the existence of a file
def verify_files_exist(workers:int):
  
  # If file is not true, returns filepath
  def file_doesnt_exist(i):
    path = face_dataset['image_fullpath'].iloc[i]
    if not os.path.isfile(path):
      return path
    return None
  
  # Creates a list of indices
  ilocs = np.arange(len(sample_dataset))
  
  stime = time.time()
  with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
      results = list(tqdm(executor.map(file_doesnt_exist, ilocs), total=len(sample_dataset)))
  etime = time.time()
  
  # Removes None from results
  results = list(filter(None, results))
  
  print("Complete in", time_delta_str(etime-stime))
  
  return len(results) == 0, results


files_exist, res = verify_files_exist(workers=8)
print("All files exist: ", files_exist)

100%|██████████████████████████| 3048/3048 [00:00<00:00, 151687.69it/s]

Complete in 0 hours(s) 0 minute(s) 0.19 second(s)
All files exist:  True





In [88]:
# Save dataset
save_path = "../dataset-sample-faces.csv"

face_dataset.to_csv(save_path, index=False)
# face_dataset = pd.read_csv(save_path)