# Creating Standardized Train Set & Test Set Folders from CelebA Dataset

**Only run this .ipynb file ONCE to create the train img folder & test img folder**

**Train Folder Location: 'data/train_celeba_std'**

**Test Folder Location: 'data/test_celeba_std**

## Import Packages

In [79]:
import os
import numpy as np
import pandas as pd
import shutil
import src
from src.utils.celeba_helper import CelebADataset
from torch.utils.data import Dataset

## Load CelebA Dataset

In [80]:
## Load the dataset
# Path to directory with all the images and mapping
img_folder = 'data/img_align_celeba'
mapping_file = 'data/identity_CelebA.txt'

# Load the dataset from file
celeba_dataset = CelebADataset(img_folder, mapping_file)

In [81]:
# labels file dataframe
file_label_mapping = celeba_dataset.file_label_mapping
display(file_label_mapping.head())

Unnamed: 0,file_name,person_id
0,000001.jpg,2880
1,000002.jpg,2937
2,000003.jpg,8692
3,000004.jpg,5805
4,000005.jpg,9295


In [82]:
# How many images are in the CelebA Dataset and how many unique persons are there?
print(f'Number of Images in CelebA Dataset is: {len(celeba_dataset)} images')
print(f'Number of Unique Persons in CelebA Dataset is: {file_label_mapping.person_id.nunique()} persons')

Number of Images in CelebA Dataset is: 202599 images
Number of Unique Persons in CelebA Dataset is: 10177 persons


## Get First File For Each Unique Person in CelebA
These files will be in the train set (10,177 images)

In [83]:
first_img_file = file_label_mapping.drop_duplicates(subset='person_id', keep='first', inplace=False, ignore_index=False)
print(len(first_img_file))
first_img_file.head()

10177


Unnamed: 0,file_name,person_id
0,000001.jpg,2880
1,000002.jpg,2937
2,000003.jpg,8692
3,000004.jpg,5805
4,000005.jpg,9295


In [84]:
# remaining images after removing first_img_file
rest = file_label_mapping.drop(first_img_file.index, axis=0, inplace=False)
print(len(rest))
rest.head()

192422


Unnamed: 0,file_name,person_id
49,000050.jpg,1058
81,000082.jpg,4407
209,000210.jpg,3602
227,000228.jpg,3422
251,000252.jpg,4960


## Get Second File For Each Unique Person Remaining in CelebA 
These files will be in the test set (10,133 images)

In [85]:
second_img_file = rest.drop_duplicates(subset='person_id', keep='first', inplace=False, ignore_index=False)
print(len(second_img_file))
second_img_file.head()

10133


Unnamed: 0,file_name,person_id
49,000050.jpg,1058
81,000082.jpg,4407
209,000210.jpg,3602
227,000228.jpg,3422
251,000252.jpg,4960


In [86]:
# remaining images after removing both first_img_file and second_img_file
rest.drop(second_img_file.index, axis=0, inplace=True)
print(len(rest))
rest.head()

182289


Unnamed: 0,file_name,person_id
290,000291.jpg,4960
783,000784.jpg,3272
807,000808.jpg,1250
912,000913.jpg,6546
949,000950.jpg,2688


## Standardized Test Set will contain 35,000 Total Images
Test Set Contains: 10,133 images (second_file_img) + ~24867 randomly sampled = 35,000 Images Total

It is necessary that the train set contains all the unique persons in Celeba (10,177) but the test set may not**

**some people only have 1 image of themselves in the dataset and will only appear in train but not test

In [87]:
# sample the remaining images (after dropping first_img_file and second_img_file) in the rest df randomly that will go with the second_img_file for test set (35,000)
test_sample = rest.sample(n=35000-10133, random_state=42)
print(len(test_sample))
test_sample.head()

24867


Unnamed: 0,file_name,person_id
109075,109076.jpg,9681
150494,150495.jpg,3916
125212,125213.jpg,8092
150599,150600.jpg,8867
112345,112346.jpg,6686


In [88]:
# get the reamining images that are left that will go in the train set with the first_img_file
rest.drop(test_sample.index, axis=0, inplace=True)
print(len(rest))
rest.head()

157422


Unnamed: 0,file_name,person_id
290,000291.jpg,4960
783,000784.jpg,3272
807,000808.jpg,1250
912,000913.jpg,6546
949,000950.jpg,2688


## Create CelebA Train DataFrame & Load into Train Folder

In [89]:
# Create One-Shot Training Folder
train_dir = 'data/train_celeba_std'

if not os.path.exists(train_dir):
  os.makedirs(train_dir)

In [90]:
# concat the 10,177 first_img_file + the remaining 157,422 images to create the new train set
train_df = pd.concat([first_img_file, rest]).sort_values(by='file_name')
print(len(train_df))
train_df.head()

167599


Unnamed: 0,file_name,person_id
0,000001.jpg,2880
1,000002.jpg,2937
2,000003.jpg,8692
3,000004.jpg,5805
4,000005.jpg,9295


In [91]:
# get list of the file paths for the training images - one-shot
train_file_paths = [os.path.join(img_folder, file_name).replace('\\','/') for file_name in train_df.file_name]

# Copy-pasting images (source path, destination path)
for name in train_file_paths:
    shutil.copy(name, train_dir)

print('CelebA Training Data is copied')

CelebA Training Data is copied


## Create CelebA Test DataFrame & Load into Test Folder

In [92]:
# Create Test Folder
test_dir = 'data/test_celeba_std'

if not os.path.exists(test_dir):
  os.makedirs(test_dir)

In [93]:
# concat the 10,133 second_img_file + the 24867 test_sample to create the 35,000 image test set
test_df = pd.concat([second_img_file, test_sample]).sort_values(by='file_name')
print(len(test_df))
test_df.head()

35000


Unnamed: 0,file_name,person_id
49,000050.jpg,1058
81,000082.jpg,4407
209,000210.jpg,3602
227,000228.jpg,3422
251,000252.jpg,4960


In [94]:
# get list of the file paths for the testing images 
test_file_paths = [os.path.join(img_folder, file_name).replace('\\','/') for file_name in test_df.file_name]

# Copy-pasting images (source path, destination path)
for name in test_file_paths:
    shutil.copy(name, test_dir)

print('CelebA Testing Data is copied')

CelebA Testing Data is copied


## Review of CelebA Training/Test Split

In [96]:
print('Original CelebA Dataset')
print(f'Number of Images in CelebA Dataset is: {len(celeba_dataset)} images')
print(f'Number of Unique Persons in CelebA Dataset is: {file_label_mapping.person_id.nunique()} persons')

print('_____________________________________________________________')

print('One Shot Training Set')
print(f'Number of Images in Training Data is: {len(train_df)} images')
print(f'Number of Unique Persons in One-Shot Training Data is: {train_df.person_id.nunique()} persons')

print('_____________________________________________________________')

print('Testing Set')
print(f'Number of Images in Test Data is: {len(test_df)} images')
print(f'Number of Unique Persons in Test Data is: {test_df.person_id.nunique()} persons')

Original CelebA Dataset
Number of Images in CelebA Dataset is: 202599 images
Number of Unique Persons in CelebA Dataset is: 10177 persons
_____________________________________________________________
One Shot Training Set
Number of Images in Training Data is: 167599 images
Number of Unique Persons in One-Shot Training Data is: 10177 persons
_____________________________________________________________
Testing Set
Number of Images in Test Data is: 35000 images
Number of Unique Persons in Test Data is: 10133 persons
