# Creating One-Shot Train Set & Test Set Folders from CelebA Dataset

**Only run this .ipynb file ONCE to create the one-shot train img folder & test img folder**

**One-Shot Train Folder Location: 'data/train_celeba_one_shot'**

**Test Folder Location: 'data/test_celeba**

## Import Packages

In [1]:
import os
import numpy as np
import pandas as pd
import shutil
import src
from src.utils.celeba_helper import CelebADataset
from torch.utils.data import Dataset

## Load CelebA Dataset

In [2]:
## Load the dataset
# Path to directory with all the images and mapping
img_folder = 'data/img_align_celeba'
mapping_file = 'data/identity_CelebA.txt'

# Load the dataset from file
celeba_dataset = CelebADataset(img_folder, mapping_file)

In [3]:
# labels file dataframe
file_label_mapping = celeba_dataset.file_label_mapping
display(file_label_mapping.head())

Unnamed: 0,file_name,person_id
0,000001.jpg,2880
1,000002.jpg,2937
2,000003.jpg,8692
3,000004.jpg,5805
4,000005.jpg,9295


In [4]:
# How many images are in the CelebA Dataset and how many unique persons are there?
print(f'Number of Images in CelebA Dataset is: {len(celeba_dataset)} images')
print(f'Number of Unique Persons in CelebA Dataset is: {file_label_mapping.person_id.nunique()} persons')

Number of Images in CelebA Dataset is: 202599 images
Number of Unique Persons in CelebA Dataset is: 10177 persons


## Create & Load One-Shot Training Folder from CelebA

There are 10,177 unique persons in the CelebA Dataset. 

The Training Set will contain the **FIRST** file_name for each person_id in the file_label_mapping dataframe

**1 img for each person (10,177 images)**

In [5]:
# Create One-Shot Training Folder
train_dir = 'data/train_celeba_one_shot'

if not os.path.exists(train_dir):
  os.makedirs(train_dir)

In [6]:
# Obtain the first file_name for each person_id in the file_label_mapping dataframe
train_df = file_label_mapping.drop_duplicates(subset='person_id', keep='first', inplace=False, ignore_index=False)
display(train_df.head())

print(f'Number of Images in One-Shot Training Data is: {len(train_df)} images')
print(f'Number of Unique Persons in One-Shot Training Data is: {train_df.person_id.nunique()} persons')

Unnamed: 0,file_name,person_id
0,000001.jpg,2880
1,000002.jpg,2937
2,000003.jpg,8692
3,000004.jpg,5805
4,000005.jpg,9295


Number of Images in One-Shot Training Data is: 10177 images
Number of Unique Persons in One-Shot Training Data is: 10177 persons


In [7]:
# get list of the file paths for the training images - one-shot
train_file_paths = [os.path.join(img_folder, file_name).replace('\\','/') for file_name in train_df.file_name]

# Copy-pasting images (source path, destination path)
for name in train_file_paths:
    shutil.copy(name, train_dir)

print('One Shot CelebA Training Data is copied')

One Shot CelebA Training Data is copied


## Create & Load Test Folder - CelebA

All other images in the CelebA dataset that are not part of the one-shot training will become the test set

This means that while the train set will only contain 1 image of each unique person, the test set may contain multiple images of a person. Additionally, if in the original CelebA Dataset, there is only 1 image of a certain person, that image will only appear in the training set and there will be no image of that person in the test set. This means that each unique person (10,177) in the training set may not have a corresponding image in the test set. 

In [8]:
# Create Test Folder
test_dir = 'data/test_celeba'

if not os.path.exists(test_dir):
  os.makedirs(test_dir)

In [9]:
# Drop ALL the Indexes in the training set (train_df) from the original file_label_mapping df, and what is left is the images for the test set
test_df = file_label_mapping.drop(train_df.index, axis=0, inplace=False)
display(test_df.head())

print(f'Number of Images in Test Data is: {len(test_df)} images')
print(f'Number of Unique Persons in Test Data is: {test_df.person_id.nunique()} persons')

Unnamed: 0,file_name,person_id
49,000050.jpg,1058
81,000082.jpg,4407
209,000210.jpg,3602
227,000228.jpg,3422
251,000252.jpg,4960


Number of Images in Test Data is: 192422 images
Number of Unique Persons in Test Data is: 10133 persons


In [10]:
# get list of the file paths for the testing images 
test_file_paths = [os.path.join(img_folder, file_name).replace('\\','/') for file_name in test_df.file_name]

# Copy-pasting images (source path, destination path)
for name in test_file_paths:
    shutil.copy(name, test_dir)

print('CelebA Testing Data is copied')

CelebA Testing Data is copied


### Orig CelebA, Train One-Shot, Test Split

In [11]:
print('Original CelebA Dataset')
print(f'Number of Images in CelebA Dataset is: {len(celeba_dataset)} images')
print(f'Number of Unique Persons in CelebA Dataset is: {file_label_mapping.person_id.nunique()} persons')

print('_____________________________________________________________')

print('One Shot Training Set')
print(f'Number of Images in One-Shot Training Data is: {len(train_df)} images')
print(f'Number of Unique Persons in One-Shot Training Data is: {train_df.person_id.nunique()} persons')

print('_____________________________________________________________')

print('Testing Set')
print(f'Number of Images in Test Data is: {len(test_df)} images')
print(f'Number of Unique Persons in Test Data is: {test_df.person_id.nunique()} persons')

Original CelebA Dataset
Number of Images in CelebA Dataset is: 202599 images
Number of Unique Persons in CelebA Dataset is: 10177 persons
_____________________________________________________________
One Shot Training Set
Number of Images in One-Shot Training Data is: 10177 images
Number of Unique Persons in One-Shot Training Data is: 10177 persons
_____________________________________________________________
Testing Set
Number of Images in Test Data is: 192422 images
Number of Unique Persons in Test Data is: 10133 persons
