**Reorganize Data with Symbolic Links**

In computing, a symbolic link (also symlink or soft link) is a term for any file that contains a reference to another file or directory in the form of an absolute or relative path and that affects pathname resolution (Source: https://en.wikipedia.org/wiki/Symbolic_link).  Here we will use symbolic links to reorganize our data.

Originally the "Dogs vs Cats" images were organized inside of two folders titled "Train" and "Test" and the images of dogs and cats were easily identifiable based off of the file name for each image.  We want to reorganize the images such that now they will be organized inside of two folders titled "Train" and "Valid" and such that the images of dogs and cats will be easily identifiable based off of whether they are in a subfolder that is titled "Dog" or a subfolder that is titled "Cat".  This is the [format](http://wiki.fast.ai/index.php/Lesson_1_Notes#Data_Structure) that is required for completing exercises in the 2018 [FAST.AI](http://www.fast.ai/) Deep Learning course.


*Step 1: Describe Original Data Organization*

In [1]:
import os
print(os.listdir("../input"))

['test', 'train']


In [2]:
print(os.listdir("../input/train"))

['cat.9491.jpg', 'cat.11613.jpg', 'cat.11841.jpg', 'dog.423.jpg', 'cat.11501.jpg', 'dog.11716.jpg', 'cat.5171.jpg', 'cat.1661.jpg', 'cat.3477.jpg', 'dog.7755.jpg', 'dog.6598.jpg', 'cat.6704.jpg', 'cat.8697.jpg', 'cat.2959.jpg', 'cat.2962.jpg', 'cat.7165.jpg', 'dog.8555.jpg', 'dog.10141.jpg', 'cat.3505.jpg', 'cat.8617.jpg', 'cat.9265.jpg', 'dog.672.jpg', 'cat.5839.jpg', 'cat.11263.jpg', 'cat.11590.jpg', 'cat.129.jpg', 'cat.5936.jpg', 'cat.2492.jpg', 'cat.2528.jpg', 'cat.3582.jpg', 'dog.1188.jpg', 'cat.7453.jpg', 'dog.2568.jpg', 'cat.4453.jpg', 'cat.10919.jpg', 'cat.6147.jpg', 'cat.1129.jpg', 'cat.11794.jpg', 'cat.9042.jpg', 'dog.9858.jpg', 'cat.1292.jpg', 'dog.9948.jpg', 'cat.8856.jpg', 'dog.6979.jpg', 'cat.9910.jpg', 'dog.2352.jpg', 'dog.7457.jpg', 'dog.8805.jpg', 'dog.9561.jpg', 'dog.10506.jpg', 'cat.3851.jpg', 'cat.5312.jpg', 'cat.10238.jpg', 'cat.10284.jpg', 'dog.4586.jpg', 'cat.6930.jpg', 'cat.8647.jpg', 'dog.5187.jpg', 'dog.11290.jpg', 'cat.4810.jpg', 'dog.7692.jpg', 'cat.9766.jpg

In [3]:
from sklearn.model_selection import train_test_split
PATH = "../input/"
root_prefix = PATH
train_filenames = os.listdir('%s/train/' % (root_prefix))
print("Sample of Training Data:", train_filenames[0:10])
test_filenames  = os.listdir('%s/test/'  % (root_prefix))
print("\nSample of Testing Data:", test_filenames[0:10])

Sample of Training Data: ['cat.9491.jpg', 'cat.11613.jpg', 'cat.11841.jpg', 'dog.423.jpg', 'cat.11501.jpg', 'dog.11716.jpg', 'cat.5171.jpg', 'cat.1661.jpg', 'cat.3477.jpg', 'dog.7755.jpg']

Sample of Testing Data: ['1523.jpg', '2804.jpg', '6364.jpg', '11825.jpg', '6180.jpg', '2506.jpg', '11546.jpg', '674.jpg', '862.jpg', '988.jpg']


In [4]:
my_train = train_filenames
my_train, my_cv = train_test_split(train_filenames, test_size=0.1, random_state=0)
print("Number of Training Images:",len(my_train))
print("Number of Testing Images:", len(my_cv))

Number of Training Images: 22500
Number of Testing Images: 2500


*Step 2: Reorganize Data Using Symbolic Links*

In [5]:
import shutil
from pathlib import Path
# Make symlinks
!cp -as "$(pwd)/../input/" "$(pwd)/COPY"
root_prefix = 'COPY'

def remove_and_create_class(dirname):
    if os.path.exists(dirname):
        shutil.rmtree(dirname)
    os.mkdir(dirname)
    os.mkdir(dirname+'/cat')
    os.mkdir(dirname+'/dog')

remove_and_create_class('%s/train' % (root_prefix))
remove_and_create_class('%s/valid' % (root_prefix))

for filename in filter(lambda x: x.split(".")[0] == "cat", my_train):
    os.symlink('%s/train/' % (root_prefix)+filename, '%s/train/cat/' % (root_prefix)+filename)
for filename in filter(lambda x: x.split(".")[0] == "dog", my_train):
    os.symlink('%s/train/' % (root_prefix)+filename, '%s/train/dog/' % (root_prefix)+filename)
for filename in filter(lambda x: x.split(".")[0] == "cat", my_cv):
    os.symlink('%s/train/' % (root_prefix)+filename, '%s/valid/cat/' % (root_prefix)+filename)
for filename in filter(lambda x: x.split(".")[0] == "dog", my_cv):
    os.symlink('%s/train/' % (root_prefix)+filename, '%s/valid/dog/' % (root_prefix)+filename)

*Step 3: Describe New Data Organization*

In [6]:
PATH = 'COPY'
print(os.listdir('COPY/train'))
print(os.listdir('COPY/valid'))

['dog', 'cat']
['dog', 'cat']


In [7]:
print(os.listdir('COPY/valid/cat'))

['cat.11508.jpg', 'cat.212.jpg', 'cat.7582.jpg', 'cat.1673.jpg', 'cat.6587.jpg', 'cat.4662.jpg', 'cat.3734.jpg', 'cat.441.jpg', 'cat.9274.jpg', 'cat.9976.jpg', 'cat.9853.jpg', 'cat.6081.jpg', 'cat.9097.jpg', 'cat.6519.jpg', 'cat.1184.jpg', 'cat.3292.jpg', 'cat.3270.jpg', 'cat.8114.jpg', 'cat.4299.jpg', 'cat.1107.jpg', 'cat.5070.jpg', 'cat.7955.jpg', 'cat.8179.jpg', 'cat.6327.jpg', 'cat.11422.jpg', 'cat.5412.jpg', 'cat.4308.jpg', 'cat.5903.jpg', 'cat.10025.jpg', 'cat.5775.jpg', 'cat.625.jpg', 'cat.4017.jpg', 'cat.2541.jpg', 'cat.2096.jpg', 'cat.6899.jpg', 'cat.1703.jpg', 'cat.3395.jpg', 'cat.12494.jpg', 'cat.7608.jpg', 'cat.1245.jpg', 'cat.786.jpg', 'cat.7884.jpg', 'cat.12438.jpg', 'cat.4120.jpg', 'cat.3038.jpg', 'cat.11075.jpg', 'cat.8249.jpg', 'cat.3344.jpg', 'cat.11562.jpg', 'cat.1580.jpg', 'cat.2676.jpg', 'cat.6517.jpg', 'cat.3480.jpg', 'cat.10989.jpg', 'cat.10186.jpg', 'cat.2411.jpg', 'cat.1262.jpg', 'cat.7439.jpg', 'cat.2874.jpg', 'cat.4676.jpg', 'cat.7971.jpg', 'cat.7627.jpg', 'c

In [8]:
# Remove symlinks before committing
!rm -rf "$(pwd)/COPY"

*Step 4: Proceed with analysis using your newly reorganized data*