## Split dataset of politician's photos

**Note: all folders are empty in order to respect the privacy of the politicians! There is no public dataset or final model. I used izquierda (left) for PSOE and derecha (right) for PP.**

Use `izquierda` and `derecha` from `cropped` to copy files into `train` and `validation` from `data_politics`. This folder should already exists and have both subfolders (or ajust the script to create them). The current **data_politics** folder will be the dataset folder for the next deep learning classifications.

Therefore, we shall use a dataset split percentage such as **80% train** and **20% test**.

We need `os` and `shutil` to manage the files, `random` to randomly split the dataset in train and validation subsets. You should have a folder structure such as:

* `./data_politics/train/izquierda`
* `./data_politics/train/derecha`
* `./data_politics/validation/izquierda`
* `./data_politics/validation/derecha`

In [1]:
import os
import random
import shutil

These lines are the locations for the source files (the entire dataset) and the future locations of the splitted dataset in train and validation subsets:

In [2]:
# Source dataset: from where to copy the files
sourceFolderClass1 = 'cropped/izquierda'
sourceFolderClass2 = 'cropped/derecha'
# Destination folders: splitted dataset in train and validation for polyps and non-polyps
destFolderClass1_tr  = 'data_politics/train/izquierda'
destFolderClass2_tr  = 'data_politics/train/derecha'
destFolderClass1_val = 'data_politics/validation/izquierda'
destFolderClass2_val = 'data_politics/validation/derecha'

Get the list with all the files in the source folder:

In [3]:
sourceFiles1 = os.listdir(sourceFolderClass1)
sourceFiles2 = os.listdir(sourceFolderClass2)
print("Class 1 - izquierda:", len(sourceFiles1))
print("Class 2 - derecha:", len(sourceFiles2))

Class 1 - izquierda: 50
Class 2 - derecha: 50


Let's suffle the lists with the source files using a random seed:

In [4]:
random.seed(1)
random.shuffle(sourceFiles1)
random.shuffle(sourceFiles2)

We shall define a number of files to copy in the `validation` subfolder for each class. If you want a different split, you should modify `val_files`.

In [5]:
# No of file to copy in VALIDATION folder for each class
val_files = 10

# Copy validation files
print('--> Validation split ...')
for i in range(val_files):
    # copy validation files for PSOE
    File1 = os.path.join(sourceFolderClass1,sourceFiles1[i])
    File2 = os.path.join(destFolderClass1_val,  sourceFiles1[i])
    shutil.copy(File1,File2)
    # copy validation files for PP
    File1 = os.path.join(sourceFolderClass2, sourceFiles2[i])
    File2 = os.path.join(destFolderClass2_val,   sourceFiles2[i])
    shutil.copy(File1, File2)

print('--> Done!')

--> Validation split ...
--> Done!


In [6]:
# Copy training images for PSOE
print('--> Train split ...')
for i in range(val_files,len(sourceFiles1)):
    File1 = os.path.join(sourceFolderClass1,  sourceFiles1[i])
    File2 = os.path.join(destFolderClass1_tr, sourceFiles1[i])
    shutil.copy(File1,File2)
# copy training images for PP
for i in range(val_files,len(sourceFiles2)):    
    File1 = os.path.join(sourceFolderClass2,  sourceFiles2[i])
    File2 = os.path.join(destFolderClass2_tr, sourceFiles2[i])
    shutil.copy(File1, File2)

print('--> Done!')

--> Train split ...
--> Done!


Now we have a splitted dataset into train and validation subfolder with each class inside:
* **100** images in the entire dataset;
* **80** images for training: 40 PSOE + 40 PP;
* **20** images for validation: 10 PSOE + 10 PP.

Let's check the composition of the subsets for the future classification:

In [7]:
print('--> Dataset: data_politics')
print('> Train - izquierda:', len(os.listdir(destFolderClass1_tr)))
print('> Train - derecha:', len(os.listdir(destFolderClass2_tr)))
print('> Validation - izquierda:', len(os.listdir(destFolderClass1_val)))
print('> Validation - derecha:', len(os.listdir(destFolderClass2_val)))

--> Dataset: data_politics
> Train - izquierda: 40
> Train - derecha: 40
> Validation - izquierda: 10
> Validation - derecha: 10


We are ready to use Deep Learning to find a classifier for PSOE - PP political affinity. Remember you could modify the dataset splitting, remove manually a specific list of files, use different names for the folders, etc.

Let's create some classifiers with the next script [Small_CNNs.ipynb](./Small_CNNs.ipynb).

Have fun with DL! @muntisa

### Acknowledgements

I gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research ([https://developer.nvidia.com/academic_gpu_seeding](https://developer.nvidia.com/academic_gpu_seeding)).