# Data Transformation Pipeline

**Before** running the code complete the following steps:
1. Download the wider face Training and Validation datasets from  [here](http://shuoyang1213.me/WIDERFACE/)
2. Unzip the datasets
3. Download the Pascal VOC annotations for the dataset from [here](https://github.com/akofman/wider-face-pascal-voc-annotations)
4. Create the following folder structure:
```bash
└───images
│   ├──train
│   ├──────[All WIDER train Annotation Files].xml
│   └──────[All WIDER train Image Folders]
│   ├──test
│   ├──────[All WIDER val Annotation Files].xml
│   └──────[All WIDER val Image Folders]
├───data_transformation.ipynb
```


# Run this script once to copy all nested images to toplevel folder:

In [9]:
import glob
import shutil
import os
src_dir = "images/train/images"
dst_dir = "images/train/"
for jpgfile in glob.iglob(os.path.join(src_dir, "**", "*.jpg")):
    shutil.move(jpgfile, dst_dir) 
src_dir = "images/test/images"
dst_dir = "images/test/"
for jpgfile in glob.iglob(os.path.join(src_dir, "**", "*.jpg")):
    shutil.move(jpgfile, dst_dir) 
    
src_dir = "images/train/images/annotations"
dst_dir = "images/train/"
for xmlfile in glob.iglob(os.path.join(src_dir, "**", "*.xml")):
    shutil.move(xmlfile, dst_dir) 
src_dir = "images/test/images/annotations"
dst_dir = "images/test/"
for xmlfile in glob.iglob(os.path.join(src_dir, "**", "*.xml")):
    shutil.move(xmlfile, dst_dir) 


Now your directory structure should look like this:
```bash
└───images
│   ├──train
│   ├──────[All WIDER train Annotation Files].xml
│   ├──────[All WIDER train Image Files].jpg
│   ├──test
│   ├──────[All WIDER val Annotation Files].xml
│   ├──────[All WIDER val Image Files].jpg
├───data_transformation.ipynb
```

# Create CSV Files from your XML annotations:

In [10]:
#https://github.com/datitran/raccoon_dataset
import os
import glob
import pandas as pd
import xml.etree.ElementTree as ET


def xml_to_csv(path):
    xml_list = []
    for xml_file in glob.glob(path + '/*.xml'):
        tree = ET.parse(xml_file)
        root = tree.getroot()
        for member in root.findall('object'):
            value = (root.find('filename').text,
                     int(root.find('size')[0].text),
                     int(root.find('size')[1].text),
                     member[0].text,
                     int(member[4][0].text),
                     int(member[4][1].text),
                     int(member[4][2].text),
                     int(member[4][3].text)
                     )
            xml_list.append(value)
    column_name = ['filename', 'width', 'height', 'class', 'xmin', 'ymin', 'xmax', 'ymax']
    xml_df = pd.DataFrame(xml_list, columns=column_name)
    return xml_df


def main():
    for folder in ['train','test']:
        image_path = os.path.join(os.getcwd(), ('images/' + folder))
        xml_df = xml_to_csv(image_path)
        xml_df.to_csv(('images/' + folder + '_labels_all.csv'), index=None)
        print('Successfully converted xml to csv.')


main()


Successfully converted xml to csv.
Successfully converted xml to csv.


1. Open the CSV Files in excel
2. Remove all images with small faces (<20px in any direction) in Power Query
3. Export to CSV (note that the file needs to be comma separated instead of semicolon)

# Note
The following files are needed for export, please also note the filenames
```bash
└───images
│   ├──train
│   ├──────[All WIDER train Annotation Files].xml
│   ├──────[All WIDER train Image Files].jpg
│   ├──train_labels.csv
│   ├──test
│   ├──────[All WIDER val Annotation Files].xml
│   ├──────[All WIDER val Image Files].jpg
│   ├──test_labels.csv
├───data_transformation.ipynb
```

# Let python copy only valid images to C:/tensorflow1/models/research/object_detection/images:

In [17]:
import csv
import glob
import shutil
import os

src_dir = "images/train/"
dst_dir = "C:/tensorflow1/models/research/object_detection/images/train"

with open('images/train_labels.csv', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
    count = 0
    for row in spamreader:       
        if count != 0:
            shutil.copy(os.path.join(src_dir,row[0]), dst_dir) 
        count = 1

src_dir = "images/test/"
dst_dir = "C:/tensorflow1/models/research/object_detection/images/test"

with open('images/test_labels.csv', newline='') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='|')
    count = 0
    for row in spamreader:       
        if count != 0:
            shutil.copy(os.path.join(src_dir,row[0]), dst_dir) 
        count = 1


# Last Step
Copy the test/train label csv files to C:/tensorflow1/models/research/object_detection/images/