# Milestone 1 - Data Collection and Preparation  

---  

# Objective of this notebook
* Prepare the image sets before modelling phase

> **Note:** The content of this notebook follows the description provided in [Milestone-1](./Milestone-1.md)

---
## 1. Importing packages & modules
You might prefer to load the required modules/packages when required. Feel free to do so if it is your preference

In [1]:
# Common modules/packages
import math
import matplotlib.pyplot as plt
import numpy as np
import pathlib, os, shutil
import random
import requests
import warnings

from zipfile import ZipFile
from PIL import Image

from csv import reader

warnings.filterwarnings('ignore')

## 2. Data Collection
Considering that modelling requires 3 image sets ('training', 'testing', 'valid'), a decent number of images must be collected beforehand. We propose three methods to download images. Nevertheless, we recommend to use the first method as the other milestones have been built on it.

Nevertheless, it is worth understand how images can be retrieved, we therefore encourage to look at method 2.2. and 2.3.

### 2.1. Extract from archive file (recommended)
There is a set of images (downloads.zip) provided located in Dataset folder.

In [2]:
rootDir = pathlib.Path('/storage')
pathToDataset = rootDir.joinpath('Dataset')
os.chdir(pathToDataset)

images_file = os.path.join(pathToDataset, 'downloads.zip')

# Extracting all the images to `train` folder
with ZipFile(images_file, 'r') as zipObj:
   zipObj.extractall(pathToDataset)

### 2.2. Direct download
The method below helps you to be more specific whether you want to focus on an art category or you already have a list of images. This method assumes that you have a list of files containing links to images. We provide with two files as an example.

In [3]:
# Download images from a list of urls
def download_listed_images(filepath):

    # Check 'downloads' folder exists
    pathToDownload = pathlib.Path.cwd().joinpath('downloads_(direct)')
    if not pathToDownload.exists():
        pathToDownload.mkdir()

    # Check Art Category folder exists
    pathToDownload = pathToDownload.joinpath(filepath[:-4])
    if not pathToDownload.exists():
        pathToDownload.mkdir()
    
    # grab the list of URLs from the input file, then initialize the total number of images downloaded so far
    with open(filepath) as urlFile:
        urls = urlFile.read().strip().split("\n")
    urlCounter = 0

    # loop the URLs
    for url in urls:
        try:
            # try to download the image
            req = requests.get(url, timeout=60)

            # save the image to disk
            pathToDownloadedImage = pathToDownload.joinpath("{}.jpg".format(str(urlCounter).zfill(8)))
            with open(pathToDownloadedImage, "wb") as imagePath:
                imagePath.write(req.content)

            # update the counter
            print("[INFO] downloaded: {}".format(pathToDownloadedImage))
            urlCounter += 1
            
        # handle if any exceptions are thrown during the download process
        except:
            print("[INFO] error downloading {}...skipping".format(pathToDownloadedImage))

Let's invoke the 'download_listed_images' function with a list of files containing the urls

In [4]:
%%script false --no-raise-error
pathToDataset = rootDir.joinpath('Dataset')
os.chdir(pathToDataset)

# List of files
image_files = ['urls_cubism.txt', 'urls_surrealism.txt']
for image_file in image_files:
    download_listed_images(image_file)

# 3. Data Preparation
Now that the images are downloaded, let's prepare the datasets. 

For example, the training images are all stored in a directory path that looks like this:
```
dataset/train/artCategory_1/abc123.jpg
dataset/train/artCategory_1/abc456.jpg
dataset/train/artCategory_1/abc789.jpg
...
dataset/train/artCategory_2/abc123.jpg
dataset/train/artCategory_2/abc456.jpg
dataset/train/artCategory_2/abc789.jpg
```

Where, in this case, the root folder for training is `dataset/train` and the classes are the names of art types. Likewise, `dataset/valid` and `dataset/test` for validation and testing respectively.

## 3.1. Preparation functions (TO DO)

Before spreading the images, let's create two utilities functions:
* One function should return a list of files present in a specific directory
* One function should return a sorted list of folder names present in a specific directory

In [5]:
# Retrieves the list of files with a directory
def getFilesInDirectory(pathToDir, extension = "*.*"):
    if not isinstance(pathToDir, pathlib.PurePath):
        pathToDir = pathlib.Path(pathToDir)
    if not pathToDir.exists():
        raise OSError
    return list([p for p in pathToDir.glob(extension) if p.is_file()])
    
# Retrieves the list of folders with a directory
def getFolderNamesInDirectory(pathToDir, prefix = ""):
    if not isinstance(pathToDir, pathlib.PurePath):
        pathToDir = pathlib.Path(pathToDir)
    if not pathToDir.exists():
        raise OSError
    list_of_dirs = [d.name for d in pathToDir.iterdir() if d.is_dir()]
    if prefix:
        list_of_dirs = [d for d in list_of_dirs if not d.startswith(prefix)]
    return sorted(list_of_dirs)

## 3.2. Prepare the images
* Set the location for `train`, `test` and `valid` folders and create the missing folders

In [6]:
# Sets the root folder for image sets
pathToDataset  = rootDir.joinpath('Dataset')
pathToDownload = pathToDataset.joinpath('downloads')

pathToTrain = pathToDataset.joinpath('train')
if not pathToTrain.exists():
    pathToTrain.mkdir()

pathToTest = pathToDataset.joinpath('test')
if not pathToTest.exists():
    pathToTest.mkdir()

pathToValid = pathToDataset.joinpath('valid')
if not pathToValid.exists():
    pathToValid.mkdir()

# Sets the folder for models (where all the models will be saved)
pathToModels = pathToDataset.joinpath('..', 'models')
if not pathToModels.exists():
    pathToModels.mkdir()

* Count the number of Art category and list them using the function above

In [7]:
# list the folders required under 'dataset' folder (using a list to reduce the lines of code)
artCategories = getFolderNamesInDirectory(pathToDownload, ".")  #collects the list of folders
print("Total no. of categories = ", len(artCategories))  #displays the number of classes (= Art categories)
print("Categories: ", artCategories)  #displays the list of classes

Total no. of categories =  4
Categories:  ['genre', 'landscape', 'portrait', 'still-life']


* For each art category in the downloads folder, spread the images to `test` folder (20% of them) and `valid` folder (20% of them) (TO DO)

In [8]:
# For each art category
for artCategory in artCategories:

    # Sets the source folder
    path_source = pathToDownload.joinpath(artCategory)
    
    # Sets the datasets
    
    # lists all the 'jpg' images in the folder
    files = getFilesInDirectory(path_source, extension="*.jpg")
    # Shuffle the images
    random.shuffle(files)
    # Determines the splitting index: 5 = 20%
    split_idx = int(round(len(files)/5, 0))
    # Split the files across the 3 datasets
    split_images = np.split(files, [3*split_idx, 4*split_idx])

    # Sets the target folders
    path_target_train = pathToTrain.joinpath(artCategory)
    if not path_target_train.exists():
        path_target_train.mkdir()
    for img_file in split_images[0]:
        shutil.move(img_file, path_target_train.joinpath(img_file.name))    
            
    path_target_test = pathToTest.joinpath(artCategory)
    if not path_target_test.exists():
        path_target_test.mkdir()
    for img_file in split_images[1]:
        shutil.move(img_file, path_target_test.joinpath(img_file.name))    

    path_target_valid = pathToValid.joinpath(artCategory)
    if not path_target_valid.exists():
        path_target_valid.mkdir()
    for img_file in split_images[2]:
        shutil.move(img_file, path_target_valid.joinpath(img_file.name))

## 4. Check the folder content

You should have the following structure:
 * (image-segmentation) >  dataset >  downloads  
 * (image-segmentation) >  dataset >  test  
 * (image-segmentation) >  dataset >  train  
 * (image-segmentation) >  dataset >  valid  

With each of the 'train', 'test' and 'valid' folders, you should retrieve one folder per art category containing the images. Some of the downloaded image files might be corrupted or simply not images. The code below removes these files. 

In [9]:
def cleanImages(location, artCategories):

    if not isinstance(location, pathlib.PurePath):
        location = pathlib.Path(location)

    # For each art category
    for artCategory in artCategories:

        # Sets the source folder
        path_source = location.joinpath(artCategory)

        # Sets the datasets
        files = getFilesInDirectory(path_source, '*.jpg')    # lists all the 'jpg' images in the folder

        for file in files:
            try:
                img = Image.open(file)
            except IOError:
                print( file )
                os.remove(file)

cleanImages(pathToTrain, artCategories)
cleanImages(pathToValid, artCategories)
cleanImages(pathToTest, artCategories)
