##### **Mildew Detection in Cherry Leaves**

# **Data collection and preparation**

## Objectives

* Collect data from Kaggle and save it to separate folders in CodeAnywhere workspace.
* Prepare data for the project
  
## Inputs

* JSON file from Kaggle - authentication token.

## Outputs

* Dataset: inputs/cherry-leaves/leaf_images 

## Additional Comments

* No addtitional comments.



---

## Install Requirements and import packages

In [None]:
!pip install -r /workspaces/mildew-detection/requirements.txt

In [None]:
import numpy
import os

## Change working directory

Change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chdir() defines the new current directory

In [None]:
os.chdir('/workspaces/mildew-detection/')
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Install Kaggle package

Kaggle package handles API calls

In [None]:
!pip install kaggle

---

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = 'codeinstitute/cherry-leaves'
DestinationFolder = 'inputs/cherry-leaves'
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
&& rm {DestinationFolder}/*.zip


# Data preparation

### Check and remove files if non-image file detected

In [None]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [None]:
remove_non_image_file(my_data_dir='inputs/cherry-leaves')

### Split dataset into train, validation and test

In [None]:
import os
import shutil
import random
import joblib


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)


We want to split the three sets by the conventional ratio of data:
* Training: 0.70
* Validation: 0.10
* Test: 0.20

In [None]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry-leaves/leaf_images",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

# Conclusion

Well done.
We have now succesfully collected the relevant data and as data preparation we have split it into three datasets to train, validate and test our upcoming model.

In the next notebook, we will visualize our data.
