# **01 - DataCollection**

## Objectives

* Fetch dataset from Kaggle API endpoint and prepare for further processes.

## Inputs

* Kaggle JSON file - the authentication token. 

## Outputs

* Generate Dataset: inputs/datasets/cherryleaves_dataset 

## Additional Comments

* No additional comments. 



---

# Collect and Store Dataset

## Change working directory

Change the working directory from the current folder to /workspace/Mildew-Detection-in-Cherry-Leaves

The output from the below cell should be '/workspace/Mildew-Detection-in-Cherry-Leaves'

In [1]:
import os
os.chdir(os.path.dirname(os.getcwd()))
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-Detection-in-Cherry-Leaves'

## Get data from Kaggle

* Install Kaggle

In [2]:
# install kaggle package
%pip install kaggle


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


* Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON

In [3]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Set the path to the kaggle data set, the path to the directory where to store it and download it.

In [4]:
KaggleEndpoint = "codeinstitute/cherry-leaves"
LocalDirPath = "inputs/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleEndpoint} -p {LocalDirPath}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:01<00:00, 48.4MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 33.2MB/s]


* Unzip the downloaded zip file and then delete it.

In [5]:
import zipfile
with zipfile.ZipFile(LocalDirPath + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(LocalDirPath)

os.remove(LocalDirPath + '/cherry-leaves.zip')

---

# Clean Data

## Filtering out non image files

In [6]:
import os
healthy_leaves='inputs/cherry_leaves_dataset/cherry-leaves/healthy'
powdery_mildew='inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew'
file_extension = ('.png', '.jpg', '.jpeg')

def drop_non_imgs(folder):
    files = os.listdir(folder)
    non_img_counter = 0
    for file in files:
        if not file.lower().endswith(file_extension):
            file_path = folder + '/' + file
            os.remove(file_path)
            non_img_counter =+ 1
        else:
            pass
    print(f"{folder} contained {non_img_counter} non image files before removal")

drop_non_imgs(healthy_leaves)
drop_non_imgs(powdery_mildew)

inputs/cherry_leaves_dataset/cherry-leaves/healthy contained 0 non image files before removal
inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew contained 0 non image files before removal


## Split dataset into train-, validation- and test-set

* Set the ratio for train-, validation- and test-sets

In [7]:
train_set_ratio = 0.7
validation_set_ratio = 0.1
test_set_ratio = 0.2

ratio_sum = round(train_set_ratio + validation_set_ratio + test_set_ratio, 2)
if not ratio_sum == 1:
    print(f'WARNING: The sum of the ratios must be 1. It is currently {ratio_sum} and need to be corrected.')
else:
    print('Data set ratio is set.')


Data set ratio is set.


* Create the new train, validation and test folders within the dataset directory

In [None]:
import os

data_dir = '/workspace/Mildew-Detection-in-Cherry-Leaves/inputs/cherry_leaves_dataset/cherry-leaves'

data_labels = os.listdir(data_dir)
folder_names = ['train', 'validation', 'test']

for foldername in folder_names:
    for data_label in data_labels:
        os.makedirs(name=data_dir + '/' + foldername + '/' + data_label)

* Split and move the data to the new folders and delete the previous folders

In [8]:
import os
import shutil
import random

for data_label in data_labels:
    data = os.listdir(data_dir + '/' + data_label)
    random.shuffle(data)

    train_data_amount = round(len(data) * train_set_ratio)
    validation_data_amount = round(len(data) * validation_set_ratio)
    
    for img in data:
        if len(os.listdir(data_dir + '/train/' + data_label)) <= train_data_amount:
            shutil.move(data_dir + '/' + data_label + '/' + img,
                        data_dir + '/train/' + data_label + '/' + img)
        elif len(os.listdir(data_dir + '/validation/' + data_label)) <= validation_data_amount:
            shutil.move(data_dir + '/' + data_label + '/' + img,
                        data_dir + '/validation/' + data_label + '/' + img)           
        else:
            shutil.move(data_dir + '/' + data_label + '/' + img,
                        data_dir + '/test/' + data_label + '/' + img)

    os.rmdir(data_dir + '/' + data_label)

---

# Conclusions and Next Steps

* Dataset is collected from Kaggle and stored in inputs folder.
* No input data is pushed to github repo
* Next step is data visualization found in 02 - DataVisualization notebook