# **Data Collection**

## Objectives

* Download data from Kaggle and prepare it for the next steps

* Clean and split the data into train, test and validation sets

## Inputs

* Kaggle JSON file - authentication key

* Kaggle API - to download the data

## Outputs

* Train, test and validation sets in `inputs/datasets/cherry_leaves_dataset`



---

## Setting up the environment
### Install requirements

In [None]:
! pip install -r /Users/oksanaerm/ML/ML-mildew-detection/requirements.txt

___
### Import libraries

In [1]:
import numpy
import os

___
### Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory 

In [2]:
current_dir = os.getcwd()
current_dir

'/Users/oksanaerm/ML/ML-mildew-detection/jupyter_notebooks'

We want to make the parent of the current directory the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/Users/oksanaerm/ML/ML-mildew-detection'

___
### Install Kaggle

In [9]:
! pip install kaggle

Collecting kaggle
  Using cached kaggle-1.5.13.tar.gz (63 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.13-py3-none-any.whl size=77733 sha256=91e79661a108a1caba861877a1d69cc846c25fec6b9b57fb5848dbacd14354b6
  Stored in directory: /Users/oksanaerm/Library/Caches/pip/wheels/9c/45/15/6d6d116cd2539fb8f450d64b0aee4a480e5366bb11b42ac763
Successfully built kaggle
Installing collected packages: kaggle
Successfully installed kaggle-1.5.13


Set Kaggle configuaration directory to current working directory and permission for kaggle.json file

In [13]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
print(os.environ['KAGGLE_CONFIG_DIR'])
! chmod 600 kaggle.json

/Users/oksanaerm/ML/ML-mildew-detection


___
# Download data from Kaggle

Set Kaggle Dataset and Download it

In [19]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:08<00:00, 7.55MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:08<00:00, 7.01MB/s]


Extract the compressed file, and remove the original file.

In [20]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data Cleaning
Remove non image files if any

In [21]:
def remove_non_image_files(my_data_dir):
    """
    Search through dataset to identify and remove non image files
    """
    image_extensions = ('.png', '.jpg', '.jpeg')
    for root, dirs, files in os.walk(my_data_dir):
        # os.walk to traverse the directory structure and look for image files
        i = 0
        j = 0
        for file in files:
            if not file.lower().endswith(image_extensions):
                file_location = os.path.join(
                    root, file)  # construct the file path
                os.remove(file_location)  # remove non image file
                i += 1
            else:
                j += 1
        print(f"Folder: {root} - has image file(s): {j}")
        print(f"Folder: {root} - has non-image file(s): {i}")


The solution using `os.walk` is time/space efficient because it avoids having to load the entire directory tree into memory at once. Instead, it generates the directory tree as needed, processing each directory and file as it goes.

Using os.walk also avoids the need to explicitly concatenate directory and file paths using string operations, which can be slow for large directory structures.

It also avoids the need to call `os.path.isdir()` to check whether a file is a directory. This is because os.walk generates only files, not directories, for each directory it visits.

In [22]:
remove_non_image_files(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

Folder: inputs/cherry_leaves_dataset/cherry-leaves - has image file(s): 0
Folder: inputs/cherry_leaves_dataset/cherry-leaves - has non-image file(s): 0
Folder: inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew - has image file(s): 2104
Folder: inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew - has non-image file(s): 0
Folder: inputs/cherry_leaves_dataset/cherry-leaves/healthy - has image file(s): 2104
Folder: inputs/cherry_leaves_dataset/cherry-leaves/healthy - has non-image file(s): 0


## Split the data into train, validation, and test sets

70-10-20 ratio is justified by the fact that the dataset is relatively large, and a 70% training set is a good starting point. A 10% validation set can provide enough data to optimize the model's hyperparameters without overfitting to the training set. Lastly, a 20% test set can provide a reasonable estimate of the model's performance on unseen data.

Additionally, this ratio can help in achieving a better balance between the size of the validation/test set and the amount of data available for training. Having a larger validation and test set provides more reliable estimates of the model's generalization performance, which is crucial in ensuring that the model is not overfitting.

In [23]:
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    split data set into three groups by ratio's .7, .1, .2
    """
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            # Move files to appropriate set directories
            # Use of enumerate leads to improved memory efficiency and faster execution time,
            # particularly in cases where the loop is iterating over a large number of items.
            for count, file_name in enumerate(files):
                if count < train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count < (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

            os.rmdir(my_data_dir + '/' + label)

In [24]:
split_train_validation_test_images(my_data_dir=f"inputs/cherry_leaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---

# Push files to GitHub

git add .

git commit -m "Add and prepare cherry leaves dataset"

git push

____

## Next Step
    
* [02 - Data Visualization.ipynb](02%20-%20Data_Visualization.ipynb)