# Data Collection for Mildew Detection in Cherry Leaves

## Objectives
* Fetch the dataset of cherry leaf images from the provided source and examine its structure.
* Save the raw image data in an organized directory structure for easy access in later stages.

## Inputs
* Dataset URL or access key if the dataset is hosted on platforms like Kaggle.

## Outputs
* Directory structure containing the raw dataset divided into training, validation, and test sets.

## Additional Comments
* Ensure compliance with any data use agreements or NDAs associated with the dataset.
---

# Import packages


In [1]:
%pip install -r "/Users/jordanfletorides/Desktop/github repos/ml-mildew-detector/dev_requirements"

Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy
import os

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/jordanfletorides/Desktop/github repos/ml-mildew-detector/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

# Changing file permissions (for Unix-like OS)
! chmod -R u+w inputs/cherry_leaves_dataset

chmod: inputs/cherry_leaves_dataset: No such file or directory


---

# Install Kaggle

In [6]:
# install kaggle package
%pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


Run the cell below **to change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON**.

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the [Kaggle URL](https://www.kaggle.com/datasets/codeinstitute/cherry-leaves/data). When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ (in some cases kaggle.com/datasets). You should copy that at KaggleDatasetPath.
* Set your destination folder.

Set the Kaggle Dataset and Download it.

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves/data"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:03<00:00, 17.9MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:03<00:00, 15.7MB/s]


Unzip the downloaded file, and delete the zip file.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data cleaning

### Check and remove non-image files

In [10]:
def clear_non_images(root_directory_path):
    valid_extensions = ('.png', '.jpg', '.jpeg')  # Lowercase versions of valid extensions
    permission_denied_files = []

    for root, dirs, files in os.walk(root_directory_path):
        image_count, non_image_count = 0, 0
        for file in files:
            # Convert the file extension to lowercase before comparison to handle both lowercase and uppercase extensions
            if file.lower().endswith(valid_extensions):
                image_count += 1  # Count as an image file
            else:
                non_image_path = os.path.join(root, file)  # Full path for non-image files
                try:
                    os.remove(non_image_path)  # Attempt to remove non-image file
                    non_image_count += 1
                except PermissionError:
                    permission_denied_files.append(non_image_path)  # Record if permission denied

        # Reporting
        if image_count or non_image_count:
            # Only print summary for directories where files were found
            print(f"Processed folder: {root}")
            print(f"- Image files: {image_count}")
            print(f"- Non-image files removed: {non_image_count}")

    # Reporting permission issues
    if permission_denied_files:
        print("\nPermission denied for the following files:")
        for file_path in permission_denied_files:
            print(file_path)


In [11]:
# Call the function with the base directory path where your dataset folders are located
clear_non_images(root_directory_path='inputs/cherry_leaves_dataset/cherry-leaves')

Processed folder: inputs/cherry_leaves_dataset/cherry-leaves/powdery_mildew
- Image files: 2104
- Non-image files removed: 0
Processed folder: inputs/cherry_leaves_dataset/cherry-leaves/healthy
- Image files: 2104
- Non-image files removed: 0


## Split train validation test set

In [12]:
import shutil
import random

def distribute_dataset_images_v2(dataset_path, train_ratio, validation_ratio, test_ratio, final_test_ratio):
    """
    Distribute images across train, validation, test, and final test folders based on specified ratios.
    
    Args:
    dataset_path (str): The directory containing the class folders.
    train_ratio (float): The proportion of images to be used for the training set.
    validation_ratio (float): The proportion of images to be used for the validation set.
    test_ratio (float): The proportion of images to be used for the initial test set.
    final_test_ratio (float): The proportion of images to be set aside for the final test set.
    """
    # Adjust the ratios to account for the final test set
    total = train_ratio + validation_ratio + test_ratio + final_test_ratio
    if total != 1.0:
        print("The sum of ratios must be 1.0.")
        return

    try:
        categories = [category for category in os.listdir(dataset_path) if os.path.isdir(os.path.join(dataset_path, category))]
        
        # Creating directories for each subset and category
        for subset in ['train', 'validation', 'test', 'final_test']:
            subset_path = os.path.join(dataset_path, subset)
            for category in categories:
                dir_path = os.path.join(subset_path, category)
                if not os.path.exists(dir_path):
                    os.makedirs(dir_path, exist_ok=True)

        # Distributing files
        for category in categories:
            files = os.listdir(os.path.join(dataset_path, category))
            random.shuffle(files)

            train_end = int(len(files) * train_ratio)
            validation_end = train_end + int(len(files) * validation_ratio)
            test_end = validation_end + int(len(files) * test_ratio)

            train_files = files[:train_end]
            validation_files = files[train_end:validation_end]
            test_files = files[validation_end:test_end]
            final_test_files = files[test_end:]

            # Function to move files to their new location
            def move_files(files, subset):
                for file_name in files:
                    source_path = os.path.join(dataset_path, category, file_name)
                    destination_path = os.path.join(dataset_path, subset, category, file_name)
                    shutil.move(source_path, destination_path)

            # Moving files to respective directories
            move_files(train_files, 'train')
            move_files(validation_files, 'validation')
            move_files(test_files, 'test')
            move_files(final_test_files, 'final_test')

            # Remove original category directory if empty
            if os.listdir(os.path.join(dataset_path, category)) == []:
                os.rmdir(os.path.join(dataset_path, category))

        print("Dataset successfully distributed into train, validation, test, and final test sets.")
    except Exception as e:
        print(f"An error occurred: {e}")


Split is as follows:
* The training set is divided into a 0.70 ratio of data.
* The validation set is divided into a 0.10 ratio of data.
* The test set is divided into a 0.15 ratio of data.
* The final set is divided into a 0.05 ratio of data.

In [13]:
distribute_dataset_images_v2(
    dataset_path="inputs/cherry_leaves_dataset/cherry-leaves",
    train_ratio=0.7,
    validation_ratio=0.1,
    test_ratio=0.15,
    final_test_ratio=0.05
)

Dataset successfully distributed into train, validation, test, and final test sets.


---

# Validation of Directory Structure Creation

In [14]:
# Validation of Directory Structure
print("Validating the newly created directory structure for Train, Validation, and Test sets...")
for root, dirs, files in os.walk("inputs/cherry_leaves_dataset", topdown=True):
    if dirs:
        print(f"Root directory: {root}")
        print(f"Subdirectories: {dirs}\n")
print("Directory structure validation completed.")


Validating the newly created directory structure for Train, Validation, and Test sets...
Root directory: inputs/cherry_leaves_dataset
Subdirectories: ['cherry-leaves']

Root directory: inputs/cherry_leaves_dataset/cherry-leaves
Subdirectories: ['final_test', 'test', 'train', 'validation']

Root directory: inputs/cherry_leaves_dataset/cherry-leaves/final_test
Subdirectories: ['powdery_mildew', 'healthy']

Root directory: inputs/cherry_leaves_dataset/cherry-leaves/test
Subdirectories: ['powdery_mildew', 'healthy']

Root directory: inputs/cherry_leaves_dataset/cherry-leaves/train
Subdirectories: ['powdery_mildew', 'healthy']

Root directory: inputs/cherry_leaves_dataset/cherry-leaves/validation
Subdirectories: ['powdery_mildew', 'healthy']

Directory structure validation completed.


---

# Reflecting on NDAs

Before proceeding further, please ensure that the dataset usage complies with any Non-Disclosure Agreements (NDAs) or data use agreements associated with the project. Specifically, check that:

- The dataset does not contain any sensitive or proprietary information not cleared for use.
- The distribution of the dataset within the project scope does not violate any terms of the NDA.

Consult with the project's legal advisor or data governance team if in doubt.

---