# **Data Collection**

## Objectives

* Import packages.
* Set working dir.
* Fetch data from kaggle and prepp it for cleaning and processing.
* Split data

## Inputs

* Kaggle JSON file - authentication token.
* Kaggle API - To download the data.

## Outputs

* Once the dataset is downloaded and strucktered it will look like this:


- inputs/
    - cherryleaves_dataset/
        - cherry-leaves/
            - test/
                - healthy/
                - powdery_mildew
            - train/
                - healthy/
                - powdery_mildew
            - validation/
                - healthy/
                - powdery_mildew


## Additional Comments

* These steps are necessary to fetch the data, clean it and divide it in subsets for the purposes of machine learning. Next step will be Data Visualization to understand the data and discover patterns.



---

## Import packages

In [1]:
! pip install -r /workspace/mildew-detector/requirements.txt



In [2]:
import numpy
import os

# Change working directory

* The notebooks are in a subfolder of this directory, therefore when running the notebook in the editor, it's necessary to change the working directory.

Change the working directory from its current folder to its parent folder

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detector'

---

## Install Kaggle

In [6]:
pip install kaggle

Note: you may need to restart the kernel to use updated packages.


Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the kaggle dataset and download it

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherryleaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherryleaves_dataset
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:01<00:00, 42.5MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 34.7MB/s]


Unzip the dowloaded file and delete the zip file

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

---

### Data Cleaning

* Remove any non images files.

In [10]:
def remove_non_image_files(data_dir):
    """
    Removes non-image files from a given directory.

    Args:
        data_dir (str): The directory containing the image files.

    Returns:
        None
    """
    # Define the file extensions to keep
    valid_extensions = ('.png', '.jpg', '.jpeg')

    # Loop through each folder in the directory
    for folder in os.listdir(data_dir):
        folder_path = os.path.join(data_dir, folder)
        if not os.path.isdir(folder_path):
            continue  # Skip non-directory files

        # Loop through each file in the folder
        files = os.listdir(folder_path)
        num_files = len(files)
        num_removed = 0
        for filename in files:
            file_path = os.path.join(folder_path, filename)

            # Check if the file has a valid image extension
            if not filename.lower().endswith(valid_extensions):
                os.remove(file_path)
                num_removed += 1

        print(f"Folder: {folder} - {num_files} total files, "
              f"{num_removed} non-image files removed.")


In [11]:
remove_non_image_files('inputs/cherryleaves_dataset/cherry-leaves')

Folder: healthy - 2104 total files, 0 non-image files removed.
Folder: powdery_mildew - 2104 total files, 0 non-image files removed.


---

## Split train validation test set

In [13]:
import os
import shutil
import random

def split_train_validation_test_images(my_data_dir:str, train_set_ratio:float, validation_set_ratio:float, test_set_ratio:float) -> None:
    """
    Splits images in the specified directory into train, validation, and test sets.
    Moves images to corresponding subdirectories.
    
    Args:
    - my_data_dir: A string representing the directory containing the image dataset.
    - train_set_ratio: A float representing the proportion of data to be used in the train set.
    - validation_set_ratio: A float representing the proportion of data to be used in the validation set.
    - test_set_ratio: A float representing the proportion of data to be used in the test set.
    
    Returns: None
    """
    
    # Check that the ratios add up to 1.0
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("Error: train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # Get class labels from directory
    labels = os.listdir(my_data_dir)
    if 'test' in labels:
        pass
    else:
        # Create train, validation, and test folders with class labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:
            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # Move file to train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # Move file to validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # Move file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

        # Remove original label folders
        for label in labels:
            os.rmdir(my_data_dir + '/' + label)



The data has been split into three parts: training set, validation set, and test set. The training set contains 70% of the data used for training the model. The validation set includes 10% of the data used for tuning the model's hyperparameters. The test set consists of the remaining 20% of the data, which is used for evaluating the performance of the final model on unseen data.

In [14]:
split_train_validation_test_images(my_data_dir="inputs/cherryleaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2)


---

# Conclusions:

* In this section, we have covered the data cleaning process, where we removed all the non-image files from our dataset. This step is essential to ensure we only work with the relevant data.
* We have also split our dataset into three sets: train, validation, and test, with a 70-10-20 ratio respectively. This step is crucial for building a robust and generalisable model.

# Next steps:

* In the next section, we will cover data preprocessing, where we will resize the images to a standard size, apply data augmentation techniques, and convert the images into a format that our model can understand.
* Once we have preprocessed our data, we will move on to the modelling phase, where we will train and evaluate various machine-learning models to identify mildew in cherry leaves.