# **DATA COLLECTION**

## **Objectives**

1. Import necessary packages and configure the working directory.
2. Authenticate and retrieve the mildew dataset from Kaggle.
3. Prepare the dataset by organizing it into train, validation, and test splits.
4. Ensure data integrity by removing any non-image files.

## **Inputs**

- Kaggle JSON file: Used for authentication and dataset download.
- Dataset sourse: The mildew dataset hosted on Kaggle.
- Local directories: Structure for storing and splitting data.

## **Outputs**

- Raw Dataset: Downloaded and unzipped into the specified folder
- Cleaned Dataset: Non-image files removed for consistency.
- Structured Data: Split into training (70%), validation (10%), and testing (20%) sets, organized in respective directories.

## **Additional Comments**

- These steps are critical to ensure the dataset is properly prepared for model training and evaluation. By structuring and cleaning the data, we minimize errors during training and improve model accuracy.



---

## **Change Working Directory**

---

### **Import Necessary Packages**

In [1]:
import os
import numpy

### **Set Working Directory & File Paths**

In [2]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-app/jupyter_notebooks'

In [3]:
os.chdir('/workspaces/mildew-detection-app')
print("You set a new current directory")

You set a new current directory


Confirm the New Current Directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspaces/mildew-detection-app'

### Install Kaggle

In [5]:
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.17-py3-none-any.whl size=105839 sha256=22327282211a27cb900adc16eb2ea8431040cd02e21b680d15cbf97972f1b375
  Stored in directory: /home/codespace/.cache/pip/wheels/46/d2/26/84d0a1acdb9c6baccf7d28cf06962ec80529fe1ad938489983
Successf

Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

### Set Dataset and Download

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/mildew_dataset
 93%|███████████████████████████████████▏  | 51.0M/55.0M [00:02<00:00, 29.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 21.7MB/s]


### Unzip the Downloaded File and Delete the ZIP File

In [8]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## **Data Preparation**

---

### **Data Cleaning**

In [9]:
def remove_non_image_file(my_data_dir):
    """
    Remove files that are not images from the dataset directory. 
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [10]:
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: Healthy - has image file 2104
Folder: Healthy - has non-image file 0
Folder: Infected - has image file 2104
Folder: Infected - has non-image file 0


### **Detect and Remove Corrupt Images**

In [11]:
import os
from PIL import Image

def check_corrupt_images(directory):
    corrupt_images = []

    for folder in os.listdir(directory):
        folder_path = os.path.join(directory, folder)
        if os.path.isdir(folder_path):  # Ensure it's a directory
            for img_name in os.listdir(folder_path):
                img_path = os.path.join(folder_path, img_name)
                try:
                    img = Image.open(img_path)  # Try opening the image
                    img.verify()  # Verify image integrity
                except (IOError, SyntaxError):
                    print(f"Corrupt image detected: {img_path}")
                    corrupt_images.append(img_path)

    return corrupt_images

# Define your dataset directory
dataset_path = "inputs/mildew_dataset/cherry-leaves"

# Run check on the extracted dataset (Healthy & Infected folders)
corrupt_images = check_corrupt_images(dataset_path)

print(f"Total corrupt images found: {len(corrupt_images)}")

Total corrupt images found: 0


### **Split Data into Train, Validation, and Test Sets**

In [12]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):    
    """
    Split the dataset into training, validation, and test sets.
    """
    # Validate that the sum of train, validation, and test ratios equals 1.0
    if train_set_ratio + validation_set_ratio + test_set_ratio!= 1.0:
        print("Error: train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # Get the class labels in the dataset directory 
    labels = os.listdir(my_data_dir)  

    # Create 'train', 'validation', and 'test' folders with class subfolders
    for folder in ['train', 'validation', 'test']:
        for label in labels:
            os.makedirs(name=os.path.join(my_data_dir, folder, label), exist_ok=True)

    # Iterate through each class label
    for label in labels:
        # Get the list of files in the current class label directory
        files = os.listdir(os.path.join(my_data_dir, label))
        random.shuffle(files)
        # Calculate the number of files for train, validation, and test sets
        train_set_files_qty = int(len(files) * train_set_ratio)
        validation_set_files_qty = int(len(files) * validation_set_ratio)

        count = 1
        for file_name in files:
            if count <= train_set_files_qty:
                # Move the file to the 'train' set
                shutil.move(os.path.join(my_data_dir, label, file_name),
                            os.path.join(my_data_dir, 'train', label, file_name))

            elif count <= (train_set_files_qty + validation_set_files_qty):
                # Move the file to the 'validation' set
                shutil.move(os.path.join(my_data_dir, label, file_name),
                            os.path.join(my_data_dir, 'validation', label, file_name))
            else:
                # Move the file to the 'test' set
                shutil.move(os.path.join(my_data_dir, label, file_name),
                            os.path.join(my_data_dir, 'test', label, file_name))

            count += 1
        # Remove the original class directory after all files are moved
        os.rmdir(os.path.join(my_data_dir, label))

In [13]:
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

### **Count Images in Each Set**

In [14]:
import os

sets = ['train', 'test', 'validation']
labels = ['Healthy', 'Infected']  

for set_name in sets:
    for label in labels:
        path = f'inputs/mildew_dataset/cherry-leaves/{set_name}/{label}' 
        try:
            number_of_files = len(os.listdir(path))
            print(f'There are {number_of_files} images in {set_name}/{label}')
        except FileNotFoundError:
            print(f"Error: Directory '{path}' not found.")

# Compute total number of images across all datasets (train, validation, test)
total_images = 0
for set_name in sets:
    for label in labels:
        path = f'inputs/mildew_dataset/cherry-leaves/{set_name}/{label}'  
        try:
            total_images += len(os.listdir(path))
        except FileNotFoundError:
            pass
print(f"\nTotal number of images: {total_images}")

There are 1472 images in train/Healthy
There are 1472 images in train/Infected
There are 422 images in test/Healthy
There are 422 images in test/Infected
There are 210 images in validation/Healthy
There are 210 images in validation/Infected

Total number of images: 4208


---

## **Conclusion and Next Steps**
---

This notebook successfully handled **data collection and preprocessing** for the **powdery mildew detection project**. The dataset was:

- Downloaded from Kaggle using authentication.
- Cleaned by removing non-image files.
- Organized into train (70%), validation (10%), and test (20%) sets.
- Verified for integrity by counting images in each set.

### **Next Steps**:

1. Data Exploration & Visualization:

- Analyze class distributions and dataset balance.
- Generate image samples to check quality.
- Compute image dimensions for standardization.

2. Data Augmentation & Preprocessing:

- Implement augmentation strategies (e.g., rotation, flipping).
- Normalize images for deep learning models.

3. Model Training & Evaluation:

- Utilize the structured dataset for CNN training.
- Tune hyperparameters to improve model performance.