# **Data Collection**

## Objectives

1. Import necessary packages and configure the working directory. 
2. Authenticate and retrieve the mildew dataset from Kaggle. 
3. Prepare the dataset by organizing it into train, validation, and test splits. 
4. Ensure data integrity by removing any non-image files.

## Inputs

- Kaggle JSON file: Used for authentication and dataset download. 
- Dataset sourse: The mildew dataset hosted on [Kaggle](https://www.kaggle.com/codeinstitute/cherry-leaves).
- Local directories: Structure for storing and splitting data. 

## Outputs

1. Raw Dataset: Downloaded and unzipped into the specified folder
2. Cleaned Dataset: Non-image files removed for consistency. 
3. Structured Data: Split into training (70%), validation (10%), and testing (20%) sets, organized in respective directories. 

## Additional Comments

- These steps are critical to ensure the dataset is properly prepared for model training and evaluation. By structuring and cleaning the data, we minimize errors during training and improve model accuracy. 



---

## Change working directory

In [1]:
import os
import numpy

In [2]:
current_dir = os.getcwd()
current_dir

'/workspace/powdery-mildew-detector/jupyter_notebooks'

In [4]:
os.chdir('/workspace/powdery-mildew-detector')
print("You set a new current directory")

You set a new current directory


In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/powdery-mildew-detector'

## Install Kaggle

In [6]:
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.17-py3-none-any.whl size=105839 sha256=679eaa5751525bd4b42641d84b8b38775f4f90a10d8f26325b1779c01eca282b
  Stored in directory: /home/gitpod/.cache/pip/wheels/46/d2/26/84d0a1acdb9c6baccf7d28cf06962ec80529fe1ad938489983
Successfull

Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the Kaggle Dataset and Download it.

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/mildew_dataset
 96%|████████████████████████████████████▌ | 53.0M/55.0M [00:02<00:00, 29.9MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 24.9MB/s]


Unzip the downloaded file, and delete the zip file.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

### Data Cleaning

In [10]:
def remove_non_image_file(my_data_dir):
    """
    Remove files that are not images from the dataset directory. 
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [11]:
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: Healthy - has image file 2104
Folder: Healthy - has non-image file 0
Folder: Infected - has image file 2104
Folder: Infected - has non-image file 0


## Split data into train, validation and test sets

In [12]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    Split the dataset into training, validation, and test sets. 
    """
    # Check if the sum of the ratios is equal to 1.0
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # Get the class labels in the dataset directory 
    labels = os.listdir(my_data_dir)  
    # Check if the 'test' folder already exists 
    if 'test' in labels:
        pass
    else:
        # Create 'train', 'validation', and 'test' folders with class subfolders
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)
        # Iterate through each class label
        for label in labels:
            # Get the list of files in the current class label directory
            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)
            # Calculate the number of files for train, validation, and test sets
            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # Move the file to the 'train' set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # Move the file to the 'validation' set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)
                else:
                    # Move the file to the 'test' set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1
            # Remove the original class directory after all files are moved
            os.rmdir(my_data_dir + '/' + label)

Typically,
- 70% of the data is allocated to the training set.
- 10% of the data is used for validation.
- 0% of the data is reserved for the test set.

In [13]:
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

---