# **Data Collection and Preprocessing**

## Objectives

1. Import necessary packages and configure the working directory. 
2. Authenticate and retrieve the mildew dataset from Kaggle. 
3. Prepare the dataset by organizing it into train, validation, and test splits. 
4. Ensure data integrity by removing any non-image files.
5. Perform data augmentation to increase the diversity of the training data.

## Inputs

- Kaggle JSON file: Used for authentication and dataset download. 
- Dataset sourse: The mildew dataset hosted on [Kaggle](https://www.kaggle.com/codeinstitute/cherry-leaves).
- Local directories: Structure for storing and splitting data. 

## Outputs

1. Raw Dataset: Downloaded and unzipped into the specified folder
2. Cleaned Dataset: Non-image files removed for consistency. 
3. Structured Data: Split into training (70%), validation (10%), and testing (20%) sets, organized in respective directories. 
4. Augmented Data: Augmented training images generated and stored in the training directory.

## Additional Comments

- These steps are critical to ensure the dataset is properly prepared for model training and evaluation. 
-  By structuring, cleaning, and augmenting the data, we minimize errors during training and improve model accuracy and generalization.

## Vision Control: 
- This notebook and the associated datasets are version controlled using Git. Each version of the dataset and the preprocessing steps are tracked to ensure reproducibility and facilitate experimentation.



---

In [None]:
import os
import zipfile
import shutil
import random
from tensorflow.keras.preprocessing.image import ImageDataGenerator

## Change working directory

In [1]:
import os
import numpy

In [None]:
current_dir = os.getcwd()
current_dir

In [None]:
os.chdir('/workspace/powdery-mildew-detector')
print("You set a new current directory")

In [None]:
current_dir = os.getcwd()
current_dir

## Install Kaggle

In [None]:
! pip install kaggle

Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the Kaggle Dataset and Download it.

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, and delete the zip file.

In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

### Data Cleaning

In [10]:
def remove_non_image_file(my_data_dir):
    """
    Remove files that are not images from the dataset directory. 
    """
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

In [None]:
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

## Split data into train, validation and test sets

In [12]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """
    Split the dataset into training, validation, and test sets. 
    """
    # Check if the sum of the ratios is equal to 1.0
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # Get the class labels in the dataset directory 
    labels = os.listdir(my_data_dir)  
    # Check if the 'test' folder already exists 
    if 'test' in labels:
        pass
    else:
        # Create 'train', 'validation', and 'test' folders with class subfolders
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)
        # Iterate through each class label
        for label in labels:
            # Get the list of files in the current class label directory
            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)
            # Calculate the number of files for train, validation, and test sets
            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # Move the file to the 'train' set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # Move the file to the 'validation' set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)
                else:
                    # Move the file to the 'test' set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1
            # Remove the original class directory after all files are moved
            os.rmdir(my_data_dir + '/' + label)

### Justification for Data Split Ratios:
- The chosen data split ratios of 70/10/20 are common in machine learning. 
    - 70% for training provides sufficient data for the model to learn the patterns.
    - 10% for validation allows for unbiased evaluation of the model's performance during training and helps with hyperrarameter tuning. 
    - 20% for testing ensures a robust final evaluation of the model's generalization ability on unseen data. 

In [13]:
split_train_validation_test_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )

## Data Augmentation

- ImageDataGenerator is used to augment the training data with random transformations.
- This helps to increase the diversity of the training set and improve the model's robustness.

In [None]:
train_data_gen = ImageDataGenerator(
    rotation_range=20,  # Randomly rotate images by up to 20 degrees
    width_shift_range=0.2,  # Randomly shift images horizontally by up to 20% of the width
    height_shift_range=0.2,  # Randomly shift images vertically by up to 20% of the height
    shear_range=0.2,  # Randomly apply shearing transformations
    zoom_range=0.2,  # Randomly zoom in or out on images
    horizontal_flip=True,  # Randomly flip images horizontally
    fill_mode='nearest'  # Fill in missing pixels with the nearest value
)

### Justification for Data Augmentation Techniques:
- The chosen data augmentation techniques help to introduce variability in the training data, making the model more robust to different orientations, positions, and scales of the leaves in the images. This can help to improve the model's generalization ability. 

In [None]:
# Apply data augmentation to the training set
train_path = f"inputs/mildew_dataset/cherry-leaves/train"
train_data_gen.flow_from_directory(
    train_path,
    target_size=(128, 128),  # Resize images to 128x128
    batch_size=32,
    class_mode='binary',  # Binary classification (healthy or infected)
    save_to_dir=train_path,  # Save augmented images to the training directory
    save_prefix='augmented_',  # Prefix for augmented image filenames
    save_format='png'  # Save images in PNG format
)

### Visualize Augmented Images

In [None]:
def visualize_augmented_images(generator, n_images=5):
    """
    Visualizes augmented images from an ImageDataGenerator.

    Args:
      generator: The ImageDataGenerator.
      n_images: The number of images to visualize.
    """
    for i in range(n_images):
        image, label = next(generator)
        plt.imshow(image)
        plt.title(f"Augmented Image {i+1}")
        plt.show()

visualize_augmented_images(train_generator)

### Missing Value Handling
- This dataset does not contain any missing values. However, in general, for image datasets, missing values could manifest as missing pixels or corrupted image regions.
- Strategies for handling missing values in image data include:
    - Removing images with missing data if the percentage of missing pixels is significant.
    - Imputing missing pixels using techniques like interpolation or inpainting.
    - Using data augmentation techniques that introduce missingness to make the model more robust to missing data.

## Conclusions and Next Steps

### Conclusions
The data collection and preprocessing steps were successfully completed:
- The Cherry Leaves dataset was downloaded from Kaggle.
- Non-image files were removed to ensure data integrity.
- The data was split into training, validation, and test sets.
- Data augmentation was applied to the training set to increase diversity and improve model generalization.

### Next Steps
The preprocessed and augmented data is now ready for the next steps:
- Exploratory data analysis will be performed to gain insights into the data.
- A suitable model architecture will be selected and trained using the prepared data.
- The trained model will be evaluated on the test set to assess its performance.

---