# **Data Collection and Preparation Notebook**

## Objectives

* Fetch data from Kaggle and prepare it.

## Inputs

* Kaggle JSON file as authentication token

## Outputs

* Generate Dataset: inputs/mildew_dataset



---

## Import Libraries

In [None]:
! pip install -r requirements.txt

In [1]:
import os
import zipfile
import random
import shutil

## Set working directory

In [2]:
current_dir = os.getcwd()
current_dir

'/Users/nathalievonheyl/Documents/ci-code_institute/PP5/pp5_mildew-detection/jupyter_notebooks'

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/Users/nathalievonheyl/Documents/ci-code_institute/PP5/pp5_mildew-detection'

## Install and download Kaggle data set

In [None]:
! pip install kaggle

Change kaggle configuration directory to current working directory and permission of kaggle authentication json

In [5]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download Kaggle dataset

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown


In [7]:
zip_path = os.path.join(DestinationFolder, "cherry-leaves.zip")

with zipfile.ZipFile(zip_path, "r") as zip_ref:
  zip_ref.extractall(DestinationFolder)

os.remove(zip_path)

---

## Data Cleaning and Preparation

### Data Cleaning: Remove non-image files

For the dynamic file path construction, I am using `os.path.join()` to avoid errors while concatenating strings.

In [8]:
def remove_non_image_file(my_data_dir):
  """
  Function to remove non-image files from the dataset directory.

  Args:
  * my_data_dir = path to image dataset folder
  """
  image_extension = ('.png', '.jpg', '.jpeg')
  folders = os.listdir(my_data_dir)  
  for folder in folders:
    folder_path =  os.path.join(my_data_dir, folder)

    if not os.path.isdir(folder_path):
      continue

    files = os.listdir(folder_path)

    i = []
    j = []
    for given_file in files:
      if not given_file.lower().endswith(image_extension):
        file_location = os.path.join(my_data_dir, folder, given_file)
        os.remove(file_location)
        i.append(1)
      else:
        j.append(1)
        pass
    print(f"Folder: {folder} - has image file", len(j))
    print(f"Folder: {folder} - has non-image file", len(i))


From previous experience, I discovered that the macOS Finder creates `.DS_Store` files in every folder. These files are not actually folders, but `os.listdir()` can mistakenly include them in the list of folders. This then can cause errors when trying to further process the directories as class labels in image classification. 

I included an if/else statement to check whether an item is a real directory using `os.path.isdir()`. `.DS_Store` files are then already skipped in this process.

In [9]:
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


### Split data into train and validation, and test sets

Organize the image dataset into train, validation, and test sets. This is essential for the ml model to be properly trained.

Here's what the function does:

- Validate that the provided split ratios sum to 1
- Check if dataset is already split based on search for existance of 'test' directory and if check positive exit function to avoid unnecessary processing
- Create the 'train', 'validation', and 'test' directories if check negative and they do not yet exist
- Randomize the image dataset and distribute files to new direcotires according to the ratio distribution
- Place images into their respective dataset folders
- Remove original class folders after successfully moving all images

In [10]:
def split_dataset_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  """
  Function to split images into train, validation, and test sets.

  The function assumes `my_data_dir` contains folders representing class labels.
  
  Args:
  * my_data_dir = path to image dataset folder
  * train_set_ratio = float between 0-1
  * validation_set_ratio = float between 0-1
  * test_set_ratio = float between 0-1
  """
  # make sure ratio of datasets sum is 1
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return
  
  # get list of labels (folders)
  labels = os.listdir(my_data_dir)

  # detect if datasets have already been split
  if os.path.exists(os.path.join(my_data_dir, 'test')):
      print("Datasets have already been split.")
      return
  
  #create train, validation, and test folder with subfolders "healthy" and "powdery_mildew"
  else:
      for folder in ['train', 'validation', 'test']:
          for label in labels:
              os.makedirs(os.path.join(my_data_dir, folder, label))

      # looping through labels
      for label in labels:
          
          files = os.listdir(os.path.join(my_data_dir, label))
          random.shuffle(files)

          # calculate number of files per directory according to ratio
          train_set_files_qty = int(len(files) * train_set_ratio)
          validation_set_files_qty = int(len(files) * validation_set_ratio)

          # move img data into new folders
          count = 1
          for file_name in files:
              if count <= train_set_files_qty:
                  shutil.move(os.path.join(my_data_dir, label, file_name),
                              os.path.join(my_data_dir, 'train', label, file_name))

              elif count <= (train_set_files_qty + validation_set_files_qty):
                  shutil.move(os.path.join(my_data_dir, label, file_name),
                              os.path.join(my_data_dir, 'validation', label, file_name))
                  
              else:
                  shutil.move(os.path.join(my_data_dir, label, file_name),
                              os.path.join(my_data_dir, 'test', label, file_name))
              
              count += 1
          
          # delete original category folders after moving files
          os.rmdir(os.path.join(my_data_dir, label))


In [11]:
split_dataset_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                     train_set_ratio=0.7,
                     validation_set_ratio=0.1,
                     test_set_ratio=0.2
                     )

---

## Conclusion and next steps

**Conclusion**

The original image dataset containing images classified by "healthy" and "powdery_mildew" has been successfully reorganized into structured 'train', 'validation', and 'test' directories with the respective subfolders containing the original class labels. 

The `remove_non_image_file` function handels the following points:
- Removing non-image files
- Skipping hidden files that could be mistakenly interpreted as class labels (e.g. .DS_Store)

The original unstructured dataset is removed. The split datasets are ready for DataVisualization.

**Next Steps**

The now structured image data is visualized to understand and verify dataset features. 
Key objectives:
- Determining class distribution between "healthy" and "powdery_mildew" images throughout the split datasets
- Determining and plotting mean and variablity images for the two class labels
- Creating an image montage for the dashboard of the future app