# **Data Collection and Preparation Notebook**

## Objectives

* Fetch data from Kaggle and prepare it.

## Inputs

* Kaggle JSON file as authentication token

## Outputs

* Generate Dataset: inputs/mildew_dataset

## Additional Comments

* No comments



---

# Import Packages

In [None]:
! pip install -r requirements.txt

Collecting numpy==1.26.1 (from -r requirements.txt (line 1))
  Using cached numpy-1.26.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
Collecting pandas==2.1.1 (from -r requirements.txt (line 2))
  Using cached pandas-2.1.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (18 kB)
Collecting matplotlib==3.8.0 (from -r requirements.txt (line 3))
  Using cached matplotlib-3.8.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (5.8 kB)
Collecting seaborn==0.13.2 (from -r requirements.txt (line 4))
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting plotly==5.17.0 (from -r requirements.txt (line 5))
  Using cached plotly-5.17.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting Pillow==10.0.1 (from -r requirements.txt (line 6))
  Using cached Pillow-10.0.1-cp312-cp312-macosx_11_0_arm64.whl.metadata (9.5 kB)
Collecting streamlit==1.40.2 (from -r requirements.txt (line 7))
  Using cached streamlit-1.40.2-py2.py3-none-any.whl.metadata (8.4 kB)
Collecting joblib==1.4.2 (from -r 

# Change working directory

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/Users/nathalievonheyl/Documents/ci-code_institute/PP5/pp5_mildew-detection/jupyter_notebooks'

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/Users/nathalievonheyl/Documents/ci-code_institute/PP5/pp5_mildew-detection'

# Install and download Kaggle data set

In [7]:
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.7.4.2-py3-none-any.whl.metadata (16 kB)
Collecting bleach (from kaggle)
  Using cached bleach-6.2.0-py3-none-any.whl.metadata (30 kB)
Collecting certifi>=14.05.14 (from kaggle)
  Using cached certifi-2025.1.31-py3-none-any.whl.metadata (2.5 kB)
Collecting charset-normalizer (from kaggle)
  Using cached charset_normalizer-3.4.1-cp312-cp312-macosx_10_13_universal2.whl.metadata (35 kB)
Collecting idna (from kaggle)
  Using cached idna-3.10-py3-none-any.whl.metadata (10 kB)
Collecting protobuf (from kaggle)
  Downloading protobuf-6.30.1-cp39-abi3-macosx_10_9_universal2.whl.metadata (593 bytes)
Collecting python-slugify (from kaggle)
  Using cached python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting requests (from kaggle)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting setuptools>=21.0.0 (from kaggle)
  Downloading setuptools-76.1.0-py3-none-any.whl.metadata (6.7 kB)
Collecting text-unidecode (from kagg

Change kaggle configuration directory to current working directory and permission of kaggle authentication json

In [10]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download Kaggle dataset

In [36]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown


In [38]:
import zipfile

zip_path = os.path.join(DestinationFolder, "cherry-leaves.zip")

with zipfile.ZipFile(zip_path, "r") as zip_ref:
  zip_ref.extractall(DestinationFolder)

os.remove(zip_path)

---

# Data Cleaning and Preparation

---

## Data Cleaning: Remove non-image files

For the dynamic file path construction, I am using `os.path.join()` for easier handling.

In [39]:
def remove_non_image_file(my_data_dir):
  """
  Function to remove non-image files
  from the dataset directory.
  """
  image_extension = ('.png', '.jpg', '.jpeg')
  folders = os.listdir(my_data_dir)  
  for folder in folders:
    folder_path =  os.path.join(my_data_dir, folder)

    if not os.path.isdir(folder_path):
      continue

    files = os.listdir(folder_path)

    i = []
    j = []
    for given_file in files:
      if not given_file.lower().endswith(image_extension):
        file_location = os.path.join(my_data_dir, folder, given_file)
        os.remove(file_location)
        i.append(1)
      else:
        j.append(1)
        pass
    print(f"Folder: {folder} - has image file", len(j))
    print(f"Folder: {folder} - has non-image file", len(i))


From previous experience, I discovered that the macOS Finder creates `.DS_Store` files in every folder. These files are not actually folders, but `os.listdir()` can mistakenly include them in the list of folders. This then can cause errors when trying to further process the directories as class labels in image classification. 

I included an if/else statement to check whether an item is a real directory using `os.path.isdir()`. `.DS_Store` files are then already skipped in this process.

In [41]:
remove_non_image_file(my_data_dir='inputs/mildew_dataset/cherry-leaves')

Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0


---

## Split data into train and validation, then test sets

In [42]:
import os
import random
import shutil

def split_dataset_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  """
  Function to split images into train, validation, and test sets.

  The function assumes `my_data_dir` contains folders representing class labels.
  
  Checks if the sum of all ratios equals 1.
  Checks if dataset is already split by looking for 'test' folder.
  Creates 'train', 'validation', 'test' directories if they do not yet exist.
  Shuffles and distributes images based on ratio passed by parameters.
  Moves images into their respective dataset folders.
  Deletes original class folders after images have been moved.
  """
  # make sure ratio of datasets sum is 1
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return
  
  # get list of labels (folders)
  labels = os.listdir(my_data_dir)

  # check if datasets have already been split
  # check for test folder as an example
  # exit function to avoid unnecessary processing
  if os.path.exists(os.path.join(my_data_dir, 'test')):
      print("Datasets have already been split.")
      return
  
  #create train, validation, and test folder with subfolders "healthy" and "powdery_mildew"
  else:
      for folder in ['train', 'validation', 'test']:
          for label in labels:
              os.makedirs(os.path.join(my_data_dir, folder, label))

      # looping through labels
      for label in labels:
          
          files = os.listdir(my_data_dir + '/' + label)
          random.shuffle(files)

          # calculate number of files for train and validation split according to ratio
          # remaining images go into test set
          train_set_files_qty = int(len(files) * train_set_ratio)
          validation_set_files_qty = int(len(files) * validation_set_ratio)

          # move img data into new folders
          count = 1
          for file_name in files:
              if count <= train_set_files_qty:
                  shutil.move(os.path.join(my_data_dir, label, file_name),
                              os.path.join(my_data_dir, 'train', label, file_name))

              elif count <= (train_set_files_qty + validation_set_files_qty):
                  shutil.move(os.path.join(my_data_dir, label, file_name),
                              os.path.join(my_data_dir, 'validation', label, file_name))
                  
              else:
                  shutil.move(os.path.join(my_data_dir, label, file_name),
                              os.path.join(my_data_dir, 'test', label, file_name))
              
              count += 1
          
          # delete original category folders after moving files
          os.rmdir(os.path.join(my_data_dir, label))


Note to self: 

If any other hidden files similar to `.DS_Store` left in folder(s) `os.rmdir()` fill fail according to (https://docs.python.org/3/library/os.html#os.rmdir)
stackoverflow recommends `shutil.rmtree()` (https://stackoverflow.com/a/62244641/28803519)

Do more research and possibly refactor later.

In [44]:
split_dataset_images(my_data_dir=f"inputs/mildew_dataset/cherry-leaves",
                     train_set_ratio=0.7,
                     validation_set_ratio=0.1,
                     test_set_ratio=0.2
                     )

---

# Conclusion and next steps

### Conclusion

The original img dataset of "healthy" and "powdery_mildew" has been organised into 'train', 'validation', and 'test' folders with the respective 'healthy' and 'powdery_mildew' subfolders.

The `remove_non_image_file` and `split_dataset_images` functions ensure a clean strucutre, removing non-image files, skipping hidden files that would be mistaken as folders (interpreted as class labels), and avoid manual errors by using os methods.

The original unstructured img dataset has been succesfully removed. The split datasets are ready for visualisation.

### Next Steps

The structured img data will be manipulated and plotted to understand and verify the dataset. Most relevant is to determine whether the classes of the dataset are balanced. In a next step I will also verify once more and check for incorrect labels (e.g. hidden files mistaken for labels). The goal of the next step will be to align the img data with the given business requirements.