# **Data Collection**

## Objectives

* Fetch data from Kaggle and formulate for future operation.

## Inputs

* Kaggle JSON File which is an authentication token.

## Outputs

* The generated dataset will be inputs/cherryleaves_dataset/cherry-leaves.
- The dataset will conists of 3 folders presenting:
    - **train**, **test** & **validation**
- These folders will also consist of 2 further folders presenting:
    - **healthy** & **powdery_mildew**

## Additional Comments

* These steps are imperative to fetch the data, clean the data and then provide for the next step: Data Visualisation.



---

# Import Packages

In [2]:
! pip install -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt

Collecting numpy==1.26.1 (from -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt (line 1))
  Downloading numpy-1.26.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pandas==2.1.1 (from -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt (line 2))
  Downloading pandas-2.1.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting matplotlib==3.8.0 (from -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt (line 3))
  Downloading matplotlib-3.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting seaborn==0.13.2 (from -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt (line 4))
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Collecting plotly==5.17.0 (from -r /workspace/mildew-detection-in-cherry-leaves/requirements.txt (line 5))
  Downloading plotly-5.17.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collecting Pillow=

# Change Working Directory

Changing the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves/jupyter_notebooks'

Making the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir('/workspace/mildew-detection-in-cherry-leaves')
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/mildew-detection-in-cherry-leaves'

# Kaggle Install

In [6]:
! pip install kaggle

Collecting kaggle
  Downloading kaggle-1.6.17.tar.gz (82 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm (from kaggle)
  Downloading tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Downloading python_slugify-8.0.4-py2.py3-none-any.whl (10 kB)
Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.6.17-py3-none-any.whl size=105786 sha256=03f87baf4ac536e33f0050e4e51c2e53fcccd787cf37c448a12b3f67f1322663
  Stored in directory: /home/gitpod/.cache/pip/wheels/46/d2/26/84d0a1acdb9c6baccf7d28cf06962ec80529fe1ad938489983
Successfull

Changing Kaggle configuration directory to the current working directory and determine permissions for the Kaggle authentication JSON.

In [8]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: cannot access 'kaggle.json': No such file or directory


Kaggle dataset and downloading.

In [9]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherryleaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Dataset URL: https://www.kaggle.com/datasets/codeinstitute/cherry-leaves
License(s): unknown
Downloading cherry-leaves.zip to inputs/cherryleaves_dataset
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:02<00:00, 34.8MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 25.7MB/s]


Unzip the downloaded file and then delete the zip file.

In [10]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# **Data Preparation**

## Data Clean

In [11]:
def removing_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        folder_path = os.path.join(my_data_dir, folder)
        if os.path.isdir(folder_path):
            files = os.listdir(folder_path)
            i = []
            j = []
            for given_file in files:
                file_location = os.path.join(folder_path, given_file)
                if os.path.isfile(file_location) and not given_file.lower().endswith(image_extension):
                    os.remove(file_location)
                    i.append(1)
                else:
                    j.append(1)
                    pass
            print(f"Folder: {folder} - has image file", len(j))
            print(f"Folder: {folder} - has non-image file", len(i))

In [12]:
removing_non_image_file(my_data_dir='inputs/cherryleaves_dataset/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0
Folder: test - has image file 2
Folder: test - has non-image file 0
Folder: train - has image file 2
Folder: train - has non-image file 0
Folder: validation - has image file 2
Folder: validation - has non-image file 0


## Divide Train, Test Validation

In [13]:
import os
import shutil
import random
! pip3 install joblib


def divide_train_validation_test_images(my_data_dir, train_set_ratio,
                                       validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio"
              "+ test_set_ratio expected sum to 1.0")
        return

    # gets the classes labels
    labels = os.listdir(my_data_dir)  # expecting to get only the folder name
    if 'test' in labels:
        pass
    else:
        # creating train and test folders with the classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # moving a secleted file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/'
                                + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # moving a secleted file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/'
                                + file_name)

                else:
                    # moving a secleted file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/'
                                + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


- Training set is divided into a **0.70** ratio of data.
- Test set is divided into a **0.20** ratio of data.
- Validation set is divided into a **0.10** ratio of data.

In [14]:
divide_train_validation_test_images(my_data_dir="inputs/cherryleaves_dataset/cherry-leaves",
                                   train_set_ratio=0.7,
                                   validation_set_ratio=0.1,
                                   test_set_ratio=0.2
                                   )