# Data Collection: Retrieve Data from Kaggle

## Objectives

* Retrieve data from Kaggle and preprocess it for subsequent steps.

## Inputs

* Kaggle JSON file - authentication token.

## Outputs

* Generate Dataset: `inputs/datasets/cherry_leaves_dataset`

## Additional Comments | Insights | Conclusions

* No further comments provided.
---

# Import packages

In [1]:
%pip install -r /workspace/cherry-leaves-mildew-detector-project-5/requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting absl-py==2.0.0
  Using cached absl_py-2.0.0-py3-none-any.whl (130 kB)
Collecting altair==5.2.0
  Using cached altair-5.2.0-py3-none-any.whl (996 kB)
Collecting astunparse==1.6.3
  Using cached astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting attrs==23.2.0
  Using cached attrs-23.2.0-py3-none-any.whl (60 kB)
Collecting blinker==1.7.0
  Using cached blinker-1.7.0-py3-none-any.whl (13 kB)
Collecting cachetools==5.3.2
  Using cached cachetools-5.3.2-py3-none-any.whl (9.3 kB)
Collecting certifi==2023.11.17
  Using cached certifi-2023.11.17-py3-none-any.whl (162 kB)
Collecting charset-normalizer==3.3.2
  Using cached charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
Collecting click==8.1.7
  Using cached click-8.1.7-py3-none-any.whl (97 kB)
Collecting contourpy==1.2.0
  Using cached contourpy-1.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.w

In [13]:
import os
import numpy
import shutil
import random
from pathlib import Path

---

## Check/Set the working directory

In [3]:
expected_parent_dir = ("/workspace/cherry-leaves-mildew-detector-project-5/")

current_dir = os.getcwd()
print(f"Current directory: {current_dir}")

parent_basename = os.path.basename(os.path.abspath(
    os.path.join(current_dir, os.pardir)))

if parent_basename != expected_parent_dir:
    os.chdir(expected_parent_dir)
    print(f"Changed to the expected parent directory: {expected_parent_dir}")
else:
    print(f"Already in the expected parent directory: {parent_basename}")

current_dir = os.getcwd()
print(f"Current directory: {current_dir}")


Current directory: /workspace/cherry-leaves-mildew-detector-project-5/jupyter_notebooks
Changed to the expected parent directory: /workspace/cherry-leaves-mildew-detector-project-5/
Current directory: /workspace/cherry-leaves-mildew-detector-project-5


---

### Install Kaggle
* This should already be installed during the import pacakges step. If not run the below command.

In [4]:
%pip install kaggle

Defaulting to user installation because normal site-packages is not writeable
Collecting kaggle
  Downloading kaggle-1.6.3.tar.gz (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.5/84.5 KB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting bleach
  Using cached bleach-6.1.0-py3-none-any.whl (162 kB)
Collecting python-slugify
  Using cached python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting tqdm
  Downloading tqdm-4.66.1-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.3/78.3 KB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting webencodings
  Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 KB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected p

### kaggle config

Execute the cell below **to update the Kaggle configuration directory to the present working directory and configure permissions for the Kaggle authentication JSON**.

In [14]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 /workspace/cherry-leaves-mildew-detector-project-5/kaggle.json

Setting and downloading the kaggle dataset.

In [15]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
 98%|█████████████████████████████████████▎| 54.0M/55.0M [00:01<00:00, 50.6MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 36.2MB/s]


In [16]:
zip_file_path = os.path.join(DestinationFolder, 'cherry-leaves.zip')

shutil.unpack_archive(zip_file_path, DestinationFolder, 'zip')

os.remove(zip_file_path)


---

# Data Preparation

### Check for and remove non-image files

In [17]:
def remove_non_image_file(my_data_dir):
    """
    Remove non-image files from each subfolder within a given directory.

    Parameters:
    - my_data_dir (str): The path to the main directory containing subfolders.

    Pseudo-code:
    - Iterate over each subfolder in the given directory.
    - For each subfolder, iterate over the files and remove those not ending
    with specified image extensions.
    - Print the count of image and non-image files for each subfolder.
    """

    image_extensions = ('.png', '.jpg', '.jpeg')

    for folder in os.listdir(my_data_dir):
        folder_path = os.path.join(my_data_dir, folder)

        image_count = 0
        non_image_count = 0

        for given_file in os.listdir(folder_path):
            file_location = os.path.join(folder_path, given_file)

            if not given_file.lower().endswith(image_extensions):
                if os.path.isfile(file_location):
                    os.remove(file_location)
                    non_image_count += 1
                elif os.path.isdir(file_location):
                    shutil.rmtree(file_location)
            else:
                image_count += 1

        print(f"Folder: {folder} - has image file {image_count}")
        print(f"Folder: {folder} - has non-image file {non_image_count}")


remove_non_image_file('inputs/cherry_leaves_dataset/cherry-leaves')


Folder: test - has image file 0
Folder: test - has non-image file 0
Folder: train - has image file 0
Folder: train - has non-image file 0
Folder: validation - has image file 0
Folder: validation - has non-image file 0
Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


### Split train validation test set

In [19]:
def split_train_validation_test_images(my_data_dir, train_set_ratio,
                                       validation_set_ratio, test_set_ratio):
    """
    Split the dataset into training, validation, and test sets based on the
    given ratios.

    Parameters:
    - my_data_dir (str): The main directory containing subfolders
    representing different classes.
    - train_set_ratio (float): The ratio of the dataset to be used for
    training.
    - validation_set_ratio (float): The ratio of the dataset to be used for
    validation.
    - test_set_ratio (float): The ratio of the dataset to be used for testing.

    Pseudo-code:
    - Check if the sum of train_set_ratio, validation_set_ratio, and
    test_set_ratio is equal to 1.0.
    - Create subdirectories ('train', 'validation', 'test') for each class
    label.
    - Shuffle the files in each class label's directory.
    - Move the files to the appropriate set directories based on the given
    ratios.
    - Remove the original class label directories.
    """

    if not (train_set_ratio + validation_set_ratio + test_set_ratio == 1.0):
        print("train_set_ratio + validation_set_ratio + "
              "test_set_ratio should sum to 1.0")
        return

    labels = [
        label
        for label in os.listdir(my_data_dir)
        if os.path.isdir(os.path.join(my_data_dir, label))
    ]

    for folder in ['train', 'validation', 'test']:
        for label in labels:
            os.makedirs(name=os.path.join(my_data_dir,
                        folder, label), exist_ok=True)

    for label in labels:
        files = os.listdir(os.path.join(my_data_dir, label))
        random.shuffle(files)

        train_set_files_qty = int(len(files) * train_set_ratio)
        validation_set_files_qty = int(len(files) * validation_set_ratio)

        for count, file_name in enumerate(files, start=1):
            destination_folder = ''
            if count <= train_set_files_qty:
                destination_folder = 'train'
            elif count <= (train_set_files_qty + validation_set_files_qty):
                destination_folder = 'validation'
            else:
                destination_folder = 'test'

            source_path = os.path.join(my_data_dir, label, file_name)
            destination_path = os.path.join(my_data_dir,
                                            destination_folder,
                                            label, file_name)
            shutil.move(source_path, destination_path)

        os.rmdir(os.path.join(my_data_dir, label))


split_train_validation_test_images('inputs/cherry_leaves_dataset'
                                   '/cherry-leaves', 0.7, 0.15, 0.15)
