# **(ADD HERE THE NOTEBOOK NAME)**

## Objectives

* Colect the data from Kaggle, unzip, prepare and storage it for further analysis.


## Inputs

* The Kaggle JSON file authentication token.


## Outputs

* Create Dataset: inputs/datasets/cherry-leaves

## Additional Comments


* The data must be saved after being prepared, removing any files that are not an images, split the data into Train, Validation and Set folders.



---

In [1]:
# %pip install numpy pandas matplotlib seaborn plotly streamlit scikit-learn tensorflow-cpu keras 
%pip install -r /workspace/project5/requirements.txt



Collecting numpy
  Using cached numpy-1.24.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting matplotlib
  Using cached matplotlib-3.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
Collecting seaborn
  Using cached seaborn-0.12.2-py3-none-any.whl (293 kB)
Collecting plotly
  Using cached plotly-5.14.1-py2.py3-none-any.whl (15.3 MB)
Collecting streamlit
  Using cached streamlit-1.22.0-py2.py3-none-any.whl (8.9 MB)
Collecting scikit-learn
  Using cached scikit_learn-1.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.6 MB)
Collecting tensorflow-cpu
  Using cached tensorflow_cpu-2.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (231.8 MB)
Collecting keras
  Using cached keras-2.12.0-py2.py3-none-any.whl (1.7 MB)
Collecting protobuf
  Using cached protobuf-4.23.2-cp37-abi3-manylinux2014_x86_64.whl (304 kB)
Collecting pandas
  Using cached pandas-2.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_

In [2]:
import numpy
import os
import shutil
import random
import joblib

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/project5/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/workspace/project5'

# Section 1

Section 1 content

In [6]:
%pip install kaggle==1.5.12


Collecting kaggle==1.5.12
  Using cached kaggle-1.5.12-py3-none-any.whl
Collecting python-slugify
  Using cached python_slugify-8.0.1-py2.py3-none-any.whl (9.7 kB)
Collecting text-unidecode>=1.3
  Using cached text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
Installing collected packages: text-unidecode, python-slugify, kaggle
Successfully installed kaggle-1.5.12 python-slugify-8.0.1 text-unidecode-1.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.1.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [7]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [8]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/datasets"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/datasets
 95%|███████████████████████████████████▉  | 52.0M/55.0M [00:02<00:00, 28.3MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:02<00:00, 24.0MB/s]


In [9]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

print("Done!")

Done!


---

# Data Preparation Stage

In this section we are going to check for any files that are not images in the zip & split the files, train, validate and test the sets.

## Find and remove files that are not images

In [10]:
def remove_non_image_file(my_data_dir):
    """Removes non-image files from the specified directory"""
    image_extensions = ('.png', '.jpg', '.jpeg')
    for folder in os.listdir(my_data_dir):
        folder_path = os.path.join(my_data_dir, folder)
        if os.path.isdir(folder_path):
            for file in os.listdir(folder_path):
                if not file.lower().endswith(image_extensions):
                    os.remove(os.path.join(folder_path, file))
    print(f"Removed non-image files from {my_data_dir}")


In [11]:
def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
    """ Split the data into train, validation and test sets """

    # Check that the sum of all the ratios is 1
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
        return

    # Get the class labels
    labels = os.listdir(my_data_dir)

    # Create train, validation and test folders if they don't already exist
    for folder in ['train', 'validation', 'test']:
        for label in labels:
            os.makedirs(name=f"{my_data_dir}/{folder}/{label}", exist_ok=True)

    # Loop over each label and move files to the appropriate set
    for label in labels:
        files = os.listdir(f"{my_data_dir}/{label}")
        random.shuffle(files)

        # Calculate the number of files for each set
        train_set_files_qty = int(len(files) * train_set_ratio)
        validation_set_files_qty = int(len(files) * validation_set_ratio)

        for i, file_name in enumerate(files):
            if i < train_set_files_qty:
                # Move given file to train set
                dest_dir = f"{my_data_dir}/train/{label}"
            elif i < (train_set_files_qty + validation_set_files_qty):
                # Move given file to validation set
                dest_dir = f"{my_data_dir}/validation/{label}"
            else:
                # Move given file to test set
                dest_dir = f"{my_data_dir}/test/{label}"
            shutil.move(f"{my_data_dir}/{label}/{file_name}", f"{dest_dir}/{file_name}")

        # Remove the original folder
        os.rmdir(f"{my_data_dir}/{label}")


In [12]:
split_train_validation_test_images(my_data_dir = f"inputs/datasets/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

In [13]:


print("Done!")

Done!


---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* If you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [14]:
import os

try:
    # create your folder here
    os.makedirs(name='my_folder')
except Exception as e:
    print(e)


[Errno 17] File exists: 'my_folder'


# Summary

Work Done:

* Download and cleaning of the data have been completed

* There are three folders within the directory inputs/datasets/cherry_leaves, namely Train, Validation, and Test. Each of these folders contains two subfolders: one with images of healthy cherry leaves and the other with images of cherry leaves infected with powdery mildew.

* In the next notebook I will visualizing the different types of leaves, obtaining their average and variation images, distinguishing the contrast between them, address business requirement number 1.




Issues:

* Was getting a 401 - unathorised error when trying to use the kaggle.json file = Fixed by creating a new api key and using that.

* 




to fix:
* Remove kaggle.json file