# **Data Collection**

## Objectives

* Fetch data from Kaggle and save as raw data
* Inspect the data and check for non-image files
* Split the data into Train, Test and Validation sets
* Save it under inputs/cherry_leaves_dataset/cherry-leaves

## Inputs

* kaggle.json for the authentication token

## Outputs

* Generate Dataset Folders for sets:
  * Train Sets: 
    * inputs/cherry_leaves_dataset/cherry-leaves/train/healthy
    * inputs/cherry_leaves_dataset/cherry-leaves/train/powdery_mildew
  * Test Sets: 
    * inputs/cherry_leaves_dataset/cherry-leaves/test/healthy
    * inputs/cherry_leaves_dataset/cherry-leaves/test/powdery_mildew
  * Validation Sets: 
    * inputs/cherry_leaves_dataset/cherry-leaves/validation/healthy
    * inputs/cherry_leaves_dataset/cherry-leaves/validation/powdery_mildew

## Additional Comments

* This covers the second and third phases of the CRISP-DM workflow, which are data understanding and data preparation



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [3]:
import os
current_dir = os.getcwd()
current_dir

'/Users/alitapantea/Documents/Projects/mildew-detection-project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [4]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [5]:
current_dir = os.getcwd()
current_dir

'/Users/alitapantea/Documents/Projects/mildew-detection-project'

# Fetch data from Kaggle

First we need to install the Kaggle package

In [None]:
# install kaggle package
%pip install kaggle==1.5.12

We then change the Kaggle configuration directory to the current working directory and set the permissions for the Kaggle authentication file

In [4]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the [Kaggle URL](https://www.kaggle.com/codeinstitute/cherry-leaves).
* Set your destination folder.

In [6]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/cherry_leaves_dataset
100%|█████████████████████████████████████▉| 55.0M/55.0M [00:07<00:00, 9.36MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:07<00:00, 7.54MB/s]


Unzip the downloaded file and then delete it

In [7]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

# Data Preparation

## Data cleaning

#### Check for and remove non-image files

First import os library

In [8]:
import os

In [9]:
def check_files(dir):
    img_ext = ('.png', '.jpg', '.jpeg')
    images = 0
    non_images = 0
    for root, dirs, files in os.walk(dir):
        for file in files:
            if not file.lower().endswith(img_ext):
                filepath = os.path.join(root, file)
                os.remove(filepath)
                non_images += 1
            else:
                images += 1
    
    print(f'Found {non_images} files that were not images')
    print(f'Found {images} files that were images')
        

In [10]:
check_files('inputs/cherry_leaves_dataset/cherry-leaves/')

Found 0 files that were not images
Found 4208 files that were images


After this step, the dataset should not contain any images. 

---

## Split Train, Test and Validation set

The next step is to split the images into folders containing the Train, Test and Validation set needed for supervised learning. The folders will also keep the labelling as healthy or powdery_mildew.

The following function was taken from the malaria walkthrough project from Code Institute as a basis and adjusted as needed. 

In [12]:
import os
import shutil
import random
import joblib

def split_image_sets(dir, train_set_ratio, test_set_ratio, validation_set_ratio):
    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("The ratio of all three sets should sum up to 1.")
        return
    
    labels = os.listdir(dir) #gets the folder names for healthy/powdery_mildew

    if 'test' in labels:
        pass
    else:
        for folder in ['train', 'test', 'validation']:
            for label in labels:
                os.makedirs(os.path.join(dir, folder, label))
        
        for label in labels:
            files = os.listdir(os.path.join(dir, label))
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1

            for file in files:
                if count <= train_set_files_qty:
                    shutil.move(os.path.join(dir, label, file), 
                        os.path.join(dir, 'train', label, file))
                elif count <= (train_set_files_qty + validation_set_files_qty):
                    shutil.move(os.path.join(dir, label, file), 
                        os.path.join(dir, 'validation', label, file))
                else:
                    shutil.move(os.path.join(dir, label, file), 
                        os.path.join(dir, 'test', label, file))
                
                count += 1
            
            os.rmdir(os.path.join(dir, label))

Conventionally, the sets are divided as follows:
* The training set covers 70% of the data
* The test set covers 20% of the data
* The validation set covers 10% of the data

In [13]:
split_image_sets('inputs/cherry_leaves_dataset/cherry-leaves/', 0.7, 0.2, 0.1)

The images are now divided as follows:

In [11]:
import os

sets = ['train', 'test', 'validation']
labels = ['healthy', 'powdery_mildew']
for set in sets:
    for label in labels:
        number_of_files = len(os.listdir(f'inputs/cherry_leaves_dataset/cherry-leaves/{set}/{label}'))
        print(f'There are {number_of_files} images in {set}/{label}')


There are 1472 images in train/healthy
There are 1472 images in train/powdery_mildew
There are 422 images in test/healthy
There are 422 images in test/powdery_mildew
There are 210 images in validation/healthy
There are 210 images in validation/powdery_mildew


We can see that each set has an even distribution of images across both labels, healthy and powdery_mildew. 
We can see that the train set has the highest number of images, and that the test set has approximately twice as many as the validation set.

---

# Next Steps

* Now that the data is cleaned (there are no non-image files) and the data is split into train, test and validation sets, we can start with the data visualization steps in the next notebook.