# Data Collection and Preparation

---

## Objectives

* Import packages
* Change working directory
* Install Kaggle
* Clean data
* Split train, validation and test sets

## Inputs

* Kaggle JSON file - authentication token

## Outputs

* Generate dataset: inputs/cherry_leaves_dataset/cherry-leaves
* Split dataset into train, validation and test

## Additional Comments

* Once business requirements have been identified, data collection and cleaning are the first steps in developing a Machine Learning model
* As we are developing a model for image analysis it's important to clean the data by removing non-image files
* We split the data - training and validation sets are necessary to check model overfitting or underfitting performance, the test set is required to find the generalised performance of the model

---

## Import packages

In [None]:
import numpy
import os

## Change working directory

These notebooks are stored in a subfolder of the current directory, therefore when running the notebook in the editor, we need to change the working directory from its current folder to its parent folder.

* We access the current directory with os.getcwd()

In [None]:
current_dir = os.getcwd()
current_dir

* We want to make the parent of the current directory the new current directory

In [None]:
os.chdir('/workspaces/mildew-detector')
print("You set a new current directory")

* Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

## Install Kaggle

In [None]:
!pip install kaggle

* Change kaggle configuration directory to current working directory
* Set permission of kaggle authentication json

In [None]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Download dataset from kaggle url path
* Set destination folder

In [None]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/cherry_leaves_dataset"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip downloaded file
* Delete zip file

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

---

## Data Preparation

### Data Cleaning

* Check for and remove non-image files

In [None]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has {len(j)} image files")
        print(f"Folder: {folder} - has {len(i)} non-image files")

In [None]:
remove_non_image_file(my_data_dir='inputs/cherry_leaves_dataset/cherry-leaves')

### Split Train, Validation and Test sets

In [None]:
import os
import shutil
import random
import joblib

def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):
  
  if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
    print("train_set_ratio + validation_set_ratio + test_set_ratio should sum 1.0")
    return

  # gets classes labels
  labels = os.listdir(my_data_dir) # it should get only the folder name
  if 'test' in labels:
    pass
  else: 
    # create train, test folders with classess labels sub-folder
    for folder in ['train','validation','test']:
      for label in labels:
        os.makedirs(name=my_data_dir+ '/' + folder + '/' + label)

    for label in labels:

      files = os.listdir(my_data_dir + '/' + label)
      random.shuffle(files)

      train_set_files_qty = int(len(files) * train_set_ratio)
      validation_set_files_qty = int(len(files) * validation_set_ratio)

      count = 1
      for file_name in files:
        if count <= train_set_files_qty:
          # move given file to train set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/train/' + label + '/' + file_name)
          

        elif count <= (train_set_files_qty + validation_set_files_qty ):
          # move given file to validation set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                      my_data_dir + '/validation/' + label + '/' + file_name)

        else:
          # move given file to test set
          shutil.move(my_data_dir + '/' + label + '/' + file_name,
                  my_data_dir + '/test/' +label + '/'+ file_name)
          
        count += 1

      os.rmdir(my_data_dir + '/' + label)

Following convention,

* The training set is divided into a 0.70 ratio of the data
* The validation set is divided into a 0.10 ratio of the data
* The test set is divided into a 0.20 ratio of the data

In [None]:
split_train_validation_test_images(my_data_dir = f"inputs/cherry_leaves_dataset/cherry-leaves",
                        train_set_ratio = 0.7,
                        validation_set_ratio=0.1,
                        test_set_ratio=0.2
                        )

---

## Push files to Repo

* Uncomment code if required

In [None]:
# import os
# try:
#     # create here your folder
#     # os.makedirs(name='')
# except Exception as e:
#     print(e)