# **Data collection and preparation**

## Objectives

* Fetch data from Kaggle
* Prepare the data for visualisation and modelling

## Inputs

* Kaggle json file for authentication
* Link to datased

## Outputs

* Dataset for mildew detector: inputs/datasets/mildew_dataset

## Additional Comments

* No comments


---

# Import packages

In [1]:
import sys 
!{sys.executable} -m pip install numpy
import numpy
import os



# Change working directory

* We are assuming you will store the notebooks in a sub folder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-detection-project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'/workspace/Mildew-detection-project'

# Install Kaggle

In [5]:
!pip install kaggle

Collecting kaggle
  Downloading kaggle-1.5.12.tar.gz (58 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.0/59.0 kB[0m [31m6.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting tqdm
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.5/78.5 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-slugify
  Downloading python_slugify-6.1.2-py2.py3-none-any.whl (9.4 kB)
Collecting text-unidecode>=1.3
  Downloading text_unidecode-1.3-py2.py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.2/78.2 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25ldone
[?25h  Created wheel for kaggle: filename=kaggle-1.5.12-py3-none-any.whl size=73031 sha256=d42d9d05e19051e9b22a17c8d96b8f16020d2bbff82a12813772560770727963
  Sto

---

# Data Collection

#### Change Kaggle configuration directory to current working directory and permission of kaggle authentication json

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

---

NOTE

* You may add how many sections you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

Set Kaggle Dataset and Download it

In [7]:
KaggleDatasetPath = "codeinstitute/cherry-leaves"
DestinationFolder = "inputs/mildew_classification"
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading cherry-leaves.zip to inputs/mildew_dataset
 91%|██████████████████████████████████▌   | 50.0M/55.0M [00:01<00:00, 44.1MB/s]
100%|██████████████████████████████████████| 55.0M/55.0M [00:01<00:00, 39.1MB/s]


Unzip data and delete zip file

In [11]:
import zipfile
with zipfile.ZipFile(DestinationFolder + '/cherry-leaves.zip', 'r') as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + '/cherry-leaves.zip')

# Data Preparation

### Check and remove non images files

In [1]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir) 
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        
            #print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location) # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file",len(j))
        print(f"Folder: {folder} - has non-image file",len(i))
    

In [4]:
remove_non_image_file(my_data_dir='/workspace/Mildew-detection-project/inputs/mildew_classification/cherry-leaves')

Folder: healthy - has image file 2104
Folder: healthy - has non-image file 0
Folder: powdery_mildew - has image file 2104
Folder: powdery_mildew - has non-image file 0


##  Split data into train, validation and test set

In [2]:
!{sys.executable} -m pip install split-folders
import splitfolders



* Use split-folders
* Split the data into a ratio of 0.7 for train, 0.2 test and 0.1 for validation set

In [3]:
splitfolders.ratio("/workspace/Mildew-detection-project/inputs/mildew_classification/cherry-leaves", output="output",
    seed=1337, ratio=(.7, .1, .2), group_prefix=None, move=False) 

Copying files: 4208 files [00:01, 2111.58 files/s]


# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section for "Conclusions and Next Steps" and state your conclusions and next steps.

In [7]:
!git add .

In [8]:
!git commit -am "Collect and prepare data for visualisation and modelling"

[main 308f5f2] Collect and prepare data for visualisation and modelling
 2 files changed, 545 insertions(+), 1 deletion(-)
 create mode 100644 jupyter_notebooks/DataCollectionAndPreparation.ipynb
 rename jupyter_notebooks/{Notebook_Template.ipynb => DataVisualisation.ipynb} (99%)


In [16]:
!git push -f origin main

Enumerating objects: 7, done.
Counting objects: 100% (7/7), done.
Delta compression using up to 16 threads
Compressing objects: 100% (5/5), done.
Writing objects: 100% (5/5), 5.65 KiB | 5.65 MiB/s, done.
Total 5 (delta 1), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (1/1), completed with 1 local object.[K
To https://github.com/kylo74/Mildew-detection-project.git
 + 0b25beb...308f5f2 main -> main (forced update)
