# **Data Collection**
---

## Objectives

* Fetch the animal species dataset from Kaggle and prepare it for further processing.  
* Organize image data into a clean structure suitable for training machine learning models.  
* Split the dataset into training, validation, and testing subsets.

## Inputs

* Kaggle JSON file – authentication token for accessing the dataset.  
* Kaggle dataset: Animal Images (multi-species).  

## Outputs

* Generated dataset located at: `inputs/datasets/animals/image/`  
  - Includes subfolders for each animal class.  
  - After preparation, split into `train/`, `validation/`, and `test/` sets.  

## Additional Comments | Insights | Conclusions

* The dataset is large (>5 GB). To save disk space, files are **moved** (not copied) into train/validation/test splits.  
* Original dataset folders may be removed once splitting is complete.  


---

## Import Packages

In [28]:
%pip install -r /workspaces/Animal_detection_camera/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


# Change working directory

Access the current directory

In [29]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Animal_detection_camera'

Set a new directory

In [30]:
os.chdir('/workspaces/Animal_detection_camera')
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [31]:
current_dir = os.getcwd()
current_dir

'/workspaces/Animal_detection_camera'

# Install Kaggle

In [32]:
%pip install kaggle==1.5.12


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


Change the Kaggle configuration directory to the current working directory and set permissions for the Kaggle authentication JSON.

In [33]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Set the Kaggle Dataset and download it. Then, unzip image files and delete the zip file.

In [34]:
import os
import zipfile

KaggleDatasetPath = "anthonytherrien/image-classification-64-classes-animal"
ZipFolder = "inputs/animals"
DestinationFolder = "inputs/datasets/animals"

# Download the dataset zip
!kaggle datasets download -d {KaggleDatasetPath} -p {ZipFolder}
zip_path = os.path.join(ZipFolder, "image-classification-64-classes-animal.zip")

os.makedirs(DestinationFolder, exist_ok=True)

# Unzip only if destination is empty
if not os.listdir(DestinationFolder):
    with zipfile.ZipFile(zip_path, "r") as zip_ref:
        zip_ref.extractall(DestinationFolder)
    print(f"✅ Extracted dataset to: {DestinationFolder}")
    # delete zip after extraction
    os.remove(zip_path)
else:
    print(f"📂 Dataset already available at: {DestinationFolder}")


Downloading image-classification-64-classes-animal.zip to inputs/animals
100%|█████████████████████████████████████▉| 5.71G/5.71G [03:20<00:00, 32.1MB/s]
100%|██████████████████████████████████████| 5.71G/5.71G [03:20<00:00, 30.5MB/s]
✅ Extracted dataset to: inputs/datasets/animals


---

# Data Preparation
---

## Data Cleaning

### Check and remove any non-image files

In [38]:
def remove_non_image_file(my_data_dir):
    image_extension = ('.png', '.jpg', '.jpeg')
    folders = os.listdir(my_data_dir)
    for folder in folders:
        files = os.listdir(my_data_dir + '/' + folder)
        # print(files)
        i = []
        j = []
        for given_file in files:
            if not given_file.lower().endswith(image_extension):
                file_location = my_data_dir + '/' + folder + '/' + given_file
                os.remove(file_location)  # remove non image file
                i.append(1)
            else:
                j.append(1)
                pass
        print(f"Folder: {folder} - has image file", len(j))
        print(f"Folder: {folder} - has non-image file", len(i))

remove_non_image_file(my_data_dir='inputs/datasets/animals/image')

Folder: lemur - has image file 216
Folder: lemur - has non-image file 0
Folder: snake - has image file 216
Folder: snake - has non-image file 0
Folder: elephant - has image file 288
Folder: elephant - has non-image file 0
Folder: frog - has image file 312
Folder: frog - has non-image file 0
Folder: chimpanzee - has image file 216
Folder: chimpanzee - has non-image file 0
Folder: chinchilla - has image file 216
Folder: chinchilla - has non-image file 0
Folder: flamingo - has image file 216
Folder: flamingo - has non-image file 0
Folder: mongoose - has image file 216
Folder: mongoose - has non-image file 0
Folder: ostrich - has image file 216
Folder: ostrich - has non-image file 0
Folder: ferret - has image file 216
Folder: ferret - has non-image file 0
Folder: camel - has image file 216
Folder: camel - has non-image file 0
Folder: bee - has image file 216
Folder: bee - has non-image file 0
Folder: mole - has image file 176
Folder: mole - has non-image file 0
Folder: penguin - has image 

## Split Train Validation Test Set

In [None]:
import random
import shutil


def split_train_validation_test_images(my_data_dir, train_set_ratio, validation_set_ratio, test_set_ratio):

    if train_set_ratio + validation_set_ratio + test_set_ratio != 1.0:
        print("train_set_ratio + validation_set_ratio + test_set_ratio should sum to 1.0")
        return

    # gets classes labels
    labels = os.listdir(my_data_dir)  # it should get only the folder name
    if 'test' in labels:
        pass
    else:
        # create train, test folders with classes labels sub-folder
        for folder in ['train', 'validation', 'test']:
            for label in labels:
                os.makedirs(name=my_data_dir + '/' + folder + '/' + label)

        for label in labels:

            files = os.listdir(my_data_dir + '/' + label)
            random.shuffle(files)

            train_set_files_qty = int(len(files) * train_set_ratio)
            validation_set_files_qty = int(len(files) * validation_set_ratio)

            count = 1
            for file_name in files:
                if count <= train_set_files_qty:
                    # move a given file to the train set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/train/' + label + '/' + file_name)

                elif count <= (train_set_files_qty + validation_set_files_qty):
                    # move a given file to the validation set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/validation/' + label + '/' + file_name)

                else:
                    # move given file to test set
                    shutil.move(my_data_dir + '/' + label + '/' + file_name,
                                my_data_dir + '/test/' + label + '/' + file_name)

                count += 1

            os.rmdir(my_data_dir + '/' + label)

The training set is divided into a 0.70 ratio of data.
The validation set is divided into a 0.10 ratio of data.
The test set is divided into a 0.20 ratio of data.

In [41]:

split_train_validation_test_images(
    my_data_dir="inputs/datasets/animals/image",
    train_set_ratio=0.7,
    validation_set_ratio=0.1,
    test_set_ratio=0.2
)

Double check that the subfolders were created in the correct location.

In [42]:

print(os.listdir("inputs/datasets/animals"))
print(os.listdir("inputs/datasets/animals/image"))

['image']
['train', 'validation', 'test']


---