# 01 - Data manipulation

This notebook is dedicated to the initial stages of the EuroSAT image classification project, focusing on data extraction and organization. The primary tasks accomplished in this notebook include:

- Data Extraction: The notebook starts with extracting the EuroSAT dataset, which is composed of satellite images labeled with various types of land cover. The extraction process involves downloading the dataset from its source repository.

- Organization of Data: Following extraction, the dataset is meticulously organized into separate folders for training and testing purposes. This organization is crucial for model training and evaluation, ensuring that data is easily accessible and systematically arranged.

- Mapping Folder Names to IDs: Each folder, representing a distinct type of land cover, is mapped to a unique identifier (ID). This mapping facilitates the classification task by associating each land cover type with a specific ID, simplifying the training and prediction process.

- Database Description and Download Link: The notebook also includes a comprehensive description of the EuroSAT dataset, highlighting its significance and utility in satellite image classification tasks. Additionally, information on where to download the dataset is provided, ensuring that the project is reproducible and accessible to others

## Dataset

"*The EuroSAT dataset is a comprehensive land cover classification dataset that focuses on images taken by the ESA Sentinel-2 satellite. It contains a total of 27,000 images, each with a resolution of 64x64 pixels. These images cover 10 distinct land cover classes and are collected from over 34 European countries. The dataset is available in two versions: RGB only (this repo) and all 13 Multispectral (MS) Sentinel-2 bands. EuroSAT is considered a relatively easy dataset, with approximately 98.6% accuracy achievable using a ResNet-50 architecture.*"

Data: https://huggingface.co/datasets/blanchon/EuroSAT_RGB

Paper: https://arxiv.org/abs/1709.00029

Homepage: https://github.com/phelber/EuroSAT

In [1]:
import os
import shutil
import random
from tqdm import tqdm

import torchvision
import torchvision.transforms as transforms


import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import directly from the datasets package
# from datasets import load_dataset
#EuroSAT_RGB = load_dataset("blanchon/EuroSAT_RGB")

In [3]:
# Unzip file - Linux
#!unzip EuroSAT_RGB.zip

In [4]:
# Name of the folder with the original images
original_images_folder = '../data/EuroSAT_RGB'

# Destination folders for training and testing image groups
train_images_folder = "../data/train_imgs"
test_images_folder = "../data/test_imgs"

In [6]:
# Create a new directory for training and testing images (if it doesn't already exist)
os.makedirs(train_images_folder, exist_ok = True)
os.makedirs(test_images_folder, exist_ok = True)

# Build training and testing data

In [8]:
# Auxiliary variables
image_class = 0
class_dict = {}

# Variable to manipulate images
names_land_type = os.listdir(original_images_folder)
names_land_type.sort()

# Set train size
train_sample_size = 0.7

In [9]:
# Views names
names_land_type

['.DS_Store',
 'AnnualCrop',
 'Forest',
 'HerbaceousVegetation',
 'Highway',
 'Industrial',
 'Pasture',
 'PermanentCrop',
 'Residential',
 'River',
 'SeaLake']

In [10]:
# Iterate over all files in the "names_land_type" list
for path_file in tqdm(names_land_type, desc = "Processing"):
    #Check if the file name doesn't start with a dot
    if not path_file.startswith('.'):
        # Construct the source directory path
        source_dir_path = os.path.join(original_images_folder, path_file)

        # List all images in the specified directory
        images = os.listdir(source_dir_path)

        # Calculate the sample size for the training set
        sample_size = int(len(images) * train_sample_size)

        # Randomly sample images for training
        train_images = random.sample(images, sample_size)

        # The rest are test images
        test_images = [img for img in images if img not in train_images]

        # Construct the final destination path for training images
        final_train_dest = os.path.join(train_images_folder, str(image_class))

        # Create a bew directory for training images (if it doesn't already exist)
        os.makedirs(final_train_dest, exist_ok = True)

        # Copy selected training images to the final destination
        for file_name in train_images:
            shutil.copy2(os.path.join(source_dir_path, file_name), final_train_dest)

        # Construct the final destination path for test images
        final_test_dest = os.path.join(test_images_folder, str(image_class))

        # Create a new directory for test images (if it doesn't already exist)
        os.makedirs(final_test_dest, exist_ok = True)

        # Copy all test images to the final destination
        for test_image in test_images:
            shutil.copy2(os.path.join(source_dir_path, test_image), final_test_dest)

        # Associate the image class with its respective file path
        class_dict[image_class] = path_file

        # Increment the image class identifier
        image_class += 1

Processing: 100%|█████████████████████████████████████████████████████████████████████████████| 11/11 [03:30<00:00, 19.18s/it]
