# Notebook to Preprocess the GIS data associated with Burial Mounds

This notebook prepares the data for YOLO model training. Training of YOLO model has strict requirements on the data that it consumes. In this notebook, we will be performing the steps that will produce the requirements needed to successfully train a YOLO model.

NOTE: paths found in this notebook are absolute Google Drive paths. User may need to change the paths specific to their Google Drive paths as Google Colab notebook only works with absolute paths to one's Google Drive directories.

# Google Colab Additional Steps
The next 2 steps are additional steps to be done when running in Google Colab. User may skip the steps if notebook is run locally.


In [1]:
#installing the following packages since when run in Google Colab, we have to install on each runtime
!pip install rasterio


Collecting rasterio
  Downloading rasterio-1.3.9-cp310-cp310-manylinux2014_x86_64.whl (20.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.6/20.6 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting affine (from rasterio)
  Downloading affine-2.4.0-py3-none-any.whl (15 kB)
Collecting snuggs>=1.4.1 (from rasterio)
  Downloading snuggs-1.4.7-py3-none-any.whl (5.4 kB)
Installing collected packages: snuggs, affine, rasterio
Successfully installed affine-2.4.0 rasterio-1.3.9 snuggs-1.4.7


In [2]:
#mounted my google drive where the dataset is located
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
#imported necessary packages
import geopandas as gpd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import rasterio
import os
from shapely.geometry import box
from rasterio.warp import transform_bounds


In [4]:
#Loaded GDF
shapefile_path = '/content/drive/MyDrive/summer_internship/data/shapefiles/MapMounds4326.shp'
gdf = gpd.read_file(shapefile_path)

#Kept only the Hairy brown circles
gdf = gdf[gdf.MpSymbl == 'Hairy brown circle']

Now, because Google Colab by default only temporarily saves results on temporary folders within the Google Colab notebook. Firstly, we will be defining a helper function that can be called anytime we need to directly copy files from one folder to another. This function will repeatedly be called to copy the results of image augmentations and label creations.

In [5]:
def copy_paste(source_directory,output_drive_path):

  #imported packages that will enable the copying of the labels to google drive
  import os
  import shutil

  #set the output_drive_path as the destination directory
  destination_directory = output_drive_path

  #Creates the destination directory if it doesn't exist
  os.makedirs(destination_directory, exist_ok=True)

  #Copied all contents from the source directory to the destination directory
  shutil.copytree(source_directory, os.path.join(destination_directory, os.path.basename(source_directory)))


# Image Splits

Given that YOLO works best with its prescribed 640x640 image sizes, we will split the images to help provide more information to our model upon training. First off, we will be defining a function that will perform the splitting of an image given a desired number of splits/crops as arguments.

In [6]:
#imported necessary packages
import rasterio
from rasterio.windows import Window

In [7]:
#defined a function that will split images when called

def split_images(geotiff_dir,rows,columns):#function requires a directory path for the geotiff images and how many rows and columns we want the image to be split

    #iterates through all images found in a given directory
    for filename in os.listdir(geotiff_dir):
        #added condition that allows for the program to work only with files that end in .tif
        if filename.endswith('.tif'):

            #succeeding steps aim to get just the filename without the .tif extension.
            #split the filename based on the dot ('.') character
            filename_parts = filename.split('.')

            #got the part before the extension. This will be used later on for the naming of the Geotiff file to be created
            actual_filename = filename_parts[0]

            #opened the file using rasterio
            geotiff_path = os.path.join(geotiff_dir, filename)

            with rasterio.open(geotiff_path) as src:
                #cpecified the number of rows and columns for the grid
                num_rows, num_cols = rows, columns

                #calculated the width and height of each subset
                subset_width = src.width // num_cols
                subset_height = src.height // num_rows

                #iterated over rows and columns to create and save each subset
                for row in range(num_rows):
                    for col in range(num_cols):
                        #calculated the window bounds for each subset
                        window = Window(col * subset_width, row * subset_height, subset_width, subset_height)

                        #read the data for the subset
                        subset_data = src.read(window=window)

                        #defined path for the desired destination of the split images
                        destination_folder = '/content/drive/MyDrive/summer_internship/split_images'
                        #creates folder if the folder does not exist
                        if not os.path.exists(destination_folder):
                            os.makedirs(destination_folder)

                        #created a new Geotiff file for each subset
                        output_path = f'{destination_folder}/{actual_filename}_{row}_{col}.tif'
                        with rasterio.open(output_path, 'w', driver='GTiff', height=subset_height, width=subset_width, count=src.count, dtype=src.dtypes[0], crs=src.crs, transform=src.window_transform(window)) as dst:
                            dst.write(subset_data)

                        print(f"Subset {row + 1}-{col + 1} saved to {output_path}")


Next, we will be calling the function above to perform splits on our data. We shall be setting the rows and columns to 6 x 6 meaning we will have 36 splits/divisions per image. This is the selected dimensions because after conducting a series of model trainings, this set of dimensions provided the best model performance. This is probably due to the cropped images having image sizes closest to the preferred 640 image size of YOLOv8.  

In [None]:
#called the function above to split the images found in the path provided below.

split_images("/content/drive/MyDrive/summer_internship/data/YambolGIS/Training32635",
             6,6)

Subset 1-1 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_0_0.tif
Subset 1-2 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_0_1.tif
Subset 1-3 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_0_2.tif
Subset 1-4 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_0_3.tif
Subset 1-5 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_0_4.tif
Subset 1-6 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_0_5.tif
Subset 2-1 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_1_0.tif
Subset 2-2 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_1_1.tif
Subset 2-3 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-066-2_GKrushevo_1_2.tif
Subset 2-4 saved to /content/drive/MyDrive/summer_internship/split_images

In [None]:
#called the function above to split the images found in the path provided below
split_images("/content/drive/MyDrive/summer_internship/data/YambolGIS/TopoMapsMultipleForTesting",
             6,6)

Subset 1-1 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_0_0.tif
Subset 1-2 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_0_1.tif
Subset 1-3 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_0_2.tif
Subset 1-4 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_0_3.tif
Subset 1-5 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_0_4.tif
Subset 1-6 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_0_5.tif
Subset 2-1 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_1_0.tif
Subset 2-2 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_1_1.tif
Subset 2-3 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_1_2.tif
Subset 2-4 saved to /content/drive/MyDrive/summer_internship/split_images/K-35-078-2_Turkey_1_3.tif


# Detection Prep
Now before we move on to create labels for the splits, for the purposes of our detection exercise later on when we will further try to see how the trained model will perform on detecting using unseen data, we will be excluding one image/map from all of the sets. This excluded image will solely be used for burial mound detection later on in detect_YOLO notebook.

In [8]:
#imported os
import os

#assigned the path containing all the images that have been split
folder_path = '/content/drive/MyDrive/summer_internship/split_images'

#iterated through files in the folder
for file_name in os.listdir(folder_path):
    #checked if the file name contains 'Razdel35', which is a random map image that we will be excluding for the detection exercise
    if 'Razdel35' in file_name:
        #assigns the full file path of the image to be removed
        file_path = os.path.join(folder_path, file_name)

        #removed the file
        os.remove(file_path)
        print(f"Removed: {file_path}")

print("Files containing 'Razdel35' removed successfully.")


Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_0_0.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_0_1.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_0_2.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_0_3.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_0_4.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_0_5.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_1_0.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_1_1.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_1_2.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_1_3.tif
Removed: /content/drive/MyDrive/summer_internship/split_images/K-35-066-3_Razdel35_1_4.tif

## Labels Creation

One of YOLO's requirements in training models is to have labels documented in txt files. The purpose of this text file is to let the model know the objects that should be detected in a particular image, and where the objects are located in a particular image. To make this happen, YOLO makes use of bounding boxes that enclose objects of interest. In our case, a txt file corresponds to a particular image(burial mound map), where the txt file contains information on all bounding boxes enclosing objects(burial mounds) to be detected by the model. The text file follows the format: (label index of the object), (center_x of the bounding box), (center_y of the bounding box), (rectangle_width of the bounding box), (rectangle_height of the bounding box). Values accepted for this format are strictly only 0-1 values, except for the label index which should be integer values.

In our first step, we will be creating labels for our dataset. This will later on be consumed by the model for training.

In [None]:
#defined a function that will create labels for images when called

def create_labels(geotiff_dir):#function requires a directory path for the geotiff images that will be provided with labels

    #iterates through all the images found in the given directory
    for filename in os.listdir(geotiff_dir):
        #initialized a bounding box list that will later on contain all bounding box information
        bounding_box_list = []
        #added condition that allows for the program to work only with files that end in .tif
        if filename.endswith('.tif'):

            #Succeeding steps aim to get just the filename without the .tif extension. This later on plays a role when it comes to
            ##naming the output txt file
            #Split the filename based on the dot ('.') character
            filename_parts = filename.split('.')

            #Got the part before the extension
            actual_filename = filename_parts[0]

            #Opened the file using rasterio
            geotiff_path = os.path.join(geotiff_dir, filename)
            with rasterio.open(geotiff_path) as src:

                #transformed the shapefile data into the coordinate system of the image
                gdf_t = gdf.to_crs(src.crs)
                bbox = box(src.bounds.left, src.bounds.bottom, src.bounds.right, src.bounds.top)
                #found the points that are inside the bounds of this image
                gdf_within_bounds = gdf_t[gdf_t.geometry.within(bbox)]

                #created a tuple list of all values under the geometry column which contains point coordinates as values
                coordinate_tuples = gdf_within_bounds['geometry'].apply(lambda point: (point.x, point.y)).tolist()

                #initialized tuple that will contain the bounding box details of each label point found from the list
                bounding_box = ()

                #iterated through the list of tuples containing the coordinate points of the labels(mounds)
                for c in coordinate_tuples:
                    longitude = c[0] #assigned the first coordinate as the longitude
                    latitude = c[1] #assigned the second coordinate as the latitude

                    #converted the longitude and latitude values to pixel values so YOLO bounding box requirements can be met
                    ##Simply using the EPSG:32635 coordinate system does not adhere to the strict 0-1 value requirements of YOLO
                    ##as EPSG:32635 converts the values to large digits

                    ##TAKE NOTE in rasterio: row, col = dataset.index(x, y); row or x coordinate is provided first despite providing it as the first
                    ##argument. Since the result is interchanged by the function, the first positioned value should be assigned to latitude

                    lat_pixel, long_pixel = src.index(longitude,latitude)

                    #Since the GIS files (geometry field) gave us points representing the center of the burial mounds,
                    ##Then the center x and center y are basically just the point coordinates

                    #However, since we want the bounding box to enclose the whole hairy brown circle symbols, we have to set the bounding box
                    ##width and height to have an allowance that will enclose the hairy brown circles.
                    ###NOTE: The width and height of the bounding rectangle was computed by subtracting the xmin and ymin from xmax and ymax
                    ##the addition and deduction of 50 and 30 pixel values below are a product of an exercise done outside of this notebook
                    ##that determined the optimal pixel allowance so that the rectangle will be able to enclose the whole hairy brown circle
                    ##and not just its center.

                    ##Then, the width and height was lessened by dividing the result by 3 because when we cropped the images we found that
                    ##the bounding box became larger than it should be. It was encompassing a lot more space than it should instead of just
                    ##encompassing the hairy brown circles. The value of 3 was determined by determining the pixel sizes of the created bounding
                    ##boxes using the opencv ROI packages

                    width = ((long_pixel+50) - (long_pixel-50))/3
                    height = ((lat_pixel+50) - (lat_pixel-30))/3

                    #normalized the bounding box values by dividing by the image height and width to meet YOLO 0-1 value requirements
                    bounding_box = (0, long_pixel/src.width, lat_pixel/src.height,
                                     width/src.width, height/src.height)

                    #appended the bounding boxes to the bounding box list
                    bounding_box_list.append(bounding_box)

                #For the next part, the list of bounding boxes are to be written into a text file since YOLO requires for its
                ##annotation labels to be in a text file with exactly the same name as the image being labeled. Only difference is
                ##the image is in .tif while the annotation file is in .txt

                #all label text files will be stored in the output folder below
                output_folder = 'labels_output'
                os.makedirs(output_folder, exist_ok=True)

                #Wrote bounding box coordinates to text files in the output folder
                output_file_path = os.path.join(output_folder, f"{actual_filename}.txt")
                with open(output_file_path, "w") as file:
                    for bb in bounding_box_list:
                        line = " ".join(map(str, bb))
                        file.write(line + "\n")

In [None]:
#called the function above to create labels for the images found in the path provided below
create_labels('/content/drive/MyDrive/summer_internship/split_images')

After calling the function, we now have corresponding label txt files for all of the images from our whole dataset found in the labels_output folder.

Now, because Google Colab by default only temporarily saves results on temporary folders within the Google Colab notebook; for the next step, we will be saving the results in our Google Drive so that we can use the labels for training the model.

In [None]:
#set the path where the labels will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship'

#assigned the temporary Google Colab folder where labels are saved to the source directory
source_directory = '/content/labels_output'

#called the function to save the contents of the labels output to our Google Drive
copy_paste(source_directory,output_drive_path)

## Preparation of training and test set in the format required by YOLO

YOLO requires the images and the corresponding label txt files to be organized in the following manner. Training images should be stored in a train folder, within an images folder that should contain only the image files (train/images). Corresponding labels should correspondingly be stored in a labels folder that should similarly contain only the label txt files (train/labels). Similarly, the same directory structure should be followed for the test/validation sets.



For the next part, we will be creating a function that can be called to copy image and label files from a given directory so we can easily organize our training and test sets when needed.

In [9]:
#imported packages for the function to copy files
import os
import shutil

#defined function copy files that can be called to copy image and label files from a given directory path

def copy_files(set_type, source_folder, destination_folder):#function requires source folder path and destination folder path

    #Creates destination folder if it doesn't exist, if it already exists, files will just be placed inside the existing folder
    if not os.path.exists(destination_folder):
        os.makedirs(destination_folder)

    #Got a list of all files in the source folder
    source_files = os.listdir(source_folder)

    #added condition to check if process will be for training set
    if set_type == 'train':

        #created list that will contain the files to be processed for the training set
        files_list = []

        #iterates through the train files which contains all the data assigned to be included in the train set.
        ##This file will be created in the next steps after this function definition code block
        for file in train_files:
            #Succeeding steps aim to get just the filename without the .tif extension.
            #Split the filename based on the dot ('.') character
            filename_parts = file.split('.')

            #Got the part before the extension. This will be used to match with the source files to ensure that only source files assigned to
            ##the trained set will be copied
            actual_filename = filename_parts[0]

            files_list.append(actual_filename)

        #iterated through all the source files
        for file in source_files:
            #Succeeding steps aim to get just the filename without the .tif extension.
            #Split the filename based on the dot ('.') character
            sourcename_parts = file.split('.')

            #Got the part before the extension.This will be used to match with the files found in the train set list
            ##to ensure that only source files assigned to the trained set will be copied
            actual_sourcename = sourcename_parts[0]

            #checks if the source file is found in the training set list, and only processes the file if it is found in the list
            if actual_sourcename in files_list:

                #introduced a condition that restricts the files to be processed only to image and label files
                if file.endswith('.tif') or file.endswith('.txt'):

                    source_path = os.path.join(source_folder, file) #joins the source folder and filename to get the source path
                    destination_path = os.path.join(destination_folder, file) #joins the destination folder and filename to provide the destination path

                    #copies the file to the chosen destination path
                    shutil.copy2(source_path, destination_path)
                    #checks which files have been successfully copied
                    print(f"File '{file}' copied to '{destination_folder}'")

#same steps above are done below for the handling of validation set
    if set_type == 'val':

        files_list = []

        for file in val_files:
            #Succeeding steps aim to get just the filename without the .tif extension.
            #Split the filename based on the dot ('.') character
            filename_parts = file.split('.')

            #Got the part before the extension
            actual_filename = filename_parts[0]

            files_list.append(actual_filename)

        #iterated through all the files
        for file in source_files:
            #Succeeding steps aim to get just the filename without the .tif extension.
            #Split the filename based on the dot ('.') character
            sourcename_parts = file.split('.')

            #Got the part before the extension
            actual_sourcename = sourcename_parts[0]

            if actual_sourcename in files_list:

                #introduced a condition that restricts the files to be processed only to image and label files
                if file.endswith('.tif') or file.endswith('.txt'):

                    source_path = os.path.join(source_folder, file) #joins the source folder and filename to get the source path
                    destination_path = os.path.join(destination_folder, file) #joins the destination folder and filename to provide the destination path

                    #copies the file to the chosen destination path
                    shutil.copy2(source_path, destination_path)
                    #checks which files have been successfully copied
                    print(f"File '{file}' copied to '{destination_folder}'")

#same steps above are done below for the handling of test set
    if set_type == 'test':

        files_list = []

        for file in test_files:
            #Succeeding steps aim to get just the filename without the .tif extension.
            #Split the filename based on the dot ('.') character
            filename_parts = file.split('.')

            #Got the part before the extension
            actual_filename = filename_parts[0]

            files_list.append(actual_filename)

        #iterated through all the files
        for file in source_files:
            #Succeeding steps aim to get just the filename without the .tif extension.
            #Split the filename based on the dot ('.') character
            sourcename_parts = file.split('.')

            #Got the part before the extension
            actual_sourcename = sourcename_parts[0]

            if actual_sourcename in files_list:

                #introduced a condition that restricts the files to be processed only to image and label files
                if file.endswith('.tif') or file.endswith('.txt'):

                    source_path = os.path.join(source_folder, file) #joins the source folder and filename to get the source path
                    destination_path = os.path.join(destination_folder, file) #joins the destination folder and filename to provide the destination path

                    #copies the file to the chosen destination path
                    shutil.copy2(source_path, destination_path)
                    #checks which files have been successfully copied
                    print(f"File '{file}' copied to '{destination_folder}'")

## Train-Val-Test Split

For our train/val/test split, we will use 90% of the data for training and validation while the remaining 10% will be for the test set. Then the train and validation sets will further be split into 90-10 where 90% will be used for training and the remaining 10% for validation.

In [10]:
#imported necessary packages
import os
from sklearn.model_selection import train_test_split

#assigned path to the folder containing images
data_path = '/content/drive/MyDrive/summer_internship/split_images'

#listed all the file names in the data folder
file_names = [f for f in os.listdir(data_path) if f.endswith('.tif')]

#split the file names into training and testing sets
trainval_files, test_files = train_test_split(file_names, test_size=0.1, random_state=42)

print(trainval_files[0:5])
print(len(trainval_files))

print(test_files[0:5])
print(len(test_files))




['K-35-053-2_Sliven_1_3.tif', 'K-35-053-4_BezmerBoyadzhik_3_4.tif', 'K-35-054-4_Voynika_5_5.tif', 'K-35-053-2_Sliven_0_4.tif', 'K-35-066-1_MalomirELhovo_2_0.tif']
615
['K-35-066-3_Razdel_3_0.tif', 'K-35-054-3_Yambol_2_3.tif', 'K-35-054-2_Atolov_5_2.tif', 'K-35-053-3_Elenovo_1_4.tif', 'K-35-066-2_GKrushevo_3_4.tif']
69


In [11]:
#imported necessary packages
import os
from sklearn.model_selection import train_test_split

# Split the file names into training and validation sets
train_files, val_files = train_test_split(trainval_files, test_size=0.1, random_state=42)

print(train_files[0:5])
print(len(train_files))

print(val_files[0:5])
print(len(val_files))




['K-35-066-2_GKrushevo_1_1.tif', 'K-35-065-3_Glavan_3_3.tif', 'K-35-065-4_Topolovgrad_2_1.tif', 'K-35-054-1_Straldzha_3_4.tif', 'K-35-054-3_Yambol_0_4.tif']
553
['K-35-067-3_Zhelyazkovo_0_2.tif', 'K-35-054-2_Atolov_4_3.tif', 'K-35-054-2_Atolov_0_3.tif', 'K-35-053-1_NZagora_5_5.tif', 'K-35-066-2_GKrushevo_0_1.tif']
62


### Train Set

#### Images

In [14]:
#called the function to copy all the images found from the source folder below

source_folder = "/content/drive/MyDrive/summer_internship/split_images"
destination_folder = "data_images/train/images"

copy_files('train',source_folder, destination_folder)

File 'K-35-078-2_Turkey_0_0.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_0_3.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_0_4.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_0_5.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_1_1.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_1_2.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_1_3.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_1_5.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_2_0.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_2_1.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_2_2.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_2_3.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_2_4.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Turkey_2_5.tif' copied to 'data_images/train/images'
File 'K-35-078-2_Tur

After running the codes above, we now have all the images coming from the train set available inside our train/images folder.

Now, similar to what we did for the output of the labels creation function above, because we are working with Google Colab, we will be saving the results in our Google Drive so that we can use the results for training the model.

In [15]:
#set the path where the results will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship/data_images/train'

#assigned the temporary Google Colab folder where results are saved to the source directory
source_directory = '/content/data_images/train/images'

#called the function to save the contents of the results to our Google Drive
copy_paste(source_directory,output_drive_path)

#### Labels

In [16]:
#Next, called the function to get the corresponding labels for the images

source_folder = '/content/drive/MyDrive/summer_internship/labels_output' #this is the folder that contains all of the labels for all the 20 images, generated by the create labels function above
destination_folder = "data_images/train/labels"

copy_files('train',source_folder, destination_folder)

File 'K-35-078-2_Turkey_0_4.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_0_0.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_0_3.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_0_5.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_3_3.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_1_2.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_1_1.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_3_4.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_3_1.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_2_2.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_2_3.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_2_4.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_2_5.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Turkey_1_5.txt' copied to 'data_images/train/labels'
File 'K-35-078-2_Tur

After running the code above, we now have label text files for all of the  images from our train set stored in the train/labels folder

Now, similar to what we did for the train images above, because we are working with Google Colab, we will be saving the results in our Google Drive so that we can use the results for training the model.

In [17]:
#set the path where the results will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship/data_images/train'

#assigned the temporary Google Colab folder where results are saved to the source directory
source_directory = '/content/data_images/train/labels'

#called the function to save the contents of the results to our Google Drive
copy_paste(source_directory,output_drive_path)

### Val Set

#### Images

In [18]:
#called the function to copy all the images found from the source folder below

source_folder = "/content/drive/MyDrive/summer_internship/split_images"
destination_folder = "data_images/val/images"

copy_files('val',source_folder, destination_folder)

File 'K-35-078-2_Turkey_0_1.tif' copied to 'data_images/val/images'
File 'K-35-078-2_Turkey_1_0.tif' copied to 'data_images/val/images'
File 'K-35-078-2_Turkey_3_5.tif' copied to 'data_images/val/images'
File 'K-35-078-2_Turkey_4_2.tif' copied to 'data_images/val/images'
File 'K-35-078-2_Turkey_4_4.tif' copied to 'data_images/val/images'
File 'K-35-053-1_NZagora_2_1.tif' copied to 'data_images/val/images'
File 'K-35-053-1_NZagora_4_1.tif' copied to 'data_images/val/images'
File 'K-35-053-1_NZagora_4_2.tif' copied to 'data_images/val/images'
File 'K-35-053-1_NZagora_5_5.tif' copied to 'data_images/val/images'
File 'K-35-053-2_Sliven_0_3.tif' copied to 'data_images/val/images'
File 'K-35-053-2_Sliven_2_3.tif' copied to 'data_images/val/images'
File 'K-35-053-2_Sliven_3_2.tif' copied to 'data_images/val/images'
File 'K-35-053-2_Sliven_5_4.tif' copied to 'data_images/val/images'
File 'K-35-053-3_Elenovo_5_3.tif' copied to 'data_images/val/images'
File 'K-35-054-1_Straldzha_0_0.tif' copied 

Now, similar to what we did for the train set, we will be saving the results in our Google Drive so that we can use the results for training the model.

In [19]:
#set the path where the results will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship/data_images/val'

#assigned the temporary Google Colab folder where results are saved to the source directory
source_directory = '/content/data_images/val/images'

#called the function to save the contents of the results to our Google Drive
copy_paste(source_directory,output_drive_path)

#### Labels

In [20]:
#Next, called the function to get the corresponding labels for the images

source_folder = '/content/drive/MyDrive/summer_internship/labels_output' #this is the folder that contains all of the labels for all the 20 images, generated by the create labels function above
destination_folder = "data_images/val/labels"

copy_files('val',source_folder, destination_folder)

File 'K-35-078-2_Turkey_1_0.txt' copied to 'data_images/val/labels'
File 'K-35-078-2_Turkey_0_1.txt' copied to 'data_images/val/labels'
File 'K-35-078-2_Turkey_3_5.txt' copied to 'data_images/val/labels'
File 'K-35-078-2_Turkey_4_2.txt' copied to 'data_images/val/labels'
File 'K-35-078-2_Turkey_4_4.txt' copied to 'data_images/val/labels'
File 'K-35-053-1_NZagora_4_1.txt' copied to 'data_images/val/labels'
File 'K-35-053-1_NZagora_2_1.txt' copied to 'data_images/val/labels'
File 'K-35-053-2_Sliven_0_3.txt' copied to 'data_images/val/labels'
File 'K-35-053-1_NZagora_5_5.txt' copied to 'data_images/val/labels'
File 'K-35-053-1_NZagora_4_2.txt' copied to 'data_images/val/labels'
File 'K-35-053-2_Sliven_3_2.txt' copied to 'data_images/val/labels'
File 'K-35-053-2_Sliven_2_3.txt' copied to 'data_images/val/labels'
File 'K-35-053-2_Sliven_5_4.txt' copied to 'data_images/val/labels'
File 'K-35-054-1_Straldzha_0_0.txt' copied to 'data_images/val/labels'
File 'K-35-053-3_Elenovo_5_3.txt' copied 

Now, similar to what we did for the train set, we will be saving the results in our Google Drive so that we can use the results for training the model.

In [21]:
#set the path where the results will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship/data_images/val'

#assigned the temporary Google Colab folder where results are saved to the source directory
source_directory = '/content/data_images/val/labels'

#called the function to save the contents of the results to our Google Drive
copy_paste(source_directory,output_drive_path)

### Test Set

#### Images

In [22]:
#called the function to copy all the images found from the source folder below

source_folder = "/content/drive/MyDrive/summer_internship/split_images"
destination_folder = "data_images/test/images"

copy_files('test',source_folder, destination_folder)

File 'K-35-078-2_Turkey_0_2.tif' copied to 'data_images/test/images'
File 'K-35-078-2_Turkey_1_4.tif' copied to 'data_images/test/images'
File 'K-35-078-2_Turkey_5_1.tif' copied to 'data_images/test/images'
File 'K-35-053-1_NZagora_3_0.tif' copied to 'data_images/test/images'
File 'K-35-053-1_NZagora_3_1.tif' copied to 'data_images/test/images'
File 'K-35-053-1_NZagora_4_3.tif' copied to 'data_images/test/images'
File 'K-35-053-2_Sliven_0_0.tif' copied to 'data_images/test/images'
File 'K-35-053-2_Sliven_0_5.tif' copied to 'data_images/test/images'
File 'K-35-053-2_Sliven_1_0.tif' copied to 'data_images/test/images'
File 'K-35-053-2_Sliven_3_0.tif' copied to 'data_images/test/images'
File 'K-35-053-2_Sliven_4_5.tif' copied to 'data_images/test/images'
File 'K-35-053-3_Elenovo_0_1.tif' copied to 'data_images/test/images'
File 'K-35-053-3_Elenovo_1_4.tif' copied to 'data_images/test/images'
File 'K-35-053-3_Elenovo_2_0.tif' copied to 'data_images/test/images'
File 'K-35-053-3_Elenovo_4_0

Now, similar to what we did for the train set, we will be saving the results in our Google Drive so that we can use the results for training the model.

In [23]:
#set the path where the results will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship/data_images/test'

#assigned the temporary Google Colab folder where results are saved to the source directory
source_directory = '/content/data_images/test/images'

#called the function to save the contents of the results to our Google Drive
copy_paste(source_directory,output_drive_path)

#### Labels

In [24]:
#Next, called the function to get the corresponding labels for the images

source_folder = '/content/drive/MyDrive/summer_internship/labels_output' #this is the folder that contains all of the labels for all the 20 images, generated by the create labels function above
destination_folder = "data_images/test/labels"

copy_files('test',source_folder, destination_folder)

File 'K-35-078-2_Turkey_0_2.txt' copied to 'data_images/test/labels'
File 'K-35-078-2_Turkey_1_4.txt' copied to 'data_images/test/labels'
File 'K-35-078-2_Turkey_5_1.txt' copied to 'data_images/test/labels'
File 'K-35-053-1_NZagora_3_1.txt' copied to 'data_images/test/labels'
File 'K-35-053-1_NZagora_3_0.txt' copied to 'data_images/test/labels'
File 'K-35-053-1_NZagora_4_3.txt' copied to 'data_images/test/labels'
File 'K-35-053-2_Sliven_1_0.txt' copied to 'data_images/test/labels'
File 'K-35-053-2_Sliven_0_0.txt' copied to 'data_images/test/labels'
File 'K-35-053-2_Sliven_0_5.txt' copied to 'data_images/test/labels'
File 'K-35-053-2_Sliven_3_0.txt' copied to 'data_images/test/labels'
File 'K-35-053-2_Sliven_4_5.txt' copied to 'data_images/test/labels'
File 'K-35-053-3_Elenovo_1_4.txt' copied to 'data_images/test/labels'
File 'K-35-053-3_Elenovo_2_0.txt' copied to 'data_images/test/labels'
File 'K-35-053-3_Elenovo_0_1.txt' copied to 'data_images/test/labels'
File 'K-35-053-3_Elenovo_4_3

Now, similar to what we did for the train set, we will be saving the results in our Google Drive so that we can use the results for training the model.

In [25]:
#set the path where the results will be saved
output_drive_path = '/content/drive/MyDrive/summer_internship/data_images/test'

#assigned the temporary Google Colab folder where results are saved to the source directory
source_directory = '/content/data_images/test/labels'

#called the function to save the contents of the results to our Google Drive
copy_paste(source_directory,output_drive_path)

After performing these steps, we now have prepared training and validation sets which can be used by our YOLO model for training. Please refer to train_YOLO.ipynb for the training of the model.