# Analysis Overview

This notebook includes a general overview of the dataset by (1) preparing your workspace to use the dataset in the COCO format and (2) visualize a few images from one of the datasets for information purposes.  The dataset used is the SARscope dataset found at the below link.  The objective of this project is to determine whether proposed image processing methods would increase the performance of different models on Synthetic Aperture Radar data of maritime vessels.

Dataset: https://www.kaggle.com/datasets/kailaspsudheer/sarscope-unveiling-the-maritime-landscape

## Section 1 - Workspace Preparation

To ensure that the user can run this notebook without issue, please do the following:

1. Ensure your Python installtion is 3.8.10 or higher.
2. You are using the pip3 package manager.
3. Run the below installation steps. These are all the packages used in this notebook.

In [None]:
%pip3 install torch
%pip3 install torchvision
%pip3 install torchmetrics
%pip3 install kagglehub
%pip3 install json
%pip3 install matplotlib
%pip3 install cv2
%pip install pycocotools

In [None]:
# Utility Imports
import os
import pathlib
import shutil

# Data Handling Imports
import kagglehub
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
from pycocotools.coco import COCO


# Model & Metric Imports
import torch
import torchmetrics
import torchvision

project_path = pathlib.Path.cwd().parent.resolve()
print(f"Project path: {project_path}")

## Section 2 - Dataset Loading

### Section 2.1: Note on Kagglehub

Kagglehub does not natively support downloading to specific directories on the user's file system.  It instead downloads it to a cache folder, which may vary between users.  Thus, the below script moves all downloads will move the dataset folder to the included */data* folder in this repo.

If you have an error, this is likely due to the `shutil.move()` command failing because it sees the dataset still cached.  To counteract this, `cd` into the cache directory that is printed in the output and delete the entire data folder.  Then run this block again.  See Section 2.2 and the below code block comments for additional information.

### Section 2.2: Deleting the Cache

To re-download the dataset, you need to remove both the formatted folder in this repo's data directory (the *kaggle* folder) and the *kailaspsudheer* folder in the cache.

**I HIGHLY RECOMMEND YOU DOWNLOAD THE DATA THROUGH KAGGLEPATH AND CLEAR THE CACHE MANUALLY YOUR FIRST TIME.  THIS WILL SHOW YOU WHERE YOUR CACHE IS AND THAT YOUR DELETION PATHS ARE CORRECT.**

In [None]:
# Create the Kaggle directory to move the downloaded data to
kaggle_path = os.path.join(project_path, "data", "kaggle")

# Flag to delete the cached directory so you can re-download the dataset.
# NOTE: I recommend you download once with this false to know the following:
#   1. Where your cache is.
#   2. That the program is finding the "kailaspsudheer" directory to delete.
#
# Once the above is confirmed, you can turn this flag on for future downloads rather than manually deleting it.
clear_cache = False

if not os.path.exists(kaggle_path):

    os.makedirs(kaggle_path, exist_ok=True)

    # Download the SARscope dataset from Kaggle
    try:
        cached_path = kagglehub.dataset_download("kailaspsudheer/sarscope-unveiling-the-maritime-landscape")
    except:
        raise LookupError("Unable to download SEAscope dataset.")

    # Get the absolute path and move it.
    cached_path = os.path.abspath(os.path.join(cached_path, "SARscope"))

    print(f"Moving cached dataset from directory {cached_path} to {kaggle_path}")
    shutil.move(cached_path, kaggle_path)

    data_path = os.path.join(kaggle_path, "SARscope")

    # Move the annotation files outside the actual data and into their own folder.
    annotation_files = []
    annotation_folder = os.path.join(data_path, "annotations")
    print(f"Making annotations directory at path {annotation_folder}")
    os.makedirs(annotation_folder, exist_ok=True)

    for folder in os.listdir(data_path):
        # Skip anything that isn't the test, train, or valid directories.
        if folder == "annotations" or not os.path.isdir(os.path.join(data_path, folder)):
            continue
        else: # Extract the annotations json file, move it to the annotations directory and rename it according to its corresponding set.
            files = os.listdir(os.path.join(data_path, folder))
            annotation_file = [x for x in files if x.endswith(".json")]

            if len(annotation_file) != 1:
                raise FileNotFoundError(f"Annotation file not found for {folder} set.")

            # Rename the annotation file and move it.
            new_annotation_file = folder + annotation_file[0]
            shutil.move(os.path.join(data_path, folder, annotation_file[0]), os.path.join(annotation_folder, new_annotation_file))

    # Show the location of the cache folder.
    print(f"Cached path: {cached_path}")

    if clear_cache:
        # Delete the kailaspsudheer directory to allow for a re-download.
        split_cached_path = cached_path.split("/")
        kail_idx = split_cached_path.index("kailaspsudheer")

        kail_dir = '/'.join(split_cached_path[0:kail_idx+1])
        print(f"Deleting cached directory at {kail_dir}")
        shutil.rmtree(kail_dir)
else: # Will default to this if you've already downloaded it to the right place.
    data_path = os.path.join(project_path, "data", "kaggle", "SARscope")

In [None]:
if not os.path.exists(data_path):
    raise FileNotFoundError(f"Not able to find data directory at path: {data_path}")
else:
    print(f"Using data path: {data_path}")

## Section 3 - Data Visualization

Below, we visualize a few randomly selected images throughout the validation dataset as examples of the different types of images the models will encounter and to ensure the annotations are working as expected.  All targets have the same category Id and category name: (1, "ship").

In [None]:
# Validation Path retrieval
val_annotations = os.path.join(data_path, "annotations", "valid_annotations.coco.json")
val_images = os.path.join(data_path, "valid")

# Extract the annotations
coco_annotation = COCO(val_annotations)

In [None]:
# Display 3 randomly selected images throughout the validation dataset with their annotations.
image_ids = coco_annotation.getImgIds(catIds=[1])
id_sample = np.random.choice(image_ids, 3)

for i, image_id in enumerate(id_sample):
    image_name = coco_annotation.loadImgs([image_id])[0]['file_name']
    annotation_id = coco_annotation.getAnnIds(imgIds=image_id, iscrowd=None)
    annotation = coco_annotation.loadAnns(annotation_id)

    # Open the Image and annotate it.
    image = Image.open(os.path.join(val_images, image_name))

    plt.imshow(np.asarray(image))
    coco_annotation.showAnns(annotation, draw_bbox=True)

    plt.title(f"Image Id: {image_id}")
    plt.show()