In this notebook performed exploratory data analysis of the dataset
Lets download all the libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Load the ground truth annotations

In [None]:
annotations_df = pd.read_csv('../data/train_ship_segmentations.csv')


Perform exploratory data analysis on the dataset

In [None]:
print("Dataset shape:", annotations_df.shape)
print("Number of unique images:", annotations_df['ImageId'].nunique())
print("Number of annotated ships:", annotations_df['EncodedPixels'].count())

Analyze the distribution of ship vs. non-ship images

In [None]:
ship_count = annotations_df['EncodedPixels'].count()
no_ship_count = annotations_df['ImageId'].nunique() - ship_count
labels = ['Ships', 'No Ships']
sizes = [ship_count, no_ship_count]
plt.figure(figsize=(6, 6))
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.axis('equal')
plt.title('Distribution of Ships vs. No Ships')
plt.show()

Calculate statistics and insights from the dataset

In [None]:
ships_per_image = annotations_df.groupby('ImageId')['EncodedPixels'].count().mean()
print("Average ships per image:", ships_per_image)
annotations_df['Size'] = annotations_df['EncodedPixels'].apply(lambda x: len(x.split()))
size_distribution = annotations_df.groupby('Size')['ImageId'].count()
plt.figure(figsize=(12, 6))
plt.bar(size_distribution.index, size_distribution.values)
plt.xlabel('Ship Size')
plt.ylabel('Number of Ships')
plt.title('Ship Size Distribution')
plt.xticks(range(1, 20))
plt.show()

Visualize sample images and corresponding masks

In [None]:
sample_images = annotations_df.sample(n=4)
plt.figure(figsize=(12, 8))
for i, row in sample_images.iterrows():
    image_id = row['ImageId']
    mask = row['EncodedPixels']

    # Load and plot the image
    image_path = f'../data/train_v2/{image_id}'
    image = plt.imread(image_path)
    plt.subplot(2, 4, i+1)
    plt.imshow(image)
    plt.axis('off')
    plt.title(f'Image: {image_id}')

    # Plot the corresponding mask
    plt.subplot(2, 4, i+5)
    plt.imshow(mask, cmap='gray')
    plt.axis('off')
    plt.title('Mask')

plt.tight_layout()
plt.show()


Since the datasets have mostly photos without ships, this may affect the construction of the training model. Since there will be much less data with ships, the model may tend to show low sensitivity to ships or reject them all as background noise.