# EDA on Planet's Amazon Dataset
---

A notebook for initial exploratory data analysis on the [Planet: Understanding the Amazon from Space](https://www.kaggle.com/c/planet-understanding-the-amazon-from-space/) dataset.

## Setup

In [None]:
import sys
import numpy as np
import pandas as pd
from skimage.io import imread
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.io as pio
from tqdm.auto import tqdm

In [None]:
sys.path.append('../data/')
import data_utils

In [None]:
data_utils.DATA_PATH = '/home/andreferreira/data/'
data_utils.IMG_PATH = 'train-jpg/'
data_utils.TIFF_PATH = 'train-tif-v2/'
data_utils.LABELS_PATH = 'train_v2.csv/train_v2.csv'

Set the plotting style:

In [None]:
pio.templates.default = 'plotly_dark'

## Load data

In [None]:
labels_df = pd.read_csv(data_utils.DATA_PATH + data_utils.LABELS_PATH)
labels_df

In [None]:
next(data_utils.get_amazon_sample(labels_df, load_tiff=True))

In [None]:
count = 0
for row in data_utils.get_amazon_sample(labels_df):
    print(row)
    count += 1
    if count >= 5:
        break

## Explore labels

Number of labeled samples:

In [None]:
len(labels_df)

Clean the labels:

In [None]:
# Build list with unique labels
label_list = []
for tag_str in labels_df.tags.values:
    labels = tag_str.split(' ')
    for label in labels:
        if label not in label_list:
            label_list.append(label)
label_list = sorted(label_list)
label_list

In [None]:
labels_df = data_utils.encode_tags(labels_df)
labels_df.head()

Add a `deforestation` label:

In [None]:
labels_df = data_utils.add_deforestation_label(labels_df)
labels_df.head()

Analyse labels' occurrence:

In [None]:
counts = labels_df[label_list+['deforestation']].sum()
counts['all'] = len(labels_df)
counts = counts.sort_values()
counts = counts.to_frame()
counts.columns = ['counts']

In [None]:
px.bar(counts)

Dataset balance, as in the percentage of positive samples (with `deforestation` label):

In [None]:
print(f'{(len(labels_df[labels_df.deforestation == 1]) / len(labels_df)) * 100:.2f}%')

While the dataset isn't exactly balanced for our definition of deforestation, it still has a relevant amount of positive samples.

## Explore images

### JPG

Plot an image from each tag:

In [None]:
len(label_list)

In [None]:
_, ax = plt.subplots(len(label_list) // 3 + 1, 3, sharex='col', sharey='row', figsize=(20, 40))
count = 0
for label in label_list:
    sample = labels_df[labels_df[label] == 1].iloc[0]
    img = imread(f"{data_utils.DATA_PATH}{data_utils.IMG_PATH}{sample.image_name}.jpg")
    ax[count // 3, count % 3].imshow(img)
    ax[count // 3, count % 3].set_title(f'{sample.image_name} - {label}')
    count += 1

Plot some deforestation images:

In [None]:
_, ax = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(20, 20))
deforestation_samples = labels_df[labels_df['deforestation'] == 1]
for count in range(9):
    sample = deforestation_samples.iloc[count]
    img = imread(f"{data_utils.DATA_PATH}{data_utils.IMG_PATH}{sample.image_name}.jpg")
    ax[count // 3, count % 3].imshow(img)
    ax[count // 3, count % 3].set_title(f'{sample.image_name} - {sample.tags}')

Estimate pixel stats:

In [None]:
imgs = []
n_samples = len(labels_df)
count = 0
for img_data, label in tqdm(data_utils.get_amazon_sample(labels_df), total=n_samples-1, desc='Loading samples'):
    imgs.append(img_data)
    count += 1
    if count >= n_samples:
        break

In [None]:
len(imgs)

In [None]:
imgs = np.array(imgs)

In [None]:
imgs.shape

In [None]:
imgs.min()

In [None]:
imgs.max()

In [None]:
# imgs = imgs[:10000]

In [None]:
imgs.mean()

In [None]:
imgs.mean(axis=(0, 1, 2))

In [None]:
imgs.std()

In [None]:
imgs[:5000].var(axis=(0, 1, 2))

### TIFF

Plot an image from each tag:

In [None]:
_, ax = plt.subplots(len(label_list) // 3 + 1, 3, sharex='col', sharey='row', figsize=(20, 40))
count = 0
for label in label_list:
    sample = labels_df[labels_df[label] == 1].iloc[0]
    img = imread(f"{data_utils.DATA_PATH}{data_utils.TIFF_PATH}{sample.image_name}.tif")
    img = img[:, :, :-1]
    img = (img - img.min()) / (img.max() - img.min())
    ax[count // 3, count % 3].imshow(img)
    ax[count // 3, count % 3].set_title(f'{sample.image_name} - {label}')
    count += 1

It's clear that we would likely need to do a more robust image preprocessing to be able to adequately use these TIFF files.

Plot some deforestation images:

In [None]:
_, ax = plt.subplots(3, 3, sharex='col', sharey='row', figsize=(20, 20))
deforestation_samples = labels_df[labels_df['deforestation'] == 1]
for count in range(9):
    sample = deforestation_samples.iloc[count]
    img = imread(f"{data_utils.DATA_PATH}{data_utils.TIFF_PATH}{sample.image_name}.tif")
    img = img[:, :, :-1]
    img = (img - img.min()) / (img.max() - img.min())
    ax[count // 3, count % 3].imshow(img)
    ax[count // 3, count % 3].set_title(f'{sample.image_name} - {sample.tags}')

Estimate pixel stats:

In [None]:
imgs = []
n_samples = len(labels_df)
count = 0
for img_data, label in tqdm(data_utils.get_amazon_sample(labels_df, load_tiff=True), total=n_samples-1, desc='Loading samples'):
    imgs.append(img_data)
    count += 1
    if count >= n_samples:
        break

In [None]:
imgs = np.array(imgs)

In [None]:
imgs.shape

In [None]:
imgs.min()

In [None]:
imgs.max()

In [None]:
imgs.mean()