# Calculating Number of Pixels in Change
In our previous notebooks we augmented our dataset by obtaining the difference of 2 satellite images capturing the same location but different times.
In order to further our understanding of the our dataset we will calculate the number of pixels that actaully have change in them for all of our images, and we will store them in our annotations csv.

This can help us determine the conditions for our models performance in the evaluation phase.

## Import Dependencies

In [None]:
from pathlib import Path
import numpy as np
import pandas as pd
from tqdm.notebook import tqdm
from skimage import io
from multiprocessing import  Pool
from functools import partial
from timeit import default_timer as timer
tqdm.pandas()

## Set Paths

In [None]:
annotations_path = Path('../input/spacenet-7-change-detection-chips-and-masks/annotations.csv')
root_dir = Path('../input/spacenet-7-change-detection-chips-and-masks/chip_dataset/chip_dataset/change_detection')

## Read CSV into Pandas Dataframe

In [None]:
df = pd.read_csv(annotations_path)

In [None]:
df.head()

### Obtain Number of Activated Pixels
The helper function below accepts the path to the mask that we want to extract the number of pixels from. Since the mask if either positive or blank, we can simply calculate the number of pixels by summing the values that return true when they are greater than 0.

In [None]:
def get_number_of_activated_pixels(mask_path,root_dir=root_dir):
    # for optimization purposes,
    # if path has 'blank' in it then number of activated pixels = 0
    if 'blank' in mask_path:
        return 0
    
    path = Path(root_dir/mask_path)
    im = io.imread(path)
    n_activated = (im > 0).sum()
    return n_activated

#### Example Path for Blank Mask

In [None]:
blank_mask_path = df[(df['is_blank'] == 'blank')]['mask_path'].iloc[0]
blank_mask_path

#### Example Path for Non-Blank Mask

In [None]:
non_blank_mask_path = df[(df['is_blank'] != 'blank')]['mask_path'].iloc[0]
non_blank_mask_path

Below we check that the masks obtained are indeed correct.

In [None]:
get_number_of_activated_pixels(blank_mask_path)

In [None]:
get_number_of_activated_pixels(non_blank_mask_path)

Let's check how long it takes to process 10000 images by using the python timeit module imported earlier. We wil time how long it takes to read the non-blank images as they are more cpu intensive.

In [None]:
test_df = df[df['is_blank'] != 'blank']['mask_path'][0:10000]

In [None]:
start = timer()
test = test_df.map(get_number_of_activated_pixels)
end = timer()
print(end - start) # Time in seconds

In [None]:
len(df)

Our time is not too great. It took us more than 100 seconds to process 10000 images, we have a total of more than 3 million images.
Let's make our programme faster by parellelizing our programe with the helper functions below.

In [None]:
def parallelize(data, func, num_of_processes=8):
    data_split = np.array_split(data, num_of_processes)
    pool = Pool(num_of_processes)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

def run_on_subset(func, data_subset):
    return data_subset.map(func)

def parallelize_on_rows(data, func, num_of_processes=8):
    return parallelize(data, partial(run_on_subset, func), num_of_processes)

Let's compare the time taken to process the same 10000 images.

In [None]:
start = timer()
test = parallelize_on_rows(test_df, get_number_of_activated_pixels);
end = timer()
print(end - start)

Great!!

That's over a 20 times increase in performance, and the gap in performance will only widen as we increase the number of images we are processing.

## Obtaining the number of pixels for all the rows
Below we use the function we created earlier to extract the number of pixels containing change from each image in the dataset.

In [None]:
df['n_change_pix'] = parallelize_on_rows(df['mask_path'], get_number_of_activated_pixels);

## Creating Time Periods
Below we will create columns for the time periods that each image was caputered in. This will be helpful later on in case we need to calculate time based statistics. 
`period_1` and `period_2` are the dates in which the first image and second image were captured respectively.

In [None]:
df['period_1'] = df.progress_apply(lambda x: pd.Period(year=x['year1'],month=x['month1'],freq='M'),axis=1)

In [None]:
df['period_2'] = df.progress_apply(lambda x: pd.Period(year=x['year2'],month=x['month2'],freq='M'),axis=1)

In [None]:
df.head()

## Calculating the Number of Months Apart for Each Image Couple
Below we will loop over the entire dataframe and use the time periods that we created earlier in order to calculate the number of months that the images are apart.

In [None]:
df['month_diff'] = df.progress_apply(lambda x: abs((x['period_1']-x['period_2']).n), axis=1)

## Saving the Output Dataframe as a CSV

In [None]:
df.to_csv('annotations.csv',index=False)