# Tutorial 00: Computing features

In this notebook, we can see how to compute features making use of the Pipeline and Feature classes. First, we import a function to load data from multiple cities and the module that contains the code for features.

In [1]:
%load_ext autoreload
%autoreload 2
from damage.data import load_data_multiple_cities
from damage import features

Now, we start by reading the data for any cities we are interested in. In this case, we will load data from Aleppo.

In [2]:
## Reading
cities = ['daraa']

rasters_path = '../data/city_rasters/'
annotations_path = '../data/annotations/'
polygons_path = '../data/polygons/'

data = load_data_multiple_cities(cities, rasters_path, annotations_path, polygons_path)

The resulting object is a dictionary with the filenames preceded by a keyword (populated_areas, annotation, raster) as keys and data as values (dataframes, tifs, arrays...). By adding that keyword as a prefix we will be able to detect on a later step what type of data each key contains. 

The mapping between the city name (Aleppo in this case) and the corresponding filenames is done at damage/data/data_sources.py Ideally, we should create a standard way of naming the files so that file is not required.

In [3]:
data.keys()

dict_keys(['populated_areas_populated_areas.shp', 'annotation_4_Damage_Sites_Daraa_CDA.shp', 'raster_daraa_2011_10_17_zoom_19.tif', 'raster_daraa_2013_11_10_zoom_19.tif', 'raster_daraa_2014_05_01_zoom_19.tif', 'raster_daraa_2016_02_25_zoom_19.tif', 'raster_daraa_2016_04_19_zoom_19.tif', 'raster_daraa_2017_02_07_zoom_19.tif', 'no_analysis_areas_5_No_Analysis_Areas_Daraa.shp'])

In [5]:
# check unique dates
shapefile_df = data['annotation_4_Damage_Sites_Daraa_CDA.shp']
print(shapefile_df.SensDt.unique())
print(shapefile_df.SensDt_2.unique())
print(shapefile_df.SensDt_3.unique())
print(shapefile_df.SensDt_4.unique())

['2013-09-07' None]
['2014-05-01' None]
['2015-06-04' None]
['2016-04-19']


Now, let's get into the code of the Pipeline.

## The feature pipeline

The ___Pipeline___ class takes a __list of tuples__ (name, function) for preprocessors and features and applies them with the transform method. __The feature functions need to return a dataframe with an identically structured index so the merge can be performed (e.g. city, patch_id).__

The ___transform___ method __takes a dictionary of data__ where each key represents a different data source (e.g. annotations) and each value a data object (e.g. pandas dataframe). The transform         method __iterates first over the preprocessor functions, overwriting the data object__. Then, it __iterates over the feature functions creating new keys__ in the data dictionary with the passed name. That way, features can use data generates by previously computed features.

Finally, the transform method merges the data generates by the feature functions making use of the __common index structure__.


In [6]:
from functools import reduce
import pandas as pd

from damage.features.base import Transformer


class Pipeline(Transformer):

    def __init__(self, features, preprocessors):
        self.features = features
        self.feature_names = [feature_name for feature_name, _ in self.features]
        self.preprocessors = preprocessors

    def transform(self, data):
        for preprocessor_name, preprocessor in self.preprocessors:
            data = preprocessor(data)

        for feature_name, feature in self.features:
            data[feature_name] = feature(data)

        feature_data = [data[name] for name in self.feature_names if name in data.keys()]
        feature_data = self._merge_feature_data(feature_data)
        return feature_data

    def _merge_feature_data(self, feature_data):
        return reduce(lambda l, r: pd.merge(l, r, left_index=True, right_index=True, how='outer'), feature_data)


## A toy example

Let's go first with a toy example. First we create our data.

In [6]:
annotations = pd.DataFrame({
    'destroyed': [0, 1],
    'patch_id': ['1', '2'],
})
annotations

Unnamed: 0,destroyed,patch_id
0,0,1
1,1,2


In [7]:
image = pd.DataFrame({
    'image': ['image_a', 'image_b'],
    'patch_id': ['1', '2'],
})
image

Unnamed: 0,image,patch_id
0,image_a,1
1,image_b,2


In [8]:
data = {
    'annotations': annotations,
    'image': image
}

Now, we create some simple functions

In [9]:
def preprocessor_preprocess_annotations(data):
    annotation_data = data['annotations']
    data['annotations'] = annotation_data.rename(columns={'destroyed': 'damage'})
    return data

def feature_create_damage_dummy(data):
    annotation_data = data['annotations'].set_index('patch_id')
    damage_dummy = pd.get_dummies(annotation_data['damage'], drop_first=True, prefix='destroyed')
    return damage_dummy

def feature_split_images(data):
    image_data = data['image'].set_index('patch_id')
    # Split
    return image_data

Now we apply these functions to the data dictionary we created and we well get a pandas dataframe with our features, indexed by the common index.

In [11]:
pipeline = Pipeline(
    preprocessors=[
        ('preprocess_annotations', preprocessor_preprocess_annotations)
    ],
    features=[
        ('damage_dummy', feature_create_damage_dummy),
        ('feature_split_images', feature_split_images)
    ]
)
feature_data = pipeline.transform(data)
feature_data.head()

Unnamed: 0_level_0,destroyed_1,image
patch_id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,image_a
2,1,image_b


## A real example

Next, we run a real example, with the data from daraa and some feature classes that exist in the damage library.

In [7]:
## Reading
cities = ['daraa']

rasters_path = '../data/city_rasters/'
annotations_path = '../data/annotations/'
polygons_path = '../data/polygons/'

data = load_data_multiple_cities(cities, rasters_path, annotations_path, polygons_path)

In [8]:
### Processing
from datetime import timedelta
patch_size = 64
stride = patch_size
TIME_TO_ANNOTATION_THRESHOLD = timedelta(weeks=1)
pipeline = features.Pipeline(
    preprocessors=[
        ('AnnotationPreprocessor', features.AnnotationPreprocessor()),
    ],
    features=[
        ('RasterSplitter', features.RasterSplitter(patch_size=patch_size, stride=stride)),
        ('AnnotationMaker', features.AnnotationMaker(patch_size, TIME_TO_ANNOTATION_THRESHOLD)),
        ('RasterPairMaker', features.RasterPairMaker()),
    ],

)

feature_data = pipeline.transform(data)

INFO:::AnnotationPreprocessor:::2019-07-26 17:17:38,427:::Applying AnnotationPreprocessor
INFO:::RasterSplitter:::2019-07-26 17:17:38,748:::Applying RasterSplitter
100%|██████████| 207/207 [01:59<00:00,  1.74it/s]
100%|██████████| 207/207 [00:02<00:00, 69.69it/s]
100%|██████████| 207/207 [00:01<00:00, 121.99it/s]
100%|██████████| 207/207 [00:02<00:00, 72.77it/s]
100%|██████████| 207/207 [00:01<00:00, 109.33it/s]
100%|██████████| 207/207 [01:42<00:00,  2.01it/s]
INFO:::AnnotationMaker:::2019-07-26 17:22:42,018:::Applying AnnotationMaker
INFO:::RasterPairMaker:::2019-07-26 17:22:43,817:::Applying RasterPairMaker


In [14]:
print(len(feature_data))
print(feature_data['destroyed'].isnull().sum())
print(feature_data.index.get_level_values('date').unique())

15574
14711
DatetimeIndex(['2017-02-07'], dtype='datetime64[ns]', name='date', freq=None)


And now we can save this data as a pickle file to retrieve it later.

In [19]:
feature_data.to_pickle('../logs/features/example_daraa.p')