In this notebook, we will prepare the data file that we will pass on to the `overlap` utility.

See: https://github.com/kchristidis/overlap

In [1]:
import pandas as pd

Load the data we generated in the previous notebook.

In [2]:
df_candidate = pd.read_csv('01-segments-candidate.csv')

In [3]:
df_candidate.head()

Unnamed: 0,dataid,city,building_type,total_square_footage,house_construction_year,egauge_min_time,egauge_max_time
0,26,Austin,Single-Family Home,2075.0,2008.0,2012-11-03 00:00:00,2017-12-02 09:59:00
1,77,Austin,Single-Family Home,2669.0,2009.0,2014-06-06 05:00:00,2017-07-23 08:34:00
2,93,Austin,Single-Family Home,2934.0,1993.0,2012-12-09 00:00:00,2017-12-02 08:59:00
3,114,Austin,Single-Family Home,1842.0,2008.0,2013-10-16 00:00:00,2017-12-02 08:31:00
4,171,Austin,Single-Family Home,2376.0,2008.0,2012-05-03 00:00:00,2017-12-02 08:59:00


In [4]:
df_candidate.set_index('dataid', inplace=True)

In [5]:
df_candidate.head()

Unnamed: 0_level_0,city,building_type,total_square_footage,house_construction_year,egauge_min_time,egauge_max_time
dataid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
26,Austin,Single-Family Home,2075.0,2008.0,2012-11-03 00:00:00,2017-12-02 09:59:00
77,Austin,Single-Family Home,2669.0,2009.0,2014-06-06 05:00:00,2017-07-23 08:34:00
93,Austin,Single-Family Home,2934.0,1993.0,2012-12-09 00:00:00,2017-12-02 08:59:00
114,Austin,Single-Family Home,1842.0,2008.0,2013-10-16 00:00:00,2017-12-02 08:31:00
171,Austin,Single-Family Home,2376.0,2008.0,2012-05-03 00:00:00,2017-12-02 08:59:00


In [6]:
df_candidate.dtypes

city                        object
building_type               object
total_square_footage       float64
house_construction_year    float64
egauge_min_time             object
egauge_max_time             object
dtype: object

Recast to egauge columns datetime objects.

In [7]:
df_candidate['egauge_min_time'] = pd.to_datetime(df_candidate['egauge_min_time'])
df_candidate['egauge_max_time'] = pd.to_datetime(df_candidate['egauge_max_time'])

In [8]:
df_candidate.dtypes

city                               object
building_type                      object
total_square_footage              float64
house_construction_year           float64
egauge_min_time            datetime64[ns]
egauge_max_time            datetime64[ns]
dtype: object

In [9]:
df_candidate.head()

Unnamed: 0_level_0,city,building_type,total_square_footage,house_construction_year,egauge_min_time,egauge_max_time
dataid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
26,Austin,Single-Family Home,2075.0,2008.0,2012-11-03 00:00:00,2017-12-02 09:59:00
77,Austin,Single-Family Home,2669.0,2009.0,2014-06-06 05:00:00,2017-07-23 08:34:00
93,Austin,Single-Family Home,2934.0,1993.0,2012-12-09 00:00:00,2017-12-02 08:59:00
114,Austin,Single-Family Home,1842.0,2008.0,2013-10-16 00:00:00,2017-12-02 08:31:00
171,Austin,Single-Family Home,2376.0,2008.0,2012-05-03 00:00:00,2017-12-02 08:59:00


Convert `egauge_min_time` and `egauge_max_time` columns into Unix timestamps, so that we can feed them into the `overlap` utility.

See: https://stackoverflow.com/a/35630179/2363529

In [10]:
df_candidate['egauge_min_time'] = df_candidate['egauge_min_time'].astype('int64')//1e9
df_candidate['egauge_max_time'] = df_candidate['egauge_max_time'].astype('int64')//1e9

In [11]:
df_candidate.dtypes

city                        object
building_type               object
total_square_footage       float64
house_construction_year    float64
egauge_min_time            float64
egauge_max_time            float64
dtype: object

In [12]:
df_candidate.head()

Unnamed: 0_level_0,city,building_type,total_square_footage,house_construction_year,egauge_min_time,egauge_max_time
dataid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
26,Austin,Single-Family Home,2075.0,2008.0,1351901000.0,1512209000.0
77,Austin,Single-Family Home,2669.0,2009.0,1402031000.0,1500799000.0
93,Austin,Single-Family Home,2934.0,1993.0,1355011000.0,1512205000.0
114,Austin,Single-Family Home,1842.0,2008.0,1381882000.0,1512203000.0
171,Austin,Single-Family Home,2376.0,2008.0,1336003000.0,1512205000.0


Create a CSV file that contains just the necessary columns for `overlap`.

In [13]:
df_candidate[['egauge_min_time', 'egauge_max_time']].head()

Unnamed: 0_level_0,egauge_min_time,egauge_max_time
dataid,Unnamed: 1_level_1,Unnamed: 2_level_1
26,1351901000.0,1512209000.0
77,1402031000.0,1500799000.0
93,1355011000.0,1512205000.0
114,1381882000.0,1512203000.0
171,1336003000.0,1512205000.0


In [14]:
df_candidate[['egauge_min_time', 'egauge_max_time']].to_csv('02-segments-candidate-unix.csv')

Install `overlap`, then execute it as follows:

    overlap -headers 02-segments-candidate-unix.csv 02-segments-overlap.csv