In this Jupyter notebook, we will explain data transformation (from raw (.csv) to dataset (.hdf5)). In addition, this transformation code includes outlier prunning and loadshape profile for filling empty data point. All the raw data files are provided seperately by their data repository.

In [1]:
# Read relevant packages

import numpy as np 
import pandas as pd 
import os 
import h5py 
import csv 
import datetime as dt

In [2]:
# Assigning name and location of the hdf5 file

file_loc = 'data.hdf5'                   

In [3]:
# Assigning NAN values on empty data points

def fill_null(dset,time_index):
    data = dset.copy()
    time = data.loc[:,time_index]
    tlist = time.tolist()
    start = dt.datetime(year = min(tlist).year, month = 1, day = 1, hour = 0, minute = 0)
    end = dt.datetime(year = max(tlist).year, month = 12, day = 31, hour = 23, minute = 0)
    i = start
    while i<=end:
        if i not in tlist:
            data = data.append([(i,np.nan)])
    return data

Let's add UTEXAS dataset on 'data.hdf5'

For the UTEXAS dataset, the structure of the data files is segregated by building and then again by year. For example, 2015 data for the Welch building is located in 'utexas/WEL/2015.csv'

In [4]:
home = 'utexas'                         #Insert home directory where data is stored
ext = '.csv'                            #Insert data storage type
separ = ','                             #Separation
mdata = 'utexas/utexas_metadata.csv'    #Point to location of metadata file

In [5]:
root, folders,x = [m for m in os.walk(home)][0]
folder_paths = [os.path.join(home, subdir) for subdir in folders]
subfiles = []
for f in folder_paths: 
    r,a, data = [m for m in os.walk(f)][0]
    for d in data: 
        if os.path.splitext(d)[1]==ext:
            subfiles.append(os.path.join(r,d))
hdf_dsets = [s.replace('\\','/').replace(ext,'') for s in subfiles]

'Subfiles' should now have a list of the complete path of all data files,and we loop through the files one by one.

In addtion, UTEXAS dataset involves various time formats. convert_any_time converts string formats to pd datetime while float formats (already in pd format) are left alone: 

In [8]:
def convert_any_dtime(x):
    try:
        return dt.datetime.strptime(x, "%m/%d/%y %H:%M")
    except:
        try:
            return dt.datetime.strptime(x, "%m/%d/%Y %H:%M")
        except:
            return None

We need to assign metadata file for UTEXAS dataset.

In [9]:
def read_mdata(mdata_file):
    return pd.read_csv(mdata_file,index_col=0)

meta_db = read_mdata(mdata)

Below is the code that compiles the data into the file. Be warned, data processing may take over 20 minutes. For a sense of where in the process the computer is, remove the commented portions

In [10]:
with h5py.File(file_loc,'a') as f:
    #counter = 0
    for data_file in subfiles: 
        #counter = counter+1
        #print('Now reading dataset #{}......................{}'.format(counter,data_file))
        d = pd.read_csv(data_file).dropna(axis=0,how='any',inplace=False)
        dt_dates = [convert_any_dtime(x) if type(x)==str else x for x in d['DateTime'] ]
        dates= np.array( [x.strftime("%Y-%m-%d %H:%M:%S").encode('utf8') for x in dt_dates], dtype = np.string_).reshape((-1,1))
        usage =np.array([x for x in d['Electrical ( kWh )']],dtype=np.float64).reshape((-1,1))
        hdfs_dset = data_file.replace('\\','/').replace(ext,'')
        f[hdfs_dset] = np.hstack((dates,usage))  

Now we can assign proper hierarchy for the metadata of the UTEXAS dataset.

![body](data_structure.png)

In [11]:
with h5py.File(file_loc,'a') as f:
    for building in meta_db.index:
        psu,sqft = meta_db.loc[building]
        if '{}/{}'.format(home,building) in f:
            bldg_grp = f['{}/{}'.format(home,building)]
            bldg_grp.attrs['PSU'] = psu
            bldg_grp.attrs['Sqft'] = sqft

<h3><font color = #333f48>EUI, we can comback to EUI after outlier detection and filling ...</font></h3>

In [12]:
#def get_eui(dset,axis,sqft):
#    hours = len(dset)
#    energy = dset.loc[:,axis].tolist()
#    #If there are missing values, scale EUI(in kWH/sqft) linearly*
#    total_e = sum(energy)*8766/hours
#    #Convert total energy (kWH) to (kbTU)
#    total_e = total_e *3.412141633
#    return total_e/sqft

In [13]:
#with h5py.File(file_loc,'a') as f:
#    subset = f[home]
#    for buil in subset.keys():
#        building = subset[buil]
#        if 'Sqft' in building.attrs:
#            sqft = building.attrs['Sqft']
#            for year in building.keys():
#                dset = pd.DataFrame(building[year][()])
#                building[year].attrs['EUI'] = get_eui(dset,1,sqft)

<h5><font color = #ff1111>---TODO--- decide whether to fill in incomplete data using regression to calculate EUI or scale linearly with sum(available data)*(8760 hrs/yr)/(hours available data) </font></h5>

Iterate same data transformation for MIT dataset

In [14]:
home = 'MIT'                            #Insert home directory where data is stored
ext = '.csv'                            #Insert data storage type
separ = ','                             #Separation
mdata = 'mdata/mdata.csv'               #Point to location of metadata file

In [15]:
root, folders,files = [m for m in os.walk(home)][0]
subfiles = []
for d in files: 
    if os.path.splitext(d)[1]==ext:
        subfiles.append(os.path.join(root,d))

print(files)

['2014.csv', '2015.csv', '2016.csv']


Note that the MIT dataset is shaped differently than our desires in a few ways: 
<ul>
<li>The files are segregated by year instead of by building</li>
<li>The files contain multiple 'error' values, so we need to screen more than just N/A</li>
<li>The list is NOT ordered, meaning that time intervals next to each other in the original scrape will NOT necessarily be together in the final </li>
<li>In addition the separation between timestamps is not hourly as we would like but every 15 minutes</li>
<li> To further complicate matters, the list increments are in kilowatts (kW), a unit of power, not a unit of energy!!!  Power is related to energy usage in that power is average energy usage over a period of time. A Watt is 1 Joule per Second, and so 1 <b>kilowatt hour</b> is our standard measurement of energy: as
$$1\text{ kWH} =\frac{\text{1000 Joule/sec}}{\text{1 kW}}\times \frac{\text{3600 seconds}}{1 hr}= 3600000\text{ Joule}$$
A 1 kW power usage for 15 minutes is therefore $\frac{1}{4} \times 3600000 \text{ Joule} = 900000 \text{Joule}$
Thus to find the total power usage over 4 different 15 minute intervals, take the average of the power usage of each individual file 
</ul>

In [16]:
def mit_name_clean(colname):
    colname = colname.replace('RealPower','')
    return colname.replace('TFR','')
def mit_aggregate(sorted_array,name):
    sorted_array.drop_duplicates(subset = 'DATE_TIME', keep = 'first', inplace = True )
    times = sorted_array['DATE_TIME'].tolist()
    usage = sorted_array[name].tolist()
    aggregated_times = []
    aggregated_usage = []
    for i in range(len(times)-3):
        one = (times[i+1]-times[i] ==dt.timedelta(minutes=15))
        two = (times[i+2]-times[i+1] ==dt.timedelta(minutes=15))
        three = (times[i+3]-times[i+2] ==dt.timedelta(minutes=15))
        if(times[i].minute==0 and one and two and three):
            aggregated_times.append(times[i].strftime("%Y-%m-%d %H:%M:%S").encode('utf8'))
            aggregated_usage.append(str((usage[i]+usage[i+1]+usage[i+2]+usage[i+3])/4).encode('utf8'))
    return np.hstack((np.array(aggregated_times).reshape((-1,1)),np.array(aggregated_usage).reshape((-1,1))))
    
        

We can account by the error values by setting up a filter, and deal with the time issue by 1) sorting the dataframe before partitioning it and 2) writing the aggregate method

In [17]:
ERROR_VALUES = ['No Data','I/O Timeout','Error','Pt Created','Configure']
with h5py.File(file_loc) as f:
    grp = f[home] if home in f else f.create_group(home)
    for dfile in subfiles: 
        year = dfile.split('\\')[-1].split('.')[0]
        df = pd.read_csv(dfile)
        print('Now analyzing data from: ................{}'.format(year))
        df.sort_values(by =['DATE_TIME'],inplace=True)
        dt_used = False
        df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME'])
        for building in df.columns:
            if(not dt_used):
                dt_used = True
            else:
                building_name = mit_name_clean(building)
                print('Building.....{}'.format(building_name))
                subset = df[['DATE_TIME',building]]
                subset = subset[subset.apply(lambda x: str(x[building]) not in ERROR_VALUES,axis=1)]
                if len(subset) == 0:
                    continue
                else:
                    subset[building] = subset.apply(lambda x: float(x[building]), axis = 1)
                    cleaned = subset[subset[building]!=0]
                    dset_name ='{}/{}'.format(building_name,year)
                    grp[dset_name] = mit_aggregate(cleaned,building)

  interactivity=interactivity, compiler=compiler, result=result)


Now analyzing data from: ................MIT/2014
Building.....E15A
Building.....E15B
Building.....E18A
Building.....E18B
Building.....E18C


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return func(*args, **kwargs)


Building.....E19A
Building.....E19B
Building.....E1A
Building.....E23A
Building.....E23B
Building.....E2A
Building.....E38A
Building.....E40A
Building.....E40B
Building.....E40C
Building.....E40D
Building.....E52A
Building.....E62A
Building.....E62B
Building.....M13A
Building.....M13B
Building.....M13C
Building.....M13D
Building.....M13E
Building.....M18A
Building.....M18B
Building.....M24A
Building.....M26A
Building.....M26B
Building.....M26C
Building.....M26D
Building.....M2A
Building.....M32D_A
Building.....M32D_B
Building.....M32G_A
Building.....M32G_B
Building.....M33A
Building.....M35A
Building.....M36A
Building.....M36B
Building.....M36C
Building.....M36D
Building.....M36E
Building.....M37A
Building.....M37B
Building.....M46A
Building.....M46B
Building.....M46C
Building.....M46D
Building.....M50A
Building.....M50B
Building.....M54A
Building.....M54B
Building.....M54C
Building.....M56A
Building.....M56B
Building.....M66A
Building.....M66B
Building.....M66EMG
Building.....M68A
Bui

  interactivity=interactivity, compiler=compiler, result=result)


Now analyzing data from: ................MIT/2015
Building.....E15A
Building.....E15B
Building.....E18A
Building.....E18B
Building.....E18C
Building.....E19A
Building.....E19B
Building.....E1A
Building.....E23A
Building.....E23B
Building.....E2A
Building.....E38A
Building.....E40A
Building.....E40B
Building.....E40C
Building.....E40D
Building.....E52A
Building.....E62A
Building.....E62B
Building.....M13A
Building.....M13B
Building.....M13C
Building.....M13D
Building.....M13E
Building.....M18A
Building.....M18B
Building.....M24A
Building.....M26A
Building.....M26B
Building.....M26C
Building.....M26D
Building.....M2A
Building.....M32D_A
Building.....M32D_B
Building.....M32G_A
Building.....M32G_B
Building.....M33A
Building.....M35A
Building.....M36A
Building.....M36B
Building.....M36C
Building.....M36D
Building.....M36E
Building.....M37A
Building.....M37B
Building.....M46A
Building.....M46B
Building.....M46C
Building.....M46D
Building.....M50A
Building.....M50B
Building.....M54A
Building.

  interactivity=interactivity, compiler=compiler, result=result)


Now analyzing data from: ................MIT/2016
Building.....E15A
Building.....E15B
Building.....E18A
Building.....E18B
Building.....E18C
Building.....E19A
Building.....E19B
Building.....E1A
Building.....E23A
Building.....E23B
Building.....E2A
Building.....E38A
Building.....E40A
Building.....E40B
Building.....E40C
Building.....E40D
Building.....E52A
Building.....E62A
Building.....E62B
Building.....M13A
Building.....M13B
Building.....M13C
Building.....M13D
Building.....M13E
Building.....M18A
Building.....M18B
Building.....M24A
Building.....M26A
Building.....M26B
Building.....M26C
Building.....M26D
Building.....M2A
Building.....M32D_A
Building.....M32D_B
Building.....M32G_A
Building.....M32G_B
Building.....M33A
Building.....M35A
Building.....M36A
Building.....M36B
Building.....M36C
Building.....M36D
Building.....M36E
Building.....M37A
Building.....M37B
Building.....M46A
Building.....M46B
Building.....M46C
Building.....M46D
Building.....M50A
Building.....M50B
Building.....M54A
Building.

Metadata assignment for the MIT dataset

In [18]:
mdata_path = '{}/{}'.format(home,mdata)
df = pd.read_csv(mdata_path)
with h5py.File(file_loc) as f: 
    ds = f[home]
    for build in ds.keys():
        if(build in df['Building'].tolist()):
            psu = df[df['Building']==build]['PSU'].tolist()[0]
        building = ds[build]
        building.attrs['PSU'] = psu

Iterate same data transformation for the IRELAND dataset

In [19]:
home = 'ireland'                        #Insert home directory where data is stored
ext = '.txt'                            #Insert data storage type
separ = ' '                             #Separation
mdata = 'ireland/meta/metadata.csv'     #Point to location of metadata file

In [20]:
root, folders,x = [m for m in os.walk(home)][0]
data_files = [os.path.join(home,dfile) for dfile in x]
data_files

['ireland/File1.txt',
 'ireland/File2.txt',
 'ireland/File3.txt',
 'ireland/File4.txt',
 'ireland/File5.txt',
 'ireland/File6.txt']

Note that the Ireland dataset is like the MIT dataset in that it too is shaped differently than our desires.
<ul>
<li>The files are segregated randomly, 1000 meters all years to a file instead of by building</li>
<li>The list is NOT ordered, meaning that time intervals next to each other in the original scrape will NOT necessarily be together in the final </li>
<li>In addition the separation between timestamps is not hourly as we would like but every 30 minutes </li>
<li> Most curiously, there is an interesting format this dataset uses for date time, of 'xxxyy' where 'xxx' is days after Dec. 31, 2008 and yy is 30 minute intervals. 00:00 - 00:30 is yy=01
</ul>

In [21]:
#Ireland dataset stores date format in a weird form: 
def ireland_date(datum):
    #First three digits are the # of days after 31 Dec 2008
    dat = str(datum)
    num_days = int(float(dat[:3]))
    #Last three digits are the # of 30 minute intervals (1 = 0:0-0:30)
    times = int(float(dat[3:]))
    ORIGIN = dt.datetime(2008,12,31)
    DAY_INCREMENT = dt.timedelta(days=1)
    MIN_INCREMENT = dt.timedelta(minutes = 30)
    return ORIGIN + DAY_INCREMENT*num_days + MIN_INCREMENT*(times-1)
def aggregate(sorted_array):
    sorted_array.drop_duplicates(subset = 1, keep = 'first', inplace = True )
    times = sorted_array[1].dt.to_pydatetime().tolist()
    usage = sorted_array[2].tolist()
    aggregated_times = []
    aggregated_usage = []
    sets = {}
    current = 2009 
    changed = False
    for i in range(len(times)-1):
        if(times[i].year==2010 and not changed):
            changed = True
            sets[2009]=np.hstack((np.array(aggregated_times).reshape((-1,1)),np.array(aggregated_usage).reshape((-1,1))))
            aggregated_times = []
            aggregated_usage = []
        if(times[i].minute==0 and times[i+1]-times[i] ==dt.timedelta(minutes=30)):
            aggregated_times.append(times[i].strftime("%Y-%m-%d %H:%M:%S").encode('utf8'))
            aggregated_usage.append(str(usage[i]+usage[i+1]).encode('utf8'))
    sets[2010] = np.hstack((np.array(aggregated_times).reshape((-1,1)),np.array(aggregated_usage).reshape((-1,1))))
    return sets

Be very careful with running the script - it can take up to $\textbf{2 hours!!!}$

In [None]:
with h5py.File(file_loc) as f:
    counter = 0 
    for dfile in data_files:
        print('Now transcribing Data File: {}'.format(dfile))
        data = pd.read_csv(dfile,sep=separ,header=None)
        buildings = data[0].unique()
        for b in buildings:
            counter = counter + 1
            print('-----Building #{}:.........ID#:{}'.format(counter,b))
            subset = data[data[0]==b]
            subset[1] = subset.apply(lambda x: ireland_date(x[1]),axis = 1)
            subset.sort(1,inplace = True)
            sets = aggregate(subset)
            for key in sets.keys():
                f['ireland/{}/{}'.format(b,key)]=sets[key]
            

Now transcribing Data File: ireland/File1.txt
-----Building #1:.........ID#:1392


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  na_position=na_position)


-----Building #2:.........ID#:1951
-----Building #3:.........ID#:1491
-----Building #4:.........ID#:1194
-----Building #5:.........ID#:1804
-----Building #6:.........ID#:1048
-----Building #7:.........ID#:1802
-----Building #8:.........ID#:1287
-----Building #9:.........ID#:1529
-----Building #10:.........ID#:1463
-----Building #11:.........ID#:1860
-----Building #12:.........ID#:1922
-----Building #13:.........ID#:1334
-----Building #14:.........ID#:1604
-----Building #15:.........ID#:1042
-----Building #16:.........ID#:1494
-----Building #17:.........ID#:1078
-----Building #18:.........ID#:1984
-----Building #19:.........ID#:1083
-----Building #20:.........ID#:1978
-----Building #21:.........ID#:1481
-----Building #22:.........ID#:1524
-----Building #23:.........ID#:1637
-----Building #24:.........ID#:1783
-----Building #25:.........ID#:1539
-----Building #26:.........ID#:1777
-----Building #27:.........ID#:1518
-----Building #28:.........ID#:1698
-----Building #29:.........ID#:1094


-----Building #227:.........ID#:1459
-----Building #228:.........ID#:1633
-----Building #229:.........ID#:1906
-----Building #230:.........ID#:1654
-----Building #231:.........ID#:1789
-----Building #232:.........ID#:1091
-----Building #233:.........ID#:1935
-----Building #234:.........ID#:1266
-----Building #235:.........ID#:1142
-----Building #236:.........ID#:1168
-----Building #237:.........ID#:1136
-----Building #238:.........ID#:1100
-----Building #239:.........ID#:1810
-----Building #240:.........ID#:1231
-----Building #241:.........ID#:1015
-----Building #242:.........ID#:1267
-----Building #243:.........ID#:1718
-----Building #244:.........ID#:1550
-----Building #245:.........ID#:1820
-----Building #246:.........ID#:1702
-----Building #247:.........ID#:1006
-----Building #248:.........ID#:1081
-----Building #249:.........ID#:1146
-----Building #250:.........ID#:1554
-----Building #251:.........ID#:1908
-----Building #252:.........ID#:1728
-----Building #253:.........ID#:1577
-

-----Building #449:.........ID#:1127
-----Building #450:.........ID#:1456
-----Building #451:.........ID#:1369
-----Building #452:.........ID#:1639
-----Building #453:.........ID#:1983
-----Building #454:.........ID#:1347
-----Building #455:.........ID#:1726
-----Building #456:.........ID#:1704
-----Building #457:.........ID#:1991
-----Building #458:.........ID#:1988
-----Building #459:.........ID#:1635
-----Building #460:.........ID#:1360
-----Building #461:.........ID#:1786
-----Building #462:.........ID#:1662
-----Building #463:.........ID#:1036
-----Building #464:.........ID#:1579
-----Building #465:.........ID#:1075
-----Building #466:.........ID#:1371
-----Building #467:.........ID#:1900
-----Building #468:.........ID#:1367
-----Building #469:.........ID#:1040
-----Building #470:.........ID#:1098
-----Building #471:.........ID#:1382
-----Building #472:.........ID#:1046
-----Building #473:.........ID#:1921
-----Building #474:.........ID#:1468
-----Building #475:.........ID#:1803
-

-----Building #671:.........ID#:1652
-----Building #672:.........ID#:1647
-----Building #673:.........ID#:1268
-----Building #674:.........ID#:1349
-----Building #675:.........ID#:1477
-----Building #676:.........ID#:1497
-----Building #677:.........ID#:1252
-----Building #678:.........ID#:1034
-----Building #679:.........ID#:1881
-----Building #680:.........ID#:1884
-----Building #681:.........ID#:1119
-----Building #682:.........ID#:1414
-----Building #683:.........ID#:1445
-----Building #684:.........ID#:1220
-----Building #685:.........ID#:1650
-----Building #686:.........ID#:1775
-----Building #687:.........ID#:1809
-----Building #688:.........ID#:1383
-----Building #689:.........ID#:1446
-----Building #690:.........ID#:1017
-----Building #691:.........ID#:1831
-----Building #692:.........ID#:1693
-----Building #693:.........ID#:1862
-----Building #694:.........ID#:1171
-----Building #695:.........ID#:1099
-----Building #696:.........ID#:1435
-----Building #697:.........ID#:1031
-

-----Building #893:.........ID#:1012
-----Building #894:.........ID#:1085
-----Building #895:.........ID#:1673
-----Building #896:.........ID#:1770
-----Building #897:.........ID#:1546
-----Building #898:.........ID#:1506
-----Building #899:.........ID#:1116
-----Building #900:.........ID#:1203
-----Building #901:.........ID#:1738
-----Building #902:.........ID#:1366
-----Building #903:.........ID#:1507
-----Building #904:.........ID#:1994
-----Building #905:.........ID#:1686
-----Building #906:.........ID#:1812
-----Building #907:.........ID#:1385
-----Building #908:.........ID#:1547
-----Building #909:.........ID#:1942
-----Building #910:.........ID#:1192
-----Building #911:.........ID#:1193
-----Building #912:.........ID#:1300
-----Building #913:.........ID#:1771
-----Building #914:.........ID#:1690
-----Building #915:.........ID#:1291
-----Building #916:.........ID#:1457
-----Building #917:.........ID#:1692
-----Building #918:.........ID#:1492
-----Building #919:.........ID#:1645
-

-----Building #1111:.........ID#:2197
-----Building #1112:.........ID#:2255
-----Building #1113:.........ID#:2019
-----Building #1114:.........ID#:2060
-----Building #1115:.........ID#:2861
-----Building #1116:.........ID#:2684
-----Building #1117:.........ID#:2758
-----Building #1118:.........ID#:2891
-----Building #1119:.........ID#:2499
-----Building #1120:.........ID#:2352
-----Building #1121:.........ID#:2428
-----Building #1122:.........ID#:2721
-----Building #1123:.........ID#:2479
-----Building #1124:.........ID#:2984
-----Building #1125:.........ID#:2696
-----Building #1126:.........ID#:2717
-----Building #1127:.........ID#:2711
-----Building #1128:.........ID#:2458
-----Building #1129:.........ID#:2133
-----Building #1130:.........ID#:2826
-----Building #1131:.........ID#:2494
-----Building #1132:.........ID#:2858
-----Building #1133:.........ID#:2810
-----Building #1134:.........ID#:2264
-----Building #1135:.........ID#:2837
-----Building #1136:.........ID#:2770
-----Buildin

-----Building #1327:.........ID#:2450
-----Building #1328:.........ID#:2829
-----Building #1329:.........ID#:2629
-----Building #1330:.........ID#:2936
-----Building #1331:.........ID#:2273
-----Building #1332:.........ID#:2079
-----Building #1333:.........ID#:2053
-----Building #1334:.........ID#:2800
-----Building #1335:.........ID#:2259
-----Building #1336:.........ID#:2675
-----Building #1337:.........ID#:2749
-----Building #1338:.........ID#:2795
-----Building #1339:.........ID#:2234
-----Building #1340:.........ID#:2351
-----Building #1341:.........ID#:2049
-----Building #1342:.........ID#:2674
-----Building #1343:.........ID#:2619
-----Building #1344:.........ID#:2620
-----Building #1345:.........ID#:2807
-----Building #1346:.........ID#:2088
-----Building #1347:.........ID#:2584
-----Building #1348:.........ID#:2949
-----Building #1349:.........ID#:2920
-----Building #1350:.........ID#:2956
-----Building #1351:.........ID#:2417
-----Building #1352:.........ID#:2365
-----Buildin

-----Building #1543:.........ID#:2321
-----Building #1544:.........ID#:2462
-----Building #1545:.........ID#:2590
-----Building #1546:.........ID#:2004
-----Building #1547:.........ID#:2959
-----Building #1548:.........ID#:2305
-----Building #1549:.........ID#:2457
-----Building #1550:.........ID#:2859
-----Building #1551:.........ID#:2030
-----Building #1552:.........ID#:2991
-----Building #1553:.........ID#:2431
-----Building #1554:.........ID#:2902
-----Building #1555:.........ID#:2093
-----Building #1556:.........ID#:2456
-----Building #1557:.........ID#:2888
-----Building #1558:.........ID#:2284
-----Building #1559:.........ID#:2868
-----Building #1560:.........ID#:2734
-----Building #1561:.........ID#:2081
-----Building #1562:.........ID#:2114
-----Building #1563:.........ID#:2764
-----Building #1564:.........ID#:2762
-----Building #1565:.........ID#:2292
-----Building #1566:.........ID#:2033
-----Building #1567:.........ID#:2571
-----Building #1568:.........ID#:2745
-----Buildin

-----Building #1759:.........ID#:2216
-----Building #1760:.........ID#:2639
-----Building #1761:.........ID#:2336
-----Building #1762:.........ID#:2015
-----Building #1763:.........ID#:2514
-----Building #1764:.........ID#:2523
-----Building #1765:.........ID#:2324
-----Building #1766:.........ID#:2561
-----Building #1767:.........ID#:2834
-----Building #1768:.........ID#:2178
-----Building #1769:.........ID#:2399
-----Building #1770:.........ID#:2121
-----Building #1771:.........ID#:2551
-----Building #1772:.........ID#:2383
-----Building #1773:.........ID#:2756
-----Building #1774:.........ID#:2183
-----Building #1775:.........ID#:2761
-----Building #1776:.........ID#:2892
-----Building #1777:.........ID#:2490
-----Building #1778:.........ID#:2131
-----Building #1779:.........ID#:2036
-----Building #1780:.........ID#:2828
-----Building #1781:.........ID#:2507
-----Building #1782:.........ID#:2020
-----Building #1783:.........ID#:2172
-----Building #1784:.........ID#:2128
-----Buildin

-----Building #1975:.........ID#:2574
-----Building #1976:.........ID#:2145
-----Building #1977:.........ID#:2120
-----Building #1978:.........ID#:2618
-----Building #1979:.........ID#:2741
-----Building #1980:.........ID#:2279
-----Building #1981:.........ID#:2459
-----Building #1982:.........ID#:2340
-----Building #1983:.........ID#:2580
-----Building #1984:.........ID#:2657
-----Building #1985:.........ID#:2731
-----Building #1986:.........ID#:2016
-----Building #1987:.........ID#:2460
-----Building #1988:.........ID#:2557
-----Building #1989:.........ID#:2244
-----Building #1990:.........ID#:2311
-----Building #1991:.........ID#:2483
-----Building #1992:.........ID#:2326
-----Building #1993:.........ID#:2642
-----Building #1994:.........ID#:2783
-----Building #1995:.........ID#:2425
-----Building #1996:.........ID#:2368
-----Building #1997:.........ID#:2660
-----Building #1998:.........ID#:2155
Now transcribing Data File: ireland/File3.txt
-----Building #1999:.........ID#:3823
----

-----Building #2190:.........ID#:3352
-----Building #2191:.........ID#:3536
-----Building #2192:.........ID#:3019
-----Building #2193:.........ID#:3394
-----Building #2194:.........ID#:3657
-----Building #2195:.........ID#:3464
-----Building #2196:.........ID#:3164
-----Building #2197:.........ID#:3928
-----Building #2198:.........ID#:3988
-----Building #2199:.........ID#:3900
-----Building #2200:.........ID#:3516
-----Building #2201:.........ID#:3038
-----Building #2202:.........ID#:3282
-----Building #2203:.........ID#:3429
-----Building #2204:.........ID#:3211
-----Building #2205:.........ID#:3506
-----Building #2206:.........ID#:3512
-----Building #2207:.........ID#:3324
-----Building #2208:.........ID#:3249
-----Building #2209:.........ID#:3877
-----Building #2210:.........ID#:3706
-----Building #2211:.........ID#:3970
-----Building #2212:.........ID#:3909
-----Building #2213:.........ID#:3005
-----Building #2214:.........ID#:3320
-----Building #2215:.........ID#:3802
-----Buildin

-----Building #2406:.........ID#:3924
-----Building #2407:.........ID#:3888
-----Building #2408:.........ID#:3789
-----Building #2409:.........ID#:3139
-----Building #2410:.........ID#:3482
-----Building #2411:.........ID#:3576
-----Building #2412:.........ID#:3918
-----Building #2413:.........ID#:3906
-----Building #2414:.........ID#:3472
-----Building #2415:.........ID#:3131
-----Building #2416:.........ID#:3430
-----Building #2417:.........ID#:3820
-----Building #2418:.........ID#:3370
-----Building #2419:.........ID#:3775
-----Building #2420:.........ID#:3214
-----Building #2421:.........ID#:3289
-----Building #2422:.........ID#:3436
-----Building #2423:.........ID#:3830
-----Building #2424:.........ID#:3601
-----Building #2425:.........ID#:3957
-----Building #2426:.........ID#:3639
-----Building #2427:.........ID#:3431
-----Building #2428:.........ID#:3833
-----Building #2429:.........ID#:3021
-----Building #2430:.........ID#:3292
-----Building #2431:.........ID#:3121
-----Buildin

-----Building #2622:.........ID#:3375
-----Building #2623:.........ID#:3595
-----Building #2624:.........ID#:3248
-----Building #2625:.........ID#:3160
-----Building #2626:.........ID#:3996
-----Building #2627:.........ID#:3023
-----Building #2628:.........ID#:3042
-----Building #2629:.........ID#:3614
-----Building #2630:.........ID#:3843
-----Building #2631:.........ID#:3032
-----Building #2632:.........ID#:3643
-----Building #2633:.........ID#:3343
-----Building #2634:.........ID#:3138
-----Building #2635:.........ID#:3715
-----Building #2636:.........ID#:3538
-----Building #2637:.........ID#:3766
-----Building #2638:.........ID#:3210
-----Building #2639:.........ID#:3099
-----Building #2640:.........ID#:3948
-----Building #2641:.........ID#:3217
-----Building #2642:.........ID#:3058
-----Building #2643:.........ID#:3839
-----Building #2644:.........ID#:3570
-----Building #2645:.........ID#:3858
-----Building #2646:.........ID#:3724
-----Building #2647:.........ID#:3501
-----Buildin

-----Building #2838:.........ID#:3242
-----Building #2839:.........ID#:3331
-----Building #2840:.........ID#:3215
-----Building #2841:.........ID#:3401
-----Building #2842:.........ID#:3846
-----Building #2843:.........ID#:3383
-----Building #2844:.........ID#:3574
-----Building #2845:.........ID#:3674
-----Building #2846:.........ID#:3119
-----Building #2847:.........ID#:3161
-----Building #2848:.........ID#:3036
-----Building #2849:.........ID#:3687
-----Building #2850:.........ID#:3297
-----Building #2851:.........ID#:3300
-----Building #2852:.........ID#:3937
-----Building #2853:.........ID#:3012
-----Building #2854:.........ID#:3046
-----Building #2855:.........ID#:3089
-----Building #2856:.........ID#:3616
-----Building #2857:.........ID#:3852
-----Building #2858:.........ID#:3940
-----Building #2859:.........ID#:3368
-----Building #2860:.........ID#:3047
-----Building #2861:.........ID#:3165
-----Building #2862:.........ID#:3958
-----Building #2863:.........ID#:3947
-----Buildin

-----Building #3053:.........ID#:4572
-----Building #3054:.........ID#:4331
-----Building #3055:.........ID#:4565
-----Building #3056:.........ID#:4604
-----Building #3057:.........ID#:4578
-----Building #3058:.........ID#:4480
-----Building #3059:.........ID#:4609
-----Building #3060:.........ID#:4424
-----Building #3061:.........ID#:4688
-----Building #3062:.........ID#:4456
-----Building #3063:.........ID#:4147
-----Building #3064:.........ID#:4046
-----Building #3065:.........ID#:4594
-----Building #3066:.........ID#:4899
-----Building #3067:.........ID#:4730
-----Building #3068:.........ID#:4298
-----Building #3069:.........ID#:4522
-----Building #3070:.........ID#:4876
-----Building #3071:.........ID#:4479
-----Building #3072:.........ID#:4086
-----Building #3073:.........ID#:4993
-----Building #3074:.........ID#:4539
-----Building #3075:.........ID#:4133
-----Building #3076:.........ID#:4437
-----Building #3077:.........ID#:4123
-----Building #3078:.........ID#:4564
-----Buildin

Assigning metadata for IRELAND dataset

In [None]:
def ireland_meta():
    TIMEZONE = "Europe/Cork"
    INDUSTRY = "Residential"
    metadata = pd.read_csv('IRELAND/meta/metadata.csv',encoding = 'ISO-8859-1',header=None)
    subset = metadata.loc[1:,[0,34,38,39]]
    with h5py.File(file_loc) as f:
        subset[0] = [str(x) for x in subset[0]]
        length = len(subset)
        i=0
        dset = f['ireland']
        for meterid in subset[0]:
            i = i+1
            print('Adding special metadata for Building #{}/{}............ID:{}'.format(i,length,meterid))
            grp = dset[meterid] if meterid in dset else None
            if grp == None:
                print('ID not found?')
            else:
                house_type = int(subset[subset[0]==meterid][34])
                if(house_type==1):
                    grp.attrs['PSU'] = 'Apartment'
                elif(house_type==2 or house_type==3 or house_type==4 or house_type==5):
                    grp.attrs['PSU'] = 'House'
                area = float(subset[subset[0]==meterid][38])
                if(area != 999999999):
                    sqft = area
                    units = int(subset[subset[0]==meterid][39])
                    if(units==1):
                        sqft = area*10.7639    #m^2 to sqft
                    grp.attrs['Sqft'] = sqft
        for mid in f['ireland'].keys():
            print(mid)
            grp = f['ireland/{}'.format(mid)] 
            grp.attrs['Timezone'] = TIMEZONE
            grp.attrs['Industry'] = INDUSTRY    

In [None]:
ireland_meta()

Iterate same data transformation for the [GENOME](https://github.com/buds-lab/the-building-data-genome-project) dataset

In [None]:
home = 'genome'                         #Insert home directory where data is stored
ext = '.csv'                            #Insert data storage type
separ = ' '                             #Separation
mdata = 'genome/meta_open.csv'          #Point to location of metadata file

In [None]:
def get_attrs(row):
    name = row[0]
    industry = row[5]
    psu = row[9]
    sqft = row[11]
    subindustry = row[13]
    timezone = row[14]
    dataend = to_dt_md(row[1]).year
    datastart = to_dt_md(row[2]).year
    return {'Name':name,'Industry':industry,'PSU':psu,'Sqft':sqft,'Subindustry':subindustry,'Timezone':timezone,'End':dataend,'Start':datastart}
def clean_nan(dataframe,column):
    return dataframe[dataframe.apply(lambda x: not np.isnan(x[column]),axis=1)]
def parse():
    DATA_PATH = 'Genome/temp_open_utc.csv'
    METADATA_PATH = 'Genome/meta_open.csv'
    ds = pd.read_csv(DATA_PATH)
    with open(METADATA_PATH) as metadata:
        reader = csv.reader(metadata)
        header  = True
        with h5py.File(file_loc) as dfile:
            f = dfile['genome'] if 'genome' in dfile else dfile.create_group('genome')
            i = 0
            for row in reader:
                if(header):           #If header row has not been read
                    header = False    #Don't do anything
                else:
                    i=i+1
                    attrs = get_attrs(row)
                    year_start = attrs['Start']
                    year_end = attrs['End']
                    name = attrs['Name']
                    print('Processing Building: #{}/507....... ID: {}'.format(i,name))
                    cleaned = clean_nan(dataframe=ds,column=name).loc[:,['timestamp',name]]
                    cleaned['timestamp'] = pd.to_datetime(cleaned['timestamp'])
                    grp = f.create_group(name) if name not in f else f[name]
                    #Set attributes from dictionary
                    for attribute in attrs.items():
                        if(attribute[0]!='Start' and attribute[0] != 'End' and attribute[0]!='Name'):
                            grp.attrs[attribute[0]] = attribute[1]
                    for year in np.arange(year_start,year_end+1):
                        start = dt.datetime(year=year,month=1,day=1,hour=0,minute=0,second=0)
                        end = dt.datetime(year=year+1,month=1,day=1,hour=0,minute=0,second=0)
                        mask = (cleaned['timestamp']>=start)&(cleaned['timestamp']<end)
                        subset = cleaned.loc[mask]
                        dates = np.array([x.strftime("%Y-%m-%d %H:%M:%S").encode('utf8') for x in subset['timestamp']],dtype=np.string_).reshape((-1,1))
                        usage =np.array( [float(use) for use in subset[name]], dtype = np.float64).reshape((-1,1))
                        grp[str(year)]= np.hstack((dates,usage))
def to_dt_md(string):
    return dt.datetime.strptime(string,"%d/%m/%y %H:%M")
def to_dt(string):
    cuts = string.split('+')[0] #Cleave the time zone modification
    return dt.datetime.strptime(cuts,"%Y-%m-%d %H:%M:%S")

In [None]:
parse()

Iterate same data transformation for the PECAN dataset (from their sql database)

In [None]:
import psycopg2 as ps

In [None]:
def create_connection(uid,pwd):
    HOST = '67.78.67.93'         #dataport.cloud
    PORT = 5434
    USER = uid   
    PWD = pwd    
    conn = None
    try:
        conn = ps.connect(dbname = "postgres", host = HOST, user = USER, password= PWD,port = PORT)
    except ps.Error as e:
        print ("connection error"+ e)
    return conn
def destroy_connection(conn):
    conn.close()

In [None]:
def dfclean(df,clean_column):
    return df[df.apply(lambda x: x[clean_column] is not None,axis=1)]
def get_years(date_list):
    years = [date.year for date in date_list]
    return np.unique(years)
def get_building(data_id, conn):
    STMT = "SELECT localhour,use FROM university.electricity_egauge_hours WHERE dataid={}".format(data_id)
    try:
        cur = conn.cursor()
        cur.execute(STMT)
        rows = cur.fetchall()
        df = pd.DataFrame(rows)
        return df
    except:
        print('Querying Error')
    return pd.DataFrame()

In [None]:
def get_ids():
    uniques_100=[22, 26, 48, 54, 59, 68, 77, 86, 93, 94, 101, 114, 115, 121, 130, 135, 160, 171, 187, 203, 222, 232, 243, 252,
    267, 275, 280, 297, 330, 347, 364, 370, 379, 410, 434, 436, 457, 470, 483, 484, 490, 491, 499, 503, 507, 508, 516, 527, 545,
    547, 555, 573, 575, 580, 585, 604, 621, 624, 645, 661, 668, 698, 739, 744, 765, 772, 774, 781, 796, 821, 861, 871, 878, 890,
    898, 900, 930, 936, 946, 954, 974, 980, 991, 994]
    uniques_200=[1037, 1069, 1086, 1103,1105,1153, 1167, 1169, 1185, 1192, 1202, 1283, 1310, 1314, 1331, 1334, 1350, 1354, 1392,
    1403, 1415, 1450, 1463, 1464, 1479, 1500, 1507, 1508, 1524, 1551, 1577, 1586, 1589, 1597, 1601, 1617, 1629, 1632, 1642,
    1681, 1696, 1697, 1700, 1714, 1718, 1731, 1766, 1782, 1790, 1791, 1792, 1796, 1800, 1801, 1830, 1832, 1845, 1854, 1879,
    1889, 1947, 1953, 1994]
    uniques_300 = [2004, 2018, 2031, 2034, 2062, 2072, 2075, 2094, 2129, 2144, 2156, 2158, 2171, 2199, 2204, 2207, 2233, 2242,
    2247, 2335, 2337, 2354, 2360, 2361, 2365, 2366, 2378, 2401, 2449, 2458, 2461, 2465, 2470, 2472, 2505, 2510, 2520, 2523, 2532,
    2557, 2575, 2606, 2638, 2641, 2667, 2710, 2742, 2750, 2751, 2755, 2769, 2787, 2814, 2815, 2818, 2824, 2829, 2845, 2859, 2864,
    2873, 2903, 2907, 2925, 2931, 2945, 2953, 2965, 2974, 2980, 2986, 2992, 2995]
    uniques_400 = [3009, 3032, 3036, 3039, 3044, 3087, 3092, 3104, 3126, 3134, 3143, 3160, 3192, 3204, 3215, 3221, 3224, 3235,
    3263, 3268, 3273, 3299, 3310, 3353, 3367, 3368, 3392, 3394, 3401, 3411, 3413, 3425, 3426, 3443, 3456, 3482, 3484, 3500,
    3504, 3506, 3510, 3519, 3527, 3531, 3538, 3544, 3577, 3615, 3631, 3632, 3635, 3649, 3652, 3676, 3678, 3687, 3719, 3721,
    3723, 3734, 3736, 3778, 3789, 3795, 3806, 3829, 3831, 3849, 3864, 3873, 3883, 3886, 3893, 3916, 3918, 3935, 3938, 3953,
    3964, 3967, 3973]
    uniques_500 = [4000, 4022, 4031, 4042, 4053, 4083, 4095, 4135, 4147, 4154, 4193, 4213, 4220, 4224, 4251, 4296, 4297, 4298,
    4302, 4313, 4321, 4329, 4336, 4342, 4352, 4357, 4373, 4375, 4383, 4416, 4438, 4447, 4473, 4495, 4499, 4505, 4514, 4526,
    4544, 4575, 4590, 4601, 4633, 4641, 4660, 4670, 4674, 4699, 4703, 4732, 4761, 4767, 4773, 4776, 4800, 4830, 4856, 4864,
    4874, 4910, 4920, 4922, 4927, 4934, 4944, 4946, 4956, 4957, 4967, 4974, 4998]
    uniques_600 = [5009, 5026, 5035, 5060, 5087, 5109, 5129, 5164, 5187, 5209, 5218, 5226, 5246, 5252, 5262, 5271, 5275, 5279,
    5288, 5298, 5317, 5356, 5357, 5371, 5395, 5400, 5403, 5438, 5439, 5448, 5449, 5450, 5456, 5485, 5539, 5545, 5552, 5568,
    5615, 5652, 5658, 5673, 5677, 5718, 5728, 5738, 5746, 5749, 5759, 5778, 5784, 5785, 5786, 5796, 5809, 5810, 5814, 5817,
    5852, 5874, 5889, 5892, 5904, 5909, 5921, 5938, 5944, 5949, 5959, 5972, 5994]
    uniques_700=[6012, 6061, 6063, 6072, 6078, 6083, 6101, 6108, 6121, 6125, 6139, 6148, 6165, 6174, 6191, 6248, 6264, 6266, 6268,
    6286, 6324, 6334, 6348, 6377, 6378, 6412, 6418, 6423, 6429, 6460, 6497, 6498, 6500, 6536, 6545, 6547, 6578, 6593, 6614, 6636,
    6643, 6673, 6688, 6689, 6691, 6692, 6730, 6799, 6800, 6826, 6836, 6871, 6887, 6888, 6910, 6911, 6941, 6956, 6960, 6979, 6990]
    uniques_800 = [7001, 7013, 7016, 7017, 7024, 7030, 7036, 7057, 7062, 7108, 7114, 7117, 7122, 7166, 7208, 7240, 7276, 7287, 7319,
    7361, 7390, 7408, 7409, 7429, 7436, 7468, 7491, 7504, 7510, 7512, 7527, 7531, 7536, 7541, 7549, 7560, 7585, 7587, 7597, 7617,
    7627, 7638, 7639, 7641, 7680, 7693, 7703, 7719, 7731, 7739, 7741, 7764, 7767, 7769, 7787, 7788, 7792, 7793, 7794, 7800, 7818,
    7850, 7863, 7866, 7875, 7881, 7893, 7900, 7901, 7940, 7951, 7965, 7973, 7982, 7984, 7989]
    uniques_900 = [8029, 8031, 8034, 8046, 8047, 8059, 8061, 8071, 8079, 8084, 8086, 8092, 8117, 8121, 8122, 8142, 8155, 8156, 8163,
    8183, 8188, 8197, 8198, 8201, 8218, 8236, 8243, 8273, 8282, 8292, 8317, 8328, 8342, 8368, 8386, 8395, 8419, 8467, 8555, 8565,
    8574, 8589, 8597, 8600, 8622, 8626, 8645, 8669, 8729, 8730, 8733, 8736, 8741, 8767, 8807, 8829, 8847, 8848, 8852, 8857, 8862,
    8872, 8886, 8890, 8942, 8956, 8961, 8967, 8986, 8995]
    uniques_1000=[9001, 9019, 9036, 9052, 9085, 9121, 9134, 9139, 9141, 9142, 9156, 9160, 9165, 9182, 9195, 9201, 9206, 9213, 9215,
    9233, 9235, 9237, 9248, 9251, 9277, 9278, 9295, 9333, 9340, 9341, 9343, 9356, 9370, 9434, 9451, 9462, 9484, 9488, 9498, 9499,
    9509, 9548, 9555, 9578, 9585, 9605, 9609, 9610, 9612, 9613, 9624, 9631, 9642, 9643, 9647, 9654, 9670, 9674, 9688, 9701, 9729,
    9737, 9745, 9766, 9771, 9773, 9775, 9776, 9803, 9818, 9830, 9836, 9846, 9875, 9912, 9915, 9919, 9921, 9922, 9923, 9926, 9929,
    9931, 9932, 9933, 9934, 9935, 9936, 9937, 9938, 9939, 9942, 9958, 9971, 9981, 9982, 9983]
    return uniques_100 + uniques_200 + uniques_300 + uniques_400 + uniques_500 + uniques_600 + uniques_700 + uniques_800 + uniques_900 + uniques_1000

Be very careful! Code can run as fast as 30 mintues to slightly over 2 hours depends on internet connection

In [None]:
user = str(input('Please enter Username:'))
pwd = str(input('Please enter Password:'))
conn = create_connection(user,pwd)

with h5py.File(file_loc) as f:
    pecan = f['pecan'] if 'pecan' in f else f.create_group('pecan')
    i = 0 
    for building_id in get_ids():
        i = i+1 
        print('Building #{}/747 : ID #{}.....'.format(i,building_id))
        df = get_building(building_id,conn)
        if(df.empty):
                print('Empty dataset')
        else:
            grp = None
            if '{}'.format(building_id) not in pecan:
                grp = pecan.create_group(str(building_id))
            else:
                grp = pecan['{}'.format(building_id)] 
            years = get_years(df[0].tolist())
            for year in years:
                print('...Adding Dataset - {}'.format(year))
                subset = dfclean(df[df.apply(lambda x: x[0].year==year,axis=1)],1)
                date_list = [date.strftime("%Y-%m-%d %H:%M:%S").encode('utf8') for date in subset[0].tolist()]
                usage = [float(x) for x in subset[1].tolist()]
                dataset = np.hstack((np.array(date_list,dtype=np.string_).reshape((-1,1)),np.array(usage,dtype=np.float64).reshape((-1,1))))
                grp[str(year)]=dataset


In [None]:
def simplifying_function_2011(x):
    if(x[2]==None or x[3]==None or x[4]==None or x[5] ==None or x[6]==None or x[7]==None):
        return None
    elif(x[2]==0 and x[3]==0 and x[4]==0 and x[5]==0 and x[6]==0 and x[7]==0):
        return 'Apartment'
    else:
        return 'House'
def add_mdata():
    TIMEZONE = "America/Austin"
    INDUSTRY = "Residential"
    STMT_2013 = 'SELECT dataid,"Conditioned_Square_Footage__c","Type_of_Home__c" FROM university.audits_2013_main'
    STMT_2011 = 'SELECT dataid,conditions_square_foot,type_of_home_single_family,type_of_home_duplex,type_of_home_triplex,type_of_home_four_plex,type_of_home_condo,type_of_home_town_home FROM university.audits_2011'
    df_2013 = pd.DataFrame()
    df_2011 = pd.DataFrame()
    user = str(input('Please enter Username:'))
    pwd = str(input('Please enter Password:'))
    try:
        conn = create_connection(user,pwd)
        cur = conn.cursor()
        cur.execute(STMT_2011)
        rows = cur.fetchall()
        destroy_connection(conn)
        df_2011 = pd.DataFrame(rows)
    except ps.Error:
        print('Querying Error_2011')
    try:
        conn = create_connection(user,pwd)
        cur = conn.cursor()
        cur.execute(STMT_2013)
        rows = cur.fetchall()
        destroy_connection(conn)
        df_2013 = pd.DataFrame(rows)
    except ps.Rrror:
        print('Querying Error_2013')
    house = df_2011.apply(simplifying_function_2011,axis=1)
    df_2011[2] = house
    subset_2011 = df_2011.loc[:,0:2]
    df_2013[2]=[x if x=='Apartment' else 'House'for x in df_2013[2].tolist()]
    union = df_2011.merge(df_2013,how='outer',left_on=0,right_on=0)
    lists = get_ids()
    with h5py.File(file_loc) as f:
        for index,row in union.iterrows():
            bid = int(row[0])
            print(bid)
            if bid in lists:
                grp = f['pecan/{}'.format(bid)]
                if(pd.isnull(row['2_y']) and not pd.isnull(row['2_x'])):
                    grp.attrs['PSU'] = row['2_x']
                elif(not pd.isnull(row['2_y'])):
                    grp.attrs['PSU']=row['2_y']
                if(pd.isnull(row['1_y']) and not pd.isnull(row['1_x'])):
                    grp.attrs['Sqft'] = float(row['1_x'])
                elif(not pd.isnull(row['1_y'])):
                    grp.attrs['Sqft']=float(row['1_y'])
        for key in f.keys():
            grp =f[key]
            grp.attrs['Timezone'] = TIMEZONE
            grp.attrs['Industry'] = INDUSTRY
        

In [None]:
add_mdata()