# **Utilities DF**

## Expected Features

| features name | type  | description  |
|----	|---	|---
| `PrimaryPropertyType` | str | example description of the feature
| `PropertyGFABuilding(s)` | float | example description of the feature
| `YearBuilt` | datetime | example description of the feature
| `Occupancy` | float | example description of the feature
| `Number of Buildings` | int | example description of the feature
| `Electricity(kWh)` | float | example description of the feature

the dataframe returns a training data set based on nyc building typologies, to allow us to predict power and water demmand for various buildings

In [2]:
%load_ext autoreload
%autoreload 2

In [3]:
# import modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

pd.options.display.float_format = '{:.2f}'.format

In [4]:
from etl.extract import ProjectZero

In [5]:
# import projectzero data
from etl.extract import ProjectZero
data = ProjectZero().get_data()

# view keys
data.keys()

# df_model instance
df = data['ext_nyc']

In [6]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,Order,Property Id,Property Name,Parent Property Id,Parent Property Name,BBL - 10 digits,"NYC Borough, Block and Lot (BBL) self-reported",NYC Building Identification Number (BIN),Address 1 (self-reported),...,Annual Maximum Demand (kW),Annual Maximum Demand (MM/YYYY),Total GHG Emissions (Metric Tons CO2e),Direct GHG Emissions (Metric Tons CO2e),Indirect GHG Emissions (Metric Tons CO2e),Water Use (All Water Sources) (kgal),Water Use Intensity (All Water Sources) (gal/ft²),Water Required?,Generation Date,DOF Benchmarking Submission Status
0,0,1,4593574,The Argonaut Building,,,1010287502.0,1010287502,1024898,224 West 57th St,...,,,732.4,76.3,656.1,3635.5,21.46,Not found,2018-02-14,Not found
1,1,3,2967701,Cathedral Preparatory Seminary,,,4018720007.0,4-01872-0007,4046340,56-25 92nd Street,...,,,164.5,109.9,54.6,102.9,1.09,Not found,2018-02-14,Not found
2,2,4,4898531,The Nomad Hotel,,,1008290050.0,1-00829-0050,1080710,1170 Broadway,...,,,1150.2,438.0,712.3,10762.6,86.1,Not found,2018-02-14,Not found


In [7]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Order,Property Id,BBL - 10 digits,Self-Reported Gross Floor Area (ft²),Largest Property Use Type - Gross Floor Area (ft²),2nd Largest Property Use - Gross Floor Area (ft²),3rd Largest Property Use Type - Gross Floor Area (ft²),Year Built,Number of Buildings,...,Weather Normalized Site Natural Gas Use (therms),Electricity Use - Grid Purchase (kBtu),Electricity Use - Grid Purchase (kWh),Weather Normalized Site Electricity (kWh),Annual Maximum Demand (kW),Total GHG Emissions (Metric Tons CO2e),Direct GHG Emissions (Metric Tons CO2e),Indirect GHG Emissions (Metric Tons CO2e),Water Use (All Water Sources) (kgal),Water Use Intensity (All Water Sources) (gal/ft²)
count,34355.0,34355.0,34355.0,33625.0,34355.0,34323.0,7226.0,2450.0,34355.0,34355.0,...,28047.0,31576.0,31576.0,31013.0,2383.0,32957.0,33116.0,33088.0,18280.0,18271.0
mean,17177.0,17355.15,4710374.13,2249469964.37,114939.21,111542.22,20136.23,11179.5,1946.11,1.2,...,357689.51,3833598.57,1123563.34,1126104.37,2635.07,18935.48,1665.27,17197.66,9964.98,586.95
std,9917.58,10016.66,1676668.58,1193103984.16,216459.1,210898.88,51022.11,24105.73,32.73,2.96,...,33292988.63,12497187.58,3662715.73,3674891.71,56174.72,2954069.88,162171.48,2943758.43,91903.0,52513.08
min,0.0,1.0,7365.0,0.0,0.0,1.0,0.0,0.0,1051.0,0.0,...,0.0,-6343.3,-1859.1,-1859.1,0.0,0.0,0.0,-17600.5,0.0,0.0
25%,8588.5,8666.5,2825644.0,1014980014.0,42273.0,41750.0,3500.0,1899.0,1925.0,1.0,...,8581.15,688842.93,201888.33,202228.7,45.6,184.1,60.4,53.8,1954.2,28.47
50%,17177.0,17383.0,4897531.0,2032480169.0,65820.0,64381.0,8000.0,5000.0,1935.0,1.0,...,32265.6,1249026.2,366068.6,367333.1,176.0,328.3,188.8,102.4,3794.2,50.03
75%,25765.5,26028.5,6297280.0,3058740072.0,113615.0,111605.0,16326.25,11200.0,1965.0,1.0,...,58461.9,2828562.88,829004.28,834277.1,308.0,596.2,331.3,238.72,6631.1,83.4
max,34354.0,34686.0,6716654.0,7000500004.0,15077660.0,15077660.0,992059.4,591640.0,2021.0,161.0,...,3936196559.9,574283382.3,168312811.4,167695514.0,2553601.0,535429700.0,20833880.0,535429700.0,5446589.9,6913227.0


## 1. `get_training_data`

this method should return a dataframe that has the following features:\
`building`, `area`, `asset`, `electricity_demmand`

In [8]:
# Consider important features that can get from design model
features = [
    'Primary Property Type - Self Selected',
    'Self-Reported Gross Floor Area (ft²)',
    'Year Built',
    'Occupancy',
    'Number of Buildings',
    'Electricity Use - Grid Purchase (kWh)'
]

df = df[features]

# rename nyc columns to match seattle
df_renamed = df.rename(columns={
    'Primary Property Type - Self Selected': 'building_typology' ,
    'Self-Reported Gross Floor Area (ft²)': 'building_gfa',
    'Year Built': 'year_built',
    'Occupancy': 'occupancy',
    'Number of Buildings': 'num_buildings',
    'Electricity Use - Grid Purchase (kWh)': 'electricity_demmand'
})

# drop duplicates
len(df_renamed)
df_renamed.duplicated().sum()
df_renamed.drop_duplicates(inplace=True)
len(df_renamed)

# missing data
df_renamed.isnull().sum().sort_values(ascending=False)
df_renamed.dropna(subset=['electricity_demmand'], inplace=True)
df_renamed.isnull().sum().sort_values(ascending=False)


building_typology      0
building_gfa           0
year_built             0
occupancy              0
num_buildings          0
electricity_demmand    0
dtype: int64

In [9]:
df_renamed = df_renamed[df_renamed.electricity_demmand >= 0]
df_renamed[df_renamed.electricity_demmand < 0]

Unnamed: 0,building_typology,building_gfa,year_built,occupancy,num_buildings,electricity_demmand


In [10]:
# check for outliers
px.scatter(df_renamed, x='building_gfa', y='electricity_demmand', color='building_typology')

In [17]:
# Regressing GFA to Electricty consumption by filtered property type
keep_types = [
    'Office',
    'K-12 School',
    'Hotel',
    'Multifamily Housing',
    'Hospital (General Medical & Surgical)',
    'Museum',
    'Retail Store',
    'College/University',
    'Laboratory',
    'Other - Mall',
    'Performing Arts',
    'Prison/Incarceration',
    'Courthouse'
       ]

df_renamed = df_renamed[df_renamed.building_typology.isin(keep_types)]

In [18]:

#dropping huge outliers
df_renamed = df_renamed[~((df_renamed['electricity_demmand'] > 50000000) | (df_renamed['building_gfa'] > 3000000))]

# dropping outliers by typology
# K-12 School
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'K-12 School') & (df_renamed['building_gfa'] > 800000))]

# Hotel
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Hotel') & (df_renamed['building_gfa'] > 100000))]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Hotel') & (df_renamed['building_gfa'] < 20000))]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Hotel') & (df_renamed['electricity_demmand'] > 3000000))]

# Multifamily Housing
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Multifamily Housing') & (df_renamed['building_gfa'] > 2000000))]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Multifamily Housing') & (df_renamed['electricity_demmand'] > 14000000))]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Multifamily Housing') & (df_renamed['electricity_demmand'] < 2000000) & (df_renamed['building_gfa'] > 700000) )]

# Hospital (General Medical & Surgical)
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Hospital (General Medical & Surgical)') & (df_renamed['electricity_demmand'] < 20000000) & (df_renamed['building_gfa'] > 1300000) )]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Hospital (General Medical & Surgical)') & (df_renamed['electricity_demmand'] < 30000000) & (df_renamed['building_gfa'] > 1700000) )]

# Museum
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Museum') & (df_renamed['building_gfa'] > 500000))]

# Retail Store
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Retail Store') & (df_renamed['building_gfa'] > 1000000))]

# College/University
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'College/University') & (df_renamed['electricity_demmand'] > 20000000))]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'College/University') & (df_renamed['electricity_demmand'] < 5000000) & (df_renamed['building_gfa'] > 249000) )]
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'College/University') & (df_renamed['electricity_demmand'] > 10000000) & (df_renamed['building_gfa'] < 500000) )]

# Performing Arts
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Performing Arts') & (df_renamed['electricity_demmand'] > 15000000))]

# Prison/Incarceration
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Prison/Incarceration') & (df_renamed['electricity_demmand'] > 20000000))]

# Courthouse
df_renamed = df_renamed[~((df_renamed['building_typology'] == 'Courthouse') & (df_renamed['building_gfa'] > 1000000))]

# Year Build
df_renamed = df_renamed[df_renamed.year_built > 1850]

# number of buildings
df_renamed = df_renamed[df_renamed.num_buildings < 20]

In [19]:
# check after filtering for outliers
px.scatter(df_renamed, x='building_gfa', y='electricity_demmand', color='building_typology')

In [14]:
df = df_renamed
df.head(3)

Unnamed: 0,building_typology,building_gfa,year_built,occupancy,num_buildings,electricity_demmand
0,Office,169416,1909,95,1,1920103.6
1,K-12 School,94380,1963,100,1,180640.0
3,Hotel,50000,1994,100,1,579335.2


## 2. `get_water_demmand`

this method should return a dataframe that has the following features:\
`building`, `area`, `asset`, `potable_water_demmand`