# **Utilities DF**

## Expected Features

| features name | type  | description  |
|----	|---	|---
| `PrimaryPropertyType` | str | example description of the feature
| `PropertyGFABuilding(s)` | float | example description of the feature
| `YearBuilt` | datetime | example description of the feature
| `Occupancy` | float | example description of the feature
| `Number of Buildings` | int | example description of the feature
| `Electricity(kWh)` | float | example description of the feature

the dataframe returns a training data set based on nyc building typologies, to allow us to predict power and water demmand for various buildings

In [33]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [45]:
# import modules
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plot

pd.options.display.float_format = '{:.2f}'.format

In [35]:
from etl.extract import ProjectZero

In [36]:
# import projectzero data
from etl.extract import ProjectZero
data = ProjectZero().get_data()

# view keys
data.keys()

# df_model instance
df = data['ext_nyc']

In [37]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,Order,Property Id,Property Name,Parent Property Id,Parent Property Name,BBL - 10 digits,"NYC Borough, Block and Lot (BBL) self-reported",NYC Building Identification Number (BIN),Address 1 (self-reported),...,Annual Maximum Demand (kW),Annual Maximum Demand (MM/YYYY),Total GHG Emissions (Metric Tons CO2e),Direct GHG Emissions (Metric Tons CO2e),Indirect GHG Emissions (Metric Tons CO2e),Water Use (All Water Sources) (kgal),Water Use Intensity (All Water Sources) (gal/ft²),Water Required?,Generation Date,DOF Benchmarking Submission Status
0,0,1,4593574,The Argonaut Building,,,1010288000.0,1010287502,1024898,224 West 57th St,...,,,732.4,76.3,656.1,3635.5,21.46,Not found,2018-02-14,Not found
1,1,3,2967701,Cathedral Preparatory Seminary,,,4018720000.0,4-01872-0007,4046340,56-25 92nd Street,...,,,164.5,109.9,54.6,102.9,1.09,Not found,2018-02-14,Not found
2,2,4,4898531,The Nomad Hotel,,,1008290000.0,1-00829-0050,1080710,1170 Broadway,...,,,1150.2,438.0,712.3,10762.6,86.1,Not found,2018-02-14,Not found


In [46]:
df.describe()

Unnamed: 0,Self-Reported Gross Floor Area (ft²),Year Built,Occupancy,Number of Buildings,Electricity Use - Grid Purchase (kWh)
count,24143.0,24143.0,24143.0,24143.0,22382.0
mean,122451.01,1947.21,98.43,1.21,1227885.34
std,234898.49,32.82,8.54,3.09,3813108.46
min,0.0,1051.0,0.0,0.0,-1859.1
25%,43260.0,1925.0,100.0,1.0,215600.0
50%,68316.0,1937.0,100.0,1.0,397206.8
75%,120000.0,1967.0,100.0,1.0,916557.8
max,15077660.0,2021.0,100.0,161.0,168312811.4


## 1. `get_training_data`

this method should return a dataframe that has the following features:\
`building`, `area`, `asset`, `electricity_demmand`

In [39]:
# Consider important features that can get from design model
features = [
    'Primary Property Type - Self Selected',
    'Self-Reported Gross Floor Area (ft²)',
    'Year Built',
    'Occupancy',
    'Number of Buildings',
    'Electricity Use - Grid Purchase (kWh)'
]

df = df[features]

# rename nyc columns to match seattle
df_renamed = df.rename(columns={
    'Primary Property Type - Self Selected': 'building_typology' ,
    'Self-Reported Gross Floor Area (ft²)': 'building_gfa',
    'Year Built': 'year_built',
    'Occupancy': 'occupancy',
    'Number of Buildings': 'num_buildings',
    'Electricity Use - Grid Purchase (kWh)': 'electricity_demmand'
})

# drop duplicates
len(df)
df.duplicated().sum()
df.drop_duplicates(inplace=True)
len(df)

# 

24143

## 2. `get_water_demmand`

this method should return a dataframe that has the following features:\
`building`, `area`, `asset`, `potable_water_demmand`