
# Introduction to Machine Learning

## Scope

Question: 

1. Given a building's characteristics, can you predict what its energy usage intensity will be?

For example, a two bedroom, 1000 ft^2 apartment will typically expend x amount of energy?

Could be useful to proactively estimate energy consumption for certain areas.


1. Given a building's characteristics and energy use, what should its Energy Star rating be?

Could be useful to assist in assigning energy star ratings to buildings without a score.
Can also assist architects in predicting the rating of structures before investing in the process.

Targets:

ENERGY STAR Score                                            17.91
Site EUI (kBtu/ft²)                                           1.39

## Prepare

```
NYC Mayor's Office of Sustainability, Green Buildings & Energy Efficiency. (2017). [Data set]. Retrieved from http://www.nyc.gov/html/gbee/html/plan/ll84_scores.shtml
```

In [187]:
import os
import pprint
import numpy as np
import pandas as pd

In [188]:
# Load the data
data_path = os.path.join(os.path.abspath('..'), 'data', 'nyc_benchmarking_disclosure_data_reported_in_2017.xlsx')
df = pd.read_excel(data_path, sheet_name='Information and Metrics')

In [189]:
# View the structure of the data
df.head()

Unnamed: 0,Order,Property Id,Property Name,Parent Property Id,Parent Property Name,BBL - 10 digits,"NYC Borough, Block and Lot (BBL) self-reported",NYC Building Identification Number (BIN),Address 1 (self-reported),Address 2,...,Total GHG Emissions (Metric Tons CO2e),Direct GHG Emissions (Metric Tons CO2e),Indirect GHG Emissions (Metric Tons CO2e),Property GFA - Self-Reported (ft²),Water Use (All Water Sources) (kgal),Water Intensity (All Water Sources) (gal/ft²),Source EUI (kBtu/ft²),Release Date,Water Required?,DOF Benchmarking Submission Status
0,1,13286,201/205,13286,201/205,1013160001,1013160001,1037549,201/205 East 42nd st.,Not Available,...,6962.2,0.0,6962.2,762051,Not Available,Not Available,619.40,2017-05-01 17:32:03,No,In Compliance
1,2,28400,NYP Columbia (West Campus),28400,NYP Columbia (West Campus),1021380040,1-02138-0040,1084198; 1084387;1084385; 1084386; 1084388; 10...,622 168th Street,Not Available,...,55870.4,51016.4,4854.1,3889181,Not Available,Not Available,404.30,2017-04-27 11:23:27,No,In Compliance
2,3,4778226,MSCHoNY North,28400,NYP Columbia (West Campus),1021380030,1-02138-0030,1063380,3975 Broadway,Not Available,...,0.0,0.0,0.0,231342,Not Available,Not Available,Not Available,2017-04-27 11:23:27,No,In Compliance
3,4,4778267,Herbert Irving Pavilion & Millstein Hospital,28400,NYP Columbia (West Campus),1021390001,1-02139-0001,1087281; 1076746,161 Fort Washington Ave,177 Fort Washington Ave,...,0.0,0.0,0.0,1305748,Not Available,Not Available,Not Available,2017-04-27 11:23:27,No,In Compliance
4,5,4778288,Neuro Institute,28400,NYP Columbia (West Campus),1021390085,1-02139-0085,1063403,710 West 168th Street,Not Available,...,0.0,0.0,0.0,179694,Not Available,Not Available,Not Available,2017-04-27 11:23:27,No,In Compliance


### Cleaning

#### Missing values

In [190]:
# Note that NaNs are listed as "Not Available" in this data set
# Change these to NaN
df = df.replace('Not Available', np.NaN)

# Display the percentage of missing values for each column
pd.options.display.float_format = '{:.2f}'.format
percent_nan = df.isna().sum()/len(df) * 100
print(percent_nan)

Order                                                         0.00
Property Id                                                   0.00
Property Name                                                 0.00
Parent Property Id                                            0.00
Parent Property Name                                          0.00
BBL - 10 digits                                               0.00
NYC Borough, Block and Lot (BBL) self-reported                0.09
NYC Building Identification Number (BIN)                      1.38
Address 1 (self-reported)                                     0.00
Address 2                                                    98.24
Postal Code                                                   0.00
Street Number                                                 1.06
Street Name                                                   1.04
Borough                                                       1.00
DOF Gross Floor Area                                          

In [191]:
# Remove all columns that have more than 50% missing values
df = df.loc[:, percent_nan < 50]
df.head()

Unnamed: 0,Order,Property Id,Property Name,Parent Property Id,Parent Property Name,BBL - 10 digits,"NYC Borough, Block and Lot (BBL) self-reported",NYC Building Identification Number (BIN),Address 1 (self-reported),Postal Code,...,Total GHG Emissions (Metric Tons CO2e),Direct GHG Emissions (Metric Tons CO2e),Indirect GHG Emissions (Metric Tons CO2e),Property GFA - Self-Reported (ft²),Water Use (All Water Sources) (kgal),Water Intensity (All Water Sources) (gal/ft²),Source EUI (kBtu/ft²),Release Date,Water Required?,DOF Benchmarking Submission Status
0,1,13286,201/205,13286,201/205,1013160001,1013160001,1037549,201/205 East 42nd st.,10017,...,6962.2,0.0,6962.2,762051,,,619.4,2017-05-01 17:32:03,No,In Compliance
1,2,28400,NYP Columbia (West Campus),28400,NYP Columbia (West Campus),1021380040,1-02138-0040,1084198; 1084387;1084385; 1084386; 1084388; 10...,622 168th Street,10032,...,55870.4,51016.4,4854.1,3889181,,,404.3,2017-04-27 11:23:27,No,In Compliance
2,3,4778226,MSCHoNY North,28400,NYP Columbia (West Campus),1021380030,1-02138-0030,1063380,3975 Broadway,10032,...,0.0,0.0,0.0,231342,,,,2017-04-27 11:23:27,No,In Compliance
3,4,4778267,Herbert Irving Pavilion & Millstein Hospital,28400,NYP Columbia (West Campus),1021390001,1-02139-0001,1087281; 1076746,161 Fort Washington Ave,10032,...,0.0,0.0,0.0,1305748,,,,2017-04-27 11:23:27,No,In Compliance
4,5,4778288,Neuro Institute,28400,NYP Columbia (West Campus),1021390085,1-02139-0085,1063403,710 West 168th Street,10032,...,0.0,0.0,0.0,179694,,,,2017-04-27 11:23:27,No,In Compliance


#### Column names

In [192]:
# Look at the existing column names
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(list(df.columns))

[   'Order',
    'Property Id',
    'Property Name',
    'Parent Property Id',
    'Parent Property Name',
    'BBL - 10 digits',
    'NYC Borough, Block and Lot (BBL) self-reported',
    'NYC Building Identification Number (BIN)',
    'Address 1 (self-reported)',
    'Postal Code',
    'Street Number',
    'Street Name',
    'Borough',
    'DOF Gross Floor Area',
    'Primary Property Type - Self Selected',
    'List of All Property Use Types at Property',
    'Largest Property Use Type',
    'Largest Property Use Type - Gross Floor Area (ft²)',
    'Year Built',
    'Number of Buildings - Self-reported',
    'Occupancy',
    'Metered Areas (Energy)',
    'Metered Areas  (Water)',
    'ENERGY STAR Score',
    'Site EUI (kBtu/ft²)',
    'Weather Normalized Site EUI (kBtu/ft²)',
    'Weather Normalized Site Electricity Intensity (kWh/ft²)',
    'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
    'Weather Normalized Source EUI (kBtu/ft²)',
    'Natural Gas Use (kBtu)',
    

In [196]:
# Rename columns so that they are easier to reference
df = df.rename(index=str, columns={"Order": "order", 
                              "Property Id": "id", 
                              "Property Name": "name", 
                              "Parent Property Id": "prnt_id", 
                              "Parent Property Name": "prnt_name", 
                              "BBL - 10 digits": "brgh_blck_lt", 
                              "NYC Borough, Block and Lot (BBL) self-reported": "brgh_blck_lt_self", 
                              "NYC Building Identification Number (BIN)": "bldng_id_no", 
                              "Address 1 (self-reported)": "addr_1_self", 
                              "Postal Code": "zip_code", 
                              "Street Number": "st_no", 
                              "Street Name": "st_name", 
                              "Borough": "brgh", 
                              "DOF Gross Floor Area": "flr_area", 
                              "Primary Property Type - Self Selected": "prim_type", 
                              "List of All Property Use Types at Property": "all_type", 
                              "Largest Property Use Type": "lgst_type", 
                              "Largest Property Use Type - Gross Floor Area (ft²)": "lgst_type_area", 
                              "Year Built": "year_built", 
                              "Number of Buildings - Self-reported": "no_bldngs", 
                              "Occupancy": "occupancy", 
                              "Metered Areas (Energy)": "mtrd_area_energy", 
                              "Metered Areas  (Water)": "mtrd_area_water", 
                              "ENERGY STAR Score": "energy_star", 
                              "Site EUI (kBtu/ft²)": "eui", 
                              "Weather Normalized Site EUI (kBtu/ft²)": "wthr_norm_site_eui", 
                              "Weather Normalized Site Electricity Intensity (kWh/ft²)": "wthr_norm_site_elec_int", 
                              "Weather Normalized Site Natural Gas Intensity (therms/ft²)": "wthr_norm_site_gas_int", 
                              "Weather Normalized Source EUI (kBtu/ft²)": "wthr_norm_src_eui", 
                              "Natural Gas Use (kBtu)": "gas", 
                              "Weather Normalized Site Natural Gas Use (therms)": "wthr_norm_gas", 
                              "Electricity Use - Grid Purchase (kBtu)": "elec", 
                              "Weather Normalized Site Electricity (kWh)": "wthr_norm_elec", 
                              "Total GHG Emissions (Metric Tons CO2e)": "co2_tot", 
                              "Direct GHG Emissions (Metric Tons CO2e)": "co2_dir", 
                              "Indirect GHG Emissions (Metric Tons CO2e)": "co2_ind", 
                              "Property GFA - Self-Reported (ft²)": "gfa_self", 
                              "Water Use (All Water Sources) (kgal)": "water", 
                              "Water Intensity (All Water Sources) (gal/ft²)": "water_int", 
                              "Source EUI (kBtu/ft²)": "src_eui", 
                              "Release Date": "rel_date", 
                              "Water Required?": "water_req", 
                              "DOF Benchmarking Submission Status": "dof_bnchmrkng_sub_status"})
pp.pprint(list(df.columns))

[   'order',
    'id',
    'name',
    'prnt_id',
    'prnt_name',
    'brgh_blck_lt',
    'brgh_blck_lt_self',
    'bldng_id_no',
    'addr_1_self',
    'zip_code',
    'st_no',
    'st_name',
    'brgh',
    'flr_area',
    'prim_type',
    'all_type',
    'lgst_type',
    'lgst_type_area',
    'year_built',
    'no_bldngs',
    'occupancy',
    'mtrd_area_energy',
    'mtrd_area_water',
    'energy_star',
    'eui',
    'wthr_norm_site_eui',
    'wthr_norm_site_elec_int',
    'wthr_norm_site_gas_int',
    'wthr_norm_src_eui',
    'gas',
    'wthr_norm_gas',
    'elec',
    'wthr_norm_elec',
    'co2_tot',
    'co2_dir',
    'co2_ind',
    'gfa_self',
    'water',
    'water_int',
    'src_eui',
    'rel_date',
    'water_req',
    'dof_bnchmrkng_sub_status']


## Analyze

### Pre-processing

### Modeling

### Hyper-parameter tuning

## Reflect