
# Introduction to Machine Learning

An interactive Machine Learning example written in Python, broken into the following sections:

1. [Scope](#scope)
  1. [Problem Definition](#scope-problem-definition)
  1. [Data](#scope-data)
1. [Prepare](#prepare)
  1. [Import](#prepare-import)
  1. [Cleaning](#prepare-cleaning)
1. [Analyze](#analyze)
  1. [Pre-processing](#analyze-pre-processing)
  1. [Modeling](#analyze-modeling)
  1. [Hyper-parameter Tuning](#analyze-hyper-parameter-tuning)
1. [Reflect](#reflect)

<a id='scope'></a>

## Scope

<a id='scope-problem-definition'></a>

### Problem Definition

We seek to answer the following questions by applying Machine Learning techniques to the NYC Mayor's Office of Sustainability dataset on green buildings and energy efficiency:

1. Given a buidling's characteristics, can we predict what its energy usage intensity will be?
  - Could be useful for urban planners to proactively estimate energy consumption of various of living spaces.
1. Given a buidling's characteristics and energy use, can we predict what its Energy Star rating would be?
  - Could be useful for Energy Star to assist in assigning ratings to buildings without a score.
  - Could be useful for architects to predict the Energy Star scores of different designs.
1. What sources of energy cause the most greenhouse gas emissions for a property?
  - Could be useful for policy makers to see what sources of energy produce the most greenhouse gas emissions.
1. What factors lead to energy loss between the source of generation and the site of the property?
  - Could be useful for engineers to identify which locations, property usages, sources of energy, etc. lead to the most energy loss and adjust the energy grid accordingly.

<a id='scope-data'></a>

### Data

#### Citation

```
NYC Mayor's Office of Sustainability, Green Buildings & Energy Efficiency. (2017). [Data set]. 
    Retrieved from http://www.nyc.gov/html/gbee/html/plan/ll84_scores.shtml
```

#### Dimensions

| Column                                                     | Functional Data Type    | Technical Data Type |
|------------------------------------------------------------|-------------------------|---------------------|
| Order                                                      | index                   | integer             |
| Property Id                                                | identifier              | text                |
| Property Name                                              | identifier              | text                |
| Parent Property Id                                         | identifier              | text                |
| Parent Property Name                                       | identifier              | text                |
| BBL - 10 digits                                            | identifier              | text                |
| NYC Borough, Block and Lot (BBL) self-reported             | identifier              | text                |
| NYC Building Identification Number (BIN)                   | identifier              | text                |
| Address 1 (self-reported)                                  | location                | text                |
| Address 2                                                  | location                | text                |
| Postal Code                                                | location                | text                |
| Street Number                                              | location                | text                |
| Street Name                                                | location                | text                |
| Borough                                                    | location                | text                |
| DOF Gross Floor Area                                       | building characteristic | numeric             |
| Primary Property Type - Self Selected                      | building characteristic | categorical         |
| List of All Property Use Types at Property                 | building characteristic | [categorical]       |
| Largest Property Use Type                                  | building characteristic | categorical         |
| Largest Property Use Type - Gross Floor Area (ft²)         | building characteristic | numeric             |
| 2nd Largest Property Use Type                              | building characteristic | categorical         |
| 2nd Largest Property Use - Gross Floor Area (ft²)          | building characteristic | numeric             |
| 3rd Largest Property Use Type                              | building characteristic | categorical         |
| 3rd Largest Property Use Type - Gross Floor Area (ft²)     | building characteristic | numeric             |
| Year Built                                                 | building characteristic | time                |
| Number of Buildings - Self-reported                        | building characteristic | integer             |
| Occupancy                                                  | building characteristic | percentage          |
| Metered Areas (Energy)                                     | building characteristic | categorical         |
| Metered Areas (Water)                                      | building characteristic | categorical         |
| ENERGY STAR Score                                          | score                   | numeric             |
| Site EUI (kBtu/ft²)                                        | energy usage            | numeric             |
| Weather Normalized Site EUI (kBtu/ft²)                     | energy usage            | numeric             |
| Weather Normalized Site Electricity Intensity (kWh/ft²)    | energy usage            | numeric             |
| Weather Normalized Site Natural Gas Intensity (therms/ft²) | energy usage            | numeric             |
| Weather Normalized Source EUI (kBtu/ft²)                   | energy usage            | numeric             |
| Fuel Oil #1 Use (kBtu)                                     | energy usage            | numeric             |
| Fuel Oil #2 Use (kBtu)                                     | energy usage            | numeric             |
| Fuel Oil #4 Use (kBtu)                                     | energy usage            | numeric             |
| Fuel Oil #5 &amp; 6 Use (kBtu)                             | energy usage            | numeric             |
| Diesel #2 Use (kBtu)                                       | energy usage            | numeric             |
| District Steam Use (kBtu)                                  | energy usage            | numeric             |
| Natural Gas Use (kBtu)                                     | energy usage            | numeric             |
| Weather Normalized Site Natural Gas Use (therms)           | energy usage            | numeric             |
| Electricity Use - Grid Purchase (kBtu)                     | energy usage            | numeric             |
| Weather Normalized Site Electricity (kWh)                  | energy usage            | numeric             |
| Total GHG Emissions (Metric Tons CO2e)                     | environmental footprint | numeric             |
| Direct GHG Emissions (Metric Tons CO2e)                    | environmental footprint | numeric             |
| Indirect GHG Emissions (Metric Tons CO2e)                  | environmental footprint | numeric             |
| Property GFA - Self-Reported (ft²)                         | building characteristic | numeric             |
| Water Use (All Water Sources) (kgal)                       | resource usage          | numeric             |
| Water Intensity (All Water Sources) (gal/ft²)              | resource usage          | numeric             |
| Source EUI (kBtu/ft²)                                      | energy usage            | numeric             |
| Release Date                                               | metadata                | time                |
| Water Required?                                            | building characteristic | boolean             |
| DOF Benchmarking Submission Status                         | metadata                | categorical         |

<a id='prepare'></a>

## Prepare

<a id='prepare-import'></a>

### Import

In [226]:
import os
import pprint
import numpy as np
import pandas as pd

In [227]:
# Import the data
data_path = os.path.join(os.path.abspath('..'), 'data', 'nyc_benchmarking_disclosure_data_reported_in_2017.xlsx')
df = pd.read_excel(data_path, sheet_name='Information and Metrics')

In [232]:
# View the structure of the data
df.head()

Unnamed: 0,Order,Property Id,Property Name,Parent Property Id,Parent Property Name,BBL - 10 digits,"NYC Borough, Block and Lot (BBL) self-reported",NYC Building Identification Number (BIN),Address 1 (self-reported),Postal Code,...,Total GHG Emissions (Metric Tons CO2e),Direct GHG Emissions (Metric Tons CO2e),Indirect GHG Emissions (Metric Tons CO2e),Property GFA - Self-Reported (ft²),Water Use (All Water Sources) (kgal),Water Intensity (All Water Sources) (gal/ft²),Source EUI (kBtu/ft²),Release Date,Water Required?,DOF Benchmarking Submission Status
0,1,13286,201/205,13286,201/205,1013160001,1013160001,1037549,201/205 East 42nd st.,10017,...,6962.2,0.0,6962.2,762051,,,619.4,2017-05-01 17:32:03,No,In Compliance
1,2,28400,NYP Columbia (West Campus),28400,NYP Columbia (West Campus),1021380040,1-02138-0040,1084198; 1084387;1084385; 1084386; 1084388; 10...,622 168th Street,10032,...,55870.4,51016.4,4854.1,3889181,,,404.3,2017-04-27 11:23:27,No,In Compliance
2,3,4778226,MSCHoNY North,28400,NYP Columbia (West Campus),1021380030,1-02138-0030,1063380,3975 Broadway,10032,...,0.0,0.0,0.0,231342,,,,2017-04-27 11:23:27,No,In Compliance
3,4,4778267,Herbert Irving Pavilion & Millstein Hospital,28400,NYP Columbia (West Campus),1021390001,1-02139-0001,1087281; 1076746,161 Fort Washington Ave,10032,...,0.0,0.0,0.0,1305748,,,,2017-04-27 11:23:27,No,In Compliance
4,5,4778288,Neuro Institute,28400,NYP Columbia (West Campus),1021390085,1-02139-0085,1063403,710 West 168th Street,10032,...,0.0,0.0,0.0,179694,,,,2017-04-27 11:23:27,No,In Compliance


<a id='prepare-cleaning'></a>

### Cleaning

#### Missing Values

In [233]:
# Note that NaNs are listed as "Not Available" in this data set
# Change these to NaN
df = df.replace('Not Available', np.NaN)

# Display the percentage of missing values for each column
pd.options.display.float_format = '{:.2f}'.format
percent_nan = df.isna().sum()/len(df) * 100
print(percent_nan)

Order                                                         0.00
Property Id                                                   0.00
Property Name                                                 0.00
Parent Property Id                                            0.00
Parent Property Name                                          0.00
BBL - 10 digits                                               0.00
NYC Borough, Block and Lot (BBL) self-reported                0.09
NYC Building Identification Number (BIN)                      1.38
Address 1 (self-reported)                                     0.00
Postal Code                                                   0.00
Street Number                                                 1.06
Street Name                                                   1.04
Borough                                                       1.00
DOF Gross Floor Area                                          1.00
Primary Property Type - Self Selected                         

In [235]:
# Remove all columns that have more than 50% missing values
df = df.loc[:, percent_nan < 50]

#### Column Names

In [223]:
# Look at the existing column names
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(list(df.columns))

[   'Order',
    'Property Id',
    'Property Name',
    'Parent Property Id',
    'Parent Property Name',
    'BBL - 10 digits',
    'NYC Borough, Block and Lot (BBL) self-reported',
    'NYC Building Identification Number (BIN)',
    'Address 1 (self-reported)',
    'Postal Code',
    'Street Number',
    'Street Name',
    'Borough',
    'DOF Gross Floor Area',
    'Primary Property Type - Self Selected',
    'List of All Property Use Types at Property',
    'Largest Property Use Type',
    'Largest Property Use Type - Gross Floor Area (ft²)',
    'Year Built',
    'Number of Buildings - Self-reported',
    'Occupancy',
    'Metered Areas (Energy)',
    'Metered Areas  (Water)',
    'ENERGY STAR Score',
    'Site EUI (kBtu/ft²)',
    'Weather Normalized Site EUI (kBtu/ft²)',
    'Weather Normalized Site Electricity Intensity (kWh/ft²)',
    'Weather Normalized Site Natural Gas Intensity (therms/ft²)',
    'Weather Normalized Source EUI (kBtu/ft²)',
    'Natural Gas Use (kBtu)',
    

In [224]:
# Rename columns so that they are easier to reference
df = df.rename(index=str, columns={"Order": "order", 
                              "Property Id": "id", 
                              "Property Name": "name", 
                              "Parent Property Id": "prnt_id", 
                              "Parent Property Name": "prnt_name", 
                              "BBL - 10 digits": "brgh_blck_lt", 
                              "NYC Borough, Block and Lot (BBL) self-reported": "brgh_blck_lt_self", 
                              "NYC Building Identification Number (BIN)": "bldng_id_no", 
                              "Address 1 (self-reported)": "addr_1_self", 
                              "Postal Code": "zip_code", 
                              "Street Number": "st_no", 
                              "Street Name": "st_name", 
                              "Borough": "brgh", 
                              "DOF Gross Floor Area": "flr_area", 
                              "Primary Property Type - Self Selected": "prim_type", 
                              "List of All Property Use Types at Property": "all_type", 
                              "Largest Property Use Type": "lgst_type", 
                              "Largest Property Use Type - Gross Floor Area (ft²)": "lgst_type_area", 
                              "Year Built": "year_built", 
                              "Number of Buildings - Self-reported": "no_bldngs", 
                              "Occupancy": "occupancy", 
                              "Metered Areas (Energy)": "mtrd_area_energy", 
                              "Metered Areas  (Water)": "mtrd_area_water", 
                              "ENERGY STAR Score": "energy_star", 
                              "Site EUI (kBtu/ft²)": "eui", 
                              "Weather Normalized Site EUI (kBtu/ft²)": "wthr_norm_site_eui", 
                              "Weather Normalized Site Electricity Intensity (kWh/ft²)": "wthr_norm_site_elec_int", 
                              "Weather Normalized Site Natural Gas Intensity (therms/ft²)": "wthr_norm_site_gas_int", 
                              "Weather Normalized Source EUI (kBtu/ft²)": "wthr_norm_src_eui", 
                              "Natural Gas Use (kBtu)": "gas", 
                              "Weather Normalized Site Natural Gas Use (therms)": "wthr_norm_gas", 
                              "Electricity Use - Grid Purchase (kBtu)": "elec", 
                              "Weather Normalized Site Electricity (kWh)": "wthr_norm_elec", 
                              "Total GHG Emissions (Metric Tons CO2e)": "co2_tot", 
                              "Direct GHG Emissions (Metric Tons CO2e)": "co2_dir", 
                              "Indirect GHG Emissions (Metric Tons CO2e)": "co2_ind", 
                              "Property GFA - Self-Reported (ft²)": "gfa_self", 
                              "Water Use (All Water Sources) (kgal)": "water", 
                              "Water Intensity (All Water Sources) (gal/ft²)": "water_int", 
                              "Source EUI (kBtu/ft²)": "src_eui", 
                              "Release Date": "rel_date", 
                              "Water Required?": "water_req", 
                              "DOF Benchmarking Submission Status": "dof_bnchmrkng_sub_status"})
pp.pprint(list(df.columns))

[   'order',
    'id',
    'name',
    'prnt_id',
    'prnt_name',
    'brgh_blck_lt',
    'brgh_blck_lt_self',
    'bldng_id_no',
    'addr_1_self',
    'zip_code',
    'st_no',
    'st_name',
    'brgh',
    'flr_area',
    'prim_type',
    'all_type',
    'lgst_type',
    'lgst_type_area',
    'year_built',
    'no_bldngs',
    'occupancy',
    'mtrd_area_energy',
    'mtrd_area_water',
    'energy_star',
    'eui',
    'wthr_norm_site_eui',
    'wthr_norm_site_elec_int',
    'wthr_norm_site_gas_int',
    'wthr_norm_src_eui',
    'gas',
    'wthr_norm_gas',
    'elec',
    'wthr_norm_elec',
    'co2_tot',
    'co2_dir',
    'co2_ind',
    'gfa_self',
    'water',
    'water_int',
    'src_eui',
    'rel_date',
    'water_req',
    'dof_bnchmrkng_sub_status']


<a id='analyze'></a>

## Analyze

<a id='analyze-pre-processing'></a>

### Pre-processing

<a id='analyze-modeling'></a>

### Modeling

<a id='analyze-hyper-parameter-tuning'></a>

### Hyper-parameter Tuning

<a id='reflect'></a>

## Reflect