# Seattle Building Data Pre-processing
## Team 5 - Connor, John, Libby, & Natalie

This file contains the code used to preprocess and clean our selected dataset. The final step will be outputting a cleaned and processed dataset as a new .csv file.

#### Setup

In [1]:
#Makes paths work if you just clone or pull the repo
import os
os.chdir('../')
os.getcwd()

'/Volumes/T7/DATA422_Fall2024_Team5'

In [2]:
#Install Required Libraries (Only needs to run once)
%pip install -q pandas numpy 

Note: you may need to restart the kernel to use updated packages.


In [3]:
#import libraries
import pandas as pd

#### Import Data

In [4]:
#Paths
PATH_TO_DATASET = "Data/Seattle Building Energy Benchmarking/2022_Building_Energy_Benchmarking_20240906.csv"

In [7]:
#Import Dataset
building_DF = pd.read_csv(PATH_TO_DATASET)

building_DF.head()
print(building_DF.columns)

Index(['OSEBuildingID', 'DataYear', 'BuildingName', 'BuildingType',
       'TaxParcelIdentificationNumber', 'Address', 'City', 'State', 'ZipCode',
       'Latitude', 'Longitude', 'Neighborhood', 'CouncilDistrictCode',
       'YearBuilt', 'NumberofFloors', 'NumberofBuildings', 'PropertyGFATotal',
       'PropertyGFABuilding(s)', 'PropertyGFAParking', 'ENERGYSTARScore',
       'SiteEUIWN(kBtu/sf)', 'SiteEUI(kBtu/sf)', 'SiteEnergyUse(kBtu)',
       'SiteEnergyUseWN(kBtu)', 'SourceEUIWN(kBtu/sf)', 'SourceEUI(kBtu/sf)',
       'EPAPropertyType', 'LargestPropertyUseType',
       'LargestPropertyUseTypeGFA', 'SecondLargestPropertyUseType',
       'SecondLargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType',
       'ThirdLargestPropertyUseTypeGFA', 'Electricity(kWh)', 'SteamUse(kBtu)',
       'NaturalGas(therms)', 'ComplianceStatus', 'ComplianceIssue',
       'Electricity(kBtu)', 'NaturalGas(kBtu)', 'TotalGHGEmissions',
       'GHGEmissionsIntensity'],
      dtype='object')


#### Data Cleaning

In [8]:
#Drop Columns That We Decided Not To Use
Dropped_Columns = ['TaxParcelIdentificationNumber', 'City', 'State', 'CouncilDistrictCode', 'PropertyGFABuilding(s)', 
                   'PropertyGFAParking', 'SiteEUIWN(kBtu/sf)', 'SiteEnergyUse(kBtu)', 'SiteEnergyUseWN(kBtu)', 
                   'SourceEUIWN(kBtu/sf)', 'LargestPropertyUseType', 'LargestPropertyUseTypeGFA', 
                   'SecondLargestPropertyUseType', 'SecondLargestPropertyUseTypeGFA', 'ThirdLargestPropertyUseType',
                   'ThirdLargestPropertyUseTypeGFA', 'Electricity(kWh)', 'NaturalGas(therms)', 'TotalGHGEmissions']
#Drop Listed Columns
df_after_drop = building_DF.drop(columns=Dropped_Columns)

In [9]:
# Column wise Null counts
column_nulls = df_after_drop.isnull().sum()
sorted_column_nulls = column_nulls[column_nulls > 0].sort_values(ascending=False)
column_dtypes = df_after_drop.dtypes
nulls_and_dtypes = pd.DataFrame({
    'Null Count': sorted_column_nulls,
    'Data Type': column_dtypes[sorted_column_nulls.index]
})

print(nulls_and_dtypes)

                       Null Count Data Type
SteamUse(kBtu)               3517   float64
NaturalGas(kBtu)             1669   float64
ENERGYSTARScore              1174   float64
SourceEUI(kBtu/sf)            458   float64
SiteEUI(kBtu/sf)              458   float64
EPAPropertyType               234    object
GHGEmissionsIntensity         209   float64
Electricity(kBtu)             208   float64
Neighborhood                    1    object


Column  by Column examination of values based on Data Wrangler view.

SteamUse - 96% missing - There are 19 '0.0' entries so its possible that Null values are improperly entered zeros. Another possibility is that these buildings don't have steam which is why they are null and the 0.0 entries are buildings with steam that didn't use any.
Options: Drop column, or fill Nulls with 0.

NaturalGas(kBtu) - 46% - Has zero values, nulls are likely from buildings that don't use natural gas.
Options: Replace Nulls with 0.0, Replace Nulls with -1.0, replace with mean/median

ENERGYSTARScore - 32% - Nulls could be buildings that don't have Energy Star Scores calculated yet
Options: Replace with 0.0, replace with -1, replace with mean/median

SourceEUI(kBtu/sf)

SiteEUI(kBtu/sf)

EPAPropertyType - Replace Nulls with Other or Mode Category

GHGEmissionsIntensity - Drop Rows with Missing values?

Electricity(kBtu) - Fill with median/mode or drop rows with missing values?

Neighborhood - drop the single missing value row



In [None]:
def handle_null_values(b_df):
    '''This function takes the building dataframe and returns a new dataframe with the null values handled'''
    b_df = b_df['SteamUse(kBtu)'].fillna(0.0) #Buildings that dont have steam have zero steam use
    b_df = b_df['ThirdLargestPropertyUseType'].fillna('None') #Buildings that dont have a third largest property use type have none
    b_df = b_df['ThiirdLargestPropertyUseTypeGFA'].fillna(-1) #marks that a building has no 3rd largest property
    b_df = b_df['SecondLargestPropertyUseType'].fillna('None') #Buildings that dont have a second largest property use type have none
    b_df = b_df['SecondLargestPropertyUseTypeGFA'].fillna(-1) #marks that a building has no 2nd largest property
    b_df = b_df['NaturalGas(therms)'].fillna(0.0) #Buildings that don't have natural gas use zero natural gas
    b_df = b_df['NaturalGas(kBtu)'].fillna(0.0) #Buildings that don't have natural gas use zero natural gas
    
    
    

#### Processing Steps

In [None]:
#encoding
from sklearn.preprocessing import StandardScaler, OneHotEncoder

cat_cols = []
num_cols = []


#### Export Processed Dataset

In [None]:
print(building_DF)
building_DF.to_csv('Data/PP_Building_Energy_Benchmarking.csv', index=False)