# Seattle Building Data Pre-processing
## Team 5 - Connor, John, Libby, & Natalie

This file contains the code used to preprocess and clean our selected dataset. The final step will be outputting a cleaned and processed dataset as a new .csv file.

#### Setup

In [16]:
#Makes paths work if you just clone or pull the repo
import os
os.chdir('../')
os.getcwd()

'/Users/mainuser/Desktop/DATA422_Fall2024_Team5'

In [5]:
#Install Required Libraries (Only needs to run once)
%pip install -q pandas numpy 

Note: you may need to restart the kernel to use updated packages.


In [7]:
#import libraries
import pandas as pd

#### Import Data

In [29]:
#Paths
PATH_TO_DATASET = "Data/Seattle Building Energy Benchmarking/2022_Building_Energy_Benchmarking_20240906.csv"

In [30]:
#Import Dataset
building_DF = pd.read_csv(PATH_TO_DATASET)

building_DF.head()

Unnamed: 0,OSEBuildingID,DataYear,BuildingName,BuildingType,TaxParcelIdentificationNumber,Address,City,State,ZipCode,Latitude,...,ThirdLargestPropertyUseTypeGFA,Electricity(kWh),SteamUse(kBtu),NaturalGas(therms),ComplianceStatus,ComplianceIssue,Electricity(kBtu),NaturalGas(kBtu),TotalGHGEmissions,GHGEmissionsIntensity
0,1,2022,MAYFLOWER PARK HOTEL,NonResidential,659000030,405 OLIVE WAY,SEATTLE,WA,98101,47.6122,...,,1107295.0,2192383.0,13629.0,Compliant,No Issue,3778091.0,1362900.0,264.5,2.99
1,2,2022,PARAMOUNT HOTEL,NonResidential,659000220,724 PINE ST,SEATTLE,WA,98101,47.61307,...,,698673.0,,27516.0,Compliant,No Issue,2383872.0,2751630.0,155.3,1.75
2,3,2022,WESTIN HOTEL (Parent Building),NonResidential,659000475,1900 5TH AVE,SEATTLE,WA,98101,47.61367,...,0.0,10599740.0,18793416.0,57000.0,Compliant,No Issue,36166313.0,5700000.0,1963.7,2.59
3,5,2022,HOTEL MAX,NonResidential,659000640,620 STEWART ST,SEATTLE,WA,98101,47.61412,...,,783804.0,1549427.0,15762.0,Compliant,No Issue,2674341.0,1576250.0,219.5,3.58
4,8,2022,WARWICK SEATTLE HOTEL,NonResidential,659000970,401 LENORA ST,SEATTLE,WA,98121,47.61375,...,0.0,1374200.0,,73621.0,Compliant,No Issue,4688770.0,7362130.0,409.0,3.6


#### Data Cleaning

In [33]:
# Column wise Null counts
column_nulls = building_DF.isnull().sum()
sorted_column_nulls = column_nulls[column_nulls > 0].sort_values(ascending=False)
column_dtypes = building_DF.dtypes
nulls_and_dtypes = pd.DataFrame({
    'Null Count': sorted_column_nulls,
    'Data Type': column_dtypes[sorted_column_nulls.index]
})

print(nulls_and_dtypes)

                                 Null Count Data Type
SteamUse(kBtu)                         3517   float64
ThirdLargestPropertyUseType            2918    object
ThirdLargestPropertyUseTypeGFA         2897   float64
SecondLargestPropertyUseType           1709    object
SecondLargestPropertyUseTypeGFA        1697   float64
NaturalGas(therms)                     1669   float64
NaturalGas(kBtu)                       1669   float64
ENERGYSTARScore                        1174   float64
SiteEUIWN(kBtu/sf)                      519   float64
SourceEUIWN(kBtu/sf)                    519   float64
SiteEnergyUseWN(kBtu)                   518   float64
SourceEUI(kBtu/sf)                      458   float64
SiteEUI(kBtu/sf)                        458   float64
SiteEnergyUse(kBtu)                     457   float64
EPAPropertyType                         234    object
GHGEmissionsIntensity                   209   float64
Electricity(kWh)                        208   float64
Electricity(kBtu)           

Column  by Column examination of values based on Data Wrangler view.

SteamUse - 96% missing - There are 19 '0.0' entries so its possible that Null values are improperly entered zeros. Another possibility is that these buildings don't have steam which is why they are null and the 0.0 entries are buildings with steam that didn't use any.
Options: Drop column, or fill Nulls with 0.

ThirdLargestPropertyUseType - 80% missing - This is a string column, null values likely mean there is no third largest property.
Options: Drop column since its a string, do nothing, or change Nulls to 'No Third Property'

ThirdLargestPropertyUseTypeGFA - 79% missing - Has zero values, im assuming the missing values are from buildings that have to third largest property use type.
Options: Drop column, replace nulls with 0.0, replace nulls with -1, replace with mean/median

SecondLargestPropertyUseType - 47% - Same as Third

SecondLargestPropertyUseTypeGFA - 48% - Same as ThirdGFA

NaturalGas(therms) - 46% - Has zero values, nulls are likely from buildings that don't use natural gas.
Options: Replace Nulls with 0.0, Replace Nulls with -1.0, replace with mean/median

NaturalGas(kBtu) - 46% - Same as above

ENERGYSTARScore - 32% - Nulls could be buildings that don't have Energy Star Scores calculated yet
Options: Replace with 0.0, replace with -1, replace with mean/median



In [None]:
def handle_null_values(building_Dataframe):
    '''This function takes the building dataframe and returns a new dataframe with the null values handled'''
    

#### Export Processed Dataset

In [None]:
print(building_DF)
building_DF.to_csv('Data/PP_Building_Energy_Benchmarking.csv', index=False)