### Note: This notebook is intended for Kaggle Days meetup #4 Delhi.
#### This is just a cleaned version of an awesome kernel https://www.kaggle.com/rohanrao/ashrae-half-and-half


<h1 align="center">ASHRAE - Great Energy Predictor III</h1>

<pre>
In this competition, you’ll develop accurate predictions of metered building energy usage in the following areas: <b>chilled water, electric, natural gas, 
hot water, and steam meters.</b> The data comes from over 1,000 buildings over a three-year timeframe. With better estimates of these energy-saving 
investments, large scale investors and financial institutions will be more inclined to invest in this area to enable progress in building efficiencies.
</pre>

<h1 align="center">Data</h1>
<pre>
Assessing the value of energy efficiency improvements can be challenging as there's no way to truly know how much energy a building would have used without 
the improvements. The best we can do is to build counterfactual models. Once a building is overhauled the new (lower) energy consumption is compared 
against modeled values for the original building to calculate the savings from the retrofit. More accurate models could support better market incentives 
and enable lower cost financing.

This competition challenges you to build these counterfactual models across four energy types based on historic usage rates and observed weather. The 
dataset includes three years of hourly meter readings from over one thousand buildings at several different sites around the world.

<b>train.csv: Main data consisting of target variable</b>
<b>test.csv: Dataset for testing purpose</b>
<b>building_meta.csv: Metadata consisting of additional information about building</b>
<b>weather[train/test].csv: Weather data from a meteorological station as close as possible to the site</b>
<b>sample_submission.csv: A valid sample submission</b>
</pre>

In [1]:
# Importing required libraries to proceed
import gc
import os
import random
import lightgbm as lgb
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

In [2]:
# Let's look where is our data in disk space
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/ashrae-energy-prediction/sample_submission.csv
/kaggle/input/ashrae-energy-prediction/building_metadata.csv
/kaggle/input/ashrae-energy-prediction/weather_test.csv
/kaggle/input/ashrae-energy-prediction/train.csv
/kaggle/input/ashrae-energy-prediction/test.csv
/kaggle/input/ashrae-energy-prediction/weather_train.csv


<h4>Dataset Compressor: Reducing size of our dataset to fit into our RAM</h4>


In [3]:
## Memory optimization

# Original code from https://www.kaggle.com/gemartin/load-data-reduce-memory-usage by @gemartin
# Modified to support timestamp type, categorical type
# Modified to add option to use float16

def reduce_mem_usage(df, use_float16=True):
    """
    Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.        
    """
    
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype("category")

    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [4]:
# Let's load all of the data through our reducer
# Dictionary which will contain all the dataset mapped with their name
data = {}

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print("Loading ", os.path.join(dirname, filename))
        data[filename] = reduce_mem_usage(pd.read_csv(os.path.join(dirname, filename)))

Loading  /kaggle/input/ashrae-energy-prediction/sample_submission.csv
Memory usage of dataframe is 636.26 MB
Memory usage after optimization is: 198.83 MB
Decreased by 68.7%
Loading  /kaggle/input/ashrae-energy-prediction/building_metadata.csv
Memory usage of dataframe is 0.07 MB
Memory usage after optimization is: 0.02 MB
Decreased by 73.8%
Loading  /kaggle/input/ashrae-energy-prediction/weather_test.csv
Memory usage of dataframe is 19.04 MB
Memory usage after optimization is: 5.25 MB
Decreased by 72.4%
Loading  /kaggle/input/ashrae-energy-prediction/train.csv
Memory usage of dataframe is 616.95 MB
Memory usage after optimization is: 173.90 MB
Decreased by 71.8%
Loading  /kaggle/input/ashrae-energy-prediction/test.csv
Memory usage of dataframe is 1272.51 MB
Memory usage after optimization is: 358.65 MB
Decreased by 71.8%
Loading  /kaggle/input/ashrae-energy-prediction/weather_train.csv
Memory usage of dataframe is 9.60 MB
Memory usage after optimization is: 2.65 MB
Decreased by 72.4%


#### Small peek into Datasets

In [5]:
data["train.csv"].head()

Unnamed: 0,building_id,meter,timestamp,meter_reading
0,0,0,2016-01-01 00:00:00,0.0
1,1,0,2016-01-01 00:00:00,0.0
2,2,0,2016-01-01 00:00:00,0.0
3,3,0,2016-01-01 00:00:00,0.0
4,4,0,2016-01-01 00:00:00,0.0


In [6]:
data["test.csv"].head()

Unnamed: 0,row_id,building_id,meter,timestamp
0,0,0,0,2017-01-01 00:00:00
1,1,1,0,2017-01-01 00:00:00
2,2,2,0,2017-01-01 00:00:00
3,3,3,0,2017-01-01 00:00:00
4,4,4,0,2017-01-01 00:00:00


In [7]:
data["building_metadata.csv"].head()

Unnamed: 0,site_id,building_id,primary_use,square_feet,year_built,floor_count
0,0,0,Education,7432,2008.0,
1,0,1,Education,2720,2004.0,
2,0,2,Education,5376,1991.0,
3,0,3,Education,23685,2002.0,
4,0,4,Education,116607,1975.0,


In [8]:
data["weather_train.csv"].head()

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2016-01-01 00:00:00,25.0,6.0,20.0,,1019.5,0.0,0.0
1,0,2016-01-01 01:00:00,24.40625,,21.09375,-1.0,1020.0,70.0,1.5
2,0,2016-01-01 02:00:00,22.796875,2.0,21.09375,0.0,1020.0,0.0,0.0
3,0,2016-01-01 03:00:00,21.09375,2.0,20.59375,0.0,1020.0,0.0,0.0
4,0,2016-01-01 04:00:00,20.0,2.0,20.0,-1.0,1020.0,250.0,2.599609


In [9]:
data["weather_test.csv"].head()

Unnamed: 0,site_id,timestamp,air_temperature,cloud_coverage,dew_temperature,precip_depth_1_hr,sea_level_pressure,wind_direction,wind_speed
0,0,2017-01-01 00:00:00,17.796875,4.0,11.703125,,1021.5,100.0,3.599609
1,0,2017-01-01 01:00:00,17.796875,2.0,12.796875,0.0,1022.0,130.0,3.099609
2,0,2017-01-01 02:00:00,16.09375,0.0,12.796875,0.0,1022.0,140.0,3.099609
3,0,2017-01-01 03:00:00,17.203125,0.0,13.296875,0.0,1022.0,140.0,3.099609
4,0,2017-01-01 04:00:00,16.703125,2.0,13.296875,0.0,1022.5,130.0,2.599609


In [10]:
data["sample_submission.csv"].head()

Unnamed: 0,row_id,meter_reading
0,0,0
1,1,0
2,2,0
3,3,0
4,4,0


<h1 align="center">Data Preprocessing and Cleaning</h1>

In [11]:
# ***Important***
# Always set proper seed values before any experiments for reproducibility
random.seed(2019)

In [12]:
# Let's convert the non-numerical feature to numerical feature through LabelEncoder
data["building_metadata.csv"].loc[:, "primary_use"] = LabelEncoder().fit_transform(data["building_metadata.csv"].loc[:, "primary_use"].values)

<pre>
Now it's time to prepare data

There are two files with features that need to be merged with the data. One is building metadata that has information on the buildings and the other is 
weather data that has information on the weather.

Note that the only features created are hour, weekday and is_holiday!
</pre>

In [13]:
# credits @Vopani

def prepare_data(X, building_data, weather_data, test=False):
    """
    Preparing final dataset with all features.
    """
    
    X = X.merge(building_data, on="building_id", how="left")
    X = X.merge(weather_data, on=["site_id", "timestamp"], how="left")
    
    X.sort_values("timestamp")
    X.reset_index(drop=True)
    
    gc.collect()
    
    holidays = ["2016-01-01", "2016-01-18", "2016-02-15", "2016-05-30", "2016-07-04",
                "2016-09-05", "2016-10-10", "2016-11-11", "2016-11-24", "2016-12-26",
                "2017-01-02", "2017-01-16", "2017-02-20", "2017-05-29", "2017-07-04",
                "2017-09-04", "2017-10-09", "2017-11-10", "2017-11-23", "2017-12-25",
                "2018-01-01", "2018-01-15", "2018-02-19", "2018-05-28", "2018-07-04",
                "2018-09-03", "2018-10-08", "2018-11-12", "2018-11-22", "2018-12-25",
                "2019-01-01"]
    
    X.timestamp = pd.to_datetime(X.timestamp, format="%Y-%m-%d %H:%M:%S")
    X.square_feet = np.log1p(X.square_feet)
    
    X["hour"] = X.timestamp.dt.hour
    X["weekday"] = X.timestamp.dt.weekday
    X["is_holiday"] = (X.timestamp.isin(holidays)).astype(int)
    
    drop_features = ["timestamp", "sea_level_pressure", "wind_direction", "wind_speed"]

    X.drop(drop_features, axis=1, inplace=True)

    if test:
        row_ids = X.row_id
        X.drop("row_id", axis=1, inplace=True)
        return X, row_ids
    else:
        y = np.log1p(X.meter_reading)
        X.drop("meter_reading", axis=1, inplace=True)
        return X, y

In [14]:
X_train, y_train = prepare_data(data["train.csv"], data["building_metadata.csv"], data["weather_train.csv"])

# Freeing up some memory by deleting uno more usable variables
del data["train.csv"], data["weather_train.csv"]
gc.collect()

0

<h1 align="center">Baseline Modeling with Cross Validation</h1>

<h2>Training Phase</h2>

In [15]:
# Credits @Vopani

X_half_1 = X_train[:int(X_train.shape[0] / 2)]
X_half_2 = X_train[int(X_train.shape[0] / 2):]

y_half_1 = y_train[:int(X_train.shape[0] / 2)]
y_half_2 = y_train[int(X_train.shape[0] / 2):]

categorical_features = ["building_id", "site_id", "meter", "primary_use", "hour", "weekday"]

d_half_1 = lgb.Dataset(X_half_1, label=y_half_1, categorical_feature=categorical_features, free_raw_data=False)
d_half_2 = lgb.Dataset(X_half_2, label=y_half_2, categorical_feature=categorical_features, free_raw_data=False)

watchlist_1 = [d_half_1, d_half_2]
watchlist_2 = [d_half_2, d_half_1]

# https://lightgbm.readthedocs.io/en/latest/Parameters.html
params = {
    "objective": "regression",
    "boosting": "gbdt",
    "num_leaves": 40,
    "learning_rate": 0.025,
    "feature_fraction": 0.85,
    "reg_lambda": 2,
    "metric": "rmse"
}

print("Building model with first half and validating on second half:")
model_half_1 = lgb.train(params, train_set=d_half_1, num_boost_round=1400, valid_sets=watchlist_1, verbose_eval=200, early_stopping_rounds=300)

print("Building model with second half and validating on first half:")
model_half_2 = lgb.train(params, train_set=d_half_2, num_boost_round=1400, valid_sets=watchlist_2, verbose_eval=200, early_stopping_rounds=300)

Building model with first half and validating on second half:




Training until validation scores don't improve for 300 rounds
[200]	training's rmse: 1.02788	valid_1's rmse: 1.37136
[400]	training's rmse: 0.91812	valid_1's rmse: 1.33719
[600]	training's rmse: 0.885171	valid_1's rmse: 1.33513
[800]	training's rmse: 0.866395	valid_1's rmse: 1.3342
[1000]	training's rmse: 0.852276	valid_1's rmse: 1.33625
Early stopping, best iteration is:
[752]	training's rmse: 0.869987	valid_1's rmse: 1.33371
Building model with second half and validating on first half:
Training until validation scores don't improve for 300 rounds
[200]	training's rmse: 1.00517	valid_1's rmse: 1.53828
[400]	training's rmse: 0.90171	valid_1's rmse: 1.50549
[600]	training's rmse: 0.874264	valid_1's rmse: 1.50257
[800]	training's rmse: 0.85691	valid_1's rmse: 1.50153
[1000]	training's rmse: 0.844696	valid_1's rmse: 1.50154
[1200]	training's rmse: 0.835786	valid_1's rmse: 1.50284
Early stopping, best iteration is:
[985]	training's rmse: 0.845938	valid_1's rmse: 1.50117


<h2>Testing Phase</h2>

In [16]:
# Preparing test data with same features as train data.
X_test, row_ids = prepare_data(data["test.csv"], data["building_metadata.csv"], data["weather_test.csv"], test=True)

<h1 align="center">Preparing Submission</h1>

In [17]:
# Getting prediction from our 1st Half trained model and weighting its prediction 0.5
pred = np.expm1(model_half_1.predict(X_test, num_iteration=model_half_1.best_iteration)) / 2

del model_half_1
gc.collect()

# Getting prediction from our 2nd Half trained model and weighting its prediction also 0.5
pred += np.expm1(model_half_2.predict(X_test, num_iteration=model_half_2.best_iteration)) / 2
    
del model_half_2
gc.collect()

12

In [18]:
# Generating our submission file
submission = pd.DataFrame({"row_id": row_ids, "meter_reading": np.clip(pred, 0, a_max=None)})
submission.to_csv("submission.csv", index=False)