## Half and Half
This notebook is the Python implementation of this awesomely simple R code: https://www.kaggle.com/kailex/ac-dc by [kxx](https://www.kaggle.com/kailex)

It demonstrates splitting the data in half and using each half to build a model which performs very well on the public LB with minimal feature engineering.

In [9]:
import gc
import os
import random

import lightgbm as lgb
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

# path_data = "/kaggle/input/ashrae-energy-prediction/"
# path_train = path_data + "train.csv"
# path_test = path_data + "test.csv"
# path_building = path_data + "building_metadata.csv"
# path_weather_train = path_data + "weather_train.csv"
# path_weather_test = path_data + "weather_test.csv"

myfavouritenumber = 0
seed = myfavouritenumber
random.seed(seed)

## Reading train data
Reading train data along with building and weather metadata.

In [10]:
df_train = pd.read_csv("data/train.csv")

building = pd.read_csv('building_metadata.csv')
le = LabelEncoder()
building.primary_use = le.fit_transform(building.primary_use)

weather_train = pd.read_csv('data/weather_train.csv')
weather_test = pd.read_csv('data/weather_test.csv')

In [11]:
from tools import reduce_mem_usage

In [12]:
df_train = reduce_mem_usage(df_train, use_float16=True)
building = reduce_mem_usage(building, use_float16=True)
weather_train = reduce_mem_usage(weather_train, use_float16=True)

Memory usage of dataframe is 616.95 MB
Memory usage after optimization is: 173.90 MB
Decreased by 71.8%
Memory usage of dataframe is 0.07 MB
Memory usage after optimization is: 0.02 MB
Decreased by 74.9%
Memory usage of dataframe is 8.58 MB
Memory usage after optimization is: 2.26 MB
Decreased by 73.7%


## Preparing data
There are two files with features that need to be merged with the data. One is building metadata that has information on the buildings and the other is weather data that has information on the weather.   

Note that the only features created are hour, weekday and is_holiday!

In [20]:
def prepare_data(X, building_data, weather_data, test=False):
    """
    Preparing final dataset with all features.
    """
    X
    X = X.merge(building_data, on="building_id", how="left")
    X = X.merge(weather_data, on=["site_id", "timestamp"], how="left")
    
    X.sort_values("timestamp")
    X.reset_index(drop=True)
    
    gc.collect()
    
    holidays = ["2016-01-01", "2016-01-18", "2016-02-15", "2016-05-30", "2016-07-04",
                "2016-09-05", "2016-10-10", "2016-11-11", "2016-11-24", "2016-12-26",
                "2017-01-02", "2017-01-16", "2017-02-20", "2017-05-29", "2017-07-04",
                "2017-09-04", "2017-10-09", "2017-11-10", "2017-11-23", "2017-12-25",
                "2018-01-01", "2018-01-15", "2018-02-19", "2018-05-28", "2018-07-04",
                "2018-09-03", "2018-10-08", "2018-11-12", "2018-11-22", "2018-12-25",
                "2019-01-01"]
    
    X.timestamp = pd.to_datetime(X.timestamp, format="%Y-%m-%d %H:%M:%S")
    X.square_feet = np.log1p(X.square_feet)
    
    X["hour"] = X.timestamp.dt.hour
    X["weekday"] = X.timestamp.dt.weekday
    X["is_holiday"] = (X.timestamp.isin(holidays)).astype(int)
    
    drop_features = ["timestamp", "wind_direction", "wind_speed"]

    X.drop(drop_features, axis=1, inplace=True)
    if not test:
        X.drop(index=X[(X.meter_reading <=0) & (X.meter == 0)].index, inplace=True)

    if test:
        row_ids = X.row_id
        X.drop("row_id", axis=1, inplace=True)
        return X, row_ids
    else:
        y = np.log1p(X.meter_reading)
        X.drop("meter_reading", axis=1, inplace=True)
        return X, y

In [14]:
# weather_train = timestamp_align(weather_train)
X_train, y_train = prepare_data(df_train, building, weather_train)

del df_train, weather_train
gc.collect()

0

## Two-fold LightGBM Model split half-and-half
The data is split into two based on time. Each half is used as the training data for a model.

In [15]:
X_half_1 = X_train[:int(X_train.shape[0] / 2)]
X_half_2 = X_train[int(X_train.shape[0] / 2):]

y_half_1 = y_train[:int(X_train.shape[0] / 2)]
y_half_2 = y_train[int(X_train.shape[0] / 2):]

categorical_features = ["building_id", "site_id", "meter", "primary_use", "hour", "weekday"]

d_half_1 = lgb.Dataset(X_half_1, label=y_half_1, categorical_feature=categorical_features, free_raw_data=False)
d_half_2 = lgb.Dataset(X_half_2, label=y_half_2, categorical_feature=categorical_features, free_raw_data=False)

watchlist_1 = [d_half_1, d_half_2]
watchlist_2 = [d_half_2, d_half_1]

params = {
    "objective": "regression",
    "boosting": "gbdt",
    "num_leaves": 40,
    "learning_rate": 0.05,
    "feature_fraction": 0.85,
    "reg_lambda": 2,
    "metric": "rmse",
    'n_jobs': -1
}

print("Building model with first half and validating on second half:")
model_half_1 = lgb.train(params, train_set=d_half_1, num_boost_round=1000, valid_sets=watchlist_1, verbose_eval=200, early_stopping_rounds=200)

print("Building model with second half and validating on first half:")
model_half_2 = lgb.train(params, train_set=d_half_2, num_boost_round=1000, valid_sets=watchlist_2, verbose_eval=200, early_stopping_rounds=200)

Building model with first half and validating on second half:
Training until validation scores don't improve for 200 rounds.
[200]	training's rmse: 0.834404	valid_1's rmse: 1.06249
[400]	training's rmse: 0.792995	valid_1's rmse: 1.04974
[600]	training's rmse: 0.773513	valid_1's rmse: 1.05095
Early stopping, best iteration is:
[432]	training's rmse: 0.789222	valid_1's rmse: 1.04896
Building model with second half and validating on first half:
Training until validation scores don't improve for 200 rounds.
[200]	training's rmse: 0.847961	valid_1's rmse: 1.05908
[400]	training's rmse: 0.805171	valid_1's rmse: 1.04541
[600]	training's rmse: 0.785137	valid_1's rmse: 1.04214
[800]	training's rmse: 0.770398	valid_1's rmse: 1.04202
Early stopping, best iteration is:
[766]	training's rmse: 0.77265	valid_1's rmse: 1.0418


In [16]:
del X_train, X_half_1, X_half_2, y_half_1, y_half_2, d_half_1, d_half_2, watchlist_1, watchlist_2
gc.collect()

318

## Preparing test data
Preparing test data with same features as train data.

In [19]:
df_test

Unnamed: 0,row_id,building_id,meter,timestamp
0,0,0,0,2017-01-01 00:00:00
1,1,1,0,2017-01-01 00:00:00
2,2,2,0,2017-01-01 00:00:00
3,3,3,0,2017-01-01 00:00:00
4,4,4,0,2017-01-01 00:00:00
...,...,...,...,...
41697595,41697595,1444,0,2018-05-09 07:00:00
41697596,41697596,1445,0,2018-05-09 07:00:00
41697597,41697597,1446,0,2018-05-09 07:00:00
41697598,41697598,1447,0,2018-05-09 07:00:00


In [21]:
df_test = pd.read_csv('data/test.csv')
# weather_test = pd.read_csv(path_weather_test)

df_test = reduce_mem_usage(df_test)
weather_test = reduce_mem_usage(weather_test)

X_test, row_ids = prepare_data(df_test, building, weather_test, test=True)

Memory usage of dataframe is 1272.51 MB
Memory usage after optimization is: 358.65 MB
Decreased by 71.8%
Memory usage of dataframe is 6.11 MB
Memory usage after optimization is: 6.11 MB
Decreased by 0.0%


In [22]:
del df_test, building, weather_test
gc.collect()

0

## Scoring test data
Averaging predictions from the two half train data models.

In [23]:
pred = np.expm1(model_half_1.predict(X_test, num_iteration=model_half_1.best_iteration)) / 2

del model_half_1
gc.collect()

pred += np.expm1(model_half_2.predict(X_test, num_iteration=model_half_2.best_iteration)) / 2
    
del model_half_2
gc.collect()

12

## Submission
Preparing final file for submission.

In [25]:
2

2

In [26]:
submission = pd.DataFrame({"row_id": row_ids, "meter_reading": np.clip(pred, 0, a_max=None)})
submission.to_csv("submission.csv", index=False)