## Divide and Conquer
This notebook is to explore features and optimize models for each site_id. The idea is to resolve some data discrepancies that are present by dividing the data rather than cleaning.   

Note that this is just another approach, need not necessarily be better or worse, but probably can add some value to ensembles irrespective of its CV or public LB scores.

In [1]:
import gc
import os

import lightgbm as lgb
import numpy as np
import pandas as pd

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder

from tqdm.notebook import tqdm

path_data = "/kaggle/input/ashrae-energy-prediction/"
path_train = path_data + "train.csv"
path_test = path_data + "test.csv"
path_building = path_data + "building_metadata.csv"
path_weather_train = path_data + "weather_train.csv"
path_weather_test = path_data + "weather_test.csv"

myfavouritenumber = 13
seed = myfavouritenumber

## Preparing data
There are two files with features that need to be merged with the data. One is building metadata that has information on the buildings and the other is weather data that has information on the weather.

In [2]:
df_train = pd.read_csv(path_train)
df_test = pd.read_csv(path_test)

building = pd.read_csv(path_building)
le = LabelEncoder()
building.primary_use = le.fit_transform(building.primary_use)

weather_train = pd.read_csv(path_weather_train)
weather_test = pd.read_csv(path_weather_test)

weather_train.drop(["sea_level_pressure", "wind_direction", "wind_speed"], axis=1, inplace=True)
weather_test.drop(["sea_level_pressure", "wind_direction", "wind_speed"], axis=1, inplace=True)

weather_train = weather_train.groupby("site_id").apply(lambda group: group.interpolate(limit_direction="both"))
weather_test = weather_test.groupby("site_id").apply(lambda group: group.interpolate(limit_direction="both"))

df_train = df_train.merge(building, on="building_id")
df_train = df_train.merge(weather_train, on=["site_id", "timestamp"], how="left")
df_train = df_train[~((df_train.site_id==0) & (df_train.meter==0) & (df_train.building_id <= 104) & (df_train.timestamp < "2016-05-21"))]
df_train.reset_index(drop=True, inplace=True)
df_train.timestamp = pd.to_datetime(df_train.timestamp, format='%Y-%m-%d %H:%M:%S')
df_train["log_meter_reading"] = np.log1p(df_train.meter_reading)

df_test = df_test.merge(building, on="building_id")
df_test = df_test.merge(weather_test, on=["site_id", "timestamp"], how="left")
df_test.reset_index(drop=True, inplace=True)
df_test.timestamp = pd.to_datetime(df_test.timestamp, format='%Y-%m-%d %H:%M:%S')

del building, le
gc.collect()

0

In [3]:
## Memory Optimization

# Original code from https://www.kaggle.com/gemartin/load-data-reduce-memory-usage by @gemartin
# Modified to support timestamp type, categorical type
# Modified to add option to use float16

from pandas.api.types import is_datetime64_any_dtype as is_datetime
from pandas.api.types import is_categorical_dtype

def reduce_mem_usage(df, use_float16=False):
    """
    Iterate through all the columns of a dataframe and modify the data type to reduce memory usage.        
    """
    
    start_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage of dataframe is {:.2f} MB".format(start_mem))
    
    for col in df.columns:
        if is_datetime(df[col]) or is_categorical_dtype(df[col]):
            continue
        col_type = df[col].dtype
        
        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == "int":
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if use_float16 and c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
        else:
            df[col] = df[col].astype("category")

    end_mem = df.memory_usage().sum() / 1024**2
    print("Memory usage after optimization is: {:.2f} MB".format(end_mem))
    print("Decreased by {:.1f}%".format(100 * (start_mem - end_mem) / start_mem))
    
    return df

In [4]:
df_train = reduce_mem_usage(df_train, use_float16=True)
df_test = reduce_mem_usage(df_test, use_float16=True)

weather_train.timestamp = pd.to_datetime(weather_train.timestamp, format='%Y-%m-%d %H:%M:%S')
weather_test.timestamp = pd.to_datetime(weather_test.timestamp, format='%Y-%m-%d %H:%M:%S')
weather_train = reduce_mem_usage(weather_train, use_float16=True)
weather_test = reduce_mem_usage(weather_test, use_float16=True)

Memory usage of dataframe is 2122.08 MB
Memory usage after optimization is: 663.15 MB
Decreased by 68.7%
Memory usage of dataframe is 4135.66 MB
Memory usage after optimization is: 1312.28 MB
Decreased by 68.3%
Memory usage of dataframe is 6.40 MB
Memory usage after optimization is: 2.27 MB
Decreased by 64.6%
Memory usage of dataframe is 12.69 MB
Memory usage after optimization is: 4.49 MB
Decreased by 64.6%


## Feature Engineering: Time
Creating time-based features.

In [5]:
df_train["hour"] = df_train.timestamp.dt.hour
df_train["weekday"] = df_train.timestamp.dt.weekday

df_test["hour"] = df_test.timestamp.dt.hour
df_test["weekday"] = df_test.timestamp.dt.weekday

## Feature Engineering: Aggregation
Creating aggregate features for buildings at various levels.

In [6]:
df_building_meter = df_train.groupby(["building_id", "meter"]).agg(mean_building_meter=("log_meter_reading", "mean"),
                                                             median_building_meter=("log_meter_reading", "median")).reset_index()

df_train = df_train.merge(df_building_meter, on=["building_id", "meter"])
df_test = df_test.merge(df_building_meter, on=["building_id", "meter"])

df_building_meter_hour = df_train.groupby(["building_id", "meter", "hour"]).agg(mean_building_meter=("log_meter_reading", "mean"),
                                                                                median_building_meter=("log_meter_reading", "median")).reset_index()

df_train = df_train.merge(df_building_meter_hour, on=["building_id", "meter", "hour"])
df_test = df_test.merge(df_building_meter_hour, on=["building_id", "meter", "hour"])

## Feature Engineering: Lags
Creating lag-based features. These are statistics of available features looking back in time by fixed intervals.   
These features are created in the weather data itself and then merged with the train and test data.

In [7]:
def create_lag_features(df, window):
    """
    Creating lag-based features looking back in time.
    """
    
    feature_cols = ["air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr"]
    df_site = df.groupby("site_id")
    
    df_rolled = df_site[feature_cols].rolling(window=window, min_periods=0)
    
    df_mean = df_rolled.mean().reset_index().astype(np.float16)
    df_median = df_rolled.median().reset_index().astype(np.float16)
    df_min = df_rolled.min().reset_index().astype(np.float16)
    df_max = df_rolled.max().reset_index().astype(np.float16)
    df_std = df_rolled.std().reset_index().astype(np.float16)
    df_skew = df_rolled.skew().reset_index().astype(np.float16)
    
    for feature in feature_cols:
        df[f"{feature}_mean_lag{window}"] = df_mean[feature]
        df[f"{feature}_median_lag{window}"] = df_median[feature]
        df[f"{feature}_min_lag{window}"] = df_min[feature]
        df[f"{feature}_max_lag{window}"] = df_max[feature]
        df[f"{feature}_std_lag{window}"] = df_std[feature]
        df[f"{feature}_skew_lag{window}"] = df_std[feature]
        
    return df

## Features
Creating and selecting all the features.

In [8]:
weather_train = create_lag_features(weather_train, 18)
weather_train.drop(["air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr"], axis=1, inplace=True)

df_train = df_train.merge(weather_train, on=["site_id", "timestamp"], how="left")

del weather_train
gc.collect()

38

In [9]:
categorical_features = [
    "building_id",
    "primary_use",
    "meter",
    "weekday",
    "hour"
]

all_features = [col for col in df_train.columns if col not in ["timestamp", "site_id", "meter_reading", "log_meter_reading"]]

## KFold Cross Validation with LGBM
Since the test data is out of time and longer than train data, creating a reliable validation strategy is going to be a major challenge. Just using a simple KFold CV here.

The folds are applied to each site individually, thus building 16 sites x 3 folds = 48 models in total.

In [10]:
cv = 2
models = {}
cv_scores = {"site_id": [], "cv_score": []}

for site_id in tqdm(range(16), desc="site_id"):
    print(cv, "fold CV for site_id:", site_id)
    kf = KFold(n_splits=cv, random_state=seed)
    models[site_id] = []

    X_train_site = df_train[df_train.site_id==site_id].reset_index(drop=True)
    y_train_site = X_train_site.log_meter_reading
    y_pred_train_site = np.zeros(X_train_site.shape[0])
    
    score = 0

    for fold, (train_index, valid_index) in enumerate(kf.split(X_train_site, y_train_site)):
        X_train, X_valid = X_train_site.loc[train_index, all_features], X_train_site.loc[valid_index, all_features]
        y_train, y_valid = y_train_site.iloc[train_index], y_train_site.iloc[valid_index]

        dtrain = lgb.Dataset(X_train, label=y_train, categorical_feature=categorical_features)
        dvalid = lgb.Dataset(X_valid, label=y_valid, categorical_feature=categorical_features)

        watchlist = [dtrain, dvalid]

        params = {"objective": "regression",
                  "num_leaves": 41,
                  "learning_rate": 0.049,
                  "bagging_freq": 5,
                  "bagging_fraction": 0.51,
                  "feature_fraction": 0.81,
                  "metric": "rmse"
                  }

        model_lgb = lgb.train(params, train_set=dtrain, num_boost_round=999, valid_sets=watchlist, verbose_eval=101, early_stopping_rounds=21)
        models[site_id].append(model_lgb)

        y_pred_valid = model_lgb.predict(X_valid, num_iteration=model_lgb.best_iteration)
        y_pred_train_site[valid_index] = y_pred_valid

        rmse = np.sqrt(mean_squared_error(y_valid, y_pred_valid))
        print("Site Id:", site_id, ", Fold:", fold+1, ", RMSE:", rmse)
        score += rmse / cv
        
        gc.collect()
        
    cv_scores["site_id"].append(site_id)
    cv_scores["cv_score"].append(score)
        
    print("\nSite Id:", site_id, ", CV RMSE:", np.sqrt(mean_squared_error(y_train_site, y_pred_train_site)), "\n")

HBox(children=(IntProgress(value=0, description='site_id', max=16, style=ProgressStyle(description_width='init…

2 fold CV for site_id: 0




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.754068	valid_1's rmse: 0.762785
Early stopping, best iteration is:
[119]	training's rmse: 0.735001	valid_1's rmse: 0.762346
Site Id: 0 , Fold: 1 , RMSE: 0.7603602484751305




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.568264	valid_1's rmse: 0.985372
Early stopping, best iteration is:
[87]	training's rmse: 0.581728	valid_1's rmse: 0.983918
Site Id: 0 , Fold: 2 , RMSE: 0.983926888878194

Site Id: 0 , CV RMSE: 0.8792780646981602 

2 fold CV for site_id: 1




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.574773	valid_1's rmse: 0.976137
[202]	training's rmse: 0.547823	valid_1's rmse: 0.970878
Early stopping, best iteration is:
[181]	training's rmse: 0.552723	valid_1's rmse: 0.970334
Site Id: 1 , Fold: 1 , RMSE: 0.9709397209129904




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.700172	valid_1's rmse: 0.750621
[202]	training's rmse: 0.645996	valid_1's rmse: 0.746141
Early stopping, best iteration is:
[246]	training's rmse: 0.632218	valid_1's rmse: 0.745548
Site Id: 1 , Fold: 2 , RMSE: 0.7471926827603798

Site Id: 1 , CV RMSE: 0.8663202472340099 

2 fold CV for site_id: 2




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.730083	valid_1's rmse: 0.808682
Early stopping, best iteration is:
[95]	training's rmse: 0.736491	valid_1's rmse: 0.807438
Site Id: 2 , Fold: 1 , RMSE: 0.799837847260722




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.638016	valid_1's rmse: 0.941298
Early stopping, best iteration is:
[178]	training's rmse: 0.595652	valid_1's rmse: 0.936179
Site Id: 2 , Fold: 2 , RMSE: 0.9351863685181886

Site Id: 2 , CV RMSE: 0.870147724749339 

2 fold CV for site_id: 3




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.307305	valid_1's rmse: 0.438325
[202]	training's rmse: 0.280498	valid_1's rmse: 0.434314
Early stopping, best iteration is:
[266]	training's rmse: 0.271233	valid_1's rmse: 0.433539
Site Id: 3 , Fold: 1 , RMSE: 0.4337620038144956




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.329694	valid_1's rmse: 0.390556
Early stopping, best iteration is:
[124]	training's rmse: 0.320484	valid_1's rmse: 0.389637
Site Id: 3 , Fold: 2 , RMSE: 0.3895220667524987

Site Id: 3 , CV RMSE: 0.4122359347555857 

2 fold CV for site_id: 4




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.14195	valid_1's rmse: 0.26554
Early stopping, best iteration is:
[90]	training's rmse: 0.145861	valid_1's rmse: 0.263895
Site Id: 4 , Fold: 1 , RMSE: 0.26370571536670734




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.210787	valid_1's rmse: 0.315069
[202]	training's rmse: 0.187287	valid_1's rmse: 0.300403
[303]	training's rmse: 0.176775	valid_1's rmse: 0.299155
Early stopping, best iteration is:
[285]	training's rmse: 0.178475	valid_1's rmse: 0.298991
Site Id: 4 , Fold: 2 , RMSE: 0.31027605418360105

Site Id: 4 , CV RMSE: 0.28793396301653257 

2 fold CV for site_id: 5




Training until validation scores don't improve for 21 rounds
Early stopping, best iteration is:
[79]	training's rmse: 0.549913	valid_1's rmse: 0.70846
Site Id: 5 , Fold: 1 , RMSE: 0.7075285005868124




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.541455	valid_1's rmse: 0.652874
[202]	training's rmse: 0.495493	valid_1's rmse: 0.630809
[303]	training's rmse: 0.467386	valid_1's rmse: 0.624115
[404]	training's rmse: 0.445317	valid_1's rmse: 0.619606
[505]	training's rmse: 0.428658	valid_1's rmse: 0.616694
[606]	training's rmse: 0.4167	valid_1's rmse: 0.613616
[707]	training's rmse: 0.405188	valid_1's rmse: 0.609822
[808]	training's rmse: 0.394651	valid_1's rmse: 0.608164
[909]	training's rmse: 0.385582	valid_1's rmse: 0.606174
Early stopping, best iteration is:
[957]	training's rmse: 0.381804	valid_1's rmse: 0.604658
Site Id: 5 , Fold: 2 , RMSE: 0.6196898132307027

Site Id: 5 , CV RMSE: 0.6650609159184316 

2 fold CV for site_id: 6




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.993374	valid_1's rmse: 1.12765
Early stopping, best iteration is:
[135]	training's rmse: 0.946248	valid_1's rmse: 1.12202
Site Id: 6 , Fold: 1 , RMSE: 1.1186400070882925




Training until validation scores don't improve for 21 rounds
Early stopping, best iteration is:
[52]	training's rmse: 0.932655	valid_1's rmse: 1.41368
Site Id: 6 , Fold: 2 , RMSE: 1.3962066118879557

Site Id: 6 , CV RMSE: 1.2650587582756638 

2 fold CV for site_id: 7




Training until validation scores don't improve for 21 rounds
Early stopping, best iteration is:
[76]	training's rmse: 1.40913	valid_1's rmse: 1.69229
Site Id: 7 , Fold: 1 , RMSE: 1.6432376097305152




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.979455	valid_1's rmse: 2.15722
Early stopping, best iteration is:
[176]	training's rmse: 0.902909	valid_1's rmse: 2.1456
Site Id: 7 , Fold: 2 , RMSE: 2.1469975025433152

Site Id: 7 , CV RMSE: 1.9117822719644604 

2 fold CV for site_id: 8




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.437353	valid_1's rmse: 0.513175
[202]	training's rmse: 0.400014	valid_1's rmse: 0.509143
Early stopping, best iteration is:
[185]	training's rmse: 0.405075	valid_1's rmse: 0.508816
Site Id: 8 , Fold: 1 , RMSE: 0.5067870362727548




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.41253	valid_1's rmse: 0.653343
Early stopping, best iteration is:
[115]	training's rmse: 0.405708	valid_1's rmse: 0.651856
Site Id: 8 , Fold: 2 , RMSE: 0.6556535586130549

Site Id: 8 , CV RMSE: 0.5859668865847514 

2 fold CV for site_id: 9




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.874791	valid_1's rmse: 0.949661
[202]	training's rmse: 0.765614	valid_1's rmse: 0.909197
Early stopping, best iteration is:
[209]	training's rmse: 0.761647	valid_1's rmse: 0.907843
Site Id: 9 , Fold: 1 , RMSE: 0.876358715745158




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.8235	valid_1's rmse: 1.0533
Early stopping, best iteration is:
[161]	training's rmse: 0.751335	valid_1's rmse: 1.03073
Site Id: 9 , Fold: 2 , RMSE: 0.9966087708560711

Site Id: 9 , CV RMSE: 0.9384118383603081 

2 fold CV for site_id: 10




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 1.08256	valid_1's rmse: 1.18021
[202]	training's rmse: 0.980672	valid_1's rmse: 1.17077
Early stopping, best iteration is:
[240]	training's rmse: 0.948965	valid_1's rmse: 1.16908
Site Id: 10 , Fold: 1 , RMSE: 1.1795947071699184




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.807944	valid_1's rmse: 1.46441
[202]	training's rmse: 0.750272	valid_1's rmse: 1.44934
Early stopping, best iteration is:
[254]	training's rmse: 0.731262	valid_1's rmse: 1.44677
Site Id: 10 , Fold: 2 , RMSE: 1.4599551554636896

Site Id: 10 , CV RMSE: 1.3271983518640267 

2 fold CV for site_id: 11




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.600879	valid_1's rmse: 0.739121
Early stopping, best iteration is:
[113]	training's rmse: 0.589399	valid_1's rmse: 0.737615
Site Id: 11 , Fold: 1 , RMSE: 0.7363616018075736




Training until validation scores don't improve for 21 rounds
Early stopping, best iteration is:
[31]	training's rmse: 0.678459	valid_1's rmse: 1.89251
Site Id: 11 , Fold: 2 , RMSE: 1.8923248139307858

Site Id: 11 , CV RMSE: 1.4358092096706716 

2 fold CV for site_id: 12




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.373533	valid_1's rmse: 0.421868
[202]	training's rmse: 0.328804	valid_1's rmse: 0.402469
[303]	training's rmse: 0.302873	valid_1's rmse: 0.396281
[404]	training's rmse: 0.285165	valid_1's rmse: 0.393029
[505]	training's rmse: 0.271971	valid_1's rmse: 0.390854
[606]	training's rmse: 0.258939	valid_1's rmse: 0.389741
[707]	training's rmse: 0.248902	valid_1's rmse: 0.388373
[808]	training's rmse: 0.238246	valid_1's rmse: 0.387691
[909]	training's rmse: 0.23049	valid_1's rmse: 0.386922
Early stopping, best iteration is:
[900]	training's rmse: 0.231102	valid_1's rmse: 0.386702
Site Id: 12 , Fold: 1 , RMSE: 0.3855484279040529




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.331468	valid_1's rmse: 0.44194
[202]	training's rmse: 0.272916	valid_1's rmse: 0.424145
[303]	training's rmse: 0.242293	valid_1's rmse: 0.41937
[404]	training's rmse: 0.22451	valid_1's rmse: 0.417429
[505]	training's rmse: 0.20988	valid_1's rmse: 0.416134
Early stopping, best iteration is:
[509]	training's rmse: 0.20943	valid_1's rmse: 0.416057
Site Id: 12 , Fold: 2 , RMSE: 0.41591393515378655

Site Id: 12 , CV RMSE: 0.40101864961431954 

2 fold CV for site_id: 13




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.871221	valid_1's rmse: 1.22693
[202]	training's rmse: 0.778435	valid_1's rmse: 1.20029
[303]	training's rmse: 0.74379	valid_1's rmse: 1.19407
[404]	training's rmse: 0.717726	valid_1's rmse: 1.19068
[505]	training's rmse: 0.696581	valid_1's rmse: 1.18823
Early stopping, best iteration is:
[576]	training's rmse: 0.684269	valid_1's rmse: 1.18689
Site Id: 13 , Fold: 1 , RMSE: 1.1872913121246813




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.860216	valid_1's rmse: 1.18772
Early stopping, best iteration is:
[132]	training's rmse: 0.81879	valid_1's rmse: 1.17807
Site Id: 13 , Fold: 2 , RMSE: 1.184739846556789

Site Id: 13 , CV RMSE: 1.1860162659293443 

2 fold CV for site_id: 14




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 1.2151	valid_1's rmse: 1.47793
[202]	training's rmse: 1.118	valid_1's rmse: 1.4572
[303]	training's rmse: 1.0684	valid_1's rmse: 1.44831
Early stopping, best iteration is:
[364]	training's rmse: 1.04783	valid_1's rmse: 1.44547
Site Id: 14 , Fold: 1 , RMSE: 1.445219252561224




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 1.26505	valid_1's rmse: 1.53168
[202]	training's rmse: 1.18301	valid_1's rmse: 1.51246
[303]	training's rmse: 1.14269	valid_1's rmse: 1.50429
[404]	training's rmse: 1.11279	valid_1's rmse: 1.49959
Early stopping, best iteration is:
[457]	training's rmse: 1.09929	valid_1's rmse: 1.49732
Site Id: 14 , Fold: 2 , RMSE: 1.481708980368324

Site Id: 14 , CV RMSE: 1.46357784051238 

2 fold CV for site_id: 15




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.512123	valid_1's rmse: 0.802916
[202]	training's rmse: 0.472729	valid_1's rmse: 0.783816
Early stopping, best iteration is:
[243]	training's rmse: 0.46514	valid_1's rmse: 0.78054
Site Id: 15 , Fold: 1 , RMSE: 0.7686943004764188




Training until validation scores don't improve for 21 rounds
[101]	training's rmse: 0.618316	valid_1's rmse: 0.673272
Early stopping, best iteration is:
[120]	training's rmse: 0.603068	valid_1's rmse: 0.662807
Site Id: 15 , Fold: 2 , RMSE: 0.6542285358699608

Site Id: 15 , CV RMSE: 0.7137597301373492 




In [11]:
pd.DataFrame.from_dict(cv_scores)

Unnamed: 0,site_id,cv_score
0,0,0.872144
1,1,0.859066
2,2,0.867512
3,3,0.411642
4,4,0.286991
5,5,0.663609
6,6,1.257423
7,7,1.895118
8,8,0.58122
9,9,0.936484


In [12]:
del df_train, X_train_site, y_train_site, X_train, y_train, dtrain, X_valid, y_valid, dvalid, y_pred_train_site, y_pred_valid, rmse, score, cv_scores
gc.collect()

3

## Scoring on test data
The test data for each site is scored individually using the 3 models, one from each fold. The final prediction is the average of the 3 models.

In [13]:
weather_test = create_lag_features(weather_test, 18)
weather_test.drop(["air_temperature", "cloud_coverage", "dew_temperature", "precip_depth_1_hr"], axis=1, inplace=True)

In [14]:
df_test_sites = []

for site_id in tqdm(range(16), desc="site_id"):
    print("Preparing test data for site_id", site_id)

    X_test_site = df_test[df_test.site_id==site_id]
    weather_test_site = weather_test[weather_test.site_id==site_id]
    
    X_test_site = X_test_site.merge(weather_test_site, on=["site_id", "timestamp"], how="left")
    
    row_ids_site = X_test_site.row_id

    X_test_site = X_test_site[all_features]
    y_pred_test_site = np.zeros(X_test_site.shape[0])

    print("Scoring for site_id", site_id)    
    for fold in range(cv):
        model_lgb = models[site_id][fold]
        y_pred_test_site += model_lgb.predict(X_test_site, num_iteration=model_lgb.best_iteration) / cv
        gc.collect()
        
    df_test_site = pd.DataFrame({"row_id": row_ids_site, "meter_reading": y_pred_test_site})
    df_test_sites.append(df_test_site)
    
    print("Scoring for site_id", site_id, "completed\n")
    gc.collect()

HBox(children=(IntProgress(value=0, description='site_id', max=16, style=ProgressStyle(description_width='init…

Preparing test data for site_id 0
Scoring for site_id 0
Scoring for site_id 0 completed

Preparing test data for site_id 1
Scoring for site_id 1
Scoring for site_id 1 completed

Preparing test data for site_id 2
Scoring for site_id 2
Scoring for site_id 2 completed

Preparing test data for site_id 3
Scoring for site_id 3
Scoring for site_id 3 completed

Preparing test data for site_id 4
Scoring for site_id 4
Scoring for site_id 4 completed

Preparing test data for site_id 5
Scoring for site_id 5
Scoring for site_id 5 completed

Preparing test data for site_id 6
Scoring for site_id 6
Scoring for site_id 6 completed

Preparing test data for site_id 7
Scoring for site_id 7
Scoring for site_id 7 completed

Preparing test data for site_id 8
Scoring for site_id 8
Scoring for site_id 8 completed

Preparing test data for site_id 9
Scoring for site_id 9
Scoring for site_id 9 completed

Preparing test data for site_id 10
Scoring for site_id 10
Scoring for site_id 10 completed

Preparing test dat

## Submission
Preparing final file for submission.

In [15]:
submit = pd.concat(df_test_sites)
submit.meter_reading = np.clip(np.expm1(submit.meter_reading), 0, a_max=None)
submit.to_csv("submission_noleak.csv", index=False)

In [16]:
## adding leak
leak0 = pd.read_csv("/kaggle/input/ashrae-leak-data/site0.csv")
leak1 = pd.read_csv("/kaggle/input/ashrae-leak-data/site1.csv")
leak2 = pd.read_csv("/kaggle/input/ashrae-leak-data/site2.csv")
leak4 = pd.read_csv("/kaggle/input/ashrae-leak-data/site4.csv")
leak15 = pd.read_csv("/kaggle/input/ashrae-leak-data/site15.csv")

leak = pd.concat([leak0, leak1, leak2, leak4, leak15])

del leak0, leak1, leak2, leak4, leak15, df_test_sites
gc.collect()

test = pd.read_csv(path_test)
test = test[test.building_id.isin(leak.building_id.unique())]

leak = leak.merge(test, on=["building_id", "meter", "timestamp"])

del test
gc.collect()

submit = submit.merge(leak[["row_id", "meter_reading_scraped"]], on=["row_id"], how="left")
submit.loc[submit.meter_reading_scraped.notnull(), "meter_reading"] = submit.loc[submit.meter_reading_scraped.notnull(), "meter_reading_scraped"] 
submit.drop(["meter_reading_scraped"], axis=1, inplace=True)

submit.to_csv("submission.csv", index=False)