# Considerations for the solution

The solution notebook is wids_challenge.ipynb. Here some considerations on the solution.

## Reconstructing a building history

We note that there are observations for years 1-6 in the training set, and on year 7 in the test set, suggesting that one could use information about **past site_eui** readings to predict the current ones.

The first step is to create an id to identiy a building through it history. We opt for a **building_id** based on facility_type, is_residential, floor_area, year_built.

*Note:* We have excluded elevation for the creation of the id, because we found examples where the building had the same characteristics, the same values for the temperatures, but different values of elevation, which does not make sense. 
Given that elevation still has some correlation with the target, we do not drop it altogether, but we won't use it for the building id determination.

The building_id allows us to look back in time and use the information from the previous years to predict the site_eui on year 7. 

We can then **add features associated to the past history**, like how different the temperatures are to the previous year, what was the target the previous year, what was the energy rating the previous year or the energy rating difference with this year.  (try to be inclusive here, then check which are the more promising features using Pearson and Distance correlation)

These features are expected to be highly correlated with the target, probably more than any other feature, other than energy star rating; in fact all other features contain only generic indications of how much energy a building consumes.


### Note on shift and diff methods

Shift methods, either applied on the groups and indexed by the groups, or transformed and applied on the original dataframe give the same values; this is the wanted behaviour

It's not the same for diff... why? Or which one should be trusted?

Transform diff cannot be trusted. But the diff method applied on the groupby operation works like the by-hand operation. So it can be used 


```
df['energy_star_rating_lagged_1'] = df.groupby('building_id')['energy_star_rating'].transform('shift').fillna(-300)
df['energy_star_rating_diff_1'] = df['energy_star_rating'] - df['energy_star_rating_lagged_1']
df['energy_star_rating_diff_1_td'] = df.groupby('building_id')['energy_star_rating'].transform('diff')
df['energy_star_rating_diff_1_d'] = df.groupby('building_id')['energy_star_rating'].diff()

(df['energy_star_rating_diff_1'] - df['energy_star_rating_diff_1_td']).sum()
==> non-zero value

(df['energy_star_rating_diff_1'] - df['energy_star_rating_diff_1_d']).sum()
==> 0
```

## Two models for two parts of the dataset


I incur in the **problem** that **we don't always have a history of one or two years**: how to fill the values there? ideally I would drop those instances that do not have past history. But about 6-7% of the instances in the test set do not have past history and I need a prediction for those as well. 
A possibility is to divide the training and test set into further sets, depending whether they have past history or not, and train two different models.

Could a single model handle the two cases at the same time? 

A decisiontree regressor is expected to be able to do that: it has the capacity to use differently the features to predict the target depending on which "category" the data belongs to (in this case, whether there is previous history or not).

On the contrary, we don't expect a linear model to be able to do that, given that it's a sum of linear terms. We would likely need two separate models. 

## Results (events with history)
if we limit ourselves to the cases where we have previous history, which in the test set represents about 93% of the cases, we find that the site_eui of the previous year has about 90% linear correlation with the site_eui of this year. We decide to train a linear model based on the first best correlated features (absolute values are shown)


```
site_eui_lag_1                        0.912507
energy_star_rating                    0.426835
energy_star_rating_lag_1              0.392748
site_eui_lag_2                        0.190093
january_min_temp                      0.176666
cooling_heating_degree_days           0.166191
january_avg_temp                      0.162738
```

Even with only the first feature we reach an RMSE of about RMSE=22 (~1% diff wrt training, almost no overtraining). When
adding the extra features (scaled appropriately) we actually get lower performance, about RMSE=23 (0.1% diff wrt training, which gauges the size of the overtraining)

## Results (events without history) 
For the remaining events without history we train a BDT. For the feature engineering we excluded quantities that could bias the result, like the year_factor and has_previous_history, include essentially all other features, with proper encodings and scaling, and added average temperatures per season.

We train the BDT on all events in the training set. With very little optimisation, we obtain an RMSE of about 44.

The most important features, are, not unexpectedly, the facility_type categories; we had noticed how different the distributions of the target were depending on the facility_type.

Other than the facility type, other features among the most important are the min temperature in winter and the energy star rating (the latter is expected, given the high correlation ~0.4 with the target, the former a bit less expected)

## Results total
We get an overall expected performance of about 23.54, obtained by weighting the sum of the RMSEs by the percentage of events with and without history, which would have gotten the 83rd place if we had participated to the challenge while it was open.

## Some random considerations

Something in the back of my mind: is it better to drop examples when you have a strong predictor with missing value, like energy star rating? I am worried that imputing is actually worse for the performance, rather than setting it to zero. Another possibility is to train different models, depending whether or not this information is available