Disclaimer: 
This colab is based on the notebook published on Kaggle. The original version can be found [here](https://www.kaggle.com/usharengaraju/wids2022-lgbm-starter-w-b#Feature-Scaling) 

# Setting up Kaggle Credentials and download data

In [None]:
#@title upload your Kaggle credentials
#@markdown You can generate them in your Kaggle user profile
from google.colab import files
files.upload()

In [None]:
#@title Download the WiDS datasets
#@markdown Make sure your credentials are up-to-date and you have accepted the competition's terms and conditions
! pip install -q kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets list

! cd content
! kaggle competitions download -c widsdatathon2022
! unzip train.csv.zip
! unzip test.csv.zip

<b>Problem Statement:</b> <p> Climate change is a globally relevant, urgent, and multi-faceted issue heavily impacted by energy policy and infrastructure. Addressing climate change involves mitigation (i.e. mitigating greenhouse gas emissions) and adaptation (i.e. preparing for unavoidable consequences). Mitigation of GHG emissions requires changes to electricity systems, transportation, buildings, industry, and land use. </p>

<p>According to a report issued by the International Energy Agency (IEA), the lifecycle of buildings from construction to demolition were responsible for 37% of global energy-related and process-related CO2 emissions in 2020. Yet it is possible to drastically reduce the energy consumption of buildings by a combination of easy-to-implement fixes and state-of-the-art strategies. For example, retrofitted buildings can reduce heating and cooling energy requirements by 50-90 percent. Many of these energy efficiency measures also result in overall cost savings and yield other benefits, such as cleaner air for occupants. This potential can be achieved while maintaining the services that buildings provide.</p>

<b>Goal: </b> <p>
The goal of this competition is to predict the energy consumption using building characteristics and climate and weather variables. </p>

# Importing libraries

In [None]:
# Essentials
import numpy as np
import pandas as pd
import datetime
import random

# Plots
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)


%matplotlib inline
sns.set(style="whitegrid", palette="muted", font_scale=1.5)
plt.rcParams["figure.figsize"] = (10, 5)

## Loading Files and Explorative Data Analysis

We'll start by loading our data (both train and test, but remember not to look at the test set!) and performing some simple analyses to understand the data better.

We'll want to look at:
* which features are included, and their type
* how are features distributed, and do they correlate with our target?
* do we have any missing values? If so, how do we want to handle those?
* ...


In [None]:
data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [None]:
# quickly checking size of our train and test set.
# The training set has one additional column, which is our label.
print("Number of train samples are", data.shape)
print("Number of test samples are", test_data.shape)

In [None]:
data.head()

In [None]:
# Which columns are in the dataset?

In [None]:
# Generate a quick description of your dataset, including key statistics
# (count, mean, std, min, max etc.) for categorical and numerical columns.

In [None]:
# Check for missing data

### Cleaning up data types
We want to ensure that categorical columns are represented accordingly. 

In [None]:
# find categorical columns. Hint: those are usually not numerical.
# change the data-type for each of the categorical columns to 'category'

In [None]:
# create a new list of all your numerical columns. Hint: dtype = 'number'

## EDA
We'll now have a closer look at our data.
We'll want to know how our target is distributed, and how our features relate to the target.

Our target is called `site_eui`

In [None]:
# first, create a plot of the target distribution
# hint: seaborn can easily plot distributions

This graph is positively skewed and has a long right side tail. 


Next, let's have a look at the relationship between our target and the categorical variables.

In [None]:
# boxplots or violinplots are great to compare distributions

There doesn't seem to be a big difference between commercial and residential properties. Some facility types seem to have fairly different target-distributions.

Let's create a similar plot for (a selection of) our numerical features.

In [None]:
cols = [
    "Year_Factor",
    "floor_area",
    "year_built",
    "energy_star_rating",
    "ELEVATION",
    "cooling_degree_days",
    "heating_degree_days",
    "precipitation_inches",
    "snowfall_inches",
    "snowdepth_inches",
    "avg_temp",
    "days_below_30F",
    "days_below_20F",
    "days_below_10F",
    "days_below_0F",
    "days_above_80F",
    "days_above_90F",
    "days_above_100F",
    "days_above_110F",
    "direction_max_wind_speed",
    "direction_peak_wind_speed",
    "max_wind_speed",
    "days_with_fog",
]

# hint: check out seaborn again to plot distributions.

An important driver of energy consumption is probably the temperature in each month.
Let's visualise the minimum, maximum and average temperatures for each month.

In [None]:
min_temp = [
    "january_min_temp",
    "february_min_temp",
    "march_min_temp",
    "april_min_temp",
    "may_min_temp",
    "june_min_temp",
    "july_min_temp",
    "august_min_temp",
    "september_min_temp",
    "october_min_temp",
    "november_min_temp",
    "december_min_temp",
]

max_temp = [
    "january_max_temp",
    "february_max_temp",
    "march_max_temp",
    "april_max_temp",
    "may_max_temp",
    "june_max_temp",
    "july_max_temp",
    "august_max_temp",
    "september_max_temp",
    "october_max_temp",
    "november_max_temp",
    "december_max_temp",
]

avg_temp = [
    "january_avg_temp",
    "february_avg_temp",
    "march_avg_temp",
    "april_avg_temp",
    "may_avg_temp",
    "june_avg_temp",
    "july_avg_temp",
    "august_avg_temp",
    "september_avg_temp",
    "october_avg_temp",
    "november_avg_temp",
    "december_avg_temp",
]


Next, investigate the correlations between your data. 
It's good to start with a heatmap, and then identify columns that are particularly highly correlated.

In [None]:
# code for your plot goes here.

## Data preparation

### Handling missing values

As we've seen above, some of our columns have missing values. We have different options to replace those, and all of them have their pros and cons.

* dropping rows with missing features
* replace with the mean/median in the column
* replace with a fixed value
* replace with a value out of distribution; this would allow us to still fit a model, but effectively keep `nan` as a possible feature-value.

Note that whatever your strategy, you'll need to apply this to your test-data as well.

Tip: have a look at `sklearn.impute.SimpleImputer`!

In [None]:
# year_built: replace with current year.
data["year_built"] = data["year_built"].replace(np.nan, 2022)

## for test data
test_data["year_built"] = test_data["year_built"].replace(np.nan, 2022)

In [None]:
from sklearn.impute import SimpleImputer

null_col = [
    "energy_star_rating",
    "direction_max_wind_speed",
    "direction_peak_wind_speed",
    "max_wind_speed",
    "days_with_fog",
]


## don't forget to apply the same transformations to your test data

In [None]:
# good place to double-check what your preprocessed data now looks like.

In [None]:
# rechecking null values
cols_with_missing = [col for col in data.columns if data[col].isnull().any()]
cols_with_missing

### Feature one-hot encoding and scaling
First, we need to make sure that all our data is present in numerical format (instead of as a string). For this, we use something called dummy- or one-hot encoding. 
Have a look at `from sklearn.preprocessingOneHotEncoder` or pd.get_dummies for this.

Depending on the model chose, it can be important that all features are on a similar scale (e.g., -1 to +1). This can help the model learn more efficiently and not get 'distracted' by some features having a much larger scale than others.

There are different ways to scale features – have a look at `sklearn.preprocessing` for different Scalers available. 


In [None]:
y = data["site_eui"]
X = data.drop(["site_eui"], axis=1)

In [None]:
# Apply one-hot encoding
from sklearn.preprocessing import OneHotEncoder

# remember to apply the same transformation to your test data.


In [None]:
from sklearn.preprocessing import StandardScaler

## Model training

Note: it is usually best to start with a simple model first – we strongly recommend testing linear models before moving to boosted trees.

A few last steps before we can train our first model: 
- we want to split our training data once more, so we can tune our hyperparameters whilst reducing risk of overfitting
- we need to choose a model: it is usually best to start with a simple model first (e.g., a linear one) before moving to more complicated ones.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=50
)

### Baseline Model: Linear Regression

We start with a simple linear regression model to get a baseline performance on this prediction task. 

In [None]:
from sklearn.linear_model import LinearRegression

# fit your model and generate predictions for your training and test set

Let's plot our predictions against our actual target values and calculate our model performance both for the training and the validation set.
Have a look at `sklearn.metrics` to see which common evaluation metrics are available.

In [None]:
# try making a scatter plot

In [None]:
# second scatter plot for test data

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# calculate the metrics for train and test set. This helps you diagnose over- or underfitting

Overall, this is not too bad for a simple linear model. But let's try something a bit more complicated to see if we can improve the model performance further.

## Non-linear model: Gradient boosted trees example

A powerful and fast algorithm is [LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html). The documentation is really helpful, so take some time to have a look there! 

In [None]:
import lightgbm


As before, let's plot our model predictions against the true values, and calculate some model performance metrics.

In [None]:
# scatterplot for training data

In [None]:
# scatterplot for test data

In [None]:
# MSE and R2 scores for train and test data

Let's see if we can improve this model further!
Next step is using the gridsearch to find the optimal parameters

In [None]:
from sklearn.model_selection import GridSearchCV

# check out the documentation and define a grid of parameters you'd like to explore.

## Generating predictions on test data (held-out dataset)

Finally, if we are happy with our model, we can use it to generate predictions based on the test-data and upload those as our submission.

Best of luck with the WiDS 2020 Datathon!

In [None]:
# testdata prediction
