# Challenge 2: Freezing Fritz

Freezing Fritz, is a pretty cool guy. He has one problem, though. In his house, it is quite often too cold or to hot during the night. Then he has to get up and open or close his windows or turn on the heat. Needless to say, he would like to avoid this. 

However, his flat has three doors that he can keep open or closed, it has four radiators, and four windows. It seems like there are endless possibilities of prepping the flat for whatever temperature the night will have. 

Fritz, does not want to play his luck any longer and decided to get active. He recorded the temperature outside and inside of his bedroom for the last year. Now he would like to find an prediction that, given the outside temperature, as well as a certain configuration of his flat, tells him how cold or warm his bedroom will become.

Can you help Freezing Fritz to find blissful sleep?

In [1]:
import numpy as np
import pandas as pd

data_train = pd.read_csv("data_train_Temperature.csv")
data_test = pd.read_csv('data_test_Temperature.csv')

random_state = np.random.RandomState(0)

To efficiently compare the reference solution to a selection of scikit-learn models we rewrote the inverse distance weighting interpolation to fit the API.

In [2]:
class PetersenRegressor:
    
    def __init__(self, weights = [10, 0.1, 1, 10, 1, 1, 1, 1, 10, 10, 1, 1], exponent = 4):
        self.weights = weights
        self.exponent = exponent
        
    
    def distance(self, x, y):
        return np.linalg.norm(self.weights * (x - y))**self.exponent
    
        
    def predict(self, X):
        if self.X is None or self.Y is None:
            raise Exception("Not fitted yet!")
                
        prediction = np.zeros(X.shape[0])
        
        for k, row in enumerate(X):
            # index of row if it is already in fitted data
            index = np.where(np.all(row == self.X, axis = 1))[0]
            
            if index.size == 0:
                # inverse distance weighting
                inverse_distance_weights = np.array([1 / self.distance(row, x) for x in self.X])
                value = np.sum(self.Y * inverse_distance_weights)
                total_inverse_distance = np.sum(inverse_distance_weights)
                prediction[k] = value / total_inverse_distance
                
            else:
                prediction[k] = self.Y[index[0]]
            
        return prediction
        
        
    def fit(self, X, Y):
        # fitting just saves the data as coefficients for the interpolation
        self.X = X
        self.Y = Y
        
        
    def score(self, X, Y):
        # root mean squared error
        return np.linalg.norm(self.predict(X) - Y) / np.sqrt(Y.shape[0])

We train on 80% of the data and then test on the remaining 20%.

In [9]:
from sklearn.model_selection import train_test_split

X = data_train.values[:, :-1]
y = data_train["Temperature Bed"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)

To select a best solution we compare the reference solution to some standard regressors from the scikit-learn library.

In [10]:
from sklearn.linear_model import LinearRegression, Ridge, SGDRegressor
from sklearn.linear_model import ElasticNet, Lars, Lasso, LassoLars, OrthogonalMatchingPursuit
from sklearn.linear_model import ARDRegression, BayesianRidge
from sklearn.linear_model import HuberRegressor, TheilSenRegressor, RANSACRegressor
from sklearn.linear_model import PoissonRegressor, TweedieRegressor, GammaRegressor
from sklearn.svm import SVR
from sklearn.kernel_ridge import KernelRidge
from sklearn.cross_decomposition import PLSRegression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Sum, WhiteKernel, DotProduct
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

regressors = {
    "Petersen" :          PetersenRegressor(),
    "linear Model" :      LinearRegression(),
    "Ridge" :             Ridge(random_state = random_state),
    "SGD":                SGDRegressor(random_state = random_state),
    "Elastic Net":        ElasticNet(random_state = random_state),
    "Lars":               Lars(random_state = random_state),
    "Lasso":              Lasso(random_state = random_state),
    "LassoLars":          LassoLars(random_state = random_state),
    "OMP":                OrthogonalMatchingPursuit(),
    "ARD":                ARDRegression(),
    "Bayesian Ridge" :    BayesianRidge(),
    "Huber":              HuberRegressor(),
    "Theil-Sen" :         TheilSenRegressor(random_state = random_state),
    "RANSAC" :            RANSACRegressor(random_state = random_state),
    "Poisson":            PoissonRegressor(),
    "Tweedie":            TweedieRegressor(),
    "Gamma":              GammaRegressor(),
    "linear SVM" :        SVR(kernel = "linear"),
    "polynomial SVM":     SVR(kernel = "poly"),
    "radial SVM":         SVR(kernel = "rbf"),
    "linear Ridge" :      KernelRidge(kernel = "linear"),
    "polynomial Ridge":   KernelRidge(kernel = "poly"),
    "radial Ridge":       KernelRidge(kernel = "rbf"),
    "PLS":                PLSRegression(n_components = 2),
    "Gaussian Process" :  GaussianProcessRegressor(kernel = Sum(DotProduct(), WhiteKernel()), random_state = random_state),
    "Nearest Neighbors" : KNeighborsRegressor(n_neighbors = 2),
    "Decision Tree" :     DecisionTreeRegressor(random_state = random_state),
    "Random Forest" :     RandomForestRegressor(random_state = random_state),
    "Ada Boost" :         AdaBoostRegressor(random_state = random_state),
    "Gradient Boost" :    GradientBoostingRegressor(random_state = random_state),
    "Neural Network" :    MLPRegressor(max_iter = 2000, alpha = 1, random_state = random_state)
}

In [11]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, max_error, explained_variance_score

comparison = pd.DataFrame()

for name, reg in regressors.items():
    print(f"{name:17} : ", end="")
    reg.fit(X_train, y_train)
    print("fitted, ", end="")
    prediction = reg.predict(X_test)
    print("predicted, ", end="")
    comparison.loc[name, "root mean squared"] = mean_squared_error(y_test, prediction, squared = False)
    comparison.loc[name, "mean absolute"] = mean_absolute_error(y_test, prediction)
    comparison.loc[name, "max"] = max_error(y_test, prediction)
    comparison.loc[name, "explained variance"] = explained_variance_score(y_test, prediction)
    print("scored.")

Petersen          : fitted, predicted, scored.
linear Model      : fitted, predicted, scored.
Ridge             : fitted, predicted, scored.
SGD               : fitted, predicted, scored.
Elastic Net       : fitted, predicted, scored.
Lars              : fitted, predicted, scored.
Lasso             : fitted, predicted, scored.
LassoLars         : fitted, predicted, scored.
OMP               : fitted, predicted, scored.
ARD               : fitted, predicted, scored.
Bayesian Ridge    : fitted, predicted, scored.
Huber             : fitted, predicted, scored.
Theil-Sen         : fitted, predicted, scored.
RANSAC            : fitted, predicted, scored.
Poisson           : fitted, predicted, scored.
Tweedie           : fitted, predicted, scored.
Gamma             : fitted, predicted, scored.
linear SVM        : fitted, predicted, scored.
polynomial SVM    : fitted, predicted, scored.
radial SVM        : fitted, predicted, scored.
linear Ridge      : fitted, predicted, scored.
polynomial Ri

In [6]:
comparison.style.highlight_min(["root mean squared", "mean absolute", "max"], "green").highlight_max("explained variance", "green")

Unnamed: 0,root mean squared,mean absolute,max,explained variance
Petersen,1.974394,1.473742,6.966961,0.762518
linear Model,2.097024,1.65455,5.898854,0.731928
Ridge,2.097573,1.655372,5.889435,0.731793
SGD,2.115462,1.688157,6.227903,0.727174
Elastic Net,3.001744,2.566644,7.544735,0.456665
Lars,2.097024,1.65455,5.898854,0.731928
Lasso,3.151712,2.692609,7.494738,0.401241
LassoLars,4.06136,3.423956,9.457016,0.0
OMP,3.211462,2.719296,7.338607,0.378899
ARD,2.10241,1.649036,6.106473,0.730526


The best performer is the random forest regressor which we will now train on the whole dataset and save the predictions of the given test set.

In [7]:
regressor = RandomForestRegressor()
regressor.fit(X, y)

prediction = regressor.predict(data_test.values[:, :-1])
np.savetxt('ToTheMooners_prediction.csv', prediction, delimiter=',')