# Challenge 2: Freezing Fritz

Freezing Fritz, is a pretty cool guy. He has one problem, though. In his house, it is quite often too cold or to hot during the night. Then he has to get up and open or close his windows or turn on the heat. Needless to say, he would like to avoid this. 

However, his flat has three doors that he can keep open or closed, it has four radiators, and four windows. It seems like there are endless possibilities of prepping the flat for whatever temperature the night will have. 

Fritz, does not want to play his luck any longer and decided to get active. He recorded the temperature outside and inside of his bedroom for the last year. Now he would like to find an prediction that, given the outside temperature, as well as a certain configuration of his flat, tells him how cold or warm his bedroom will become.

Can you help Freezing Fritz to find blissful sleep?

In [54]:
import numpy as np
import pandas as pd

data_train = pd.read_csv("data_train_Temperature.csv")
data_test = pd.read_csv('data_test_Temperature.csv')

random_state = np.random.RandomState(0)

To efficiently compare the reference solution to a selection of scikit-learn models we rewrote the inverse distance weighting interpolation to fit the API.

In [43]:
class PetersenRegressor:
    
    def __init__(self, weights = [10, 0.1, 1, 10, 1, 1, 1, 1, 10, 10, 1, 1], exponent = 4):
        self.weights = weights
        self.exponent = exponent
        
    
    def distance(self, x, y):
        return np.linalg.norm(self.weights * (x - y))**self.exponent
    
        
    def predict(self, X):
        if self.X is None or self.Y is None:
            raise Exception("Not fitted yet!")
                
        prediction = np.zeros(X.shape[0])
        
        for k, row in enumerate(X):
            # index of row if it is already in fitted data
            index = np.where(np.all(row == self.X, axis = 1))[0]
            
            if index.size == 0:
                # inverse distance weighting
                inverse_distance_weights = np.array([1 / self.distance(row, x) for x in self.X])
                value = np.sum(self.Y * inverse_distance_weights)
                total_inverse_distance = np.sum(inverse_distance_weights)
                prediction[k] = value / total_inverse_distance
                
            else:
                prediction[k] = self.Y[index[0]]
            
        return prediction
        
        
    def fit(self, X, Y):
        # fitting just saves the data as coefficients for the interpolation
        self.X = X
        self.Y = Y
        
        
    def score(self, X, Y):
        # root mean squared error
        return np.linalg.norm(self.predict(X) - Y) / np.sqrt(Y.shape[0])

We train on 80% of the data and then test on the remaining 20%.

In [44]:
from sklearn.model_selection import train_test_split

X = data_train.values[:, :-1]
y = data_train["Temperature Bed"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = random_state)

To select a best solution we compare the reference solution to some standard regressors from the scikit-learn library.

In [45]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, TheilSenRegressor, RANSACRegressor, Ridge, BayesianRidge
from sklearn.kernel_ridge import KernelRidge
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Sum, WhiteKernel, DotProduct
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from sklearn.neural_network import MLPRegressor

regressors = {
    "Petersen" :          PetersenRegressor(),
    "Nearest Neighbors" : KNeighborsRegressor(),
    "linear Model" :      LinearRegression(),
    "linear SVM" :        SVR(kernel = "linear"),
    "polynomial SVM":     SVR(kernel = "poly"),
    "radial SVM":         SVR(kernel = "rbf"),
    "Theil-Sen" :         TheilSenRegressor(random_state = random_state),
    "RANSAC" :            RANSACRegressor(random_state = random_state),
    "Ridge" :             Ridge(random_state = random_state),
    "Bayesian Ridge" :    BayesianRidge(),
    "linear Ridge" :      KernelRidge(kernel = "linear"),
    "polynomial Ridge":   KernelRidge(kernel = "poly"),
    "radial Ridge":       KernelRidge(kernel = "rbf"),
    "Gaussian Process" :  GaussianProcessRegressor(kernel = Sum(DotProduct(), WhiteKernel()), random_state = random_state),
    "Decision Tree" :     DecisionTreeRegressor(random_state = random_state),
    "Random Forest" :     RandomForestRegressor(random_state = random_state),
    "Ada Boost" :         AdaBoostRegressor(random_state = random_state),
    "Gradient Boost" :    GradientBoostingRegressor(random_state = random_state),
    "Neural Network" :    MLPRegressor(max_iter = 2000, alpha = 1, random_state = random_state)
}

In [46]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, max_error, explained_variance_score

comparison = pd.DataFrame()

for name, reg in regressors.items():
    print(f"{name:17} : ", end="")
    reg.fit(X_train, y_train)
    print("fitted, ", end="")
    prediction = reg.predict(X_test)
    print("predicted, ", end="")
    comparison.loc[name, "root mean squared"] = mean_squared_error(y_test, prediction, squared = False)
    comparison.loc[name, "mean absolute"] = mean_absolute_error(y_test, prediction)
    comparison.loc[name, "max"] = max_error(y_test, prediction)
    comparison.loc[name, "explained variance"] = explained_variance_score(y_test, prediction)
    print("scored.")

Petersen          : fitted, predicted, scored.
Nearest Neighbors : fitted, predicted, scored.
linear Model      : fitted, predicted, scored.
linear SVM        : fitted, predicted, scored.
polynomial SVM    : fitted, predicted, scored.
radial SVM        : fitted, predicted, scored.
Theil-Sen         : fitted, predicted, scored.
RANSAC            : fitted, predicted, scored.
Ridge             : fitted, predicted, scored.
Bayesian Ridge    : fitted, predicted, scored.
linear Ridge      : fitted, predicted, scored.
polynomial Ridge  : fitted, predicted, scored.
radial Ridge      : fitted, predicted, scored.
Gaussian Process  : fitted, predicted, scored.
Decision Tree     : fitted, predicted, scored.
Random Forest     : fitted, predicted, scored.
Ada Boost         : fitted, predicted, scored.
Gradient Boost    : fitted, predicted, scored.
Neural Network    : fitted, predicted, scored.


In [47]:
comparison.style.highlight_min(["root mean squared", "mean absolute", "max"], "green").highlight_max("explained variance", "green")

Unnamed: 0,root mean squared,mean absolute,max,explained variance
Petersen,1.974394,1.473742,6.966961,0.762518
Nearest Neighbors,3.023535,2.549908,7.90705,0.448827
linear Model,2.097024,1.65455,5.898854,0.731928
linear SVM,2.151175,1.651761,6.325103,0.724112
polynomial SVM,3.139452,2.480373,7.367465,0.399296
radial SVM,2.685788,2.05724,8.425278,0.580823
Theil-Sen,2.097831,1.647327,6.18189,0.731728
RANSAC,2.909287,2.308985,7.882212,0.530407
Ridge,2.097573,1.655372,5.889435,0.731793
Bayesian Ridge,2.09923,1.65729,5.867501,0.731385


The best performer is the random forest regressor which we will now train on the whole dataset and save the predictions of the given test set.

In [53]:
regressor = RandomForestRegressor()
regressor.fit(X, y)

prediction = regressor.predict(data_test.values[:, :-1])
np.savetxt('ToTheMooners_prediction.csv', prediction, delimiter=',')