In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler

# For imports
from notebooks import utility
import importlib

# For optimization
from sklearnex import patch_sklearn
patch_sklearn()

Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


## Data import
Let’s import the data that was previously cleaned

In [2]:
X_train = pd.read_csv("../DWMProjectData/formodel/X_train.csv")
y_train = pd.read_csv("../DWMProjectData/formodel/y_train.csv")
X_valid = pd.read_csv("../DWMProjectData/formodel/X_valid.csv")
y_valid = pd.read_csv("../DWMProjectData/formodel/y_valid.csv")
X_test = pd.read_csv("../DWMProjectData/formodel/X_test.csv")
y_test = pd.read_csv("../DWMProjectData/formodel/y_test.csv")
# Transform all y in a 1-dimensional array - required to avoid warning in model building
y_train = np.ravel(y_train)
y_valid = np.ravel(y_valid)
y_test = np.ravel(y_test)

## Scale data
For Linear regression data scale is required, in order to give to each feature the same importance.
It is questionable whether it's a good idea to give the same importance to all features, since it's not true that all features contribute equally to the price evaluation, but for the sake of simplicity let's build a first model in this way (that is still better than a model built with random features importance).

In [3]:
from utility import scale
importlib.reload(utility)
X_train, X_valid, X_test = scale(X_train, X_valid, X_test)

## Score function

I defined the score functions used for the regression. For a more clear approach I wrote the function `print_metrics` in the file `utility.py` In particular, I decided to write a function that prints the following values to compare models:
- mean absolute error
- mean squared error
- $r^2$, where the best score is 1, good is above 0.7
- explained variance score, where the best score is 1

In [4]:
from utility import print_metrics
importlib.reload(utility)

<module 'notebooks.utility' from 'C:\\Users\\marco\\Documents\\UNI\\Y3\\DataWebMining\\project\\DWMProject\\notebooks\\utility.py'>

## Model Building + Score

In [5]:
model = LinearRegression(n_jobs=-1)
model.fit(X_train, y_train)

LinearRegression(n_jobs=-1)

In [6]:
print_metrics(y_test, model.predict(X_test))

+--------------------------+--------+
|          Method          | Value  |
| mean absolute error      | 0.073  |
+--------------------------+--------+
| mean squared error       | 0.030  |
+--------------------------+--------+
| r^2                      | -0.006 |
+--------------------------+--------+
| explained variance score | -0.005 |
+--------------------------+--------+
