# Dummy MODEL(S)

In [1]:
import pandas as pd
import numpy as np

# For imports
from notebooks import utility
import importlib

## Data import
Let's import the data that was previously cleaned

In [2]:
X_train = pd.read_csv("../DWMProjectData/formodel/X_train.csv")
y_train = pd.read_csv("../DWMProjectData/formodel/y_train.csv")
X_valid = pd.read_csv("../DWMProjectData/formodel/X_valid.csv")
y_valid = pd.read_csv("../DWMProjectData/formodel/y_valid.csv")
X_test = pd.read_csv("../DWMProjectData/formodel/X_test.csv")
y_test = pd.read_csv("../DWMProjectData/formodel/y_test.csv")
# Transform all y in a 1-dimensional array - required to avoid warning in model building
y_train = np.ravel(y_train)
y_valid = np.ravel(y_valid)
y_test = np.ravel(y_test)

## Score function

I defined the score functions used for the regression. For a more clear approach I wrote the function `print_metrics` in the file `utility.py` In particular, I decided to write a function that prints the following values to compare models:
    - mean absolute error
- mean squared error
- $r^2$, where the best score is 1, good is above 0.7
- explained variance score, where the best score is 1

In [3]:
from utility import print_metrics
importlib.reload(utility)

<module 'notebooks.utility' from 'C:\\Users\\marco\\Documents\\UNI\\Y3\\DataWebMining\\project\\DWMProject\\notebooks\\utility.py'>

## Dummy Models
Taking inspiration from [here](https://towardsdatascience.com/creating-benchmark-models-the-scikit-learn-way-af227f6ea977), I want to build first some dummy models whose results can be used to be compared with the real models.
I build two models:
- `dummy_mean` predicts as the mean of `y_train`
- `dummy_median` predicts as the meadian of `y_train`

In [4]:
from utility import get_dummy_model
importlib.reload(utility)

dummy_mean = get_dummy_model("mean", X_train, y_train)
dummy_median = get_dummy_model("median", X_train, y_train)

y_pred = dummy_mean.predict(X_test)
print("< ---------- MEAN DUMMY ---------- >")
print_metrics(y_test, y_pred)

y_pred = dummy_median.predict(X_test)
print("\n< --------- MEDIAN DUMMY --------- >")
print_metrics(y_test, y_pred)

< ---------- MEAN DUMMY ---------- >
+--------------------------+--------+
|          Method          | Value  |
| mean absolute error      | 0.071  |
+--------------------------+--------+
| mean squared error       | 0.030  |
+--------------------------+--------+
| r^2                      | -0.000 |
+--------------------------+--------+
| explained variance score | 0      |
+--------------------------+--------+

< --------- MEDIAN DUMMY --------- >
+--------------------------+--------+
|          Method          | Value  |
| mean absolute error      | 0.070  |
+--------------------------+--------+
| mean squared error       | 0.030  |
+--------------------------+--------+
| r^2                      | -0.003 |
+--------------------------+--------+
| explained variance score | 0      |
+--------------------------+--------+


As I could imagine, the results are pretty unsatisfactory, but that's totally fine, since the model is predicting as the mean or median.
Now I just hope to obtain better results with my models!