In [6]:
import pandas as pd 
# Load data
melbourne_data = pd.read_csv('melb_data.csv') 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

In [8]:
from sklearn.tree import DecisionTreeRegressor 
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X,y)

#### Metrics to evaluate performance of a mode:

- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Root Mean Squared Logarithmic Error (RMSLE)

In [10]:
from sklearn.metrics import mean_absolute_error 
preds = melbourne_model.predict(X)
mean_absolute_error(preds,y)

434.71594577146544

The Problem with "In-Sample" Scores¶
The measure we just computed can be called an "in-sample" score. We used a single "sample" of houses for both building the model and evaluating it. Here's why this is bad.

Imagine that, in the large real estate market, door color is unrelated to home price.

In [15]:
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor

# Path of the file to read
home_data = pd.read_csv('train.csv')
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]

# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)

print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())


First in-sample predictions: [208500. 181500. 223500. 140000. 250000.]
Actual target values for those homes: [208500, 181500, 223500, 140000, 250000]


Root Mean Squared Logarithmic Error (RMSLE) is a metric that measures the square root of the average squared logarithmic difference between the predicted values and the actual values. It is calculated as the square root of the mean of the squared logarithmic errors. 

MSLE is similar to RMSE, but it has some advantages and disadvantages, such as:

- It is less sensitive to outliers or large errors, as it takes the logarithm of the errors. This means that it reduces the impact of large errors and gives more weight to small errors, which may be desirable in some cases.
- It is scale-invariant, meaning that it does not depend on the scale or units of the data. This makes it easier to compare the RMSLE of different models or datasets.
- It is only applicable to positive values, as it takes the logarithm of the values. This means that it cannot handle zero or negative values, which may limit its use in some scenarios.

In [21]:
# Import the train_test_split function and uncomment
# from _ import _
from sklearn.model_selection import train_test_split
# Import the train_test_split function and uncomment
# from _ import _
train_X, val_X, train_y, val_y  = train_test_split(X,y,random_state =1)

In [25]:
# Specify the model
iowa_model = DecisionTreeRegressor( random_state = 1)

# Fit iowa_model with the training data.
iowa_model.fit(train_X,train_y)


In [27]:
# Predict with all validation observations
preds_val = iowa_model.predict(val_X)
# print the top few validation predictions
preds_val

array([186500., 184000., 130000.,  92000., 164500., 220000., 335000.,
       144152., 215000., 262000., 180000., 121000., 175900., 210000.,
       248900., 131000., 100000., 149350., 235000., 156000., 149900.,
       265979., 193500., 377500., 100000., 162900., 145000., 180000.,
       582933., 146000., 140000.,  91500., 112500., 113000., 145000.,
       312500., 110000., 132000., 305000., 128000., 162900., 115000.,
       110000., 124000., 215200., 180000.,  79000., 192000., 282922.,
       235000., 132000., 325000.,  80000., 237000., 208300., 100000.,
       120500., 162000., 153000., 187000., 185750., 335000., 129000.,
       124900., 185750., 133700., 127000., 230000., 146800., 157900.,
       136000., 153575., 335000., 177500., 143000., 202500., 168500.,
       105000., 305900., 192000., 190000., 140200., 134900., 128950.,
       213000., 108959., 149500., 190000., 175900., 160000., 250580.,
       157000., 120500., 147500., 118000., 117000., 110000., 130000.,
       148500., 1480

In [31]:
from sklearn.metrics import mean_absolute_error 
val_mae = mean_absolute_error(preds_val,val_y)
print(val_mae)

29652.931506849316
