# Notes from the tutorial for "Introduction to Machine Learning" on kaggle
https://www.kaggle.com/learn/intro-to-machine-learning

https://www.kaggle.com/code/dansbecker/model-validation

## Model Validation

Model validation is a way to measure the quality of the model. One way to measure model quality is to determin its predictive accuracy.  

First, summarize the model quality into an understandable way.
One way of summarizing model quality is Mean Absolute Error (MAE)

Prediction error: 
error = actual - predicted

With MAE, we take the absolute value of each error and take the average of those |errors|. On average, predictions are off by ~ X. 

In [None]:
# Data Loading Code Hidden Here
import pandas as pd

# Load data
melbourne_file_path = './melbourne-housing-snapshot/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
# Filter rows with missing price values
filtered_melbourne_data = melbourne_data.dropna(axis=0)
# Choose target and features
y = filtered_melbourne_data.Price
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'BuildingArea', 
                        'YearBuilt', 'Lattitude', 'Longtitude']
X = filtered_melbourne_data[melbourne_features]

from sklearn.tree import DecisionTreeRegressor
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(X, y)

In [None]:
# Calculate the mean absolute error
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

### "In-Sample" Scores
The previous example measured is called "in-sample" score - a single "sample" of houses for both building and evaluating the model. 

Training and testing with the same data might not allow the model to predict accurately when given new data. Such as if the training data shows that "green door" houses have a higher value. If new data doesn't follow this pattern, then model will be very inaccurate. 

Performance should be measure on data not used to build the model. Therefore, excluding some data from the model-building process is the most straightforward way. This is called "validation data". 

The function "train_test_split" from scikit-learn breaks up data into two pieces. Some data will be used to train the model, and the rest will be used as validation data to calculate MAE. 

In [None]:
from sklearn.model_selection import train_test_split

# split data into training and validation data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)

# Define model
melbourne_model = DecisionTreeRegressor()

# Fit model
melbourne_model.fit(train_X, train_y)

# Get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))