## Getting Started
In this notebook, we begin by importing the essential Python libraries required for data analysis and machine learning. We import the Pandas library, which provides efficient data structures and tools for loading, exploring, and manipulating datasets.

In [1]:
import pandas as pd

## Interpreting Data Description

### Loading the Dataset

Next, we specify the file path of the dataset and load the data into a Pandas DataFrame. This allows us to work with the data in a structured, tabular format.

### Exploring the Dataset

To understand the dataset better, we generate a statistical summary using the describe() method. This provides insights into data distribution, ranges, and missing values.

In [2]:
# save filepath to variable for easier access
melbourne_file_path = './input/melbourne-housing-snapshot/melb_data.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 
# print a summary of the data in Melbourne data
melbourne_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


Print some statics

In [3]:
# What is the average lot size (rounded to nearest integer)?
avg_lot_size = round(melbourne_data['Landsize'].mean())

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 2024 - melbourne_data['YearBuilt'].max()

print(avg_lot_size, newest_home_age)

558 6.0


## Selecting Data for Modeling

Before building a machine learning model, we need to reduce the dataset to a manageable number of relevant variables. Using domain knowledge and intuition, we select features that are likely to influence the target variable.

At this stage, feature selection is done manually. More advanced techniques for automatic feature selection will be explored in later lessons.

### Viewing Available Columns

To choose suitable features, we first inspect all the columns present in the dataset. This helps us understand what information is available for modeling.

In [12]:
melbourne_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='str')

### Handling Missing Values

The dataset contains some missing values, meaning certain features were not recorded for all houses. Since we will learn proper techniques for handling missing data later, we take a simple approach for now.

In this step, we remove rows that contain missing values to keep the dataset clean and suitable for basic modeling.

In [4]:
# drop rows with missing values
melbourne_data = melbourne_data.dropna(axis=0)

#### Selecting the Prediction Target

In supervised learning, the variable we want to predict is called the prediction target.
Using dot notation, we extract this column from the dataset. By convention, the prediction target is named y.

In [5]:
y = melbourne_data.Price

## Choosing Features

The variables used as inputs to the model are called features.
Rather than using all available columns, we begin with a small set of features that are likely to influence house prices.

In [7]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

By convention, feature data is stored in a variable named X.

In [8]:
X = melbourne_data[melbourne_features]

### Reviewing Feature Data

Before modeling, it is important to visually inspect the selected features.
We use describe() to view summary statistics and head() to see the first few rows.

In [None]:
X.describe()

This step helps identify unusual values and ensures the data looks reasonable.

## Building the Model

We use the **scikit-learn** library to build machine learning models.
Model development follows four key steps:

1. Define the model
2. Fit the model to data : Capture patterns from provided data. This is the heart of modeling.
3. Predict outcomes
4. Evaluate performance

Here, we define and train a **Decision Tree Regressor**.

In [13]:
!pip install scikit-learn



In [14]:
from sklearn.tree import DecisionTreeRegressor

melbourne_model = DecisionTreeRegressor(random_state=1)
melbourne_model.fit(X, y)

0,1,2
,"criterion  criterion: {""squared_error"", ""friedman_mse"", ""absolute_error"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in the half mean Poisson deviance to find splits. .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 0.24  Poisson deviance criterion.",'squared_error'
,"splitter  splitter: {""best"", ""random""}, default=""best"" The strategy used to choose the split at each node. Supported strategies are ""best"" to choose the best split and ""random"" to choose the best random split.",'best'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. For an example of how ``max_depth`` influences the model, see :ref:`sphx_glr_auto_examples_tree_plot_tree_regression.py`.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: int, float or {""sqrt"", ""log2""}, default=None The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",
,"random_state  random_state: int, RandomState instance or None, default=None Controls the randomness of the estimator. The features are always randomly permuted at each split, even if ``splitter`` is set to ``""best""``. When ``max_features < n_features``, the algorithm will select ``max_features`` at random at each split before finding the best split among them. But the best found split may vary across different runs, even if ``max_features=n_features``. That is the case, if the improvement of the criterion is identical for several splits and one split has to be selected at random. To obtain a deterministic behaviour during fitting, ``random_state`` has to be fixed to an integer. See :term:`Glossary ` for details.",1
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow a tree with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0


Setting random_state ensures reproducible results.

### Making Predictions

Once the model is trained, we can use it to make predictions.
For demonstration, we predict prices for the first few houses in the dataset.

In [19]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Model Validation

After building a model, it is important to evaluate how well it performs.
In most machine learning tasks, this means checking how close the model’s predictions are to actual values.

Simply testing the model on the same data used for training can be misleading, because the model may memorize patterns instead of learning general trends.

### Measuring Model Quality (MAE)

To summarize model performance using a single value, we use Mean Absolute Error (MAE).

MAE measures the average size of prediction errors without considering their direction.

$$
\text{MAE} = \text{average of } \lvert \text{actual} - \text{predicted} \rvert
$$

In [21]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

### Problem with In-Sample Evaluation

When predictions are evaluated using the same data used for training, the model often appears unrealistically accurate.
This happens because the model learns patterns specific to the training data that may not hold for new data.

- To properly assess real-world performance, we must evaluate the model on unseen data (**validation data**).

- **Validation Data** is a portion of the dataset excluded from training and used only for evaluation. This helps estimate how well the model will perform on new, real-world data.

### Splitting Data for Validation

We use train_test_split to divide the dataset into:

- Training data – used to fit the model
- Validation data – used to evaluate predictions

Setting a random_state ensures the same split every time the code is run.

### Evaluating with Validation MAE

After training the model on the training set, we calculate MAE using predictions on the validation set.
This provides a more realistic estimate of model performance.

A large difference between training error and validation error indicates overfitting.

In [22]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

276329.719388853


## Underfitting and Overfitting

### Experimenting With Different Models

Now that we can measure model quality using validation data, we can try different model settings and choose the one that performs best on unseen data.

For Decision Trees, a key factor is model complexity, which is strongly influenced by the tree size (depth / number of leaves).

#### Tree Complexity and Model Performance

- Deeper / larger trees learn more detailed patterns from the training data.

- Smaller trees learn simpler patterns.

However, more complexity is not always better because it can hurt generalization.

#### Underfitting vs Overfitting

- **Underfitting** happens when the model is too simple to capture important patterns.
High error on training and validation
- **Overfitting** happens when the model learns noise or random details in training data.
Very low training error. High validation error.

Our goal is to find a *balanced model* that gives the lowest validation error.

#### Controlling Tree Size with max_leaf_nodes

A practical way to control decision tree complexity is by setting max_leaf_nodes.

- **Small value** → simpler tree (risk of underfitting)
- **Large value** → complex tree (risk of overfitting)

We test multiple values and choose the one with the lowest validation MAE.

#### Utility Function to Compare Models

We define a helper function that:

1. Trains a model with a given max_leaf_nodes
2. Predicts on the validation set
3. Returns the validation MAE

In [23]:
from sklearn.metrics import mean_absolute_error
from sklearn.tree import DecisionTreeRegressor

def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

#### Comparing Different Tree Sizes

We run the function for several values of max_leaf_nodes and compare the MAE scores.
The best choice is the one with the **lowest validation MAE**.

In [24]:
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" % (max_leaf_nodes, my_mae))

Max leaf nodes: 5 		 Mean Absolute Error: 385696
Max leaf nodes: 50 		 Mean Absolute Error: 279794
Max leaf nodes: 500 		 Mean Absolute Error: 261718
Max leaf nodes: 5000 		 Mean Absolute Error: 271320


## Random Forests

### Intro

Decision trees face a trade-off between underfitting and overfitting.

- Deep trees with many leaves tend to overfit the training data.
- Shallow trees fail to capture important patterns in the data.

This challenge is common even in advanced machine learning models. To address it, we use ensemble methods, which combine **multiple models** to improve performance.

### What is a Random Forest?

A Random Forest is an ensemble of many decision trees. Each tree makes its own prediction, and the final prediction is the *average* of all trees.

This approach:

- Reduces overfitting
- Improves prediction accuracy
- Works well with default settings

### Building a Random Forest Model

We build a Random Forest in scikit-learn using the RandomForestRegressor class, following the same steps as before: define, fit, predict, and evaluate.

In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

207190.6873773146


### Results and Interpretation

The Random Forest model produces a lower validation MAE (ie: 207190.687) compared to the best single decision tree model (ie: 261718).
This indicates better generalization to unseen data.