# Predicting Real Estate Values by using _**Decision Tree Diagram**_
### **Prepared by: Musaini Ramlee**

### **1) Introduction**   
Welcome to my Machine Learning Portfolio!

In this case study, I will perform a real-world task of a Machine Learning Engineer.

I have been appointed as a business partner based on my expertise in Machine Learning. My partner requires me to predict the real estate values in order to help him making more decision based on data analysis and ultimately profitable investment.

In this particular case, I will be using **Decision Tree Diagram** purely because of its simplicity, easy to understand model, and it is the basic building block for some of the best models in data science.

#### **Methodology - Decision Tree Diagram**
* This model divides houses into only two categories. 
* The predicted price for any house under consideration is based on the historical average price of houses in the same category (process of fitting/training the model).
* After the model has been fit, my business partner will be able to use the model to new data in order to predict prices of additional homes.


### **2) Problem Statement (Business Tasks)**            

These are the key questions and the deliverables to be achieved from this model building activity.   
i.	What is the predicted price for new house/new development?   
ii.	What is the amount of error from the created model?   


### **3) Data Sets**

a.	**Dataset Origin**: Sources of this dataset is from preprossesed [Melbourne Housing Snapshot](https://www.kaggle.com/dansbecker/melbourne-housing-snapshot). 
(CC0: Public Domain). The dataset is made available through Kaggle.

b.	**Description**: The dataset was scraped from publicly available results posted every week from Domain.com.au. Data owner has cleaned it well. The dataset includes Address, Type of Real estate, Suburb, Method of Selling, Rooms, Price, Real Estate Agent, Date of Sale and distance from C.B.D.

c.	**Notes on Specific Variables**  

* Rooms: Number of rooms

* Price: Price in dollars

* Method: S - property sold; SP - property sold prior; PI - property passed in; PN - sold prior not disclosed; SN - sold not disclosed; NB - no bid; VB - vendor bid; W - withdrawn prior to auction; SA - sold after auction; SS - sold after auction price not disclosed. N/A - price or highest bid not available.

* Type: br - bedroom(s); h - house,cottage,villa, semi,terrace; u - unit, duplex; t - townhouse; dev site - development site; o res - other residential.

* SellerG: Real Estate Agent

* Date: Date sold

* Distance: Distance from CBD

* Regionname: General Region (West, North West, North, North east …etc)

* Propertycount: Number of properties that exist in the suburb.

* Bedroom2 : Scraped # of Bedrooms (from different source)

* Bathroom: Number of Bathrooms

* Car: Number of carspots

* Landsize: Land Size

* BuildingArea: Building Size

* CouncilArea: Governing council for the area


d.	**Data Structure and Overview**: The raw data is pre-processes by using the code below. The objective is to identify how it is stuctured.

In [43]:
import pandas as pd
melb_data = pd.read_csv('melb_data.csv')
melb_data.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [44]:
melb_data.dtypes

Suburb            object
Address           object
Rooms              int64
Type              object
Price            float64
Method            object
SellerG           object
Date              object
Distance         float64
Postcode         float64
Bedroom2         float64
Bathroom         float64
Car              float64
Landsize         float64
BuildingArea     float64
YearBuilt        float64
CouncilArea       object
Lattitude        float64
Longtitude       float64
Regionname        object
Propertycount    float64
dtype: object

In [45]:
# Print summary statistics in next line
melb_data.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


### **4) Data Preprocessing**

*	The data format is in CSV file. The downloaded file will be cleaned with a standard process below

In [46]:
import glob
from pathlib import Path
import os
cwd = os.getcwd()

#Step 1 - Remove any duplicates for all the csv files
## using function 'drop_duplicates'


#Step 2 - Strip any whitespace from ends of each value across all series in dataframe
def trim_all_columns(df):
    trim_strings = lambda x: x.strip() if isinstance(x, str) else x
    return df.applymap(trim_strings)


#Step 3 - Standardize text format
""" def standardize_text(df):
    lowercase = lambda s: s.str.lower() if s.dtype=='object' else s
    return df.apply(standardize_text) """ #code in progress


#Step 4 - Check for NaN values and save CSV file to other folders
def check_NaN(df):
    if df.isnull().values.any() == False:
        print("No NaN values in " + csv.name )
    else:
        print(csv.name + " contains NaN values. Please check the following columns")
        print(df.isna().sum())
    df.to_csv(cwd + "/cleaned/" + csv.name) #save the file in a new dir

    
#Loop the cleaning process for all the csv files
csv_files = [f for f in Path(cwd).glob('*.csv')] #list all csv


for csv in csv_files: #iterate list
    # Get data
    df = pd.read_csv(csv)

    # The cleaning operation for Step 1, 2, 3 & 4 above
    df.drop_duplicates(keep=False, inplace=True)  #drop duplicates
    df = trim_all_columns(df)  #trim extra spaces
    "df = standardize_text(df)" #change all text to lower case / WIP
    df = check_NaN(df) #check for missing values

melb_data.csv contains NaN values. Please check the following columns
Suburb              0
Address             0
Rooms               0
Type                0
Price               0
Method              0
SellerG             0
Date                0
Distance            0
Postcode            0
Bedroom2            0
Bathroom            0
Car                62
Landsize            0
BuildingArea     6450
YearBuilt        5375
CouncilArea      1369
Lattitude           0
Longtitude          0
Regionname          0
Propertycount       0
dtype: int64


From initial processing, we can see that the data set contains NaN values. As of now, we will just drop the missing values.

In [50]:
#dropping NaN values
melb_data = melb_data.dropna(axis=0)

# listing all columns
melb_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

As we can see above, there are too many variables to be considered for modelling.   
There are many ways to select a subset of data, as for now we will focus on two approaches only.

a) Dot notation, which we use to select the _"prediction target"_  
b) Selecting with a column list, which we use to select the "features"

### **5) Model Creation**

#### 5.1) Selection of Prediction Target

This process starts by pulling out a variable with dot-notation. This single column is stored in a Series, which is broadly like a DataFrame with only a single column of data.

We'll use the dot notation to select the column we want to predict, which is called the prediction target. By convention, the prediction target is called **y**. So the code we need to save the house prices in the Melbourne data is

In [52]:
y = melb_data.Price

#### 5.2) Choosing _"Features"_

The columns that are inputted into our model (and later used to make predictions) are called "features." In our case, those would be the columns used to determine the home price. Sometimes, you will use all columns except the target as features. Other times you'll be better off with fewer features.

For now, we'll build a model with only a few features. 

We select multiple features by providing a list of column names inside brackets. 

In [54]:
melb_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

By standard convention, these data is called **X**

In [55]:
X = melb_data[melb_features]

In [72]:
# Print statistics summary
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [62]:
# Quick glance of the data
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


#### 5.3) Modelling process

We will use the scikit-learn library to create the model. Scikit-learn is easily the most popular library for modeling the types of data typically stored in DataFrames.

The steps to building and using a model are:

* **Define**: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
* **Fit**: Capture patterns from provided data. This is the heart of modeling.
* **Predict**: Just what it sounds like
* **Evaluate**: Determine how accurate the model's predictions are.

Here is an example of defining a decision tree model with scikit-learn and fitting it with the features and target variable

In [84]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

Many machine learning models allow some randomness in model training. Specifying a number for random_state ensures you get the same results in each run.   
This is considered a good practice. You use any number, and model quality won't depend meaningfully on exactly what value you choose.

We now have a fitted model that we can use to make predictions.

In practice, we'll want to make predictions for new houses coming on the market rather than the houses we already have prices for.   
But we'll make predictions for the first few rows of the training data to see how the predict function works.

In [85]:
print("Making price predictions for the following 5 houses:")
print(X.head())
print("The predicted prices (in AUD $) are")
print(melbourne_model.predict(X.head()))

Making price predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predicted prices (in AUD $) are
[1035000. 1465000. 1600000. 1876000. 1636000.]


#### 5.4) Model Validation

Measuring model quality is the key to iteratively improving our models.
In most applications, the relevant measure of model quality is predictive accuracy. In other words, will the model's predictions be close to the actual values.

There are many metrics for summarizing model quality, but we'll start with one called _Mean Absolute Error_ (also called **MAE**).   
Let's break down this metric starting with the last word, error.

The prediction error for each house is:   
_error = actual−predicted_

With the MAE metric, we take the absolute value of each error. This converts each error to a positive number.     
We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as

**_"On average, our predictions are off by about X"_**

Let's calculate the MAE

In [86]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

1115.7467183128902

As you can see above, the error value is very small (relative to the average House Price of 1mil). This is because we just use the same data to train and to test the model.   
It is known as "In-sample" scores. This is a bad modelling approach.

Since models' practical value come from making predictions on new data, we measure performance on data that wasn't used to build the model.    
The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before.   
This data is called **_validation data_**.

#### 5.5) Model Test/Train

The scikit-learn library has a function train_test_split to break up the data into two pieces.   
We'll use some of that data as training data to fit the model, and we'll use the other data as validation data to calculate the MAE.

In [99]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
## The split is based on a random number generator. Supplying a numeric value to the random_state argument guarantees we get the same split every time we run this script.

train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 1)

# Define model
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

251688.7630729503


As you may see above, the MAE forin-sample data was about 1100 dollars. Out-of-sample it is more than 250,000 dollars!.

This is the difference between a model that is almost exactly right (too good to be true), and one that is unusable for most practical purposes.    
As a point of reference, the average home value in the validation data is 1.1 million dollars. So the error in new data is about a quarter of the average home value.

We will continue to improve this model.

### **6) Model Tuning**

The tuning process is when we try to improve the model accuracy by experimenting with different options that are available. In the case of Decision Tree Diagram, we may want to experiment with the number of splits. We wil create more groups/leaves when we add the number of splits.

#### Underfitting vs. Overfitting

**Overfitting** might happen When we divide the houses amongst many leaves, we also have fewer houses in each leaf.   
Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses)

Meanwhile,

**Underfitting** might happen if we failed to split the houses into optimum number of leaves.   
This will result in predictions that may be far off for most houses, even in the training dataset (and it will be bad in validation too for the same reason).   
An underfitted model will fail to capture important distinctions and patterns in the orignal dataset, so it will also perform poorly even in training data, 

For this particular model, there are a few alternatives for controlling the tree depth, and many allow for some routes through the tree to have greater depth than other routes.   
One of them is the **_max_leaf_nodes_** argument that provide a very sensible way to control overfitting vs underfitting. 

We can use a utility function to help compare MAE scores from different values for _max_leaf_nodes_:

In [88]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    pred_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, pred_val)
    return(mae)

In [90]:
# compare MAE with differing values of max_leaf_nodes
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  369673
Max leaf nodes: 50  		 Mean Absolute Error:  266644
Max leaf nodes: 500  		 Mean Absolute Error:  243613
Max leaf nodes: 5000  		 Mean Absolute Error:  256227


In [94]:
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in [5, 50, 500, 5000]}
best_tree_size =  min(scores, key=scores.get)
print('Best tree size is:')
print(best_tree_size)

Best tree size is:
500


Of the options listed, 500 is the optimal number of leaves for this model.   
Finalized your model with this acquired information.

In [107]:
# Fill in argument to make optimal size
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=1)

# fit the final model and uncomment the next two lines
final_model.fit(X, y)

DecisionTreeRegressor(max_leaf_nodes=500, random_state=1)

In [108]:
# get predicted prices on validation data
val_predictions = final_model.predict(val_X)
print("MAE for this improved model is:")
print(mean_absolute_error(val_y, val_predictions))

MAE for this improved model is:
124961.08322355292


As you can see above, the MAE has improved significantly, from over 250,000 to 124,961!

### **7) Alternative Model - _Random Forest_**

Modelling with Decision trees may leave us with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Exploring other models can lead to better performance. We'll look at the random forest as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

In [106]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
rf_model = forest_model.predict(val_X)
print("MAE for this Random Forest model is:")
print(mean_absolute_error(val_y, rf_model))

MAE for this Random Forest model is:
190414.59149025998


As you can see above, the MAE has improved significantly, from over 250,000 to 190414!.      
However there is likely room for further improvement for this random forest model.