# Data Exploration of Housing Prices

### Setting up Imports

In [14]:
import pandas as pd

### Creation of our DataFrame

In [15]:
df = pd.read_csv("melb_data.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13580 entries, 0 to 13579
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         13580 non-null  object 
 1   Address        13580 non-null  object 
 2   Rooms          13580 non-null  int64  
 3   Type           13580 non-null  object 
 4   Price          13580 non-null  float64
 5   Method         13580 non-null  object 
 6   SellerG        13580 non-null  object 
 7   Date           13580 non-null  object 
 8   Distance       13580 non-null  float64
 9   Postcode       13580 non-null  float64
 10  Bedroom2       13580 non-null  float64
 11  Bathroom       13580 non-null  float64
 12  Car            13518 non-null  float64
 13  Landsize       13580 non-null  float64
 14  BuildingArea   7130 non-null   float64
 15  YearBuilt      8205 non-null   float64
 16  CouncilArea    12211 non-null  object 
 17  Lattitude      13580 non-null  float64
 18  Longti

### Describing the numeric qualities of our data
    - min, 25%, 50%, 75%, and max are calculated as follows:
        - The 25th percentile is the value between min & max that is larger than 25% of the data set, and smaller than 75%
        - The 75th percentile is similar - the value between min & max that is larger than 75% of the data set, and smaller than 25%
        - Shows the distribution of these numeric values

In [16]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Lattitude,Longtitude,Propertycount
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,3105.301915,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217,-37.809203,144.995216,7454.417378
std,0.955748,639310.7,5.868725,90.676964,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762,0.07926,0.103916,4378.581772
min,1.0,85000.0,0.0,3000.0,0.0,0.0,0.0,0.0,0.0,1196.0,-38.18255,144.43181,249.0
25%,2.0,650000.0,6.1,3044.0,2.0,1.0,1.0,177.0,93.0,1940.0,-37.856822,144.9296,4380.0
50%,3.0,903000.0,9.2,3084.0,3.0,1.0,2.0,440.0,126.0,1970.0,-37.802355,145.0001,6555.0
75%,3.0,1330000.0,13.0,3148.0,3.0,2.0,2.0,651.0,174.0,1999.0,-37.7564,145.058305,10331.0
max,10.0,9000000.0,48.1,3977.0,20.0,8.0,10.0,433014.0,44515.0,2018.0,-37.40853,145.52635,21650.0


## Machine Learning Model Creation
---
### First, we'll preview our available data columns

In [17]:
df.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

#### This data set possesses a few null values.  To remedy this, we'll use the DF method .dropna() and specify an axis of 0 (rows)

In [18]:
df = df.dropna(axis=0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6196 entries, 1 to 12212
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Suburb         6196 non-null   object 
 1   Address        6196 non-null   object 
 2   Rooms          6196 non-null   int64  
 3   Type           6196 non-null   object 
 4   Price          6196 non-null   float64
 5   Method         6196 non-null   object 
 6   SellerG        6196 non-null   object 
 7   Date           6196 non-null   object 
 8   Distance       6196 non-null   float64
 9   Postcode       6196 non-null   float64
 10  Bedroom2       6196 non-null   float64
 11  Bathroom       6196 non-null   float64
 12  Car            6196 non-null   float64
 13  Landsize       6196 non-null   float64
 14  BuildingArea   6196 non-null   float64
 15  YearBuilt      6196 non-null   float64
 16  CouncilArea    6196 non-null   object 
 17  Lattitude      6196 non-null   float64
 18  Longtit

### Selecting the Prediction Target, using Dot Notation

In [19]:
# y variable corresponds to prediction target
y = df.Price

### Selecting "features" that will be used to determine our target (price)
- Depending on the situation, we may use all columns as "features", or we may target a few specific features instead
- We can also use different combinations of "features" in training our models, and compare to see which yields higher accuracy in results

In [20]:
# Providing list of column names (features) we will use
features = ["Rooms","Bathroom","Landsize","Lattitude","Longtitude"]

# Creating our X (feature data set) by filtering our original DF
X = df[features]

# Reviewing X's numeric qualities
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [21]:
# Further review & exploration
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


#### **General Model Creation Steps:**
---
1. **Define**
    - What type of model will you use? 
2. **Fit**
    - Capture patterns from the provided data - the heart of modeling!!
3. **Predict**
    - Make your prediction 
4. **Evaluate**
    - Determine accuracy of model's predictions
---

#### We will be utilizing scikit-learn's Decision Tree Model for this prediction

In [22]:
from sklearn.tree import DecisionTreeRegressor
# Including additional imports used in data validation
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

# Specifying random_state allows us to get the same results with each execution
# This ensures we are able to reproduce the same model
tree_model = DecisionTreeRegressor(random_state=1)

# Fitting of model
tree_model.fit(X, y)

#### To illustrate use of model, we will make prediction for head of our training data set

In [23]:
print("Calculating price predictions for the following 5 houses:")
print(X.head())
print("Predicted price values are:")
print(tree_model.predict(X.head()))

Calculating price predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
Predicted price values are:
[1035000. 1465000. 1600000. 1876000. 1636000.]


## Model Validation
---
### Measuring Quality of Models
- Typically, we'll summarize the quality of our model into a single metric
- Here, we will start with **Mean Absolute Error**
  - **error = actual - predicted**
- "On average, our model predictions are off by about X"

In [24]:
predicted_prices = tree_model.predict(X)
error_value = mean_absolute_error(y, predicted_prices)
print(f"On average, our model predictions are off by about ${error_value}")

On average, our model predictions are off by about $1115.7467183128902


### A brief note on In-Sample Scores:
- In-sample scores refer to a model trained on a single "sample" of data for both building and evaluating the data
- This can lead to inaccurate predictions based on isolated trends in our sample size
  - EX: Home data sample showing houses with green doors being expensive, so model interprets this as a pattern
  - Since this is true in sample, our model seems accurate
  - But, if our model sees new data that contradicts that pattern, the accuracy will drop in practice
- We avoid this by using *new* data to make predictions!  
- **Validation Data** - A section of data that is excluded from model-building to instead be used for model accuracy testing

### Creating our Training & Testing Sets

In [25]:
# Unpack tuple of values representing testing and training data set values
# Like model creation, we can specify random_state to ensure results are reproducible
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Defining a new model
split_tree_model = DecisionTreeRegressor()

# Using our training sets to fit our model
split_tree_model.fit(train_X,train_y)

# Finally, make predictions using our validation data
val_predictions = split_tree_model.predict(val_X)

# Calculating a new absolute error, using our validation data and predicted values
error_margin = mean_absolute_error(val_y, val_predictions)
print(f"On average, our model predictions are off by about ${error_margin}")

# Note the HUGE jump in error margins!  This is because we are "testing" our data with out-of-sample values
# Our new error margin is $250,000 - this model is no longer useful in real-world application!

On average, our model predictions are off by about $254295.8373143964


## Underfitting & Overfitting
---

### Underfitting
- Scenario: Decision tree that splits into a very small number of leaves at the top level (2-4)
  - There are a large number of houses within each leaf's category
  - There is so much variance between data, that most predictions and validations will be inaccururate - model cannot "see" patterns in data
### Overfitting
- Scenario: Decision Tree that splits into a large number of leaves at the top level (10+)
  - There are only a few houses that will fall into each leaf's category
  - This means that the model will match the training data nearly perfectly, but since it is based on a small sample it becomes inaccurate when faced with new data

### Solution
- Our goal is to reach the middle ground between both of these, where there is lowest amount of variance between training and validation predictions
- We control tree depth using DecisionTreeRegressor argument `max_leaf_nodes`

### Example Comparison of MAE variance based on `max_leaf_nodes`

In [26]:
# Example function, which will utilize previously created test/train sets
def calculate_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    # Setting random state to ensure model can be reproduced
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    predicted_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, predicted_val)
    return mae

In [32]:
# Example execution of function, using variety of max_leaf_nodes values
for max_nodes in [5, 25, 50, 100, 250, 350, 500, 1000]:
    mae = calculate_mae(max_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_nodes, mae))

Max leaf nodes: 5  		 Mean Absolute Error:  369673
Max leaf nodes: 25  		 Mean Absolute Error:  283377
Max leaf nodes: 50  		 Mean Absolute Error:  266644
Max leaf nodes: 100  		 Mean Absolute Error:  256533
Max leaf nodes: 250  		 Mean Absolute Error:  240719
Max leaf nodes: 350  		 Mean Absolute Error:  242986
Max leaf nodes: 500  		 Mean Absolute Error:  243613
Max leaf nodes: 1000  		 Mean Absolute Error:  244793


### Updating model with optimal size

In [34]:
# 250 max leaf nodes had lowest MAE
optimal_model = DecisionTreeRegressor(max_leaf_nodes=250, random_state=0)
# This is our final model - we no longer will use test/train sets
# As we are simulating the model encountering a new set of data to predict
optimal_model.fit(X, y)
predicted_val = optimal_model.predict(val_X)
mae = mean_absolute_error(val_y, predicted_val)
print(f"On average, our model predictions are off by about ${mae}")

On average, our model predictions are off by about $165272.73256425172


## Random Forest Models
---
- Similar to Decision Trees, but it utilizes many trees, and makes decision by averaging the predictions of *each* component tree
- Has same parameter struction as Decision Tree, but is generally much more accurate

### Example Random Forest Housing Price Model

In [35]:
# Importing random forest regressor class
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=0)
forest_model.fit(train_X, train_y)
price_prediction = forest_model.predict(val_X)
mae = mean_absolute_error(val_y, price_prediction)
print(f"On average, our model predictions are off by about ${mae}")
# For comparison, our most efficient tree model had MAE of $240,000!

On average, our model predictions are off by about $192287.89755541208
