In [11]:
import pandas as pd

home_data = pd.read_csv('melb_data.csv')
home_data.describe()

# What is the average lot size (rounded to nearest integer)?
avg_lot_size = 10517

# As of today, how old is the newest home (current year - the date in which it was built)
newest_home_age = 11

The newest house in your data isn't that new. A few potential explanations for this:

They haven't built new houses where this data was collected.
The data was collected a long time ago. Houses built after the data publication wouldn't show up.
If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?

How could you dig into the data to see which explanation is more plausible?

Check out this discussion thread to see what others think or to add your ideas.




If the explanation is #1 (“They haven't built new houses where this data was collected”), then, we still should consider the elapsed time and its effects, such as asset depreciation. We could build a model that contemplates this sort of features, or changes to the original model should be made.

If the reason’s #2 (“The data was collected a long time ago. Houses built after the data publication wouldn't show up”), then we’d probably have an outdated model (as in obsolete already).

Either way, it depends a lot on the way the model is being used, and what we expect of the model itself. For instance, if we train the model on data from certain city, we shouldn’t be expecting great predictions when studying another city with big differences (great contrast between their markets along with other things). If it’s not a big city but a small town in which things don’t tend to change that often (or quickly), the built model is more likely to be useful for longer.

Knowing the data is from Iowa is a good start into understanding what’s happening, even without digging into the data; for example, the state of Iowa is below the average in population for US states (known from before). But there are many elements (data, facts, etc.) to take advantage of. If we check on YearBuilt, we’ll see that there are several houses in our dataset that were built in the same year as the determined as “newest”, also from the year before; so, it’d be strange, at the very least, if they simply suddenly stop “making” houses. Moreover, there are not registered sales after 2010, which is a striking fact and would potentially make a vast majority agree that the actual reason is #2 (if the latter was not enough).

Again, it might be a problem depending on our needs; it wouldn’t be surprising if it is. And there are many ways to try to explain the whole case.

In [2]:
home_data.columns

Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

In [3]:
home_data = home_data.dropna(axis=0)

In [4]:
y = home_data.Price

In [5]:
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude'] 

In [6]:
X = home_data[melbourne_features]

In [7]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [8]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


The steps to building and using a model are:

- Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified too.
- Fit: Capture patterns from provided data. This is the heart of modeling.
- Predict: Just what it sounds like
- Evaluate: Determine how accurate the model's predictions are.


In [9]:

from sklearn.tree import DecisionTreeRegressor

# define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# fit model
melbourne_model.fit(X, y)

DecisionTreeRegressor(random_state=1)

In [10]:
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
4      4       1.0     120.0   -37.8072    144.9941
6      3       2.0     245.0   -37.8024    144.9993
7      2       1.0     256.0   -37.8060    144.9954
The predictions are
[1035000. 1465000. 1600000. 1876000. 1636000.]
