## Selecting Data for Modeling

Start from picking a few variables using intuition for now. <br>
To choose vars, we'll need to see a list of all columns in the dataset. We can do that with `columns` method of the DataFrame(code below).

In [4]:
%%time
import pandas as pd

melbourne_file_path = '../data/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path)
melbourne_data.columns

CPU times: user 38.6 ms, sys: 9.07 ms, total: 47.6 ms
Wall time: 48.7 ms


Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

The Melbourne data has some missing values. Some houses weren't recorded proper way. <br>
So for now just drop missing values.

In [5]:
melbourne_data = melbourne_data.dropna(axis=0) # axis = 0 means drop rows where N/A

## Selection The Prediction Target

So, with pandas we can use dot-notation to select column. This will be Series object. <br>
We'll use the dot notation to select column that we want to predict. This column is called the `prediction target`. <br>
As common, the prediction target is called `y`.

In [6]:
y = melbourne_data.Price

## Choosing "Features"

The columns that are inputed into model, and used to make predictions, are called "features". <br>
In this case we select columns/features that potentially can help predict home price.

In [7]:
features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']

In [8]:
# For convention, this data is called X.
X = melbourne_data[features]

Now we can review data that we'll be using to predict house price.

In [10]:
X.describe()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
count,6196.0,6196.0,6196.0,6196.0,6196.0
mean,2.931407,1.57634,471.00694,-37.807904,144.990201
std,0.971079,0.711362,897.449881,0.07585,0.099165
min,1.0,1.0,0.0,-38.16492,144.54237
25%,2.0,1.0,152.0,-37.855438,144.926198
50%,3.0,1.0,373.0,-37.80225,144.9958
75%,4.0,2.0,628.0,-37.7582,145.0527
max,8.0,8.0,37000.0,-37.45709,145.52635


In [12]:
X.head()

Unnamed: 0,Rooms,Bathroom,Landsize,Lattitude,Longtitude
1,2,1.0,156.0,-37.8079,144.9934
2,3,2.0,134.0,-37.8093,144.9944
4,4,1.0,120.0,-37.8072,144.9941
6,3,2.0,245.0,-37.8024,144.9993
7,2,1.0,256.0,-37.806,144.9954


## Building Model

We'll use the `scikit-learn` lib to create model. Scikit-learn is probably the most popular lib for modeling data that stored in DataFrames. <br>

The steps to build and use model:
1. <b> Define type of model. </b> In our case is decision tree.
2. <b> Train model. </b> Capture patters from provided data.
3. <b> Predict. </b>
4. <b> Evaluate. </b> Determine how accurate the model's prediction are.

In [18]:
%%time
from sklearn.tree import DecisionTreeRegressor

# Define model, define random_state as 1 to ensure same results each run
model = DecisionTreeRegressor(random_state=1)

# Train/fit model
model.fit(X, y)

CPU times: user 38.6 ms, sys: 3.78 ms, total: 42.4 ms
Wall time: 41.5 ms


Actually models need to test on test data, not on training data. But for current model we'll make predictions for the first few rows of the training data just to see how prediction func works.

In [65]:
# %%time
print("Data that we trying predict, Price column:")
print(melbourne_data[features + ['Price']].head())
print("\nPredictions:")
print(model.predict(X.head()).transpose())

Data that we trying predict, Price column:
   Rooms  Bathroom  Landsize  Lattitude  Longtitude      Price
1      2       1.0     156.0   -37.8079    144.9934  1035000.0
2      3       2.0     134.0   -37.8093    144.9944  1465000.0
4      4       1.0     120.0   -37.8072    144.9941  1600000.0
6      3       2.0     245.0   -37.8024    144.9993  1876000.0
7      2       1.0     256.0   -37.8060    144.9954  1636000.0

Predictions:
[1035000. 1465000. 1600000. 1876000. 1636000.]
