This is an introduction to machine learning

The scenario used:
Your cousin has made millions of dollars speculating on real estate. He's offered to become business partners with you because of your interest in data science. He'll supply the money, and you'll supply models that predict how much various houses are worth.

A basic model: decision tree
A decision tree is a decision support hierarchical model that uses a tree-like model of decisions and their possible consequences.

We use data to decide how to break the houses into two groups, and then again to determine the predicted price in each group. This step of capturing patterns from data is called fitting or training the model. The data used to fit the model is called the training data.

The details of how the model is fit (e.g. how to split up the data) is complex enough that we will save it for later. After the model has been fit, you can apply it to new data to predict prices of additional home

The predicted price for the house is at the bottom of the tree. The point at the bottom where we make a prediction is called a leaf.

visual representation of a decision tree: https://storage.googleapis.com/kaggle-media/learn/images/prAjgku.pngs.

1. Preparation for machine learning

In [1]:
import pandas as pd

The most important part of the Pandas library is the DataFrame. A DataFrame holds the type of data you might think of as a table. This is similar to a sheet in Excel, or a table in a SQL database.

In [2]:
# data initialisation
# save filepath to variable for easier access
# the content in the '' is the location of the data file, which can be varied on different devices
melbourne_file_path ='Desktop/MELBOURNE_HOUSE_PRICES_LESS.csv'
# read the data and store data in DataFrame titled melbourne_data
melbourne_data = pd.read_csv(melbourne_file_path) 

In [3]:
# data representation
# print a summary of the data in Melbourne data
print(melbourne_data.describe())
# print a list of columns contained in the data file
print(melbourne_data.columns)
# print a small part of the table in the data file
print(melbourne_data.head())

              Rooms         Price      Postcode  Propertycount      Distance
count  63023.000000  4.843300e+04  63023.000000   63023.000000  63023.000000
mean       3.110595  9.978982e+05   3125.673897    7617.728131     12.684829
std        0.957551  5.934989e+05    125.626877    4424.423167      7.592015
min        1.000000  8.500000e+04   3000.000000      39.000000      0.000000
25%        3.000000  6.200000e+05   3056.000000    4380.000000      7.000000
50%        3.000000  8.300000e+05   3107.000000    6795.000000     11.400000
75%        4.000000  1.220000e+06   3163.000000   10412.000000     16.700000
max       31.000000  1.120000e+07   3980.000000   21650.000000     64.100000
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Postcode', 'Regionname', 'Propertycount', 'Distance',
       'CouncilArea'],
      dtype='object')
         Suburb           Address  Rooms Type      Price Method   SellerG  \
0    Abbotsford     49 Lithgow St      3

In [4]:
# refining the data file
# The Melbourne data has some missing values (some houses for which some variables weren't recorded.)
# dropna drops missing values (think of na as "not available")
melbourne_data = melbourne_data.dropna(axis=0)

2. Select data for modeling

In [5]:
# select a prediction target fromt the columns of the data file
# conventionally, y is used as an variable to represent the prediction target 
y = melbourne_data.Price

In [6]:
# choose features from the columns of the data file
# The columns that are inputted into our model (and later used to make predictions) are called "features." 
# In our case, those would be the columns used to determine the home price.
melbourne_features = [ 'Rooms','Postcode','Distance']
X = melbourne_data[melbourne_features]

# visualize the features, X
print(X.head())
print(X.describe())

   Rooms  Postcode  Distance
0      3      3067       3.0
1      3      3067       3.0
2      3      3067       3.0
3      3      3040       7.5
4      2      3042      10.4
              Rooms      Postcode      Distance
count  48433.000000  48433.000000  48433.000000
mean       3.071666   3123.210332     12.702761
std        0.944708    125.534940      7.550030
min        1.000000   3000.000000      0.000000
25%        2.000000   3051.000000      7.000000
50%        3.000000   3103.000000     11.700000
75%        4.000000   3163.000000     16.700000
max       31.000000   3980.000000     55.800000


3. Building the model

The steps to building and using a model are:
    
Define: What type of model will it be? A decision tree? Some other type of model? Some other parameters of the model type are specified t 
 .
Fit: Capture patterns from provided data. This is the heart of mode  
 g.
Predict: Just what it soun   .
 ike
Evaluate: Determine how accurate the model's predictions are.

In [7]:
from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
melbourne_model = DecisionTreeRegressor(random_state=1)

# Fit model
melbourne_model.fit(X, y)

In [8]:
# making predictions and visualizing them
print("Making predictions for the following 5 houses:")
print(X.head())
print("The predictions are")
print(melbourne_model.predict(X.head()))

Making predictions for the following 5 houses:
   Rooms  Postcode  Distance
0      3      3067       3.0
1      3      3067       3.0
2      3      3067       3.0
3      3      3040       7.5
4      2      3042      10.4
The predictions are
[1241833.33333333 1241833.33333333 1241833.33333333 1162721.97309417
  647622.09302326]


4. Model validation

To conduct a model validation check, you'd first need to summarize the model quality into an understandable way. If you compare predicted and actual home values for 10,000 houses, you'll likely find mix of good and bad predictions. Looking through a list of 10,000 predicted and actual values would be pointless. We need to summarize this into a single metric.

There are many metrics for summarizing model quality, but we'll start with one called Mean Absolute Error (also called MAE

MAE = the average of the absolute values of the difference between the predicted values and the actual valuesr.

In [9]:
from sklearn.metrics import mean_absolute_error

predicted_home_prices = melbourne_model.predict(X)
mean_absolute_error(y, predicted_home_prices)

202264.1656746377

Since models' practical value come from making predictions on new data, it's better to measure performance on data that wasn't used to build the model.

The most straightforward way to do this is to exclude some data from the model-building process, and then use those to test the model's accuracy on data it hasn't seen before. This data is called validation data.

In [10]:
from sklearn.model_selection import train_test_split

# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time 

train_X, val_X, train_y, val_y = train_test_split(X, y, train_size=0.8,test_size=0.2, random_state = 0)
# Define model
melbourne_model = DecisionTreeRegressor()
# Fit model
melbourne_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = melbourne_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

209132.22922836881


5. Problems: underfitting and overfitting

In a decision tree model, the depth of the tree influences the accuracy of the result of the model.

As the tree gets deeper, the dataset gets sliced up into leaves with fewer houses. If a tree only had 1 split, it divides the data into 2 groups. If each group is split again, we would get 4 groups of houses. Splitting each of those again would create 8 groups. If we keep doubling the number of groups by adding more splits at each level, we'll have 2^10=1024 groups of houses by the time we get to the 10th level. 

Overfitting: when we divide the houses amongst many leaves, we also have fewer houses in each leaf. Leaves with very few houses will make predictions that are quite close to those homes' actual values, but they may make very unreliable predictions for new data (because each prediction is based on only a few houses).
Underfitting: when the decision tree has fewer splits, the prediction might be inaccurate even for the training data as some important features/patterns are not captured.

https://storage.googleapis.com/kaggle-media/learn/images/AXSEOfI.pngng

In [11]:
# find the most accurate model by comparing the max number of leaf nodes 
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)

# compare MAE with differing values of max_leaf_nodes, for example:5,50,500,5000 leaves
for max_leaf_nodes in [5, 50, 500, 5000]:
    my_mae = get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y)
    print("Max leaf nodes: %d  \t\t Mean Absolute Error:  %d" %(max_leaf_nodes, my_mae))

Max leaf nodes: 5  		 Mean Absolute Error:  342446
Max leaf nodes: 50  		 Mean Absolute Error:  229001
Max leaf nodes: 500  		 Mean Absolute Error:  209375
Max leaf nodes: 5000  		 Mean Absolute Error:  209115


Overfitting: capturing spurious patterns that won't recur in the future, leading to less accurate predictions, or
Underfitting: failing to capture relevant patterns, again leading to less accurate predictions.

6. Another model: random forest model

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. For instance, random forest.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters.

In [12]:
from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

208530.33606399526
