## 1. Decision Tree Regressor - complete

In [1]:
# Code to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor


# Path of the file to read
iowa_file_path = 'train.csv'

home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]

# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)

# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae_simple = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae_simple))


Validation MAE: 29,653


In [29]:
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)


# candidate_max_leaf_nodes = [1500]
candidate_max_leaf_nodes = [2*i-1 for i in range(2, 75)]

best_mae = None

# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for leaf_node in candidate_max_leaf_nodes:
    my_mae = get_mae(leaf_node, train_X, val_X, train_y, val_y)
    print(f'Max leaf nodes: {leaf_node}\tMean Absolute Error: {my_mae}')
    if best_mae is None or my_mae<best_mae:
        best_mae = my_mae
        best_tree_size = leaf_node

# Store the best value of max_leaf_nodes 
print('\nBest tree size:', best_tree_size)


Max leaf nodes: 3	Mean Absolute Error: 39912.20512711714
Max leaf nodes: 5	Mean Absolute Error: 35044.51299744237
Max leaf nodes: 7	Mean Absolute Error: 34769.10089767185
Max leaf nodes: 9	Mean Absolute Error: 31863.851616036944
Max leaf nodes: 11	Mean Absolute Error: 30389.783612505194
Max leaf nodes: 13	Mean Absolute Error: 29124.908937039498
Max leaf nodes: 15	Mean Absolute Error: 28125.478430318668
Max leaf nodes: 17	Mean Absolute Error: 27807.663665995344
Max leaf nodes: 19	Mean Absolute Error: 28648.267042530915
Max leaf nodes: 21	Mean Absolute Error: 28750.331097785598
Max leaf nodes: 23	Mean Absolute Error: 28653.86284944501
Max leaf nodes: 25	Mean Absolute Error: 29016.41319191076
Max leaf nodes: 27	Mean Absolute Error: 28616.229360696358
Max leaf nodes: 29	Mean Absolute Error: 28704.92928766505
Max leaf nodes: 31	Mean Absolute Error: 28994.467469483232
Max leaf nodes: 33	Mean Absolute Error: 28355.08322598861
Max leaf nodes: 35	Mean Absolute Error: 28761.35218024895
Max leaf 

In [23]:
# FINAL MODEL

# Fill in argument to make optimal size and uncomment
iowa_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)

# fit the final model
iowa_model.fit(train_X, train_y)

val_predictions = iowa_model.predict(val_X)

val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of candidate_max_leaf_nodes: \t {:,.0f}".format(val_mae))
print("Validation MAE when not specifying candidate_max_leaf_nodes: \t {:,.0f}".format(val_mae_simple))


Validation MAE for best value of candidate_max_leaf_nodes: 	 26,704
Validation MAE when not specifying candidate_max_leaf_nodes: 	 29,653


In [24]:
print('Final model prediction:')
print(iowa_model.predict(val_X.head()))

print('\nReal Prices:')
print(y.head())

Final model prediction:
[181225.35416667 130647.68518519 125404.5         94060.
 149639.97826087]

Real Prices:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64


## 2. Random Forests
Using a more sophisticated machine learning algorithm.

Decision trees leave you with a difficult decision. A deep tree with lots of leaves will overfit because each prediction is coming from historical data from only the few houses at its leaf. But a shallow tree with few leaves will perform poorly because it fails to capture as many distinctions in the raw data.

Even today's most sophisticated modeling techniques face this tension between underfitting and overfitting. But, many models have clever ideas that can lead to better performance. We'll look at the random forest as an example.

The random forest uses many trees, and it makes a prediction by averaging the predictions of each component tree. It generally has much better predictive accuracy than a single decision tree and it works well with default parameters. If you keep modeling, you can learn more models with even better performance, but many of those are sensitive to getting the right parameters.

We build a random forest model similarly to how we built a decision tree in scikit-learn - this time using the `RandomForestRegressor` class instead of `DecisionTreeRegressor`.



In [25]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
iowa_preds = forest_model.predict(val_X)
print('Random Forest Regresor\'s MAE: \t\t\t\t {:,.0f}'.format(mean_absolute_error(val_y, iowa_preds)))

print('Best Decision Tree Regressor\'s MAE: \t\t\t {:,.0f}'.format(val_mae))
print("Validation MAE when not specifying max_leaf_nodes: \t {:,.0f}".format(val_mae_simple))

Random Forest Regresor's MAE: 				 21,857
Best Decision Tree Regressor's MAE: 			 26,704
Validation MAE when not specifying max_leaf_nodes: 	 29,653


There is likely room for further improvement, but this is a big improvement over the best decision tree error of 250,000. There are parameters which allow you to change the performance of the Random Forest much as we changed the maximum depth of the single decision tree. But one of the best features of Random Forest models is that they generally work reasonably even without this tuning.

## Random Forest Training by selecting the features for a project

In [6]:
def house_pricing(features: list) -> float:
    # Code to load data
    import pandas as pd
    from sklearn.ensemble import RandomForestRegressor
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split


    # Path of the file to read
    iowa_file_path = 'train.csv'

    home_data = pd.read_csv(iowa_file_path)
    # Create target object and call it y
    y = home_data.SalePrice
    # Create X
    X = home_data[features]

    # Split into validation and training data
    train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

    forest_model = RandomForestRegressor(random_state=1)
    forest_model.fit(train_X, train_y)
    iowa_preds = forest_model.predict(val_X)
    
    # Make validation predictions and calculate mean absolute error
    val_mae_simple = mean_absolute_error(iowa_preds, val_y)
    
    print('Random Forest Regresor\'s MAE: \t {:,.0f}'.format(mean_absolute_error(val_y, iowa_preds)))


In [7]:
a = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
     'FullBath', 'MSSubClass', 'OverallQual', 'OverallCond', 
     'YearRemodAdd', 'GrLivArea', 'Fireplaces', 'WoodDeckSF']
b = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
     'FullBath', 'MSSubClass', 'OverallQual', 'OverallCond', 
     'YearRemodAdd', 'GrLivArea', 'Fireplaces', 'WoodDeckSF']
c = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 
     'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd', 'MSSubClass', 
     'OverallQual', 'OverallCond', 'YearRemodAdd', 'GrLivArea', 
     'HalfBath', 'KitchenAbvGr', 'TotRmsAbvGrd', 'OpenPorchSF', 
     'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 
      'MiscVal', 'MoSold', 'YrSold']

house_pricing(a)
house_pricing(b)
house_pricing(c)

Random Forest Regresor's MAE: 	 17,077
Random Forest Regresor's MAE: 	 17,077
Random Forest Regresor's MAE: 	 18,067
