kc_house_data.csv came from here: https://www.kaggle.com/harlfoxem/housesalesprediction

Given a collection of houses, their details and prices, **automatically** build a model to predict how house price based on other attributes.

In [1]:
import pandas as pd
import numpy as np

source = pd.read_csv("kc_house_data.csv")
source.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import math
import time
from itertools import combinations

# All columns we in theory care about
allColumns = [
    "bedrooms", 
    "bathrooms", 
    "sqft_living", 
    "sqft_lot",
    "floors", 
    "waterfront",
    "view", 
    "condition", 
    "grade",    
    "sqft_basement",
    "yr_built",
    "lat", 
    "long"    
]

# Calculate the number of all combinations of all columns:
# C(n,k) where n = number of all columns and k goes from 1 to n
totalAttempts = 0
for numOfColumnsToTake in range(1, len(allColumns) + 1):
    totalAttempts = totalAttempts + len(list(combinations(allColumns, numOfColumnsToTake)))

# Try all possible combinations and find the best one
attemptNum = 0
bestScore = -1
bestListOfColumnsToTake = None
for numOfColumnsToTake in range(1, len(allColumns) + 1):
    for columnsToTake in combinations(allColumns, numOfColumnsToTake):
        listOfColumnsToTake = list(columnsToTake)
        X = source[listOfColumnsToTake]
        y = source["price"]
        reg = LinearRegression()
        scores = cross_val_score(reg, X, y, cv=4)
        attemptNum = attemptNum + 1
        meanScore = scores.mean()
        if meanScore > bestScore:
            bestScore = meanScore
            bestListOfColumnsToTake = listOfColumnsToTake
            print(attemptNum, totalAttempts, scores.mean(), listOfColumnsToTake)
        #if attemptNum % 500 == 0:
        #    print(attemptNum, totalAttempts)
print("DONE")

1 8191 0.0924455132116 ['bedrooms']
2 8191 0.274164201825 ['bathrooms']
3 8191 0.492157781418 ['sqft_living']
15 8191 0.505010936423 ['bedrooms', 'sqft_living']
39 8191 0.528953457518 ['sqft_living', 'waterfront']
40 8191 0.53466751728 ['sqft_living', 'view']
45 8191 0.565433017686 ['sqft_living', 'lat']
111 8191 0.575558915642 ['bedrooms', 'sqft_living', 'lat']
235 8191 0.604220256482 ['sqft_living', 'waterfront', 'lat']
241 8191 0.608967274333 ['sqft_living', 'view', 'lat']
455 8191 0.611517894111 ['bedrooms', 'sqft_living', 'waterfront', 'lat']
461 8191 0.615083941431 ['bedrooms', 'sqft_living', 'view', 'lat']
831 8191 0.624382695697 ['sqft_living', 'waterfront', 'view', 'lat']
839 8191 0.629400315829 ['sqft_living', 'waterfront', 'grade', 'yr_built']
840 8191 0.633916025804 ['sqft_living', 'waterfront', 'grade', 'lat']
855 8191 0.635306373456 ['sqft_living', 'view', 'grade', 'lat']
876 8191 0.638439168414 ['sqft_living', 'grade', 'yr_built', 'lat']
1371 8191 0.643766092315 ['bedroo

Same as above, but this time learn to predict `log(price)` instead of `price` itself.

In [3]:
totalAttempts = 0
for numOfColumnsToTake in range(1, len(allColumns) + 1):
    totalAttempts = totalAttempts + len(list(combinations(allColumns, numOfColumnsToTake)))
        
attemptNum = 0
bestScore = -1
bestListOfColumnsToTake = None
y = list(source.apply(lambda row: math.log(row.price), axis=1).values)
for numOfColumnsToTake in range(1, len(allColumns) + 1):
    for columnsToTake in combinations(allColumns, numOfColumnsToTake):
        listOfColumnsToTake = list(columnsToTake)
        X = source[listOfColumnsToTake]
        #y = source["price"]
        reg = LinearRegression()
        scores = cross_val_score(reg, X, y, cv=4)
        attemptNum = attemptNum + 1
        meanScore = scores.mean()
        if meanScore > bestScore:
            bestScore = meanScore
            bestListOfColumnsToTake = listOfColumnsToTake
            print(attemptNum, totalAttempts, scores.mean(), listOfColumnsToTake)
        #if attemptNum % 500 == 0:
        #    print(attemptNum, totalAttempts)
print("DONE")

1 8191 0.114244007365 ['bedrooms']
2 8191 0.30145029908 ['bathrooms']
3 8191 0.482391855971 ['sqft_living']
9 8191 0.493733933712 ['grade']
21 8191 0.503592278942 ['bedrooms', 'grade']
32 8191 0.505631924042 ['bathrooms', 'grade']
40 8191 0.506411743869 ['sqft_living', 'view']
42 8191 0.554346428383 ['sqft_living', 'grade']
45 8191 0.6535067705 ['sqft_living', 'lat']
111 8191 0.655895064384 ['bedrooms', 'sqft_living', 'lat']
228 8191 0.656636928922 ['sqft_living', 'floors', 'lat']
235 8191 0.665786066233 ['sqft_living', 'waterfront', 'lat']
241 8191 0.678614409627 ['sqft_living', 'view', 'lat']
250 8191 0.702987346978 ['sqft_living', 'grade', 'lat']
470 8191 0.703176795519 ['bedrooms', 'sqft_living', 'grade', 'lat']
756 8191 0.705951911134 ['bathrooms', 'grade', 'yr_built', 'lat']
840 8191 0.714893445591 ['sqft_living', 'waterfront', 'grade', 'lat']
855 8191 0.724421759042 ['sqft_living', 'view', 'grade', 'lat']
876 8191 0.729751439476 ['sqft_living', 'grade', 'yr_built', 'lat']
1371 8