# Predicting positive or negative logerror

# Fire up GraphLab Create

In [1]:
import graphlab

# Read our cleaned data again

This is the same Zillow SFrame we worked with last week

In [2]:
features_plus_error = graphlab.SFrame('features_with_error')

This non-commercial license of GraphLab Create for academic use is assigned to harleyyesm@gmail.com and will expire on May 30, 2018.


[INFO] graphlab.cython.cy_server: GraphLab Create v2.1 started. Logging: C:\Users\harleyjj\AppData\Local\Temp\graphlab_server_1498150621.log.0


# Build an over/under price classifier

In [3]:
graphlab.canvas.set_target('ipynb')

In [4]:
features_plus_error['logerror'].show(view='Numeric')

## Split logerrors into overpriced vs underpriced

The log error is defined as log(Zestimate) - log(sale price), which is equivalent to log(Zestimate/sale price).  So positive logerrors are overpriced by Zillow, and negative errors are underpriced by Zillow.  A zero logerror would mean the Zestimate algorithm was correct about the price.  

In [5]:
#ignore all 0 logerrors
features_plus_error = features_plus_error[features_plus_error['logerror'] != 0]

In [6]:
#overpriced = positive logerror
features_plus_error['overpriced'] = features_plus_error['logerror'] > 0

In [7]:
features_plus_error['overpriced'].show(view='Categorical')

## Let's train the overpriced classifier

In [8]:
train_data,test_data = features_plus_error.random_split(.8, seed=0)

In [26]:
graphlab.canvas.set_target('browser')

features_plus_error.show()

Canvas is accessible via web browser at the URL: http://localhost:58641/index.html
Opening Canvas in default web browser.


In [22]:
classifier_features = features_plus_error.column_names()

print len(classifier_features)

result = []
for e in classifier_features:
    if e == 'parcelid':
        continue
    if e == 'logerror':
        continue
    if e == 'abslogerror':
        continue
    if e == 'overpriced':
        continue
    result.append(e)
    
classifier_features = result
print len(classifier_features)

54
50


In [23]:
overpriced_model = graphlab.logistic_classifier.create(train_data,
                                                     target='overpriced',
                                                     features=classifier_features,
                                                     validation_set=test_data,
                                                      max_iterations=100)

# Evaluate the overpriced model

In [24]:
overpriced_model.evaluate(test_data, metric='roc_curve')

{'roc_curve': Columns:
 	threshold	float
 	fpr	float
 	tpr	float
 	p	int
 	n	int
 
 Rows: 100001
 
 Data:
 +-----------+----------------+----------------+------+------+
 | threshold |      fpr       |      tpr       |  p   |  n   |
 +-----------+----------------+----------------+------+------+
 |    0.0    |      1.0       |      1.0       | 9945 | 8091 |
 |   1e-05   | 0.949697194414 | 0.958672699849 | 9945 | 8091 |
 |   2e-05   | 0.932023235694 | 0.944293614882 | 9945 | 8091 |
 |   3e-05   | 0.918675071067 | 0.934640522876 | 9945 | 8091 |
 |   4e-05   | 0.910888641701 | 0.926093514329 | 9945 | 8091 |
 |   5e-05   | 0.903349400569 | 0.920261437908 | 9945 | 8091 |
 |   6e-05   | 0.897787665307 | 0.914831573655 | 9945 | 8091 |
 |   7e-05   | 0.891360771227 | 0.911312217195 | 9945 | 8091 |
 |   8e-05   | 0.886169818317 | 0.906686777275 | 9945 | 8091 |
 |   9e-05   | 0.882461994809 | 0.904273504274 | 9945 | 8091 |
 +-----------+----------------+----------------+------+------+
 [100001 row

In [25]:
overpriced_model.show(view='Evaluation')

### Some ideas on how to make this work (it doesn't right now)
* We could log-transform some of the continuous variables

* We could use one-hot encoding to ensure that the algorithm doesn't treat categorical variables as numeric

* We could try some other classifiers: Sci-kit learn has a wide range of classifiers that would probably be stronger than logistic regression, for example ensemble learners like random forest, bagging, or gradient boosting