## What is the Zestimate and what is the logerror?

Before diving in to this whole 'Zestimate' thing, it may help to clear up a few ideas, at least as we understand them in pursuing our project's goal.

There are two main values dictating home sale prices:

1.) The Market Value = what the BUYER says the property is worth; and

2.) The Appraised Value = what the BANK says the property is worth.

Banks win 99% of the time because banks have 99% of the money, and that's a good thing - it helps mitigate reality between Buyer and Seller.

I can't demand my house sell for a million dollars when the bank says it's only worth a Happy Meal.  Conversely, the buyer can't buy my home for a Happy Meal when the bank says it's **AT LEAST** worth a Taco Bell Tripleupa Box.  (That took some practice.)

Undertanding this communication gap, Zillow was created in 2006 as a way of providing information to both home-buyers and home-sellers, the end goal being a mutual understanding at the beginning of price negotiations.  One of their flagship offerings is their 'Zestimate,' a constantly-updated and fine-tuned home valution model that is used to predict the market value of a home based on things like '*home facts, location, and market conditions*' (italics are directly from their website, https://www.zillow.com/zestimate/).

While strong and highly durable, the Zestimate is not perfect, even by it's own admission.  From the 'Median Error' section of the Zestimate website: "For most major markets, the Zestimate for on-market homes is within 10% of the final sale price more than 95% of the time."

Plain English: in cities of roughly a million or more people, the difference between Zillow's *predicted* home sale price is 10% different from the home's *actual* sale price.  Not bad, but on a \\$300,000 home, Zestimate can only ballpark a sales price range between \\$270- and \\$330-thousand dollars, a potential dream-crusher for both parties (banks still make out like a .

Because homes are not fiat currencies (they have actual, real value), Zillow can continually improve their model with tangible feedback in hopes of minimizing that error gap.

To see what may be driving this error, we are using what we learned in the Clustering Methodologies section of our Data Science Q-Course.  Instead of 'Mean Error,' we are clustering to determine what is driving the 'logerror' experienced in Zillow's predictive model.  Using logerror (from our provided MySQL database) means that we are assuming a distribution underlying Zillow estimates and actual home sale prices. 

### NB:

You may be wondering why we're dealing with California data.  At least we know we were.

Turns out, Texas (and a handful of others) is a non-disclosure state, and the Texas Real Estate Commission - Rulers over all things Texas Real Estate - is under no legal obligation to provide any home sale price information to outside companies like Zillow or RedFin (the Pepsi to Zillow's Coke). 

Take that for what it's worth, but that leads us to believe the logerror drivers we discover will be unique to the California Zestimate model, and cannot be applied universally without a sacrifice in accuracy.



In [5]:
import warnings
warnings.filterwarnings("ignore")

# Wrangling
import pandas as pd
import numpy as np

# Exploring
import scipy.stats as stats

# Visualizing
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(123)

from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

import numpy as np

import env
import zillow_acquire


df = zillow_acquire.get_zillow_data_from_sql()

In [6]:
df.head()

Unnamed: 0,parcelid,id,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,fireplaceflag,structuretaxvaluedollarcnt,taxvaluedollarcnt,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,transactiondate
0,10711855,1087254,,,,2.0,3.0,,8.0,2.0,...,,249655.0,624139.0,2016.0,374484.0,7659.36,,,60371130000000.0,2017-07-07
1,10711877,1072280,1.0,,,2.0,4.0,,8.0,2.0,...,,253000.0,660000.0,2016.0,407000.0,8123.91,,,60371130000000.0,2017-08-29
2,10711888,1340933,1.0,,,2.0,4.0,,8.0,2.0,...,,257591.0,542923.0,2016.0,285332.0,6673.24,,,60371130000000.0,2017-04-04
3,10711910,1878109,,,,2.0,3.0,,8.0,2.0,...,,57968.0,78031.0,2016.0,20063.0,1116.46,,,60371130000000.0,2017-03-17
4,10711923,2190858,,,,2.0,4.0,,8.0,2.0,...,,167869.0,415459.0,2016.0,247590.0,5239.85,,,60371130000000.0,2017-03-24


In [3]:
df.columns

Index(['parcelid', 'id', 'airconditioningtypeid', 'architecturalstyletypeid',
       'basementsqft', 'bathroomcnt', 'bedroomcnt', 'buildingclasstypeid',
       'buildingqualitytypeid', 'calculatedbathnbr', 'decktypeid',
       'finishedfloor1squarefeet', 'calculatedfinishedsquarefeet',
       'finishedsquarefeet12', 'finishedsquarefeet13', 'finishedsquarefeet15',
       'finishedsquarefeet50', 'finishedsquarefeet6', 'fips', 'fireplacecnt',
       'fullbathcnt', 'garagecarcnt', 'garagetotalsqft', 'hashottuborspa',
       'heatingorsystemtypeid', 'latitude', 'longitude', 'lotsizesquarefeet',
       'poolcnt', 'poolsizesum', 'pooltypeid10', 'pooltypeid2', 'pooltypeid7',
       'propertycountylandusecode', 'propertylandusetypeid',
       'propertyzoningdesc', 'rawcensustractandblock', 'regionidcity',
       'regionidcounty', 'regionidneighborhood', 'regionidzip', 'roomcnt',
       'storytypeid', 'threequarterbathnbr', 'typeconstructiontypeid',
       'unitcnt', 'yardbuildingsqft17', 'yardb

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77414 entries, 0 to 77413
Data columns (total 60 columns):
parcelid                        77414 non-null int64
id                              77414 non-null int64
airconditioningtypeid           24953 non-null float64
architecturalstyletypeid        206 non-null float64
basementsqft                    50 non-null float64
bathroomcnt                     77381 non-null float64
bedroomcnt                      77381 non-null float64
buildingclasstypeid             15 non-null float64
buildingqualitytypeid           49672 non-null float64
calculatedbathnbr               76772 non-null float64
decktypeid                      614 non-null float64
finishedfloor1squarefeet        6023 non-null float64
calculatedfinishedsquarefeet    77185 non-null float64
finishedsquarefeet12            73749 non-null float64
finishedsquarefeet13            41 non-null float64
finishedsquarefeet15            3009 non-null float64
finishedsquarefeet50          

#### Idea is to create a simple linear regression model to predict 'landtaxvaluedollarcnt' and impute that into the missing values in land square feet

In [None]:
df.corr(method = "pearson")

In [10]:
plt.figure = figsize=(16, 9)
sns.heatmap(df, annot=True, cmap = "Blues")

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''