# Exploratory Data Analysis of Zillow Real Estate Data

In this competition, Zillow has provided a file of Zillow estimates, house sale transaction data together with some meta-data (e.g.: number of rooms etc.). The goal of this competition is to come up with an algorithm that predicts the residual error, that is the log error defined as the difference between the Zestimate and actual sales price.

In other words, the competition focuses on trying to predict when the Zillow Estimate is more reliable and when less so.

The purpose of this Jupyter notebook is to perform an initial analysis of the data at hand.

Specifically, I want to:

- Observe the format of the data input files.
- Explore completeness of data for each factor.
- Observe the distribution of the residual error.
- Explore any correlated factors.
- Explore log error with relationship to various factors.

In [86]:
import pandas as pd
import holoviews as hv
import seaborn as sb
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats
import pylab
%matplotlib inline

## File format exploration

First let's explore how the different input files look like.

In [87]:
train_df = pd.read_csv('./train/train_2016.csv')
property_df = pd.read_csv('./train/properties_2016.csv')

In [88]:
train_df.head(5)

In [89]:
property_df.head(5)

These two files are very simple. Train file contains just the logerror together wtih parcelid and transactiondate

Property file contains a list of features. These two files are linked via "parcelid". Just looking at the head, there appear to be a lot of "gaps" of information for many of the features. So one important thing to do is to explore how complete each feature is.

How many features are there?

In [90]:
len(property_df.columns)

There appear to be 57 features. Some of these names are really long and almost incomprehensible. I want to rename these columns. Furthermore, it's important to note that although columns like "airconditioningtypeid" look numeric (ie: float), these are actually categorical in nature and not numerical which is important to think about when trying to use these features. 

The following columns are actually categorical:

- 'airconditioningtypeid'
- 'architecturalstyletypeid'
- 'buildingqualitytypeid'
- 'buildingclasstypeid'
- 'heatingorsystemtypeid'
- 'pooltypeid10'
- 'pooltypeid2'
- 'pooltypeid7'
- 'propertycountylandusecode'
- 'propertylandusetypeid'
- 'propertyzoningdesc'
- 'rawcensustractandblock'
- 'censustractandblock'
- 'regionidcounty'
- 'regionidcity'
- 'regionidzip'
- 'regionidneighborhood'
- 'storytypeid'
- 'typeconstructiontypeid'

These columns contain boolean values:

- 'taxdelinquencyflag'
- 'fireplaceflag'

These columns are especially interesting:

- 'fips': Things in here look like numbers. But this number is actually a way to describe which city, county, state a propery is located in.
- 'rawcensustractandblock': Appears to be a combination of fips code, tract number and block number separated by a "." . 

These categorical variables/features will have to be converted to dummy variables before doing any machine learning on them (except for if we are using decision trees). Related categorical variables could be converted to bag of words in more advanced explorations/tests.

When I rename the columns, categorical columns that are encoded via numbers will be prefixed with "id". Boolean features will have "has_". All others will just be num, count, or area.

In [91]:
property_df.columns

I am making these variable names more concise and consistent.

In [92]:
property_df.columns = ['parcelid', 
'ac_id', 
'id_arci_style',
'area_basement', 
'num_bathroom', 
'num_bedroom', 
'id_build_class',
'id_build_quality', 
'calculatedbathnbr', 
'id_decktype',
'area_first_floor', 
'area_total_calc',
'area_fin_living', 
'area_fin_perim_living', 
'area_fin_total_area',
'area_first_floor_2', 
'area_base', 
'fips', 
'num_fire',
'num_fullbath', 
'num_garagecar', 
'area_garage', 
'has_spa',
'id_heating_system_id', 
'latitude', 
'longitude', 
'area_lotsize',
'num_pool', 
'area_pool_total', 
'id_spa_tub', 
'id_pool_spa_hottub', 
'id_pool_no_hottub',
'id_zone_county_landusecode', 
'id_zone_landuse',
'zone_property', 
'rawcensustractandblock', 
'region_city',
'region_county', 
'region_neighborhood',
'region_zip', 
'num_room',
'id_storytype', 
'num_3_4_bath', 
'typeconstructiontypeid',
'num_unit', 
'area_patio_yd', 
'area_shed_yd', 
'year_built',
'num_stories', 
'has_fireplace', 
'assessed_home_value',
'assessed_parcel_value', 
'assessmentyear', 
'landtaxvaluedollarcnt',
'tax_amount', 
'tax_is_delinquent', 
'tax_delinquency_year',
'censustractandblock']

## Exploration of Data Completeness

In [93]:
missing_df = property_df.isnull().sum(axis=0).reset_index()
missing_df.columns = ['column_name', 'missing_count']
filled_df = property_df.notnull().sum(axis=0).reset_index()
filled_df.columns = ['column_name', 'filled_count']
merged_df = pd.merge(missing_df, filled_df, on=['column_name'])
merged_df['fraction_filled'] = merged_df['filled_count']/(merged_df['missing_count'] + merged_df['filled_count'])*100
merged_df = merged_df.loc[merged_df['missing_count']>0]
merged_df = merged_df.sort_values(by='missing_count', ascending=True)
merged_df.head()

In [94]:
fig = plt.figure(figsize=(30, 20))
sn_plot = sb.barplot(x='fraction_filled', y='column_name', data=merged_df)
sn_plot.set_xlabel('Percent', fontsize = 25)
sn_plot.set_ylabel('Feature', fontsize = 25)
sn_plot.set_title('Feature Completeness', fontsize=50)

It appears that a many features are very incomplete. It's very tempting to eliminate features that do not have much content from the analysis. If I look at the features though I see things like:

- num_pool
- has_spa
- id_spa_tub
- area_pool_total
- has_fireplace
- area_basement

These are features that most homes do not have. I will have to explore these features to determine if it is likely that Null/None in this case means that the house does not have this particular feature and so it was not filled in. 

One thing to explore later in the analysis is to determine the impact of the sparsely populated fields on log error with the following options:

- looking only at filled in information.
- looking at the information content of nan on log error
- looking at the information with imputing information.

## Observe the distribution of the residual error


In [95]:
hv.notebook_extension('bokeh')
frequencies, edges = np.histogram(train_df['logerror'], 500)
hv.Histogram(frequencies, edges)

In [96]:
stats.probplot(train_df['logerror'], dist='norm', plot=pylab)

This looks quite tight actually but is not normally distributed. The deviations from the line outside the interval -2, and 2 suggests a distribution with heavy tails, that is, the ends of the distribution contain values that are more extreme than you would expect with a normal distribution. The graph looks symmetrical. If one were to look at the absolute log error, the shape of the histogram would look very similar. We would just start at 0 and kind of double the values. This brings up a point of note in terms of granularity:

- Absolute log error: Less granular. We can tell how good a particular feature is in estimating the sales price.
- Log error: More granular. We can now also tell if a feature also tends to over or underestimate.

## Explore any correlated factors.

Let's start easy and just observe if there are any significant 2-d correlations between numerical features.

In [97]:
full_train_df = pd.merge(train_df, property_df, on=['parcelid'])

In [98]:
non_categorical_features_df = full_train_df[[
    'logerror',
    'transactiondate',
    'area_basement', 
    'num_bathroom', 
    'num_bedroom', 
    'calculatedbathnbr', 
    'area_first_floor', 
    'area_total_calc',
    'area_fin_living', 
    'area_fin_perim_living', 
    'area_fin_total_area',
    'area_first_floor_2', 
    'area_base', 
    'num_fire',
    'num_fullbath', 
    'num_garagecar', 
    'area_garage', 
    'has_spa',
    'latitude', 
    'longitude', 
    'area_lotsize',
    'num_pool', 
    'area_pool_total', 
    'num_room',
    'num_3_4_bath', 
    'num_unit', 
    'area_patio_yd', 
    'area_shed_yd', 
    'year_built',
    'num_stories', 
    'has_fireplace', 
    'assessed_home_value',
    'assessed_parcel_value', 
    'assessmentyear', 
    'landtaxvaluedollarcnt',
    'tax_amount', 
    'tax_is_delinquent', 
    'tax_delinquency_year',
]]

In [99]:
cor_mat = non_categorical_features_df.corr(method='spearman')
sb.heatmap(cor_mat)

We can see that no single variable is strongly correlated with log error (it wouldn't be a challenge otherwise :-).

Otherwise, high correlation can be observed in places that make sense. For example:

assessed_home_value highly correates with:
- parcel_value
- landtaxvalueddollarscnt
- tax_amount

There is also some moderate correlation with:
- area_basement
- num_bathroom
- num_bedroom
- area_total_calc
- etc...

...basically things that correlate with the size of the house.

## Exploring the Effects of Time on logerror

One variable not assessed in the above correlation analysis is the effect of time on logerror. My expectation is that sales are down in the winter time. And with decreased sales, the logerror would be higher somehow as well since there are fewer data points to get better estimates on. This is just my guess though and so I have to actually confirm this.

To be filled in....

## Conclusions from First Analysis:

For features that are not very complete, we need to evaluate if
- imputing values makes sense. For things like pool size it may (nan = 0). But for things like area_total_calc it may not make sense (it's not clear how the fill-in should best be computed).
- sparse features have an impact on logerror.

We need to convert "appaerent" numerical features to categorical features. One way to do this is to convert the various types of "id" columns to binary dummy variables. A question to the community would be whether these id variables could also be treated as a bag of words which I have heard of before but have not used yet.

Another question to explore/ask the community is whether for decision-tree-based ML algorithms it makes sense to apply the above mentioned techniques at all or just leave each feature as is.

Personally, I think I would want to apply a model zoo to this challenge just to have a comparison on the performance of simple ML. Then also add some deep learning architectures and/or decision-tree-based methods.

The combined models could then be used in some kind of weak learner model of StackNet model.