In this notebook, I explore the data and save a version of the data that I consider a starting a point. I save the features that are readily available without too much manipulation. 

With the basic version of the data, I can do a few preprocessing steps and easily apply different statistical learning methods.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

cd Zillow/

Ignore the cell above. I just needed it to get to the right directory.

# Data Reading & Combining

In [None]:
dftrain = pd.read_csv('train_2016.csv')

Training data

In [None]:
dfprop = pd.read_csv('properties_2016.csv')

Property data

In [None]:
dfprop.describe()

In [None]:
dfprop.index = dfprop.parcelid

In [None]:
dfprop = dfprop.drop('parcelid',axis=1)

In [None]:
dfprop.head()

In [None]:
dftrain.index = dftrain.parcelid

In [None]:
dftrain = dftrain.drop('parcelid',axis=1)

In [None]:
dftrain.head()

In [None]:
dfcomb = dfprop.join(dftrain,how='outer')

Combine both sets of data on their parcel id.

In [None]:
dfcomb.head()

Number of parcels without a transaction

In [None]:
np.sum(np.isnan(dfcomb.logerror))

Most homes that didn't have a transaction had a NaN as their logerror. For cleaning the data, we can use the whole property data set but for predicting log errors we can only use the homes that had a transaction.

# Data Filtering

In [None]:
dfcomb_nar = dfcomb.loc[pd.isnull(dfcomb.logerror) == False,:]

In [None]:
dfcomb_nar.shape

Calculate the percentage of missing values for each feature and the number of unique values each feature takes.

In [None]:
desc_df = pd.DataFrame(index = ['% nan','num unique vals'],columns = dfcomb_nar.columns)
for col in dfcomb_nar.columns:
    desc_df.loc['% nan',col] = np.sum(pd.isnull(dfcomb_nar.loc[:,col]))/dfcomb_nar.shape[0]
    desc_df.loc['num unique vals',col] = len(dfcomb_nar.loc[:,col].value_counts())
    
    

In [None]:
desc_df.iloc[:,:10]

In [None]:
desc_df.iloc[:,10:20]

In [None]:
desc_df.iloc[:,20:30]

In [None]:
desc_df.iloc[:,30:40]

In [None]:
desc_df.iloc[:,40:50]

In [None]:
desc_df.iloc[:,50:60]

Predictors to use:
bathrromcnt
bedroomcnt
calculatedbathnbr
calculatedfinishedsquarefeet
fullbathcnt
latitude
longitude
roomcnt
yearbuilt
structuretaxvaluedollarcnt
landtaxvaluedollarcnt


Categorical Predictors to use:
regionidcounty


Categorical Predictors with too many values to be encoded:
regionidcity
regionidneighborhood
regionidzip

It would greatly increase the number of parameters that need to be estimated and hence increase the uncertainty of our model.

# Misc. Feature Analysis
If a feature needed extra analysis for me to decide to include it or not, then I did it below.

In [None]:
dfprop['parcelid'].head()

In [None]:
dfprop.propertycountylandusecode.value_counts()

The land use code varies for each county. It could be used with regionidcounty.

In [None]:
dfprop['airconditioningtypeid'].head()

In [None]:
dfprop['airconditioningtypeid'].value_counts()

Air conditioning type is categorical.

In [None]:
dfprop['architecturalstyletypeid'].head()

In [None]:
dfprop['architecturalstyletypeid'].tail()

In [None]:
dfprop['architecturalstyletypeid'].value_counts()

Architectural style type is categorical.

In [None]:
dfprop['basementsqft'].max()

In [None]:
dfprop['basementsqft'].min()

In [None]:
plt.hist(dfprop.basementsqft.dropna())
plt.show()

In [None]:
dfprop.bathroomcnt.head()

In [None]:
np.sum(np.isnan(dfprop.bathroomcnt))

11,462 places don't have an entry for a bathroom?

In [None]:
np.sum(dfprop.bathroomcnt == 0)

116,614 places don't have a bathroom?

In [None]:
dfprop.bathroomcnt.count()

In [None]:
plt.hist(dfprop.bathroomcnt.dropna())
plt.show()

In [None]:
dfprop.bedroomcnt.head()

In [None]:
dfprop.bedroomcnt.tail()

In [None]:
dfprop.fips.head()

In [None]:
dfprop.fips.tail()

In [None]:
dfprop.fips.value_counts()

In [None]:
dfprop.fullbathcnt.value_counts()

In [None]:
dfprop.propertylandusetypeid.value_counts()

In [None]:
dfprop.assessmentyear.value_counts()

In [None]:
dfprop.landtaxvaluedollarcnt.head()

In [None]:
dfprop.landtaxvaluedollarcnt.tail()

In [None]:
dfprop.landtaxvaluedollarcnt.describe()

In [None]:
tax_mask = pd.isnull(dfprop.structuretaxvaluedollarcnt) == False
dfprop.taxvaluedollarcnt[tax_mask] - \
dfprop.landtaxvaluedollarcnt[tax_mask] - \
dfprop.structuretaxvaluedollarcnt[tax_mask]

taxvaluedollarcnt = landtaxvaluedollarcnt + structuretaxvaluedollarcnt

In [None]:
dfprop.taxvaluedollarcnt - dfprop.taxamount

In [None]:
dfprop.taxamount.describe()

In [None]:
tvdc_mask = pd.isnull(dfprop.taxvaluedollarcnt) == False
ta_mask = pd.isnull(dfprop.taxamount) == False

ta_tvdc_mask = ta_mask & tvdc_mask

In [None]:
plt.scatter(dfprop.taxamount[ta_tvdc_mask],dfprop.taxvaluedollarcnt[ta_tvdc_mask])
plt.show()

There is a clear relationship here. To begin, I will just include the underlying values for taxvaluedollarcnt and ignore taxamount.

# Saving the data

In [None]:
predictors = ['bathroomcnt',\
              'bedroomcnt',\
              'calculatedbathnbr',\
              'calculatedfinishedsquarefeet',\
              'fullbathcnt',\
              'latitude',\
              'longitude',\
              'roomcnt',\
              'yearbuilt',\
              'structuretaxvaluedollarcnt',\
              'landtaxvaluedollarcnt',\
              'regionidcounty']
response = ['logerror']

              

In [None]:
cols = predictors + response
cols

In [None]:
df_clean = dfcomb_nar.loc[:,cols].dropna()

In [None]:
df_clean.shape

In [None]:
df_clean.to_csv('version_1.csv')