# Zillow EDA

_By [Michael Rosenberg](mailto:mmrosenb@andrew.cmu.edu)._

_**Description**: Contains my exploration of the Zillow dataset for the Kaggle Competition._

In [None]:
#imports

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

#constants
%matplotlib inline
sns.set_style("dark")
percentLev = 100
figWidth = figHeight = 8

In [None]:
#load in the dataset
trainFrame = pd.read_csv("../input/train_2016.csv")
sampleSubmission = pd.read_csv("../input/sample_submission.csv")
propFrame = pd.read_csv("../input/properties_2016.csv")

## Metadata Analysis

We will first explore some of the more general components of the dataset before giving summary statistics on relevant variables in the dataset.

In [None]:
print(trainFrame.shape)
print(trainFrame.columns)

We have around 90000 observations to consider in the training dataset. Let's see how many unique parcel IDs are accounted for.

In [None]:
print(len(trainFrame["parcelid"].unique()))

We see that we have about the same number of unique properties as there are observations in this dataset. This suggests that there is a minimal number of repeats to consider. Given the time-sensitive nature of the [submission](https://www.kaggle.com/c/zillow-prize-1#evaluation), this is slightly concerning that we have a limited number of time points for each property. That being said, let's go see what property IDs have multiple observations and whether this is an error in the dataset.

In [None]:
idCountFrame = trainFrame.groupby("parcelid",as_index = False)["logerror"].count()
idCountFrame = idCountFrame.rename(columns = {"logerror":"count"})
#get max count observations
print(idCountFrame["count"].max())
#get the observations with count greater than 1
moreThanOneFrame = idCountFrame[idCountFrame["count"] > 1]
print(moreThanOneFrame)
print(moreThanOneFrame["count"].mean())

It looks like most of our observations with multiple entries have around 2 entries, while the maximum number of entries found in our training set for a given property is 3.

In [None]:
dateCountFrame = trainFrame.groupby("transactiondate")["parcelid"].count()
plt.figure(figsize=(figWidth,figHeight))
dateCountFrame.plot()
plt.xlabel("Date")
plt.ylabel("Count")
plt.title("Distribution of Date in Training Set")

_Figure 1: Distribution of training set by date._

We see two interesting components occurring here:

*  The granularity of each property is by day, when we are expected to predict by month. This may suggest that if we want to use some time-dependent features in our dataset, we will need to look into a transaction date that is refined only to the month. We should take a look at whether the test dataset is at this level of granularity as well.

* We see that our observation rate drops off for the months in which we are supposed to be predicting. This definitely suggests that we should be accounting for some time-dependent features in our dataset given that we have such limited information on our target future.

In [None]:
#get parcel IDs from sample submission, find if there is overlap with training data
sampleSubmissionIDSet = set(sampleSubmission["ParcelId"].unique())
trainingIDSet = set(trainFrame["parcelid"].unique())
#get overlap
overlap = trainingIDSet & sampleSubmissionIDSet
print(len(overlap))
#get difference
diff = sampleSubmissionIDSet - trainingIDSet
print(len(diff))

Interestingly, we see that the test set has many more observations than the training set and yet also contains the training set itself! This suggests to me that not only do we need to do well at predicting out-of-time for our current observations, but also out-of-sample. That is two problems (i.e. more than one)! How tf do I do two problems! We will try.

In [None]:
plt.figure(figsize=(figWidth,figHeight))
plt.hist(trainFrame["logerror"])
plt.xlabel("$logerror$")
plt.ylabel("count")
plt.title("Distribution of $logerror$")

_Figure 2: Distribution of our Target Variable._

We see that the domain for our target variable is rather small. This may help to inform the size of the weights on our features.

In [None]:
propFrame.shape

We see that we have close to 3 million properties in the overall set, with 58 features to choose form. The number of observations suggests that a neural architecture could be useful later down the line, but the out-of-time problem and the relatively small number of features suggests that a shallower model might be also useful.

In [None]:
#get number of missing in each column
propFrame.isnull().sum()

_Table 1: Number of missing observations per feature in the properties dataset._

We see that we have missing observations across the board for all of our variables. This suggests that we will need to likely do a large amount of feature reduction simply based on the availability of certain variables and that we will need to offer a different model when a selected feature is missing. To simply put, this is fucking gross.