Work in progress...If possible, tips and orientation are appreciated

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline
import matplotlib.pyplot as plt
import datetime

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.v

In [None]:
train = pd.read_csv("../input/train_2016_v2.csv")
prop = pd.read_csv("../input/properties_2016.csv")
sub = pd.read_csv("../input/sample_submission.csv")

Let`s take a look at the data

In [None]:
train.head()

In [None]:
prop.head()

Ok, let's append the properties of each house to the train data, so that we can have an unified database.

In [None]:
merged_df = train.merge(prop, how="left", on="parcelid")
print(train.shape)
print(prop.shape)
print(merged_df.shape)
merged_df.head()

In [None]:
merged_df.info()

Most of the features are numeric but "has hot tub or spa, "property county land use code", "property zoning desc", "fireplace flag" and "tax delinquency flag".

Let's take a look at those features.

From the dictionary provided we have:
* __hashottuborpa__:  Does the home have a hot tub or spa;
* __propertycountylandusecode__:  County land use code i.e. it's zoning at the county level;
* __propertyzoningdesc__:  Description of the allowed land uses (zoning) for that property;
* __fireplaceflag__:  Is a fireplace present in this home;
* __taxdelinquencyflag__: Property taxes for this parcel are past due as of 2015.

In [None]:
cat_features = ["hashottuborspa", "propertycountylandusecode", 
                "propertyzoningdesc", "fireplaceflag", "taxdelinquencyflag"]

for col in cat_features:
    print("category", col, ":")
    print(merged_df[col].unique())

__propertylandusecode__ and __propertyzoningdesc__ seem very specific codes.

For the others, it seems that `nan` and `True`/`Y` could be replaced by 0-1 enconding, being 0 for `nan` and 1 for `True` or `Y`. Let's save it for later.

Let's deal with dates. 

Let's create Quarter dummies to later check if they are correlated with the target value.

Also, let's check for intramonth seasonality, kept simple, separating first 15 days from last 15days of the month.


In [None]:
merged_df['Quarter'] = pd.PeriodIndex(merged_df['transactiondate'], freq='Q').strftime('Q%q')
merged_df = pd.concat([merged_df, pd.get_dummies(merged_df["Quarter"], prefix_sep='_')], axis=1)
merged_df['first15daysmonth'] = np.where(pd.to_datetime(merged_df['transactiondate']).dt.day <= 15, 1, 0)

merged_df.drop(["Quarter"], axis=1, inplace=True)

In [None]:
merged_df.tail()

#### Logerror exploration

In [None]:
merged_df["logerror"].plot(figsize=(16,9))

In [None]:
merged_df["logerror"].plot.density(figsize=(16,9))

In [None]:
merged_df.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2,
                c="logerror", cmap=plt.get_cmap("jet"), colorbar=True,
                figsize=(16,9))
plt.legend()

#### Correlations

Let's check for correlations between logerror and house features.

In [None]:
corr_matrix = merged_df.corr()

In [None]:
corr_matrix["logerror"].sort_values(ascending=False)

None of the features have high correlation coefficient. The highest comes from basementsqft and is 0.25.

Let's create some new features and see if there is any change.

In [None]:
merged_df["bath_per_sqft"] = merged_df["bathroomcnt"] / merged_df["calculatedfinishedsquarefeet"]
merged_df["bath3-4_per_sqft"] = merged_df["threequarterbathnbr"] / merged_df["calculatedfinishedsquarefeet"]
merged_df["sqft_per_floor"] = merged_df["calculatedfinishedsquarefeet"] / merged_df["numberofstories"]
merged_df["basement_perc_totalarea"] = merged_df["basementsqft"] / merged_df["finishedsquarefeet15"]

In [None]:
corr_matrix = merged_df.corr()
corr_matrix["logerror"].sort_values(ascending=False)

There is no substantially change. It appears that if there is any relationship between features and logerror, it might not be linear.

### Models

In [None]:
merged_df.head()

Since there is no strong linear relationship between logerror and any other feature, let's, then, try first a classification tree to get an ideia of how the logerror can relate to the features.

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, r2_score

X = [merged_df.iloc[:, 3:][col] for col in merged_df.iloc[:, 3:].columns if merged_df.iloc[:, 3:][col].dtype!="object"]
X = pd.DataFrame(X).transpose().fillna(0)
y = merged_df["logerror"].fillna(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
dt_y_pred = dt.predict(X_test)

r2_score(y_true=y_test, y_pred=dt_y_pred)

.....

comments for improvement ar