This notebook aims to explore the volatility of the datasets near the test data date range and explores the relationships between house price, inflation and interest rates, which are important drivers for people's decision making.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
# Any results you write to the current directory are saved as output.
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)

In [None]:
train_df = pd.read_csv("../input/train.csv")
train_df = train_df[['id','timestamp','price_doc', 'full_sq', 'num_room']]
train_df = train_df[train_df['full_sq'] < 450]
train_df = train_df[train_df['num_room'] > 0]
macro_df = pd.read_csv("../input/macro.csv")
macro_df = macro_df[['timestamp','usdrub','eurrub','mortgage_rate', 'mortgage_value', 'deposits_rate', 'salary_growth', 'cpi', 'ppi', 'overdue_wages_per_cap']]

In [None]:
def visualize_feature_over_time(df, feature):
    df['date_column'] = pd.to_datetime(df['timestamp'])
    df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))

    df = df[[feature,"mnth_yr"]]
    df_vis = df.groupby('mnth_yr')[feature].mean()
    df_vis = df_vis.reset_index()
    df_vis['mnth_yr'] = pd.to_datetime(df_vis['mnth_yr'])
    df_vis.sort_values(by='mnth_yr')
    df_vis.plot(x='mnth_yr', y=feature)

    plt.figure()
    plt.show()

First we take a look at the house price over time. In the loading of the dataset, we make a rough attempt to filter  outliers that would impact  the mean. Here's the plot for that:

In [None]:
visualize_feature_over_time(train_df, "price_doc")

Now we look for similar patterns in the macro data. What the test data doesn't really tell us is how the house price could be affected by the economy, only the macro data can do that. Features like full_sq, state, number of cafes, distance to roads do explain the price, but you'd have to assume the economy is static to extract the effects of that. In this dataset, the economy is highly volatile, as the following will show. November, december 2014 and the first months of 2015 show a huge ramp-up in inflation and exchange rates, as the following plots show.

Note that the slight increasing trend in house price doesn't follow the exchange rate all that well:

In [None]:
visualize_feature_over_time(macro_df, "usdrub")

In [None]:
visualize_feature_over_time(macro_df, "eurrub")

If you look at the general trend and the sharp increase, then even the inflation indicators aren't good explanators for the house price.

In [None]:
visualize_feature_over_time(macro_df, "cpi")

In [None]:
visualize_feature_over_time(macro_df, "ppi")

The reaction of the house price on the economy is a complex relationship of some important features. Inflation is one and interest rates is another. Here we look at the trends of the mortgage and deposits rate. We notice that the rate follows the inflation spike, but also a sharp decrease immediately afterwards towards normal levels. The normal rate is around 12.5 percent, the extreme is found near 14.5 percent. Over the entire period, we see much larger fluctuations than the house price with peaks inbetween that are tied to the inflation rates.

In [None]:
visualize_feature_over_time(macro_df, "mortgage_rate")

In [None]:
visualize_feature_over_time(macro_df, "deposits_rate")

Salary growth should be kept in step with inflation rates, otherwise people are losing purchasing power. The recorded growth sort of follows the inflation indicators in ppi and cpi, but near the steep extreme around 2015, we see that the growth falls down to zero.

In [None]:
visualize_feature_over_time(macro_df, "salary_growth")

Missing values in the macro-economic data:

In [None]:
visualize_feature_over_time(macro_df, "overdue_wages_per_cap")

There are many factors influencing the house prices and the data shows a highly volatile period for which predictions have to be made. When inflation goes up, people see their income disappear straight into a black hole of purchases and consumption. Investing into fixed assets is a strategy to maintain more control over your income in an attempt to bridge periods of high volatility in the economy. So high inflation does not necessarily mean that people avoid purchasing houses, it can cause people to purchase more. The interest rate makes borrowing money very expensive, which reduces people's willingness to purchase more houses. Then there is also the delay over which people make decisions, you don't buy a house in one or two month's time, first you need to select a property whilst following the news how inflation and interest rates are making an impact. Then you need to worry if the contract you're about to close and the associated rates allow you to sustain a living or if the economy goes back to normal and you're stuck with an excessively high interest rate payoff that you closed off in a really bad time period. All these factors in a volatile period make it extremely difficult to predict the house price, especially given the fact that near the end of the training set we see this huge spike in the way, which has a huge impact on actions that people take and which could go either way.

When more people decide to purchase property, the price goes up beyond the inflation rate, because moscow has land scarcity and people are looking at the best investment they can make. The inflation rate and interest spike could cause people to wait on the purchases, leading to an increasing trend of house prices due to inflation, but maybe not entirely following the inflation trend due to demand falling off.

What people have already noticed is that RMSE so far is a better error metric than RMSLE, which suggests that RMSLE is under-valuing the house prices for the test set and further suggests that the algorithm doesn't embed the trend very well. We are looking at an anomalous situation however and it's unclear how the house prices truly react to that. Also, it should be the goal to find great explanators or models that truly predict things well rather than training on RMSE itself, because that doesn't provide a generic model. This model would then (potentially) show a good performance when the house price goes up, but be awful when the house prices are going down or stabilizing over another period.

**Vote up if these insights are useful to you!**