# A Simple Macro Model To Predict Monthly House Prices

## Introduction

This model uses linear distrubuted lags (Almon lags of degree 1) of `balance_trade` and `mortgage_rate` to predict monthly real log house prices (monthly median log of `price_doc` divided by `cpi`).

Why Almon lags?  Because I like them.  They've never had a good theoretical justification, but they're easy for me to understand, and they're a convenient way to avoid overfitting by reducing the number of parameters without obfuscating the content of the model.  (An [Almon lag structure](http://davegiles.blogspot.com/2017/01/explaining-almon-distributed-lag-model.html) effectively constrains the regression coefficients on several lagged values of the same variable to lie on a polynomial curve -- in this case a 1st-degree polynomial, otherwise known as a straight line.  So I take 6 coefficients and force them to line up in a way that involves only 2 parameters.  It's feature engineering, 1960's style.)

Why this particular lag structure (linear from month 0 to month -5)?  Prelminary analysis indicated that this was a reasonable number of lags to include (given data limitations).  And then the structure had to be linear, or it would have too many parameters, which defeats the purpose of Almon lags.  Also, the choice to use the same number of lags for both variables is like a constraint that reduces overfitting, since "number of lags" is really an extra parameter hiding in the background.  That's also partly why I chose a total of 6 months -- half a year -- rather than something like 5 or 7, which would be a magic number.

Why mortgage rate?  Because it's the obvious macro variable that would affect housing prices with a lag.

Why trade balance?  Mostly becuase it fits really, really well.  When I fit it with several lags with no constraint on the lag structure, all the lags got coefficients with the same sign.  Hard to believe that would happen by chance.

On a theoretical level, trade balance is relevant to housing for Russia in particular because it's an indicator of how much excess savings Russia is generating.  (This doesn't work, for example, for the US, which is a net borrower.)  A large trade surplus corresponds to a lot of excess savings are going abroad.  In that situation, there are probably also a lot of savings going into domestic housing investment.

As I understand it, the amount of savings generated in Russia varies a great deal from one year to the next mostly becuase of energy prices.  When energy prices are high, producers save a lot of their income.  When prices are low, producers don't have much to save.  Presumably, when energy prices are high, producers don't keep all their saved income in foreign assets but bring some of it back to Russsia to buy housing.

## Get the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

In [None]:
macro = pd.read_csv('../input/macro.csv')
train = pd.read_csv('../input/train.csv')

In [None]:
macro["timestamp"] = pd.to_datetime(macro["timestamp"])
macro["year"]  = macro["timestamp"].dt.year
macro["month"] = macro["timestamp"].dt.month
macro["yearmonth"] = 100*macro.year + macro.month
macmeds = macro.groupby("yearmonth").median()
macmeds.head()

In [None]:
train["timestamp"] = pd.to_datetime(train["timestamp"])
train["year"]  = train["timestamp"].dt.year
train["month"] = train["timestamp"].dt.month
train["yearmonth"] = 100*train.year + train.month
prices = train[["yearmonth","price_doc"]]
p = prices.groupby("yearmonth").median()
p.head()

In [None]:
df = macmeds.join(p)
# Take a look at some of the data, just to make sure it's there:
df.loc[ [201109,201212,201403,201506],
             ["cpi","balance_trade","mortgage_rate","year","month","price_doc"]]

## Functions to deal with Almon Lags

In [None]:
#  Adapted from code at http://adorio-research.org/wordpress/?p=7595
#  Original post was dated May 31st, 2010
#    but was unreachable last time I tried

import numpy.matlib as ml
 
def almonZmatrix(X, maxlag, maxdeg):
    """
    Creates the Z matrix corresponding to vector X.
    """
    n = len(X)
    Z = ml.zeros((len(X)-maxlag, maxdeg+1))
    for t in range(maxlag,  n):
       #Solve for Z[t][0].
       Z[t-maxlag,0] = sum([X[t-lag] for lag in range(maxlag+1)])
       for j in range(1, maxdeg+1):
             s = 0.0
             for i in range(1, maxlag+1):       
                s += (i)**j * X[t-i]
             Z[t-maxlag,j] = s
    return Z

def almonXcof(zcof, maxlag):
    """
    Transforms the 'b' coefficients in Z to 'a' coefficients in X.
    """
    maxdeg  = len(zcof)-1
    xcof    = [zcof[0]] * (maxlag+1)
    for i in range(1, maxlag+1):
         s = 0.0
         k = i
         for j in range(1, maxdeg+1):
             s += (k * zcof[j])
             k *= i
         xcof[i] += s
    return xcof

## Prepare data for macro model

In [None]:
y = df.price_doc.div(df.cpi).apply(np.log).loc[201108:201506]
print( y.head() )
y.shape

In [None]:
nobs = 47  # August 2011 through June 2015, months with price_doc data
tblags = 5    # Number of lags used on PDL for Trade Balance
mrlags = 5    # Number of lags used on PDL for Mortgage Rate
ztb = almonZmatrix(df.balance_trade.loc[201103:201506].as_matrix(), tblags, 1)
zmr = almonZmatrix(df.mortgage_rate.loc[201103:201506].as_matrix(), mrlags, 1)
columns = ['tb0', 'tb1', 'mr0', 'mr1']
z = pd.DataFrame( np.concatenate( (ztb, zmr), axis=1), y.index.values, columns )
X = sm.add_constant( z )
X.shape

## Fit

In [None]:
eq = sm.OLS(y, X)
fit = eq.fit()
fit.summary()

Here's what the fit looks like in-sample.  Pretty good for fitting 47 data points with only 5 parameters.

In [None]:
%matplotlib inline
plt.plot(y.values)
plt.plot(pd.Series(fit.predict(X)).values)

## Predict

In [None]:
test_cpi = df.cpi.loc[201507:201605]
test_index = test_cpi.index
ztb_test = almonZmatrix(df.balance_trade.loc[201502:201605].as_matrix(), tblags, 1)
zmr_test = almonZmatrix(df.mortgage_rate.loc[201502:201605].as_matrix(), mrlags, 1)
z_test = pd.DataFrame( np.concatenate( (ztb_test, zmr_test), axis=1), test_index, columns )
X_test = sm.add_constant( z_test )
pred_lnrp = fit.predict( X_test )
pred_p = np.exp(pred_lnrp) * test_cpi
pred_p.to_csv("monthly_macro_predicted.csv")
pred_p

In [None]:
print( "Here's the average price predicted for the test period by the macro model: \n")
print( np.exp( pred_lnrp.mean() + np.log(test_cpi).mean() ) )
print( "\nDivide (logarithmic) average baseline micro model price prediction by this")
print( "   and use the result to justify multiplier for training prices in the micro model.")