Update 2017-06-20:  New version will apply a threshold frequency*probability and just round the original price if probability for recoded price is too low.

This kernel attempts to implement a probabilistic version of [Jiwon's "Small Improvements"][1].  Jiwon's idea is that predicted prices for the test set should be selected from the set of actual prices that appeared in the training set.  For example, if your model predicts a price 9,999,973 rubles, the actual price was more likely 10,000,000 rubles, and you would realize this if you looked at the training set and saw that many units sold for 10,000,000 rubles but none sold for 9,999,973 rubles.  Also my analysis shows that this is [particularly an issue][2] with investment properties, which have an overwhelming tendency to sell at round numbered prices.  In this version of this kernel I adjust only investment prices, not prices for owner-occupied units, which will need to be adjusted separately if at all.

How do we know *which* actual training set price would correspond to a given model prediction on the test set?  One possibility is to choose the closest value, or the closest among values above a given frequency threshold.  Here I take a different approach.  I first posit a log-normal probability distribution for the difference between actual and model-predicted price (with a variance parameter that is currently user-specified, but eventually presumably arrived at by cross-validation).  I then posit that the distribution of prices is equal to the frequency distribution of prices on the training set (with adjustments for the upward trend over time).  For each prediction, I multiply the two implied probability densities and choose the modal price from the product of the two.  (Actually I use adjusted frequencies, not a probability density per se, for the overall price distribution, because the scaling of the density is irrelevant.)


  [1]: https://www.kaggle.com/rezimitpo/small-improvements
  [2]: https://www.kaggle.com/aharless/an-interesting-fact-about-investment-properties

In [None]:
# Parameters
prediction_stderr = 0.03  #  assumed standard error of predictions
                          #  (smaller values make output closer to input)
train_test_logmean_diff = 0.1  # assumed shift used to adjust frequencies for time trend
probthresh = 30  # minimum probability*frequency to use new price instead of just rounding
rounder = 2  # number of places left of decimal point to zero

Load the required libraries and data. 

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn import model_selection, preprocessing
import xgboost as xgb
import datetime
from scipy.stats import norm

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')
macro = pd.read_csv('../input/macro.csv')
id_test = test.id

Run a quick naive XGB to generate some predictions

In [None]:
y_train = train["price_doc"]
x_train = train.drop(["id", "timestamp", "price_doc"], axis=1)
x_test = test.drop(["id", "timestamp"], axis=1)

for c in x_train.columns:
    if x_train[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(x_train[c].values)) 
        x_train[c] = lbl.transform(list(x_train[c].values))
        
for c in x_test.columns:
    if x_test[c].dtype == 'object':
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(x_test[c].values)) 
        x_test[c] = lbl.transform(list(x_test[c].values))
        
xgb_params = {
    'eta': 0.05,
    'max_depth': 5,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'objective': 'reg:linear',
    'eval_metric': 'rmse',
    'silent': 1
}

dtrain = xgb.DMatrix(x_train, y_train)
dtest = xgb.DMatrix(x_test)

In [None]:
num_boost_rounds = 380
model = xgb.train(xgb_params, dtrain, num_boost_round= num_boost_rounds)

y_predict = model.predict(dtest)
output = pd.DataFrame({'id': id_test, 'price_doc': y_predict})
output.head()

Save predictions before small improvements

In [None]:
output.to_csv('before.csv', index=False)
preds = output

Select investment sales from training set and generate frequency distribution

In [None]:
invest = train[train.product_type=="Investment"]
freqs = invest.price_doc.value_counts().sort_index()
print(freqs.head(20))
freqs.sample(10)

Select investment sales from test set predictions

In [None]:
test_invest_ids = test[test.product_type=="Investment"]["id"]
invest_preds = pd.DataFrame(test_invest_ids).merge(preds, on="id")
invest_preds.head()

Express X-axis of training set frequency distribution as logarithms, and save standard deviation to help adjust frequencies for time trend.

In [None]:
lnp = np.log(invest.price_doc)
stderr = lnp.std()
lfreqs = lnp.value_counts().sort_index()
lfreqs.head()

Adjust frequencies for time trend

In [None]:
lnp_diff = train_test_logmean_diff
lnp_mean = lnp.mean()
lnp_newmean = lnp_mean + lnp_diff

In [None]:
def norm_diff(value):
    return norm.pdf((value-lnp_diff)/stderr) / norm.pdf(value/stderr)

In [None]:
newfreqs = lfreqs * (pd.Series(lfreqs.index.values-lnp_newmean).apply(norm_diff).values)

print( "What the middle of the adjusted and unadjusted freqs look like:")
print( lfreqs.values[880:900] )
print( newfreqs.values[880:900] )

print( "\nHeads")
print( lfreqs.head() )
print( newfreqs.head() )

print( "\nTails")
print( lfreqs.tail() )
print( newfreqs.tail() )

print( "\nSums")
print( lfreqs.sum() )
print( newfreqs.sum() )

print( "\nFirst prices that have nonzero frequencies:")
print( np.exp(newfreqs.index.values[0:20]) )

newfreqs.shape

In [None]:
stderr = prediction_stderr

Logs of model-predicted prices

In [None]:
lnpred = np.log(invest_preds.price_doc)
lnpred.head()

`lnpred` has one entry for each test case (m=4998).  `newfreqs.index.values` has one entry for each nonzero-frequency price (n=1750).  For each test case we are going create a corresponding probability distribution, based on the assumed distribution of the actual-predicted difference.  We will evaluate the distribution at the prices that correspond to the nonzero-frequency prices, so the result will be a 4998 x 1750 matrix, showing a distribution for each test case.

In [None]:
print(lnpred.shape)
print(newfreqs.index.values.shape)

Create assumed probability distributions.

In [None]:
mat =(np.array(newfreqs.index.values)[:,np.newaxis] - np.array(lnpred)[np.newaxis,:])/stderr
modelprobs = norm.pdf(mat)

Multiply by frequency distribution.

In [None]:
freqprobs = pd.DataFrame( np.multiply( np.transpose(modelprobs), newfreqs.values ) )
freqprobs.index = invest_preds.price_doc.values
freqprobs.columns = freqs.index.values.tolist()
freqprobs.head()

Find mode for each case.

In [None]:
prices = freqprobs.idxmax(axis=1)


Apply probability*frequency threshold and reset below-threshold points to old values, rounded.  (The point of this is that we don't want to exclude entirely the possibility of having test predictions that were not represented among the the training prices.  So points where we have low confidence are set back to the old predictions and just rounded.)

In [None]:
priceprobs = freqprobs.max(axis=1)
mask = priceprobs<probthresh
prices[mask] = np.round(prices[mask].index,-rounder)

Data frames with new predictions

In [None]:
pr = invest_preds.price_doc
pd.DataFrame( {"id":test_invest_ids.values, "original":pr, "revised":prices.values}).head()

In [None]:
newpricedf = pd.DataFrame( {"id":test_invest_ids.values, "price_doc":prices} )
newpricedf.head()

Merge these new predictions (for just investment properties) back into the full prediction set.

In [None]:
preds.head()

In [None]:
newpreds = preds.merge(newpricedf, on="id", how="left", suffixes=("_old",""))
newpreds.loc[newpreds.price_doc.isnull(),"price_doc"] = newpreds.price_doc_old
newpreds.drop("price_doc_old",axis=1,inplace=True)
newpreds.head()

Save.

In [None]:
newpreds.to_csv('after.csv', index=False)