## Why I gave up on heteroskedasticity (for now)

The intention when I started this notebook was to settle on [a set of sample weights](https://www.kaggle.com/aharless/sberbank-data-wls-with-heteroskedasticity-weights) to account for the predictably greater reliability of some data points in the Sberbank housing data.  I even drafted a pompous write-up about the importance of heteroskedasticity weighting.  The result of my out-of-sample testing, however, is that I could not find a set of weights that was an improvement over the weighting scheme I was already using (weighting by time alone, for a different reason).  Despite my disappointment, I'm uploading this script to document my analysis and serve as the fourth checkpoint in my modeling process.  And also to make it part of my code base in case I decide to copy some of it to do a similar kind of analysis in the future.

Another result of this analysis is that I decided to go back to my aggressive time weighting parameter -- actually an even more extreme version, which gives zero weight to the beginning of the data set.  This seems intuitively wrong to me, but it performs better both on out-of-sample testing within the training set and on the public leaderboard.  I sometimes do allow my intuition to overrule the data.  (After all, the ostensible conclusions of "the data" depend on the intuitive belief that the assumptions of the data analysis are correct, so an alternative intuition could be stronger.)  But in this case the data are strong enough to alter my priors.

I decided to leave in the part where I got impressed by the robustness of the heteroskedasticity model.  Maybe a cautionary tale.  In the end it makes bad pudding.

In [None]:
# Parameters
use_pipe = True
weight_base = "2010-07-01"  # Used for the initial analysis, but later I try other values

### Read and munge the data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train = pd.read_csv('../input/train.csv')
macro = pd.read_csv('../input/macro.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
dfa = pd.concat([train, test])  # "dfa" stands for "data frame all"
# Eliminate spaces and special characters in area names
dfa.loc[:,"sub_area"] = dfa.sub_area.str.replace(" ","").str.replace("\'","").str.replace("-","")
dfa = dfa.merge(macro, on='timestamp', suffixes=['','_macro'])

Note:  As a general principle, when there are values that seem invalid but are "special", such as zeros and ones in numeric fields that shouldn't be zero or one, or values that seem out of the reasonable range but aren't entirely impossible, my approach is to treat them as special cases rather than either missing or valid.  Dummy variables will take care of those special cases in OLS, though the situation may get complicated when fancy methods are involved.  One possibility is to recode them using OLS coefficients to replace the ugly values (perhaps coefficients from a baseline OLS with a sparse set of obvious features).

In [None]:
dfa["fullzero"] = (dfa.full_sq==0)
dfa["fulltiny"] = (dfa.full_sq<4)
dfa["fullhuge"] = (dfa.full_sq>2000)
dfa["lnfull"] = np.log(dfa.full_sq+1)

dfa["nolife"] = dfa.life_sq.isnull()
dfa.life_sq = dfa.life_sq.fillna(dfa.life_sq.median())
dfa["lifezero"] = (dfa.life_sq==0)
dfa["lifetiny"] = (dfa.life_sq<4)
dfa["lifehuge"] = (dfa.life_sq>2000)
dfa["lnlife"] = np.log( dfa.life_sq + 1 )

dfa["nofloor"] = dfa.floor.isnull()
dfa.floor = dfa.floor.fillna(dfa.floor.median())
dfa["floor1"] = (dfa.floor==1)
dfa["floor0"] = (dfa.floor==0)
dfa["floorhuge"] = (dfa.floor>50)
dfa["lnfloor"] = np.log(dfa.floor+1)

dfa["nomax"] = dfa.max_floor.isnull()
dfa.max_floor = dfa.max_floor.fillna(dfa.max_floor.median())
dfa["max1"] = (dfa.max_floor==1)
dfa["max0"] = (dfa.max_floor==0)
dfa["maxhuge"] = (dfa.max_floor>80)
dfa["lnmax"] = np.log(dfa.max_floor+1)

dfa["norooms"] = dfa.num_room.isnull()
dfa.num_room = dfa.num_room.fillna(dfa.num_room.median())
dfa["zerorooms"] = (dfa.num_room==0)
dfa["lnrooms"] = np.log( dfa.num_room + 1 )

dfa["nokitch"] = dfa.kitch_sq.isnull()
dfa.kitch_sq = dfa.kitch_sq.fillna(dfa.kitch_sq.median())
dfa["kitch1"] = (dfa.kitch_sq==1)
dfa["kitch0"] = (dfa.kitch_sq==0)
dfa["kitchhuge"] = (dfa.kitch_sq>400)
dfa["lnkitch"] = np.log(dfa.kitch_sq+1)

In [None]:
dfa["material0"] = dfa.material.isnull()
dfa["material1"] = (dfa.material==1)
dfa["material2"] = (dfa.material==2)
dfa["material3"] = (dfa.material==3)
dfa["material4"] = (dfa.material==4)
dfa["material5"] = (dfa.material==5)
dfa["material6"] = (dfa.material==6)

# "state" isn't explained but it looks like an ordinal number, so for now keep numeric
dfa.loc[dfa.state>5,"state"] = np.NaN  # Value 33 seems to be invalid; others all 1-4
dfa.state = dfa.state.fillna(dfa.state.median())

# product_type gonna be ugly because there are missing values in the test set but not training
# Check for the same problem with other variables
dfa["owner_occ"] = (dfa.product_type=='OwnerOccupier')
dfa.owner_occ.fillna(dfa.owner_occ.mean())

dfa = pd.get_dummies(dfa, columns=['sub_area'], drop_first=True)

In [None]:
# Build year is ugly
# Can be missing
# Can be zero
# Can be one
# Can be some ridiculous pre-Medieval number
# Can be some invalid huge number like 20052009
# Can be some other invalid huge number like 4965
# Can be a reasonable number but later than purchase year
# Can be equal to purchase year
# Can be a reasonable nubmer before purchase year

dfa.loc[dfa.build_year>2030,"build_year"] = np.NaN
dfa["nobuild"] = dfa.build_year.isnull()
dfa["sincebuild"] = pd.to_datetime(dfa.timestamp).dt.year - dfa.build_year
dfa.sincebuild.fillna(dfa.sincebuild.median(),inplace=True)
dfa["futurebuild"] = (dfa.sincebuild < 0)
dfa["newhouse"] = (dfa.sincebuild==0)
dfa["tooold"] = (dfa.sincebuild>1000)
dfa["build0"] = (dfa.build_year==0)
dfa["build1"] = (dfa.build_year==1)
dfa["untilbuild"] = -dfa.sincebuild.apply(np.min, args=[0]) # How many years until planned build
dfa["lnsince"] = dfa.sincebuild.mul(dfa.sincebuild>0).add(1).apply(np.log)

In [None]:
# Note for later:
# Want to check for valididty of relationships, e.g. kitch_sq < life_sq < full_sq
# But this interacts with how variables are already processed, so that may have to be changed
# For example, if kitch_sq is sometimes huge and there is a dummy to identify those huge cases,
#  do we want a separate dummy to identify which of those cases are internally consistent?

In [None]:
# Interaction terms
dfa["fullzero_Xowner"] = dfa.fullzero.astype("float64") * dfa.owner_occ
dfa["fulltiny_Xowner"] = dfa.fulltiny.astype("float64") * dfa.owner_occ
dfa["fullhuge_Xowner"] = dfa.fullhuge.astype("float64") * dfa.owner_occ
dfa["lnfull_Xowner"] = dfa.lnfull * dfa.owner_occ
dfa["nofloor_Xowner"] = dfa.nofloor.astype("float64") * dfa.owner_occ
dfa["floor0_Xowner"] = dfa.floor0.astype("float64") * dfa.owner_occ
dfa["floor1_Xowner"] = dfa.floor1.astype("float64") * dfa.owner_occ
dfa["lnfloor_Xowner"] = dfa.lnfloor * dfa.owner_occ
dfa["max1_Xowner"] = dfa.max1.astype("float64") * dfa.owner_occ
dfa["max0_Xowner"] = dfa.max0.astype("float64") * dfa.owner_occ
dfa["maxhuge_Xowner"] = dfa.maxhuge.astype("float64") * dfa.owner_occ
dfa["lnmax_Xowner"] = dfa.lnmax * dfa.owner_occ
dfa["kitch1_Xowner"] = dfa.kitch1.astype("float64") * dfa.owner_occ
dfa["kitch0_Xowner"] = dfa.kitch0.astype("float64") * dfa.owner_occ
dfa["kitchhuge_Xowner"] = dfa.kitchhuge.astype("float64") * dfa.owner_occ
dfa["lnkitch_Xowner"] = dfa.lnkitch * dfa.owner_occ
dfa["nobuild_Xowner"] = dfa.nobuild.astype("float64") * dfa.owner_occ
dfa["newhouse_Xowner"] = dfa.newhouse.astype("float64") * dfa.owner_occ
dfa["tooold_Xowner"] = dfa.tooold.astype("float64") * dfa.owner_occ
dfa["build0_Xowner"] = dfa.build0.astype("float64") * dfa.owner_occ
dfa["build1_Xowner"] = dfa.build1.astype("float64") * dfa.owner_occ
dfa["lnsince_Xowner"] = dfa.lnsince * dfa.owner_occ
dfa["state_Xowner"] = dfa.state * dfa.owner_occ

### Select features to fit
(There's been a lot of back-and-forth off-camera here, mostly having to do with variables that caused collinearity problems -- sometimes, I think, because the scaler introduced imprecision in the context of bagging.  The scaling is applied to the whole training set, but models are fit to individual bags, and it sometimes seems to cause the model to mistakenly fit a coefficient for a dummy that should be omitted because it is constant in a particular bag.  That's an issue I have to think about.)

In [None]:
# Sets of features that go together

# Features derived from full_sq
fullvars = ["fullzero", "fulltiny",
           # For now I'm going to drop the one "fullhuge" case. Later use dummy, maybe.
           #"fullhuge",
           "lnfull" ]

# Features derived from floor
floorvars = ["nofloor", "floor1", "floor0",
             # floorhuge isn't very important, and it's causing problems, so drop it
             #"floorhuge", 
             "lnfloor"]

# Features derived from max_floor
maxvars = ["max1", "max0", "maxhuge", "lnmax"]

# Features derived from kitch_sq
kitchvars = ["kitch1", "kitch0", "kitchhuge", "lnkitch"]

# Features derived from bulid_year
buildvars = ["nobuild", "futurebuild", "newhouse", "tooold", 
             "build0", "build1", "untilbuild", "lnsince"]

# Features (dummy set) derived from material
matervars = ["material1", "material2",  # material3 is rare, so lumped in with missing 
             "material4", "material5", "material6"]

# Features derived from interaction of floor and product_type
floorXvars = ["nofloor_Xowner", "floor1_Xowner", "lnfloor_Xowner"]

# Features derived from interaction of kitch_sq and product_type
kitchXvars = ["kitch1_Xowner", "kitch0_Xowner", "lnkitch_Xowner"]

# Features (dummy set) derived from sub_area
subarvars = [
       'sub_area_Akademicheskoe',
       'sub_area_Alekseevskoe', 'sub_area_Altufevskoe', 'sub_area_Arbat',
       'sub_area_Babushkinskoe', 'sub_area_Basmannoe', 'sub_area_Begovoe',
       'sub_area_Beskudnikovskoe', 'sub_area_Bibirevo',
       'sub_area_BirjulevoVostochnoe', 'sub_area_BirjulevoZapadnoe',
       'sub_area_Bogorodskoe', 'sub_area_Brateevo', 'sub_area_Butyrskoe',
       'sub_area_Caricyno', 'sub_area_Cheremushki',
       'sub_area_ChertanovoCentralnoe', 'sub_area_ChertanovoJuzhnoe',
       'sub_area_ChertanovoSevernoe', 'sub_area_Danilovskoe',
       'sub_area_Dmitrovskoe', 'sub_area_Donskoe', 'sub_area_Dorogomilovo',
       'sub_area_FilevskijPark', 'sub_area_FiliDavydkovo',
       'sub_area_Gagarinskoe', 'sub_area_Goljanovo',
       'sub_area_Golovinskoe', 'sub_area_Hamovniki',
       'sub_area_HoroshevoMnevniki', 'sub_area_Horoshevskoe',
       'sub_area_Hovrino', 'sub_area_Ivanovskoe', 'sub_area_Izmajlovo',
       'sub_area_Jakimanka', 'sub_area_Jaroslavskoe', 'sub_area_Jasenevo',
       'sub_area_JuzhnoeButovo', 'sub_area_JuzhnoeMedvedkovo',
       'sub_area_JuzhnoeTushino', 'sub_area_Juzhnoportovoe',
       'sub_area_Kapotnja', 'sub_area_Konkovo', 'sub_area_Koptevo',
       'sub_area_KosinoUhtomskoe', 'sub_area_Kotlovka',
       'sub_area_Krasnoselskoe', 'sub_area_Krjukovo',
       'sub_area_Krylatskoe', 'sub_area_Kuncevo', 'sub_area_Kurkino',
       'sub_area_Kuzminki', 'sub_area_Lefortovo', 'sub_area_Levoberezhnoe',
       'sub_area_Lianozovo', 'sub_area_Ljublino', 'sub_area_Lomonosovskoe',
       'sub_area_Losinoostrovskoe', 'sub_area_Marfino',
       'sub_area_MarinaRoshha', 'sub_area_Marino', 'sub_area_Matushkino',
       'sub_area_Meshhanskoe', 'sub_area_Metrogorodok', 'sub_area_Mitino',
       'sub_area_Molzhaninovskoe', 'sub_area_MoskvorecheSaburovo',
       'sub_area_Mozhajskoe', 'sub_area_NagatinoSadovniki',
       'sub_area_NagatinskijZaton', 'sub_area_Nagornoe',
       'sub_area_Nekrasovka', 'sub_area_Nizhegorodskoe',
       'sub_area_NovoPeredelkino', 'sub_area_Novogireevo',
       'sub_area_Novokosino', 'sub_area_Obruchevskoe',
       'sub_area_OchakovoMatveevskoe', 'sub_area_OrehovoBorisovoJuzhnoe',
       'sub_area_OrehovoBorisovoSevernoe', 'sub_area_Ostankinskoe',
       'sub_area_Otradnoe', 'sub_area_Pechatniki', 'sub_area_Perovo',
       'sub_area_PokrovskoeStreshnevo', 'sub_area_PoselenieDesjonovskoe',
       'sub_area_PoselenieFilimonkovskoe', 
        # This one is almost empty.  Will lump in with another category.
        #'sub_area_PoselenieKievskij',
        # This one is almost empty.  Will lump in with another category.
        #'sub_area_PoselenieKlenovskoe', 
       'sub_area_PoselenieKokoshkino',
       'sub_area_PoselenieKrasnopahorskoe',
       'sub_area_PoselenieMarushkinskoe',
        # This one is almost empty.  Will lump in with another category.
        #'sub_area_PoselenieMihajlovoJarcevskoe',
       'sub_area_PoselenieMoskovskij', 'sub_area_PoselenieMosrentgen',
       'sub_area_PoselenieNovofedorovskoe',
       'sub_area_PoseleniePervomajskoe', 'sub_area_PoselenieRjazanovskoe',
       'sub_area_PoselenieRogovskoe', 
        # This one is almost empty.  Will lump in with another category.
        #'sub_area_PoselenieShhapovskoe',
       'sub_area_PoselenieShherbinka', 'sub_area_PoselenieSosenskoe',
       'sub_area_PoselenieVnukovskoe', 'sub_area_PoselenieVoronovskoe',
       'sub_area_PoselenieVoskresenskoe', 'sub_area_Preobrazhenskoe',
       'sub_area_Presnenskoe', 'sub_area_ProspektVernadskogo',
       'sub_area_Ramenki', 'sub_area_Rjazanskij', 'sub_area_Rostokino',
       'sub_area_Savelki', 'sub_area_Savelovskoe', 'sub_area_Severnoe',
       'sub_area_SevernoeButovo', 'sub_area_SevernoeIzmajlovo',
       'sub_area_SevernoeMedvedkovo', 'sub_area_SevernoeTushino',
       'sub_area_Shhukino', 'sub_area_Silino', 'sub_area_Sokol',
       'sub_area_SokolinajaGora', 'sub_area_Sokolniki',
       'sub_area_Solncevo', 'sub_area_StaroeKrjukovo', 'sub_area_Strogino',
       'sub_area_Sviblovo', 'sub_area_Taganskoe', 'sub_area_Tekstilshhiki',
       'sub_area_TeplyjStan', 'sub_area_Timirjazevskoe',
       'sub_area_Troickijokrug', 'sub_area_TroparevoNikulino',
       'sub_area_Tverskoe', 'sub_area_Veshnjaki', 'sub_area_Vnukovo',
       'sub_area_Vojkovskoe', 'sub_area_Vostochnoe',
       'sub_area_VostochnoeDegunino', 'sub_area_VostochnoeIzmajlovo',
       'sub_area_VyhinoZhulebino', 'sub_area_Zamoskvoreche',
       'sub_area_ZapadnoeDegunino', 'sub_area_Zjablikovo', 'sub_area_Zjuzino',   
       ]


# Lump together small sub_areas
dfa = dfa.assign( sub_area_PoselenieSmall =
   dfa.sub_area_PoselenieMihajlovoJarcevskoe +
   dfa.sub_area_PoselenieKievskij +
   dfa.sub_area_PoselenieKlenovskoe +
   dfa.sub_area_PoselenieShhapovskoe )

# For now eliminate case with ridiculous value of full_sq
dfa = dfa[~dfa.fullhuge]

    
# Independent features

indievars = ["owner_occ", "state", "state_Xowner", "lnfull_Xowner", "mortgage_growth"]


# Complete list of features to use for fit

allvars = fullvars + floorvars + maxvars + kitchvars + buildvars + matervars
allvars += floorXvars + kitchXvars + subarvars + indievars



### Set up target variable, along with some related features for heteroskedasticity model

In [None]:
# The normalized target variable:  log real sale price
training = dfa[dfa.price_doc.notnull()]
training.lnrp = training.price_doc.div(training.cpi).apply(np.log)
y = training.lnrp

# Features to use later in heteroskedasticity model
million1 = (training.price_doc==1e6)
million2 = (training.price_doc==2e6)
million3 = (training.price_doc==3e6)

# Create X matrix for fitting
keep = allvars + ['timestamp']  # Need to keep timestamp to calculate weights
X = training[keep] 

### Set up (time-weights, imputation, scaling, etc.) for initial fit

In [None]:
def get_weights(df):
    # Weight cases linearly on time
    # with later cases (more like test data) weighted more heavily
    basedate = pd.to_datetime(weight_base).toordinal() # Basedate gets a weight of zero
    wtd = pd.to_datetime(df.timestamp).apply(lambda x: x.toordinal()) - basedate
    wts = np.array(wtd)/1e3 # The denominator here shouldn't matter, just gives nice numbers.
    return wts

In [None]:
wts = get_weights(X)
X = X.drop("timestamp", axis=1)

In [None]:
if use_pipe:
    from sklearn.preprocessing import Imputer, StandardScaler
    from sklearn.pipeline import make_pipeline

    # Make a pipeline that transforms X
    pipe = make_pipeline(Imputer(), StandardScaler())
    pipe.fit(X)
    pipe.transform(X)

### Fit a bagged WLS and generate squared residuals

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import BaggingRegressor
import random

random.seed = 100
lr = LinearRegression(fit_intercept=True)
br = BaggingRegressor(lr)
if use_pipe:
    br.fit(pipe.transform(X), y, sample_weight=wts)
else:
    br.fit(X, y, sample_weight=wts)

In [None]:
# Look for collinearity problems
for e in br.estimators_:
    co = e.coef_
    mask = np.abs(co)>1e4
    print( X.columns[mask].values )

In [None]:
# Predict on the training set and take the residuals
if use_pipe:
    pred = br.predict( pipe.transform(X) )
else:
    pred = br.predict(X)
resids = y - pred
resids2 = resids * resids

### Fit the initial heteroskedsticity model

In [None]:
# Add heteroskedasticity-related features
Xhetero = X.assign(million1=million1, million2=million2, million3=million3, lnrp=y)

I tried this using raw squared residuals, but that model predicts a lot of negative values.  Given the distribution of the squared residuals, logs are more appropriate.

In [None]:
lnres2 = np.log(resids2)
plt.hist(resids2, bins=20)
plt.show()
plt.hist(lnres2, bins=20)
plt.show()

In [None]:
# For a visual look at the results of a rough model, use statsmodels.WLS
from statsmodels.regression.linear_model import WLS
xdat = Xhetero.copy().astype(np.float64)
xdat["constant"] = 1
ydat = lnres2.copy().astype(np.float64)
result = WLS(ydat, xdat, weights=wts).fit()
result.summary()

That's a regression sorely in need of regularization, so I'm going to fit a LASSO.  But SKLearn's version of LASSO doesn't allow for sample weights, so I first want to check the unweighted regression to see if the results are similar.

In [None]:
from statsmodels.regression.linear_model import OLS
result = OLS(ydat, xdat).fit()
result.summary()

Close enough for a heteroskedasticity model, I think.  OK, let's do LASSO.

### Regularize

I experimented with various values of the LASSO parameter to see what gave reasonable-looking results.  Here's a range of possibilities:

#### First LASSO:  alpha=0.05

In [None]:
from sklearn.linear_model import Lasso
la_hetero = Lasso(alpha=5e-2)
if use_pipe:
    newpipe = make_pipeline(Imputer(), StandardScaler())
    newpipe.fit(Xhetero)
    la_hetero.fit(newpipe.transform(Xhetero), lnres2)
    pred_res2 = np.exp( la_hetero.predict( newpipe.transform(Xhetero) ) )
else:
    la_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( la_hetero.predict(Xhetero) )
print( np.min(pred_res2) )
print( np.max(pred_res2) )
pd.DataFrame(Xhetero.columns, la_hetero.coef_)[np.abs(la_hetero.coef_)>1e-5]

#### Second LASSO:  alpha=0.1

In [None]:
la_hetero = Lasso(alpha=1e-1)
if use_pipe:
    la_hetero.fit(newpipe.transform(Xhetero), lnres2)
    pred_res2 = np.exp( la_hetero.predict( newpipe.transform(Xhetero) ) )
else:
    la_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( la_hetero.predict(Xhetero) )
print( np.min(pred_res2) )
print( np.max(pred_res2) )
pd.DataFrame(Xhetero.columns, la_hetero.coef_)[np.abs(la_hetero.coef_)>1e-5]

#### Third LASSO:  alpha=0.2

In [None]:
la_hetero = Lasso(alpha=2e-1)
if use_pipe:
    la_hetero.fit(newpipe.transform(Xhetero), lnres2)
    pred_res2 = np.exp( la_hetero.predict( newpipe.transform(Xhetero) ) )
else:
    la_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( la_hetero.predict(Xhetero) )
print( np.min(pred_res2) )
print( np.max(pred_res2) )
pd.DataFrame(Xhetero.columns, la_hetero.coef_)[np.abs(la_hetero.coef_)>1e-5]

#### Fourth LASSO:  alpha=0.4

In [None]:
la_hetero = Lasso(alpha=4e-1)
if use_pipe:
    la_hetero.fit(newpipe.transform(Xhetero), lnres2)
    pred_res2 = np.exp( la_hetero.predict( newpipe.transform(Xhetero) ) )
else:
    la_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( la_hetero.predict(Xhetero) )
print( np.min(pred_res2) )
print( np.max(pred_res2) )
pd.DataFrame(Xhetero.columns, la_hetero.coef_)[np.abs(la_hetero.coef_)>1e-5]

Initially, I decided to fit an OLS, and then a WLS, to a slightly altered version of the variable set chosen by the third LASSO.

In [None]:
ls_hetero = LinearRegression(fit_intercept=True)
xh = Xhetero[["lnfloor_Xowner", "sub_area_Nekrasovka", "owner_occ",
              "state_Xowner", "million1", "million2", "million3"]]
if use_pipe:
    newerpipe = make_pipeline(Imputer(), StandardScaler())
    newerpipe.fit(xh)
    ls_hetero.fit(newerpipe.transform(xh), lnres2)
    pred_res2 = np.exp( ls_hetero.predict( newerpipe.transform(xh) ) )
else:
    ls_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( ls_hetero.predict(Xhetero) )
print( np.min(pred_res2) )
print( np.max(pred_res2) )
print( np.mean(pred_res2) )
print( np.std(resids2) )
print( np.min(resids2) )
print( np.max(resids2) )
print( np.mean(resids2) )
print( np.std(resids2) )

pd.DataFrame(xh.columns, ls_hetero.coef_)[np.abs(ls_hetero.coef_)>1e-5]

In [None]:
xdat = xh.copy().astype(np.float64)
xdat["constant"] = 1
ydat = lnres2.copy().astype(np.float64)
result = WLS(ydat, xdat, weights=wts).fit()
result.summary()

Here's what impressed me:  I fit a bagged version of the WLS, using 20% random samples, and the coefficients are amazingly robust across bags -- I mean, about as robust as one could hope given the amount of random variation one expects in the data.

In [None]:
random.seed = 200
xin = xh
lr = LinearRegression(fit_intercept=True)
br = BaggingRegressor(lr, max_samples=0.2)
br.fit(xin, lnres2, sample_weight=wts)
pd.DataFrame(data=[e.coef_ for e in br.estimators_], columns=xin.columns)

Just to make the point (and, at the time, to check that the samples were what I intended):  these are quite small samples relative to the total training data set, with only the randomly expected amount of overlap.  You should be able to see why I was impressed with the robustness.

In [None]:
sums = [sum(s) for s in br.estimators_samples_]
print( sums )
print (np.mean(sums))

In [None]:
print( xin.shape[0] )
br

In [None]:
sum(br.estimators_samples_[0]*br.estimators_samples_[1])

In [None]:
br.estimators_samples_

But one really should also do a division by time.  A 50/50 split sample test showed less robustness and led me to omit some variables which had dramatically different coefficients on the sample halves (also because I was suspicious of them to begin with).

In [None]:
xin_top = xin[0:15235]
r_top = lnres2[0:15235]
xin_bot = xin[15236:30470]
r_bot = lnres2[15236:30470]
xdat = xin_top.copy().astype(np.float64)
xdat["constant"]=1
rdat = r_top.copy().astype(np.float64)
print( OLS(rdat, xdat).fit().summary() )
xdat = xin_bot.copy().astype(np.float64)
xdat["constant"]=1
rdat = r_bot.copy().astype(np.float64)
print( OLS(rdat, xdat).fit().summary() )

So here's what, at the time, was my "final" model, which I fit first as an exploratory WLS and then as a bagged WLS that was intended to generate my heteroskedasticity weights.

In [None]:
xh = Xhetero[["sub_area_Nekrasovka", "owner_occ",
              "million1", "million2", "million3"]]
xdat = xh.copy().astype(np.float64)
xdat["constant"] = 1
ydat = lnres2.copy().astype(np.float64)
result = WLS(ydat, xdat, weights=wts).fit()
result.summary()

In [None]:
random.seed = 200
xin = xh
lr = LinearRegression(fit_intercept=True)
br = BaggingRegressor(lr, max_samples=0.5)
br.fit(xin, lnres2, sample_weight=wts)
pd.DataFrame(data=[e.coef_ for e in br.estimators_], columns=xin.columns)

In [None]:
pred_res2 = np.exp( br.predict(xin) )
print()
print( np.min(pred_res2) )
print( np.min(resids2) )
print()
print( np.max(pred_res2) )
print( np.max(resids2) )
print()
print( np.mean(pred_res2) )
print( np.mean(resids2) )
print()
print( np.std(pred_res2) )
print( np.std(resids2) )

Since all the predictors are binary and the model is linear, there are a small number of possible output values.  But the range of values is weird.  If I were to go with standard procedure and weight by the reciprocal of the predicted squared residuals (thereby equalizing the expected variance), I would, rather ridiculously, have to give 28 times as much weight to owner-occupied sales in Nekraskova as to typical (investment) sales.  And a price of 3 million rubles -- which, based on my past inspection of the data, is not necessarily far out of line and probably very often just represents a round figure for an otherwise reasonable price -- would further reduce the weight by a factor of 11.

In [None]:
predvals = pd.Series(pred_res2).value_counts().sort_index()
predvals

In [None]:
print( predvals.index.values[3]/predvals.index.values[0] )
print( predvals.index.values[8]/predvals.index.values[3] )

So I tried using a "waterdown" parameter, where I add a constant to the predicted residuals and use the reciprocal of that as my sample weight.  The results were disappointing.  I won't go through everything I did, but I eventually decided to give up on my intuitively chosen bagged WLS and go back to LASSO, letting out-of-sample RMSLE choose the penalty coefficient and the "waterdown" parameter, as well the parameter for my time weights.

### Evaluate

In [None]:
from sklearn.metrics import make_scorer

def rmsle_exp(y_true_log, y_pred_log):
    y_true = np.exp(y_true_log)
    y_pred = np.exp(y_pred_log)
    return np.sqrt(np.mean(np.power(np.log(y_true + 1) - np.log(y_pred + 1), 2)))

def score_model(model, pipe=None):
    if (pipe==None):
        train_error = rmsle_exp(y_train, model.predict(X_train))
        test_error = rmsle_exp(y_test, model.predict(X_test))
    else:
        train_error = rmsle_exp(y_train, model.predict(pipe.transform(X_train)))
        test_error = rmsle_exp(y_test, model.predict(pipe.transform(X_test)))
    return train_error, test_error

In [None]:
len(X)

Split the training data by time into a 75% training sample and a 25% test sample.  (Note that the data were already given to us in order of time.)

In [None]:
split = int(.75*len(X))
X_train = X[0:split]
X_test = X[split:len(X)]
y_train = y[0:split]
y_test = y[split:len(y)]
wts_train = wts[0:split]

Do a grid search to choose the most successful combination of LASSO penalty coefficient, waterdown parameter for predicted squared residuals, and zero date for time weighting.  (Note that I have given up on bagging now.  Originally I did some runs which used bagging, and the bagged versions almost always did worse out of sample.  So these just use plain WLS.)

*This takes too long to run on Kaggle, so I've replaced the grid definitions with abbreviated ones.  The originals are still in the code, though, but they're superseded*

In [None]:
from sklearn.linear_model import Lasso
newpipe = make_pipeline(Imputer(), StandardScaler())
newpipe.fit(Xhetero)

minval = 100.
where_min = ["Never",0.,0.]
results = pd.DataFrame()
alphas = [5e-1,2e-1,1e-1,5e-2,2e-2,1e-2,5e-3,2e-3]
# DELETE NEXT LINE TO RUN FULL VERSION
alphas = [5e-1,5e-2,2e-3]
for alpha in alphas:
    print( "alpha=", alpha)
    la_hetero = Lasso(alpha=alpha)
    la_hetero.fit(newpipe.transform(Xhetero), lnres2)
    pred_res2 = np.exp( la_hetero.predict( newpipe.transform(Xhetero) ) )
    la_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( la_hetero.predict(Xhetero) )
    wbvalues = ["2010-01-01", "2010-04-01", "2010-07-01","2010-10-01",
                "2011-01-01", "2011-04-01", "2011-07-01"]
# DELETE NEXT LINE TO RUN FULL VERSION
    wbvalues = ["2010-01-01", "2011-01-01", "2011-07-01"]
    for wbase in wbvalues:
        print( "    wbase=", wbase)
        basedate = pd.to_datetime(wbase).toordinal() # Basedate gets a weight of zero
        wtd = pd.to_datetime(training.timestamp).apply(lambda x: x.toordinal()) - basedate
        wts = np.array(wtd)/1e3 # The denominator here shouldn't matter, just gives nice numbers.
        row = []
        wdvalues = [0,1,2,4,8,16,32,64,128,256,512]
# DELETE NEXT LINE TO RUN FULL VERSION
        wdvalues = [1,32,512]
        for waterdown in wdvalues:
            pred_wd = pred_res2 + waterdown
            wts_train = (wts * (pred_wd)) [0:split]
            lr.fit(X_train, y_train, sample_weight=wts_train)
            test_error = rmsle_exp(y_test, lr.predict(X_test))
            if test_error < minval:
                minval = test_error
                where_min = [wbase, alpha, waterdown]
            row = row + [test_error]
        index = pd.MultiIndex.from_tuples([(wbase,alpha)], names=['wbase', 'alpha'])
        dfrow = pd.DataFrame( index=index, data=[row], columns=wdvalues)
        results = results.append( dfrow )
print( where_min )
print( minval )
results

The choice was 512 (my proxy for infinity) for the waterdown parameter, so the LASSO penalty coefficient is irrelevant.  And the results favor the most aggressive version of time weighting.  I did another grid search below, which is a combination of fine tuning and hoping the data would change their mind.

*Again, I've abbreviated the grid.*

In [None]:
minval = 100.
where_min = ["Never",0.,0.]
results = pd.DataFrame()
alphas = [2e-1,1e-1,5e-2,2e-2,1e-2]
# DELETE NEXT LINE TO RUN FULL VERSION
alphas = [2e-1,1e-2]
for alpha in alphas:
    print( "alpha=", alpha)
    la_hetero = Lasso(alpha=alpha)
    la_hetero.fit(newpipe.transform(Xhetero), lnres2)
    pred_res2 = np.exp( la_hetero.predict( newpipe.transform(Xhetero) ) )
    la_hetero.fit(Xhetero, lnres2)
    pred_res2 = np.exp( la_hetero.predict(Xhetero) )
    wbvalues = ["2011-06-01", "2011-06-15", "2011-07-01","2010-07-15",
                "2011-08-01", "2011-08-15"]
# DELETE NEXT LINE TO RUN FULL VERSION
    wbvalues = ["2011-06-15", "2011-08-15"]
    for wbase in wbvalues:
        print( "    wbase=", wbase)
        basedate = pd.to_datetime(wbase).toordinal() # Basedate gets a weight of zero
        wtd = pd.to_datetime(training.timestamp).apply(lambda x: x.toordinal()) - basedate
        wts = np.array(wtd)/1e3 # The denominator here shouldn't matter, just gives nice numbers.
        row = []
        wdvalues = [0,2,8,32,128,512]
# DELETE NEXT LINE TO RUN FULL VERSION
        wdvalues = [8,512]
        for waterdown in wdvalues:
            pred_wd = pred_res2 + waterdown
            wts_train = (wts * (pred_wd)) [0:split]
            lr.fit(X_train, y_train, sample_weight=wts_train)
            test_error = rmsle_exp(y_test, lr.predict(X_test))
            if test_error < minval:
                minval = test_error
                where_min = [wbase, alpha, waterdown]
            row = row + [test_error]
        index = pd.MultiIndex.from_tuples([(wbase,alpha)], names=['wbase', 'alpha'])
        dfrow = pd.DataFrame( index=index, data=[row], columns=wdvalues)
        results = results.append( dfrow )
print( where_min )
print( minval )
results            

Again, the data tell me to throw away the heteroskedasticity weights (water them down to nothing) and choose the most aggressive time weighting.  In subsequent analysis, I've arbitrarily chosen an even slightly more aggressive time weighting, with 2011-08-19 as the zero date, so the initial data points are given the smallest possible nonzero weight in my linear weighting scheme.  Call that OCD, or Occam's Razor: the simplest way to choose time weights is to start from zero and ramp up linearly.

## Ditch the heteroskedasticity weighting and just fit with time weights

In [None]:
weight_base = "2010-08-19"
wts = get_weights(training)
newwts = wts
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression(fit_intercept=True)
if use_pipe:
    lr.fit(pipe.transform(X), y, sample_weight=newwts)
else:
    lr.fit(X, y, sample_weight=newwts)

### Predict on the test set

In [None]:
testing = dfa[dfa.price_doc.isnull()]

In [None]:
df_test = pd.DataFrame(columns=X.columns)
for column in df_test.columns:
        df_test[column] = testing[column]        

In [None]:
# Make the predictions
if use_pipe:
    pred = lr.predict( pipe.transform(df_test) )
else:
    pred = lr.predict(df_test)
predictions = np.exp(pred)*testing.cpi

# And put this in a dataframe
predictions_df = pd.DataFrame()
predictions_df['id'] = testing['id']
predictions_df['price_doc'] = predictions
predictions_df.head()



In [None]:
predictions_df.to_csv('wls_predictions.csv', index=False)

### Postlogue:  Inspect the model fit

In [None]:
# Check for ridiculous coefficients, likely indicating collinearity
co = lr.coef_
ra = range(len(co))
mask = np.abs(co)>1e4
X.columns[mask].values


In [None]:
from statsmodels.regression.linear_model import WLS
xdat = X.copy().astype(np.float64)
xdat["constant"] = 1
ydat = y.copy().astype(np.float64)
result = WLS(ydat, xdat, weights=newwts).fit()
result.summary()

In [None]:
# Note that, if the model is run without the pipe transform, the coefficients below
#  should be the same as those above.  Sometimes they are, sometimes not.
#  If they're not the same, probably numerical instability due to collinearity.
pd.DataFrame(X.columns, co)

### Frequencies of sub_areas (completely out of the blue)

I'm going to use these in the next analysis, to collapse sub_areas with fewer than 10 cases into geographically contiguous aggregates.  I had to do something to point forward after such a disappointing analysis.

In [None]:
train.sub_area.value_counts().tail(20)

In [None]:
print(train.sub_area.sort_values().unique()[1:170])