## Exploring extrapolation

Modified the dataset through sorting by YrSold & MoSold then using most recent auctions as the validation set. This shows how you can use a random forest model to determine if there's any temporal features that can be used to predict if a record is validation or training record. This can also be used to determine if the test set is random or grouped by time.

In [10]:
%load_ext autoreload
%autoreload 2

%matplotlib inline

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [11]:
from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics
from sklearn.model_selection import train_test_split

PATH = "data/"
df_raw = pd.read_csv(f'{PATH}Train.csv', low_memory=False)

#sort values
df_raw = df_raw.sort_values(by=['YrSold', 'MoSold'],ascending=False)

df_raw.SalePrice = np.log(df_raw.SalePrice)
train_cats(df_raw)
df, y, nas = proc_df(df_raw, 'SalePrice', max_n_cat=7)


In [12]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

def split(df, y):
    #return (train_test_split(df, y, test_size=0.2, random_state=42))
    #or
    n_valid = 300  # same as Kaggle's test set size
    n_trn = len(df)-n_valid
    X_train, X_valid = split_vals(df, n_trn)
    y_train, y_valid = split_vals(y, n_trn)
    return (X_train, X_valid, y_train, y_valid)


X_train, X_valid, y_train, y_valid = split(df, y)


In [13]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)
        
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

g_score = {
            'RMSE_train': [],
            'RMSE_valid': [],
            'score_train': [],
            'score_valid': []
}

def plot_trend():
    pd.DataFrame(g_score).plot(y='RMSE_valid', figsize=(10,6), legend=True);

def print_score(m):
    data = {
            'RMSE_train': [rmse(m.predict(X_train), y_train)],
            'RMSE_valid': [rmse(m.predict(X_valid), y_valid)],
            'score_train': [m.score(X_train, y_train)],
            'score_valid': [m.score(X_valid, y_valid)]
            }
    if hasattr(m, 'oob_score_'): data['score_oob_'] = [m.oob_score_]
    for k in ['RMSE_train', 'RMSE_valid', 'score_train', 'score_valid']:
        #print(data[k][0])
        g_score[k].append(data[k][0])    
    df_score = pd.DataFrame(data)
    display_all(df_score.transpose())

In [14]:
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, min_samples_leaf=1, max_features=0.5, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

Unnamed: 0,0
RMSE_train,0.055858
RMSE_valid,0.12194
score_train,0.980645
score_valid,0.902595
score_oob_,0.855669


In [8]:
# fi = rf_feat_importance(m, df)
# #df_keep = df[fi[fi.imp>0.0042].cols].copy()
# df_keep = df

# X_train, X_valid, y_train, y_valid = split(df_keep, y)
# m = RandomForestRegressor(n_estimators=40, n_jobs=-1, min_samples_leaf=1, max_features=0.5, oob_score=True)
# m.fit(X_train, y_train)
# print_score(m)

Unnamed: 0,0
RMSE_train,0.059694
RMSE_valid,0.124675
score_train,0.977895
score_valid,0.898177
score_oob_,0.85527


In [15]:
# from scipy.cluster import hierarchy as hc
# corr = np.round(scipy.stats.spearmanr(df_keep).correlation, 4)
# corr_condensed = hc.distance.squareform(1-corr)
# z = hc.linkage(corr_condensed, method='average')
# fig = plt.figure(figsize=(16,10))
# dendrogram = hc.dendrogram(z, labels=df_keep.columns, orientation='left', leaf_font_size=16)
# plt.show()

In [167]:
# df.drop(['GarageYrBlt', 'GarageCars', 'TotRmsAbvGrd', 'TotalBsmtSF'], axis=1, inplace=True)
# X_train, X_valid, y_train, y_valid = split(df_keep, y)
# m = RandomForestRegressor(n_estimators=40, n_jobs=-1, min_samples_leaf=1, max_features=0.5, oob_score=True)
# m.fit(X_train, y_train)
# print_score(m)

Unnamed: 0,0
RMSE_train,0.056592
RMSE_valid,0.123793
score_train,0.980133
score_valid,0.899613
score_oob_,0.857128


In [78]:
??df_keep.append

In [16]:
ex_X_train = X_train.copy()
ex_X_valid = X_valid.copy()
ex_X_train['is_valid'] = 0
ex_X_valid['is_valid'] = 1
ex_df = ex_X_train.append(ex_X_valid, ignore_index=True)
display(ex_df[['YrSold', 'MoSold', 'is_valid']])

Unnamed: 0,YrSold,MoSold,is_valid
0,2010,7,0
1,2010,7,0
2,2010,7,0
3,2010,7,0
4,2010,7,0
5,2010,7,0
6,2010,6,0
7,2010,6,0
8,2010,6,0
9,2010,6,0


In [17]:
x, y, _ = proc_df(ex_df, 'is_valid')
m = RandomForestRegressor(n_estimators=40, n_jobs=-1, min_samples_leaf=1, max_features=0.5, oob_score=True)
m.fit(x, y)
m.oob_score_

fi = rf_feat_importance(m, x);fi[:10]

Unnamed: 0,cols,imp
43,YrSold,0.905022
42,MoSold,0.040928
11,YearRemodAdd,0.008416
235,SaleCondition_Partial,0.006375
10,YearBuilt,0.00518
32,GarageYrBlt,0.004633
234,SaleCondition_Normal,0.003193
0,Id,0.002508
3,LotArea,0.0024
125,BsmtCond_Gd,0.001733


In this example I changed the data to be ordered by year sold then picked the most recent auctions as the validation set. Then went about doing [the extrapolation](https://youtu.be/3jl2h9hSRvc?t=3134). Mark the training and validation set accordingly then combine it into one df and try to predict whether a record is a validation record. If you get a very high oob_score then you can look at feature importance. The most important features may be indicative of temporal features that you may want to remove from the training set to better predict future values. 

The same can be done with kaggle test set, if you can predict if a record is a test set record then it means that the test set is not random!