# Validation and cross-validation 

In this exercise you will implement a validation pipeline. 

At the end of Exercise 2, you tested your model against the training and test datasets. As you should observe, there's a gap between the results. By validating your model, not only should you be able to anticipate the test time performance, but also have a method to compare different models.

Implement the basic validation method, i.e. a random split. Test it with your model from Exercise 2.

In [1]:
%matplotlib inline

!wget -O mieszkania.csv https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
!wget -O mieszkania_test.csv https://www.dropbox.com/s/dbrj6sbxb4ayqjz/mieszkania_test.csv?dl=1

--2017-11-06 19:07:18--  https://www.dropbox.com/s/zey0gx91pna8irj/mieszkania.csv?dl=1
Translacja www.dropbox.com (www.dropbox.com)... 162.125.66.1, 2620:100:6022:1::a27d:4201
Łączenie się z www.dropbox.com (www.dropbox.com)|162.125.66.1|:443... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 302 Found
Lokalizacja: https://dl.dropboxusercontent.com/content_link/f7SJbHhRRHDSDw0F8tgnMGwRlQ46sy7ZPtS3dPVJwtjkFVgnJLzIypa6H98B9l2V/file?dl=1 [podążanie]
--2017-11-06 19:07:19--  https://dl.dropboxusercontent.com/content_link/f7SJbHhRRHDSDw0F8tgnMGwRlQ46sy7ZPtS3dPVJwtjkFVgnJLzIypa6H98B9l2V/file?dl=1
Translacja dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 162.125.66.6, 2620:100:6022:6::a27d:4206
Łączenie się z dl.dropboxusercontent.com (dl.dropboxusercontent.com)|162.125.66.6|:443... połączono.
Żądanie HTTP wysłano, oczekiwanie na odpowiedź... 200 OK
Długość: 6211 (6,1K) [application/binary]
Zapis do: `mieszkania.csv'


2017-11-06 19:07:21 (612 MB/s) - zapisano `mieszkan

In [2]:
!head mieszkania.csv mieszkania_test.csv

==> mieszkania.csv <==
m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
104,mokotowo,2,2,1940,1,780094
43,ochotowo,1,1,1970,1,346912
128,grodziskowo,3,2,1916,1,523466
112,mokotowo,3,2,1920,1,830965
149,mokotowo,3,3,1977,0,1090479
80,ochotowo,2,2,1937,0,599060
58,ochotowo,2,1,1922,0,463639
23,ochotowo,1,1,1929,0,166785
40,mokotowo,1,1,1973,0,318849

==> mieszkania_test.csv <==
m2,dzielnica,ilość_sypialni,ilość_łazienek,rok_budowy,parking_podziemny,cena
71,wolowo,2,2,1912,1,322227
45,mokotowo,1,1,1938,0,295878
38,mokotowo,1,1,1999,1,306530
70,ochotowo,2,2,1980,1,553641
136,mokotowo,3,2,1939,1,985348
128,wolowo,3,2,1983,1,695726
23,grodziskowo,1,1,1975,0,99751
117,mokotowo,3,2,1942,0,891261
65,ochotowo,2,1,2002,1,536499


In [11]:
import pandas

def get_dataset(path):
    with open(path) as flats:
        data = pandas.read_csv(flats)
    #data = data.sample(frac=1)
    return data

def get_training_dataset():
    data_path = 'mieszkania.csv'
    return get_dataset(data_path)

def get_testing_dataset():
    data_path = 'mieszkania_test.csv'
    return get_dataset(data_path)

def l2_loss(ys, ps):
    """
    Least square error.
    :param ys: ground truth prices
    :param ps: prediction prices
    """
    # quicker solution
    # return np.linalg.norm(y - x) / len(ys)
    return math.sqrt(sum((y - x) ** 2 for x, y in zip(ys, ps)) / len(ys))

dataset = get_training_dataset()
ys = dataset['cena']

print(dataset[:10])

    m2    dzielnica  ilość_sypialni  ilość_łazienek  rok_budowy  \
0  104     mokotowo               2               2        1940   
1   43     ochotowo               1               1        1970   
2  128  grodziskowo               3               2        1916   
3  112     mokotowo               3               2        1920   
4  149     mokotowo               3               3        1977   
5   80     ochotowo               2               2        1937   
6   58     ochotowo               2               1        1922   
7   23     ochotowo               1               1        1929   
8   40     mokotowo               1               1        1973   
9  138     mokotowo               3               2        2011   

   parking_podziemny     cena  
0                  1   780094  
1                  1   346912  
2                  1   523466  
3                  1   830965  
4                  0  1090479  
5                  0   599060  
6                  0   463639  
7     

In [8]:
def get_districts_set():
    return frozenset(dataset['dzielnica'])

districts = get_districts_set()

print(districts)

frozenset({'ochotowo', 'wolowo', 'grodziskowo', 'mokotowo'})


In [10]:
import numpy as np

def get_features(data=dataset):
    m2 = [item / 200 for item in data['m2']]
    bedrooms = data['ilość_sypialni']
    bathrooms = data['ilość_łazienek']
    construction_year = [(2017 - year) / 100 for year in data['rok_budowy']]
    parking_lot = data['parking_podziemny']
    district_features = [np.array(np.array(data['dzielnica']) == np.array(district), dtype=float) for district in districts]

    area_data = dataset.get(['m2', 'cena', 'dzielnica'])
    area_data['cena'] /= area_data['m2']
    average_district_prices_per_meter = area_data.groupby('dzielnica')['cena'].mean()

    average_meter_price_feature = [area * average_district_prices_per_meter[district] for area, district in 
                                  zip(data['m2'], data['dzielnica'])]
    average_meter_price_feature /= np.mean(average_meter_price_feature)

    #features
    xs = np.array([m2, bedrooms, bathrooms, construction_year, parking_lot] + district_features + 
                  [average_meter_price_feature]).T
    return xs

xs = get_features(dataset)

print(xs.shape)

(200, 10)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


In [16]:
from sklearn.linear_model import LinearRegression
import numpy as np
import math

X = xs
regr = LinearRegression()
regr.fit(X, ys) # training

print(regr.score(X, ys), 'r2_score on training')

sk_loss = l2_loss(ys, regr.predict(X))
print(sk_loss, 'l2_loss on training')

0.989255862116 r2_score on training
28160.77291537937 l2_loss on training


In [17]:
testing_data = get_testing_dataset()
ys_testing = testing_data['cena']
xs_testing = get_features(testing_data)

print(regr.score(xs_testing, ys_testing), 'r2_score on testing')
print(l2_loss(ys_testing, regr.predict(xs_testing)), 'l2_loss on testing')

0.923261962644 r2_score on testing
80204.34508809487 l2_loss on testing


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':


Let's split the training data first. Our goal is to get better model, than the above.

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(xs, ys, test_size=0.33, random_state=42)
print(y_train.shape, y_test.shape)

regr = LinearRegression()
regr.fit(X_train, y_train)

print(regr.score(X_train, y_train), 'r2_score on training')

sk_loss = l2_loss(y_train, regr.predict(X_train))
print(sk_loss, 'l2_loss on training')

(134,) (66,)
0.989313809756 r2_score on training
27276.318362422906 l2_loss on training


In [25]:
print(regr.score(X_test, y_test), 'r2_score on testing')

sk_loss = l2_loss(y_test, regr.predict(X_test))
print(sk_loss, 'l2_loss on testing')

0.987828172863 r2_score on testing
31190.588872936456 l2_loss on testing


To make the random split validation reliable, a huge chunk of training data may be needed. To get over this problem, one may apply cross-validaiton.

![alt-text](https://chrisjmccormick.files.wordpress.com/2013/07/10_fold_cv.png)

Let's now implement the method. Make sure that:
* number of partitions is a parameter,
* the method is not limited to `mieszkania.csv`,
* the method is not limited to one specific model.

K-fold cross validation is for model validation!!!
https://stats.stackexchange.com/questions/52274/how-to-choose-a-predictive-model-after-k-fold-cross-validation
Compare with the result from scikit-learn implementation.

In [32]:
regr = LinearRegression()

def cross_validation(model, X, y, folds, loss_function):
    """
    :param model: model object, that has the method fit, predict
    :param X: training data
    :param y: target values
    :param folds: the number of folds (splits)
    :param loss_function: loss function
    :return: averged loss function score
    """
    def single_result(x_subset, y_subset):
        model.fit(x_subset, y_subset)
        return loss_function(y_subset, model.predict(x_subset))
    
    x_splits = np.array_split(X, folds)
    y_splits = np.array_split(y, folds)
    
    return np.average([single_result(x_sub, y_sub) for x_sub, y_sub in zip(x_splits, y_splits)])
    
    
print(cross_validation(regr, xs, ys, folds=3, loss_function=l2_loss))

26599.8249846


Recall that sometimes validation may be tricky, e.g. significant class imbalance, having a small number of subjects, geographically clustered instances...

What could in theory go wrong here with random, unstratified partitions? Think about potential solutions and investigate the data in order to check whether these problems arise here.

In [None]:
##############################
# TODO: Investigate the data #
##############################