# Playing with cross validation methods
The purpose of this notebook is to play with cross validation methods that are provided from sci-kit.



In [1]:
# Inserting the cleaned data
import pandas as pd
from sklearn import tree

train_csv = pd.read_csv('clean_data/train.csv')
train_csv.head()

Unnamed: 0,Store,Date,Weekly_Sales,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment
0,1,2010-02-05,1643690.9,39.7,2.572,0.0,0.0,0.0,0.0,0.0,211.653972,6.566
1,2,2010-02-05,2136989.46,39.7,2.572,0.0,0.0,0.0,0.0,0.0,211.653972,6.566
2,3,2010-02-05,461622.22,39.7,2.572,0.0,0.0,0.0,0.0,0.0,211.653972,6.566
3,4,2010-02-05,2135143.87,39.7,2.572,0.0,0.0,0.0,0.0,0.0,211.653972,6.566
4,5,2010-02-05,317173.1,39.7,2.572,0.0,0.0,0.0,0.0,0.0,211.653972,6.566


## K-fold
Update: the method is weird since it's outputting some values that do not make sense, like it's supposed to show values from sliced dataframes and not just some random numbers.
Update 2: Upon further inspection, by the looks of the outputs, it seems that the randomised represent indexes of the rows rather than the actual data itself, which is correct. However, it is not convenient since the program needs the actual randomised dataset so that it can be fed straight to an algorithm.


In [2]:
from sklearn.model_selection import KFold
partial_dataset_without = train_csv.loc[:,['Temperature', 'Fuel_Price']]
partial_dataset_without.head()

Unnamed: 0,Temperature,Fuel_Price
0,39.7,2.572
1,39.7,2.572
2,39.7,2.572
3,39.7,2.572
4,39.7,2.572


In [3]:
classes = train_csv['Weekly_Sales']
classes.head()

0    1643690.90
1    2136989.46
2     461622.22
3    2135143.87
4     317173.10
Name: Weekly_Sales, dtype: float64

In [4]:
kf = KFold(n_splits=3)
formatted_data = kf.split(partial_dataset_without.values, classes.values)
x, y, z = formatted_data

In [5]:
x

(array([2055, 2056, 2057, ..., 6162, 6163, 6164]),
 array([   0,    1,    2, ..., 2052, 2053, 2054]))

In [6]:
y

(array([   0,    1,    2, ..., 6162, 6163, 6164]),
 array([2055, 2056, 2057, ..., 4107, 4108, 4109]))

In [7]:
z

(array([   0,    1,    2, ..., 4107, 4108, 4109]),
 array([4110, 4111, 4112, ..., 6162, 6163, 6164]))

## train_test_split helper function
Update: the function worked fine, the next thing to test is the stratified k-fold.

In [8]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(partial_dataset_without, classes, test_size=0.2)

In [9]:
print(len(x_train.values), 'x_train entries')

4932 x_train entries


In [14]:
x_train.head()

Unnamed: 0,Temperature,Fuel_Price
463,65.3,2.808
1678,70.96,2.725
2246,37.74,2.983
2749,69.64,3.622
363,63.18,2.719


In [13]:
x_test.head()

Unnamed: 0,Temperature,Fuel_Price
3142,83.81,3.699
2662,70.71,3.473
4792,48.25,3.51
1951,51.31,2.708
4455,45.62,3.129


In [15]:
len(x_test)

1233

In [17]:
y_train.head()

463     2121788.61
1678    1936621.09
2246     613899.15
2749     307333.62
363     1979247.12
Name: Weekly_Sales, dtype: float64

In [19]:
len(y_train)

4932

In [21]:
y_test.head()

3142     396826.06
2662     827968.36
4792    1365546.69
1951     926573.81
4455    1497462.72
Name: Weekly_Sales, dtype: float64

In [22]:
len(y_test)

1233

## StratifiedKFold
Update: The function gives out an error that says that the type needs to be either *binary* or *multiclass* and not *continous*. It seems that this split function (and its implementation tha raised an error) in particular expects concrete classes rather than values that represent sales.

```
    def _make_test_folds(self, X, y=None):
         rng = self.random_state
         y = np.asarray(y)
         type_of_target_y = type_of_target(y)
         allowed_target_types = ('binary', 'multiclass')
         if type_of_target_y not in allowed_target_types:
             raise ValueError(
                 'Supported target types are: {}. Got {!r} instead.'.format(
                     allowed_target_types, type_of_target_y))
```

In [23]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10)

In [24]:
train, test = skf.split(partial_dataset_without.values, classes.values)


ValueError: Supported target types are: ('binary', 'multiclass'). Got 'continuous' instead.