# Train test DRY management 
How should you keep the train and test data set seperate but still having [DRY](https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) code?
This is something that I have been thinking about for a while and I would be very pleased I anyone could comment or give some advices.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

## Fake some data

In [None]:
X_, y_ = make_regression(n_samples=1000,
    n_features=10,
    n_informative=5,
    n_targets=1,
    bias=0.0,
    noise=0.3,
    shuffle=True,
    random_state=0)

## Train-test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_, y_, test_size=0.33, random_state=0)

Now the data is split into four separate items: **X_train**, **X_test**, **y_train**, **y_test**, which will make it harder to keep the code DRY.  For instance if we want to remove the first 10 rows of the train data, we will need to remove them from both **X_train** and **y_train**:

## Removing the first 10 rows (we suspect that they are wrong somehow)

In [None]:
X_train_ = X_train[10:]
y_train_ = y_train[10:]

Having to repeat this operation for both X and y is not very DRY. In order to handle this more elegantly we can combine **X** and **y** in the same data frame (we will of course need to separate them at a later stage again).

In [None]:
train = pd.DataFrame(X_train)
train['y'] = y_train

test = pd.DataFrame(X_test)
test['y'] = y_test

Now we can handle both **X** and **y** at the same time (more DRY):

In [None]:
train = train.iloc[10:]

## Adding a column...
But what if we should add a new column? We must do this for both training data and testing data...
Perhaps we want to invent a new feature called **helper**:

In [None]:
train['helper'] = np.arange(0,len(train))
test['helper'] = np.arange(0,len(test))

So here we need to define **helper** two times, repeating some code, not good...
I have seen a solution where you create a list with the **train** and **test** data as a solution to this:

In [None]:
datasets = [train, test]

Then you can iterate over this list (even more DRY):

In [None]:
for dataset in datasets:
    dataset['helper'] = np.arange(0,len(dataset))

In [None]:
datasets[1].head()

And if you want to modify a column in test:

In [None]:
test['helper'] = 1

...this will now be vissible in the **datasets** list:

In [None]:
datasets[1].head()

**BUT!** if you redefine the test dataframe (which you sometimes need to do, or do by misstake) for instance by replacing it with an empty dataframe:

In [None]:
test = pd.DataFrame()

...this operation "loses" the connection to the **datasets** list, since it still contain the "old" test dataframe: 

In [None]:
datasets[1].head()

## Better way?
So we have shown that the "list-approach" above can be powerfull, but that it is not a "bullit proof" solution.
Is this a good way to keep your data handling code DRY? Does it have any drawbacks or is there an even better way?

I think it would be nice with a solution where **train**/**test** can either be handled separately or as one combined unit. 

The nicest would be if one could have everything in the same dataframe, but I cannot figure out how to get the to work???

In [None]:
data = pd.DataFrame()
train_ = pd.DataFrame(train)
data = data.append(train_)

In [None]:
data.head()

In [None]:
train['helper'] = 2

In [None]:
train.head()

In [None]:
data.head()