# Simple Usecase

We all copied this code at least once in our life:

```python
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
```

and even if it's not that long, it's still a bit annoying to write/ copy. It must be a better way to do this. And there is! With the `Dataset` class, you can do this:



In [1]:
from skdataset.dataset import Dataset
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import numpy as np
np.set_printoptions(threshold=10)

# make some data
X, y = make_classification(
    n_samples=200, 
    n_features=4, 
    n_classes=2, 
    n_clusters_per_class=1,
    random_state=42,
)

ds = Dataset(X=X, y=y)

In [2]:
train_ds, test_ds = train_test_split(ds, test_size=0.33, random_state=42)

In [3]:
len(train_ds) + len(test_ds) == len(X)

True

This by itself if not that impressive, but it's a good start. Let's see what else we can do with this class.
We also can manipulate the data in the dataset with `transform` method without changing the original data.

In [4]:
def multiply_by_2(dataset: Dataset):
    dataset.X = dataset.X * 2
    return dataset

ds.transform(multiply_by_2)[:3, 'X'] 

array([[-0.63315862, -0.74435214,  1.13876335, -0.46847537],
       [-0.92523065, -0.29979364,  7.23459721, -5.21427765],
       [ 2.60788321,  4.11773656,  2.7461706 , -4.11749094]])

In [5]:
ds[:3, 'X']

array([[-0.31657931, -0.37217607,  0.56938167, -0.23423769],
       [-0.46261533, -0.14989682,  3.61729861, -2.60713883],
       [ 1.3039416 ,  2.05886828,  1.3730853 , -2.05874547]])

or even filter some rows based on a condition and that filter will be applied to all the variables:


In [6]:
import numpy as np

np.set_printoptions(threshold=10)

In [7]:
ds.filter(lambda x: x.y == 1)

{'X': array([[-3.16579308e-01, -3.72176068e-01,  5.69381673e-01,
         -2.34237685e-01],
        [ 1.30394160e+00,  2.05886828e+00,  1.37308530e+00,
         -2.05874547e+00],
        [ 2.34871791e+00,  3.66332221e+00,  2.15367981e+00,
         -3.44843438e+00],
        ...,
        [ 4.57034850e-01,  7.10998715e-01,  4.06047306e-01,
         -6.60427991e-01],
        [ 2.95498748e-03,  2.41383787e-01,  1.67668118e+00,
         -1.36553500e+00],
        [ 5.82578896e-01,  8.75518424e-01,  2.99929732e-01,
         -6.64855018e-01]]),
 'y': array([1, 1, 1, ..., 1, 1, 1])}

The abvious benefit of this is that you always have all of your data together, so you dont have to filter them once, and map the filter to the other parts(e.g. filtering some rows on X and then apply the same filter on y, or sample_weight)

## DatasetDict

Here we will interduce `DatasetDict` which is a dictionary-like object that holds multiple `Dataset` objects. It's very useful when you have multiple datasets(e.g. train, val and test) that you want to keep together.

You can create one by jast passing a dict with some keys and `Dataset` objects as values or call `split` with a spliter function on your dataset. Here is an example:

In [8]:
from skdataset import DatasetDict

ds_dict = DatasetDict({'train': train_ds, 'test': test_ds})

ds_dict.keys()

dict_keys(['train', 'test'])

In [9]:

ds_dict = ds.split(train_test_split, test_size=0.33, random_state=42)

ds_dict.keys()

dict_keys(['train', 'test'])

This new object has some handy attributes like `X_train` or `y_test` which all automatically generated from the keys in the dict:

In [10]:
ds_dict.X_train

array([[ 1.08384298,  1.63367133,  0.59219328, -1.26471893],
       [ 2.0629089 ,  3.58425588,  4.48422581, -5.13700395],
       [ 1.39912667,  2.25751204,  1.81515089, -2.48699264],
       ...,
       [ 0.52175103,  0.68583542, -0.4261365 , -0.03049782],
       [ 0.24602058,  0.69302302,  2.41232056, -2.1393599 ],
       [ 1.64450901,  2.64268527,  2.05746083, -2.86133886]])

which is equal to saying `ds_dict['train']['X']`.

Imagine you have a function that you want to apply to all of your splits now, instead of looping over them, you can do `ds_dict.transform(func)` and it will apply the function to all of the splits and return a new `DatasetDict` object:

In [11]:
ds_dict.transform(multiply_by_2)

{'train': {'X': array([[  2.16768595,   3.26734266,   1.18438656,  -2.52943786],
         [  4.1258178 ,   7.16851176,   8.96845162, -10.27400789],
         [  2.79825335,   4.51502408,   3.63030177,  -4.97398527],
         ...,
         [  1.04350205,   1.37167084,  -0.852273  ,  -0.06099565],
         [  0.49204117,   1.38604604,   4.82464112,  -4.27871979],
         [  3.28901801,   5.28537054,   4.11492167,  -5.72267771]]),
  'y': array([1, 1, 1, ..., 0, 0, 1])},
 'test': {'X': array([[ 2.6358973 ,  4.3559125 ,  4.14682446, -5.27668074],
         [ 1.75768303,  3.10428922,  4.17675655, -4.66643081],
         [-2.29434589, -3.29988447, -0.13391834,  1.76676886],
         ...,
         [-0.0109273 ,  0.97612538,  7.01157169, -5.69359055],
         [-1.82972602, -2.19872701,  2.9538225 , -1.07976942],
         [ 2.58931623,  3.67742897, -0.17904181, -1.72542688]]),
  'y': array([1, 1, 0, ..., 0, 0, 0])}}

## Put It All Together

Let's see how we can use all of these together. If we wanted to do it without `Dataset` It would be something like this:

```python
from skdataset.dataset import Dataset
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=200, 
    n_features=4, 
    n_classes=2, 
    n_clusters_per_class=1,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42
)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
print('Train score:', model.score(X_train, y_train))
print('Test score:', model.score(X_test, y_test))
```

In [12]:
from skdataset.dataset import Dataset
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification

X, y = make_classification(
    n_samples=200, 
    n_features=4, 
    n_classes=2, 
    n_clusters_per_class=1,
    random_state=42,
)

ds_dict = Dataset(X=X, y=y).split(train_test_split, test_size=0.33, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(**ds_dict['train'])

print('Train score:', model.score(**ds_dict['train']))
print('Test score:', model.score(**ds_dict['test']))

Train score: 1.0
Test score: 0.8181818181818182
