# Example: Train/test/evaluate pipeline with `BCDict`

In [2]:
from pprint import pprint
import math
import pandas as pd
import numpy as np
from typing import Collection
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

import bcdict
from bcdict import BCDict

np.set_printoptions(precision=2)
pd.options.display.precision = 2

# Generate random data

Let's start by generating some random data.

First of all, a function that returns a random DataFrame with 4 feature columns and one target column:

In [3]:
np.random.seed(42)

def get_random_data() -> dict[str, pd.DataFrame]:
    """Just create some random data."""
    columns = list("ABCD") + ["target"]
    nrows = np.random.randint(10, 25)
    df = pd.DataFrame(
        np.random.random((nrows, len(columns))) + 0.01, 
        columns=columns,
    )
    return df

We will work with three different dataset:

In [4]:
keys = ["apples", "pears", "bananas"]

## First BCDict magic

Now, generate a dictionary with 3 entries of random data.

The `bootstrap()` function calls a function for every item in a list and returns a BCDict:

In [5]:
dfs = bcdict.bootstrap(keys, get_random_data)

`dfs` is a broadcast dict with keys apples, pears and bananas.

It's values are dataframes of random values.

We can now call arbitrary functions on the BCDict.

It will be called on all values of the dictionary, and return a dictionary with the results of the function calls.

Let's try with the `head()` function:

In [None]:
pprint(dfs.head(3))

We can also access attributes the same way. The following line returns `shape` attribute of all values in the dictionary:

In [None]:
dfs.shape

# Indexing and column selection

We can also slice all values in the dictionary at once.

We'll use this here to get a dictionary of series with the target column, and a DataFrame with all features (`X` and `y` in sklearn terminology).

Here we select the 'target' column and save it in `y`:


In [None]:
y = dfs['target']
y.shape

And we get all `X` dataframes by dropping the target column:

In [None]:
X = dfs.drop(columns="target")
X.shape

# Split the data into train and test

Using the `apply()` function we can apply arbitrary functions on the dictionaries:

In [None]:
from sklearn.model_selection import train_test_split

splits = bcdict.apply(train_test_split, X, y)

Each entry in the dictionary now contains a list with X_train, X_test, y_train, y_test:

In [None]:
splits['apples']

## Unpacking dictionaries

A dictionary with a tuple or a list in each value can be unpacked.

So instead of one dictionary with tuples of 4 values we get 4 separate dictionaries:

In [None]:
X_train, X_test, y_train, y_test = splits.unpack()
X_train.shape, y_train.shape, X_test.shape, y_test.shape

# Create models

Let us now create an (unfitted) linear regression model for each key. We use the `bootstrap()` function again:

In [None]:
models = bcdict.bootstrap(keys, LinearRegression)
models

... and train all three models:

In [None]:
models.fit(X_train, y_train)
pprint(models.coef_)

We have just fitted 3 models without a for loop or any code repetition!

# Make predictions...

*...and demonstrate argument broadcast*

Apply each model to the correct dataset:

In [None]:
preds = models.predict(X_test)
preds

`models` is a BCDict.

`X_test` is a dictionary with the same keys as `models`.

When calling the `predict()` function, the `X_test` argument gets *broadcast*.

The above line is equivalent to:

```python
preds = {k: model.predict(X_test[k]) for k, model in models.items()}
```

# Evaluate the predictions

In [None]:
# now we pipe all predictions and the
scores = bcdict.apply(r2_score, y_test, preds)
pprint(scores)

The `apply()` function applies a callable (in this case, `r2_score`) on each element of a BCDict.

The above line is equivalent to:

```python
scores = {k: r2_score(y_test[k], preds[k])}
```

The *first* broadcast dictionary in the arguments determines the keys of the output dictionary. All other arguments are either passed on unmodified, or they are broadcast if they are also a BCDict with the same keys.


Conclusion: no single for loop or dict comprehension used to train 3 models predict and evaluate 3 grids :)

## Cross validation

Of course, we can also apply a cross validation on all our data sets:

In [39]:
from sklearn.model_selection import cross_val_score
models = bcdict.bootstrap(keys, LinearRegression)
res = bcdict.apply(cross_val_score, models, X, y, cv=3)
pprint(res)

{'apples': array([-1.99, -1.96, -0.38]),
 'bananas': array([-0.91, -2.28, -1.55]),
 'pears': array([-6.94, -2.62, -0.59])}


# Conclusion

We just created a pipeline to train a model, generate predictions *and* validate the model for three datasets.

And we did that without writing a single for-loop!