# df-and-order how-to!

## What is df-and-order anyway?

Using `df-and-order` your interactions with dataframes become very clean and predictable.

Say you've been working on some project for one month already and you had a bunch of experiments. 

Your working directory ended up like this:

    data/
    ├── raw_df_proj1.csv
    ├── raw_df_new_prj1.csv
    ├── cleaned_df_v1.csv
    ├── cleaned_df_the_best.csv
    ├── cleaned_df.csv
    └── cleaned_df_improved.csv
Looks familiar? :) Except the namings it would be challenging to find how exactly those files were generated. How to reproduce the result? It'd be feasible to find the roots ( at least if you use some VCS ) yet very time-consuming.

`df-and-order` was made to tackle these problems.

In every task it always starts with some intial, commonly raw dataframe. It could be some logs, backend table etc. Then we come to play with it, transform it somehow to finally get a nice&clean dataframe. 

`df-and-order` assigns a config file to every raw dataframe. The config will contain all the useful metadata and more importantly: declaration of every transformation performed on the dataframe. Just by looking at the config file we would be able to say how some transformation was done.

`df-and-order` assumes that you already have a dataframe to work with. ( unfortunately it can't provide it for you... )

The only thing the lib wants you to do is to organize your dataframes in separate folders. The lib is config-based so it's nice to have a folder that contains all at once:

- the initial dataframe 

- a config for it 

- all transformed variations of the initial dataframe.

You should pick a unique identifier for each dataframe, it will serve as the folder name and the filename for the initial dataframe.

Example of such structure:

    data/
    ├── unique_df_id_1/ - folder with all artifacts for a df with id unique_df_id_1
    │   ├── unique_df_id_1.csv - initial dataframe
    │   ├── df_config.yaml - contains metadata and declared transformations
    │   ├── transform_1_unique_df_id_1.csv - first transformed df
    │   └── transform_2_unique_df_id_1.csv - second transformed df
    ├── unique_df_id_2/ - same goes with other dataframes
    │   ├── ...
    │   └── ...
    └── unique_df_id_3/
        ├── ...
        ├── ...
        ├── ...
        └── ...

---

## 0. We need a dataframe!

We are going to create it by hand!

In [None]:
import pandas as pd

In [None]:
example_df = pd.DataFrame({
    'num_col': [1,2,3,4,5],
    'str_col': ['one', 'two', 'three', 'four', 'five'],
    'date_col': ['2020-05-17', '2020-05-18', '2020-05-19', '2020-05-20', '2020-05-21'],
    'redundant_col': [0, 0, 0, 0, 0]
})
example_df

What an amazing dataframe we have! Let's choose an id for our dataframe. It can be anything, but unique in your data folder.

In [None]:
example_df_id = 'super_demo_df_2020'

Now let's create a folder for it.

In [None]:
import os
df_folder_path = os.path.join('data', example_df_id)
if not os.path.exists(df_folder_path):
    os.makedirs(df_folder_path)

The only thing left is to save our dataframe there.

In [None]:
filename = example_df_id + '.csv'
example_df.to_csv(os.path.join(df_folder_path, filename), index=False)

In [None]:
!ls -l data/$example_df_id

Hooray! Next step is to create a config file.

## 1. Config file

Config file contains all metadata we find useful and all transformations needed as well.

`DfReader` operates in your data folder and knows where to locate all dataframes and configs for them. We will create new config using `DfReader` instance.

In [None]:
import pandas as pd
# in case you've cloned the repo without installing the lib via pip
import sys
sys.path.append('../')
from df_and_order.df_reader import DfReader
from df_and_order.df_cache import DfCache

DfReader is able to work with any format you want by using `DfCache` subclasses. Each subclass provides logic how to save/load a dataframe. 

See the example below, where we create simple pandas wrapper for saving/loading csv files:

In [None]:
class CsvDfCache(DfCache):
    # just a basic wrapper around pandas csv built-in methods.
    def _save(self, df: pd.DataFrame, path: str, *args, **kwargs):
        df.to_csv(path, index=False, *args, **kwargs)

    def _load(self, path: str, *args, **kwargs) -> pd.DataFrame:
        return pd.read_csv(path, *args, **kwargs)

Just as I mentioned earlier, we first need an instance of `DfReader`.

In [None]:
# we must declare which format our dataframes saved in
df_format = 'csv'
# can be any path you want, in our case it's 'data' folder
dir_path = 'data/'
reader = DfReader(dir_path=dir_path, format_to_cache_map={
    # DfReader now knows how to work with csv files.
    df_format: CsvDfCache()
})

We are all set for now and ready to create a config!

In [None]:
# you may want to provide any additional information for describing a dataset
# here, as an example, we save the info about the dataset's author
metadata = {'author': 'Data Man'}
# the unique id we came up with above. 
df_id = example_df_id
# other information is already available for us
reader.create_df_config(df_id=df_id, # config will store dataframe id as well
                        initial_df_format=df_format, # in which format initial dataframe is saved
                        metadata=metadata)

Done! let's take a look at the config file.

In [None]:
!cat data/$example_df_id/df_config.yaml

Simple as that.

## 2. Reading a dataframe

In [None]:
reader.read(df_id=df_id)

I started the section with the code right away because it's so simple and intuitive, no need for comments! :)

You just tell `DfReader` a dataframe id and you get the dataframe right back. No more hardcoded paths and mixed up formats. Once you set up `DfReader` - everything just works. 

Close your eyes and imagine how beneficial it is when working in the same repository with many fellow colleagues. No more shared notebooks with hardcoded paths leading to who-knows-how generated dataframes.

### Still not convinced df-and-order is useful? Just watch!

It's a good idea to hide all the logic behind your own subclass:

In [None]:
class AmazingDfReader(DfReader):
    def __init__(self):
        # while working in some repo, our data is usually stored in some specific
        # place we can provide a path for. Ideally you should write some path generator
        # to be able to run the code from any place in your repository.
        dir_path = 'data'
        reader = super().__init__(dir_path=dir_path, format_to_cache_map={
            # here we list all the formats we want to work with
            'csv': CsvDfCache()
        })

Enjoy the next cell:

In [None]:
amazing_reader = AmazingDfReader()
amazing_reader.read(df_id=df_id)

Now you see how cool it is? Anybody can use AmazingDfReader across the codebase in a super clean way without bothering how it's configured!

# 3. Transforms

Very often our initial dataframe is the raw one and needs to be transformed in some way. 

e.g. we want still need the initial dataframe since it contains some important information, nonetheless we can't use it to fit our model. No doubt, it requires some changes.

`df-and-order` supports `in-memory` transformations as well as `permanent` ones. The only difference is that in the permanent case we store the resulting dataframe on disk next to the initial df. 

You can see a transformation as a combination of one or many steps.

e.g. we may want to:

    - first drop column 'redundant_col'
    - then convert column 'date_col' from str to date
    Do it all in memory only

Behind the scenes each step represents a class with the only one method called `transform`. It takes a df and returns a df. Here's the intuitive example:

    class DropColsTransformStep(DfTransformStep):
        """
        Simply drops some undesired columns from a dataframe.
        """
        def __init__(self, cols: List[str]):
            self._cols_to_drop = cols

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return df.drop(self._cols_to_drop, axis=1)

Then we wrap it in the `DfTransformStepConfig` class that doesn't perform the transformation but rather just describes the step:

The easiest way to initialize `DfTransformStepConfig` is by passing `DfTransformStep` subclass type along with the init parameters:

    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                         params={'cols': ['redundant_col']}),
                                         
Important note here:

`DfTransformStep` suclass should be stored in the separate python file, not in some notebook etc. 

Otherwise, `df-and-order` will not be able to locate it.
                                         
Another way is to provide the full module path for your `DfTransformStep` suclass, including the class name. Choose whatever suits you.

    DfTransformStepConfig(module_path='df_and_order.steps.DropColsTransformStep',
                          params={'cols': ['redundant_col']}),

In both cases `params` will be passed to init method of the specified `DfTransformStep` suclass.

All the transforms declarations will be translated to the config file. 

If it feels overwhelming, just follow the following example and everything will become clear:

We want to remove `redundant_col` since it doesn't provide any useful information and we also need to convert `date_col` to datetime. Since our dataframe is quite small, we will do all the transformations in memory, without any intermediates.

In [None]:
from df_and_order.df_transform import DfTransformConfig
from df_and_order.df_transform_step import DfTransformStepConfig
from df_and_order.steps.pd import DropColsTransformStep, DatesTransformStep

# we describe all the steps required
in_memory_steps = [
    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                         params={'cols': ['redundant_col']}),
    DfTransformStepConfig.from_step_type(step_type=DatesTransformStep,
                                         params={'cols': ['date_col']})
]

# arbitrary unique id for our transformation
example_transform_id = 'model_input'
# here's the instance of our entire transform
example_transform = DfTransformConfig(transform_id=example_transform_id, 
                                      df_format=df_format,
                                      in_memory_steps=in_memory_steps)

In [None]:
transformed_df = amazing_reader.read(df_id=df_id, 
                                     transform=example_transform)
transformed_df

In [None]:
transformed_df.info()

**Pretty rad, isn't it?**

Our transform is now visible in the config:

In [None]:
!cat data/$example_df_id/df_config.yaml

###### Note: you are free to edit the config file manually as well!

Once a transform is declared in the config file you can just pass `transform_id` to the `DfReader.read` method. See:

In [None]:
amazing_reader.read(df_id=df_id, transform_id=example_transform_id)

Maybe you want to switch to your initial dataframe? No problem! Just don't pass `transform_id`.

In [None]:
initial_df = amazing_reader.read(df_id=df_id)
initial_df

In [None]:
initial_df.info()

Finally, let's cover the case when we want to persist a transform's result. It's a good idea to remove `redundant_col` once and for all.

In [None]:
# we describe all the steps required
in_memory_steps = [
    DfTransformStepConfig.from_step_type(step_type=DatesTransformStep,
                                         params={'cols': ['date_col']})
]

# let's just move DropColsTransformStep from in_memory to permanent steps
permanent_steps = [
    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                     params={'cols': ['redundant_col']}),
]

# arbitrary unique id for our transformation
permanent_transform_id = 'model_input_permanent'
# here's the instance of our entire transform
permanent_transform = DfTransformConfig(transform_id=permanent_transform_id, 
                                        df_format=df_format,
                                        in_memory_steps=in_memory_steps,
                                        permanent_steps=permanent_steps)

In [None]:
final_df = amazing_reader.read(df_id=df_id, 
                               transform=permanent_transform)
final_df

In [None]:
!cat data/$example_df_id/df_config.yaml

In [None]:
!ls -l data/$example_df_id/

Notice that we now have `model_input_permanent_super_demo_df_2020.csv` file stored to the disk.

Every time after calling `read` with the transform_id - it recovers from the file.

In [None]:
amazing_reader.read(df_id=df_id, 
                    transform=permanent_transform)

### Important note: `in-memory` transforms run everytime when your read a dataframe, no matter it was stored on the disk or not.

### That's it, now you are ready to try df-and-order power in your own projects.

# Some advanced stuff

### Reacting to changes in transformations codebase

Obviously, even after having all the transformation steps declared in the config file, it doesn't prevent us from code changes in those steps subclasses. Once a step is changed, we have an outdated transformed dataframe on the disk.

`df-and-order` has a built-in safety mechanism for avoiding such cases.

It compares the creation date of the persisted dataframe with the last modification date of any of the permanent steps. Meaning if a permanent step we used to transform the dataframe was changed afterwards - we can no longer use it. It's crucial while working in the same repo with others. All your team members must read the same dataframe using the same config.

Example:

In [None]:
from example_steps.steps import DummyTransformStep

In [None]:
!cat example_steps/steps.py

The transform above does literally nothing, but bear with me.

In [None]:
permanent_steps = [
    DfTransformStepConfig.from_step_type(step_type=DummyTransformStep, params={})
]
dummy_transform_id = 'dummy'
dummy_transform = DfTransformConfig(transform_id=dummy_transform_id, 
                                    df_format=df_format,
                                    permanent_steps=permanent_steps)

In [None]:
amazing_reader.read(df_id=df_id, 
                    transform=dummy_transform)

In [None]:
!cat data/super_demo_df_2020/df_config.yaml

In [None]:
!ls -l data/super_demo_df_2020/

Nothing new so far. But now let's change the transform step file.

In [None]:
with open('example_steps/steps.py', "a") as file:
    file.write('\n')

If we then try to read the transformed dataframe - it crashes since the code of our dummy step was modified after the dataframe was persisted.

In [None]:
amazing_reader.read(df_id=df_id, transform_id=dummy_transform_id)

There are two ways to deal with it. 

First one is to force the read operation by passing `forced=True`:

In [None]:
amazing_reader.read(df_id=df_id, transform_id=dummy_transform_id, forced=True)

It can save you time when you are sure that your data will be consistent with your expectations yet this way is certainly not recommended.

Yeah, it can be annoying to get such an error after some minor changes, e.g. something was renamed or blank lines were removed.

But it's better to get an error rather than outdated wrong dataframe.
If we remove the file and try again - everything works just fine.

In [None]:
!rm data/$example_df_id/dummy_super_demo_df_2020.csv

In [None]:
amazing_reader.read(df_id=df_id, transform_id=dummy_transform_id)

### Note on in-memory transforms

If your transform consists of both in-memory and permanent steps, your in-memory steps are not allowed to change the shape of df. Remember, in-memory steps are applied every time your read a dataframe.

In [None]:
# made up example when we remove some cols in memory and then perform 
# some permanent transform step that will cause our dataframe to be persisted
in_memory_steps = [
    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                     params={'cols': ['redundant_col']}),
]
permanent_steps = [
    DfTransformStepConfig.from_step_type(step_type=DatesTransformStep,
                                         params={'cols': ['date_col']})
]

# arbitrary unique id for our transformation
bad_in_memory_transform_id = 'bad_in_memory'
# here's the instance of our entire transform
bad_in_memory_transform = DfTransformConfig(transform_id=bad_in_memory_transform_id, 
                                            df_format='csv',
                                            in_memory_steps=in_memory_steps,
                                            permanent_steps=permanent_steps)

In [None]:
amazing_reader.read(df_id=df_id, transform=bad_in_memory_transform)