# df-and-order how-to!

### Important note

`df-and-order` assumes that you already have a dataframe to work with. 

The only thing the lib wants you to do is to organize your dataframes in separate folders. The lib is config-based so it's nice to have a folder that contains all at once:

- the initial dataframe 

- a config for it 

- all transformed variations of the initial dataframe.

You should pick a unique identifier for each dataframe, it will serve as the folder name and the filename for the initial dataframe.

Example of such structure:

    data/
    ├── unique_df_id_1/ - folder with all artifacts for df with id unique_df_id_1
    │   ├── unique_df_1.csv - initial dataframe
    │   ├── df_config.yaml - contains metadata and declared transformations
    │   ├── transform_1_unique_df_1.csv - first transformed df
    │   └── transform_2_unique_df_1.csv - second transformed df
    ├── unique_df_id_2/ - same goes with other dataframes
    │   ├── ...
    │   └── ...
    └── unique_df_id_3/
        ├── ...
        ├── ...
        ├── ...
        └── ...

---

## 0. We need a dataframe!

We need a dataframe so we are going to create it by hand!

In [1]:
import pandas as pd

In [2]:
example_df = pd.DataFrame({
    'num_col': [1,2,3,4,5],
    'str_col': ['one', 'two', 'three', 'four', 'five'],
    'date_col': ['2020-05-17', '2020-05-18', '2020-05-19', '2020-05-20', '2020-05-21'],
    'redundant_col': [0, 0, 0, 0, 0]
})
example_df

Unnamed: 0,num_col,str_col,date_col,redundant_col
0,1,one,2020-05-17,0
1,2,two,2020-05-18,0
2,3,three,2020-05-19,0
3,4,four,2020-05-20,0
4,5,five,2020-05-21,0


Let's choose an id for our dataframe.

In [3]:
example_df_id = 'example_df_2020'

Now let's create a folder for it.

In [4]:
import os
if not os.path.exists(example_df_id):
    os.makedirs(example_df_id)

The only thing left is to save our dataframe there.

In [5]:
filename = example_df_id + '.csv'
example_df.to_csv(os.path.join(example_df_id, filename), index=False)

Hooray! Next step is to create a config file.

## 1. Config file

Config file contains all metadata we are interested in and all transformations needed as well.

`DfReader` operates in your data folder and knows where to locate all dataframes and configs for them. We will create new config using `DfReader` instance.

In [6]:
import pandas as pd
from df_and_order.df_reader import DfReader
from df_and_order.df_cache import DfCache

DfReader is able to work with any format you want by using `DfCache` subclasses. Each subclass provides logic how to save/load a dataframe. 

See the example below:

In [7]:
class CsvDfCache(DfCache):
    # just a basic wrapper around pandas csv built-in methods.
    def _save(self, df: pd.DataFrame, path: str, *args, **kwargs):
        df.to_csv(path, index=False, *args, **kwargs)

    def _load(self, path: str, *args, **kwargs) -> pd.DataFrame:
        return pd.read_csv(path, *args, **kwargs)

We will use it for our needs.

Just as I mentioned earlier, we first need an instance of `DfReader`.

In [8]:
# we must declare with which format we want to operate
df_format = 'csv'
# empty string means we want to store dataframes in the current directory.
# can be any path you want.
dir_path = ''
reader = DfReader(dir_path=dir_path, format_to_cache_map={
    # DfReader now knows how to work with csv files.
    df_format: CsvDfCache()
})

We are all set for now and ready to create a config!

In [9]:
# you may want to provide any additional information for describing a dataset
# here, as an example, we save the info about the dataset's author
metadata = {'author': 'Data Man'}
# the unique id we came up with above. 
df_id = example_df_id
# other information is already available for us
reader.create_df_config(df_id=df_id, # config will store dataframe id as well
                        initial_df_format=df_format, # in which format initial dataframe is saved
                        transformed_df_format=df_format, # in which format to save a transformed datafarme
                        metadata=metadata)

In [10]:
!cat example_df_2020/df_config.yaml

df_id: example_df_2020
initial_df_format: csv
metadata:
  author: Data Man
transformed_df_format: csv


Simple as that.

## 2. Reading a dataframe

In [11]:
reader.read(df_id=df_id)

Unnamed: 0,num_col,str_col,date_col,redundant_col
0,1,one,2020-05-17,0
1,2,two,2020-05-18,0
2,3,three,2020-05-19,0
3,4,four,2020-05-20,0
4,5,five,2020-05-21,0


I started with the code just right away because how simple it is and how intuitive it is!

You just need to provide a dataframe id and you are ready to go. No more hardcoded paths and mixed up formats. Once you set up `DfReader` - all just works behind the scenes. 

Just imagine how beneficial it is when working in the same repository with many fellow colleagues.

### Still not convinced df-and-order is useful? Just watch!

It's a good idea to hide all the logic behind your own subclass:

In [12]:
class AmazingDfReader(DfReader):
    def __init__(self):
        # while working in some repo, our data is usually stored in some specific
        # place we can provide a path for
        dir_path = ''
        reader = super().__init__(dir_path=dir_path, format_to_cache_map={
            # here we list all the formats we want to work with
            'csv': CsvDfCache()
        })

In [13]:
amazing_reader = AmazingDfReader()
amazing_reader.read(df_id=df_id)

Unnamed: 0,num_col,str_col,date_col,redundant_col
0,1,one,2020-05-17,0
1,2,two,2020-05-18,0
2,3,three,2020-05-19,0
3,4,four,2020-05-20,0
4,5,five,2020-05-21,0


Now you see how cool it is? Anybody can use AmazingDfReader across the codebase in a super clean way without bothering how it's configured!

# 3. Transforms

Very often our initial dataframe is the raw one and needs to be transformed in some way. 

e.g. we want to have the initial dataframe containing some important information yet we don't need all of it to fit our model. 

`df-and-order` supports `in-memory` transformations as well as permanent ones. The only difference is that in the permanent case we store the resulting dataframe on disk next to the initial df. 

You can see a transformation as a combination of one or many steps.

e.g.

    - first drop column 'redundant_col'
    - then convert column 'date_col' from str to date
    Do it all in memory only

Each step represents a class with the only method `transform` that takes a df and returns a df.

    class DropColsTransformStep(DfTransformStep):
        """
        Simply drops some undesired columns from a dataframe.
        """
        def __init__(self, cols: List[str]):
            self._cols_to_drop = cols

    def transform(self, df: pd.DataFrame) -> pd.DataFrame:
        return df.drop(self._cols_to_drop, axis=1)

Then we are able to declare it as a transform step:

The easiest way is to just pass `DfTransformStep` suclass type:

    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                         params={'cols': ['redundant_col']}),
                                         
Important note here:

`DfTransformStep` suclass should be stored in the separate file, not in some notebook etc. 

Otherwise, `df-and-order` will not be able to locate it.
                                         
Another way is to provide the full module path for your `DfTransformStep` suclass, including the class name. Choose whatever suits you.

    DfTransformStepConfig(module_path='df_and_order.steps.DropColsTransformStep',
                          params={'cols': ['redundant_col']}),

In both cases `params` will be passed to init method of the specified `DfTransformStep` suclass.

All the transforms declarations will be translated to the config file. 

Let's see how it works in the next example:

We want to remove `redundant_col` since it doesn't provide any useful information and we also need to convert `date_col` to datetime. Since our dataframe is quite small, we will do all the transformations in memory, without any intermediates.

In [14]:
from df_and_order.df_transform import DfTransformConfig
from df_and_order.df_transform_step import DfTransformStepConfig
from df_and_order.steps import DropColsTransformStep, DatesTransformStep

# we describe all the steps required
in_memory_steps = [
    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                         params={'cols': ['redundant_col']}),
    DfTransformStepConfig.from_step_type(step_type=DatesTransformStep,
                                         params={'cols': ['date_col']})
]

# arbitrary unique id for our transformation
example_transform_id = 'model_input'
# here's the instance of our entire transform
example_transform = DfTransformConfig(transform_id=example_transform_id, 
                                      in_memory_steps=in_memory_steps)

In [15]:
transformed_df = amazing_reader.read(df_id=df_id, 
                                     transform=example_transform)
transformed_df

Unnamed: 0,num_col,str_col,date_col
0,1,one,2020-05-17
1,2,two,2020-05-18
2,3,three,2020-05-19
3,4,four,2020-05-20
4,5,five,2020-05-21


In [16]:
transformed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   num_col   5 non-null      int64         
 1   str_col   5 non-null      object        
 2   date_col  5 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 248.0+ bytes


**Pretty rad, isn't it?**

Our transform is now visible in the config:

In [17]:
!cat example_df_2020/df_config.yaml

df_id: example_df_2020
initial_df_format: csv
metadata:
  author: Data Man
transformed_df_format: csv
transforms:
  model_input:
    in_memory:
    - module_path: df_and_order.steps.DropColsTransformStep
      params:
        cols:
        - redundant_col
    - module_path: df_and_order.steps.DatesTransformStep
      params:
        cols:
        - date_col


Note: you are free to edit the config file manually as well!

Once a transform is declared in the config file you can just pass `transform_id` to the `DfReader.read` method. See:

In [18]:
amazing_reader.read(df_id=df_id, transform_id=example_transform_id)

Unnamed: 0,num_col,str_col,date_col
0,1,one,2020-05-17
1,2,two,2020-05-18
2,3,three,2020-05-19
3,4,four,2020-05-20
4,5,five,2020-05-21


Maybe you want to switch to your initial dataframe? No problem! Just don't pass `transform_id`.

In [19]:
initial_df = amazing_reader.read(df_id=df_id)
initial_df

Unnamed: 0,num_col,str_col,date_col,redundant_col
0,1,one,2020-05-17,0
1,2,two,2020-05-18,0
2,3,three,2020-05-19,0
3,4,four,2020-05-20,0
4,5,five,2020-05-21,0


In [20]:
initial_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   num_col        5 non-null      int64 
 1   str_col        5 non-null      object
 2   date_col       5 non-null      object
 3   redundant_col  5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes


Finally, let's cover the case when we want to persist a transform's result. It's a good idea to remove `redundant_col` once and for all.

In [21]:
# we describe all the steps required
in_memory_steps = [
    DfTransformStepConfig.from_step_type(step_type=DatesTransformStep,
                                         params={'cols': ['date_col']})
]

# let's just move DropColsTransformStep from in_memory to permanent steps
permanent_steps = [
    DfTransformStepConfig.from_step_type(step_type=DropColsTransformStep,
                                     params={'cols': ['redundant_col']}),
]

# arbitrary unique id for our transformation
permanent_transform_id = 'model_input_permanent'
# here's the instance of our entire transform
permanent_transform = DfTransformConfig(transform_id=permanent_transform_id, 
                                        in_memory_steps=in_memory_steps,
                                        permanent_steps=permanent_steps)

In [22]:
final_df = amazing_reader.read(df_id=df_id, 
                               transform=permanent_transform)
final_df

Unnamed: 0,num_col,str_col,date_col
0,1,one,2020-05-17
1,2,two,2020-05-18
2,3,three,2020-05-19
3,4,four,2020-05-20
4,5,five,2020-05-21


In [23]:
!cat example_df_2020/df_config.yaml

df_id: example_df_2020
initial_df_format: csv
metadata:
  author: Data Man
transformed_df_format: csv
transforms:
  model_input:
    in_memory:
    - module_path: df_and_order.steps.DropColsTransformStep
      params:
        cols:
        - redundant_col
    - module_path: df_and_order.steps.DatesTransformStep
      params:
        cols:
        - date_col
  model_input_permanent:
    in_memory:
    - module_path: df_and_order.steps.DatesTransformStep
      params:
        cols:
        - date_col
    permanent:
    - module_path: df_and_order.steps.DropColsTransformStep
      params:
        cols:
        - redundant_col


In [24]:
!ls -l example_df_2020/

total 24
-rw-r--r--  1 ilya.tyutin  staff  631 May 18 11:02 df_config.yaml
-rw-r--r--  1 ilya.tyutin  staff  138 May 18 11:02 example_df_2020.csv
-rw-r--r--  1 ilya.tyutin  staff  114 May 18 11:02 model_input_permanent_example_df_2020.csv


Notice that we now see `model_input_permanent_example_df_2020.csv` file stored to the disk.

Every time we read it - it recovers from the file.

In [25]:
amazing_reader.read(df_id=df_id, 
                    transform=permanent_transform)

Unnamed: 0,num_col,str_col,date_col
0,1,one,2020-05-17
1,2,two,2020-05-18
2,3,three,2020-05-19
3,4,four,2020-05-20
4,5,five,2020-05-21


Important note: `in memory` transforms run everytime when your read a dataframe, no matter it was it stored on the disk or not.