# Basic usage of `pdpipe`

So how does using `pdpipe` looks like? Let's first import `pandas` and `pdpipe`, an intialize a nice little dataframe:


In [1]:
import pandas as pd

In [2]:
raw_df = pd.DataFrame(
    data=[
        [
            42,
            23,
            "Jo",
            "M",
            True,
            False,
            0.07,
            "USA",
            "Living life to its fullest",
        ],
        [
            81,
            23,
            "Dana",
            "F",
            True,
            True,
            0.3,
            "USA",
            "the pen is mightier then the sword",
        ],
        [
            11,
            25,
            "Bo",
            "M",
            False,
            True,
            2.3,
            "Greece",
            "all for one and one for all",
        ],
        [
            14,
            44,
            "Derek",
            "M",
            True,
            True,
            1.1,
            "Denmark",
            "every life is precious",
        ],
        [
            22,
            72,
            "Regina",
            "F",
            True,
            False,
            7.1,
            "Greece",
            "all of you get off my porch",
        ],
        [
            48,
            50,
            "Jim",
            "M",
            False,
            False,
            0.2,
            "Germany",
            "boy do I love dogs and cats",
        ],
        [
            50,
            80,
            "Richy",
            "M",
            False,
            True,
            100.2,
            "Finland",
            "I love Euro bills",
        ],
        [
            80,
            80,
            "Wealthus",
            "F",
            False,
            True,
            123.2,
            "Finland",
            "In Finance We Trust",
        ],
    ],
    columns=[
        "Id",
        "Age",
        "Name",
        "Gender",
        "Smoking",
        "Runs",
        "Savings",
        "Country",
        "Quote",
    ],
)

This results in the following dataframe:

In [3]:
raw_df

Unnamed: 0,Id,Age,Name,Gender,Smoking,Runs,Savings,Country,Quote
0,42,23,Jo,M,True,False,0.07,USA,Living life to its fullest
1,81,23,Dana,F,True,True,0.3,USA,the pen is mightier then the sword
2,11,25,Bo,M,False,True,2.3,Greece,all for one and one for all
3,14,44,Derek,M,True,True,1.1,Denmark,every life is precious
4,22,72,Regina,F,True,False,7.1,Greece,all of you get off my porch
5,48,50,Jim,M,False,False,0.2,Germany,boy do I love dogs and cats
6,50,80,Richy,M,False,True,100.2,Finland,I love Euro bills
7,80,80,Wealthus,F,False,True,123.2,Finland,In Finance We Trust


## Constructing pipelines

We can create different pipeline stage object by calling their constructors,
which can be of course identified by their camel-cased names, such as 
`pdp.ColDrop` for dropping columns and `pdp.Encode` to encode them, etc.

To build a pipeline, we will usually call the `PdPipeline` class constructor,
and provide it with a list of pipeline stage objects:

In [4]:
import pdpipe as pdp
from pdpipe import df

pipeline = pdp.PdPipeline(
    [
        df.set_index("Id"),
        pdp.ColDrop("Name"),
        df.drop_rows_where["Savings"] > 100,
        df["Healthy"] << df["Runs"] & ~df["Smoking"],
        pdp.Bin({"Savings": [1]}, drop=False),
        pdp.Scale("StandardScaler"),
        pdp.TokenizeText("Quote"),
        pdp.SnowballStem("EnglishStemmer", columns=["Quote"]),
        pdp.RemoveStopwords("English", "Quote"),
        pdp.Encode("Gender"),
        pdp.OneHotEncode("Country"),
    ]
)

Printing the pipeline object displays it in order. 

In [5]:
pipeline

A pdpipe pipeline:
[ 0]  Apply dataframe method set_index with kwargs {}
[ 1]  Drop columns Name
[ 2]  Drop rows by qualifier <RowQualifier: Qualify rows with df[Savings] >
      100>
[ 3]  Assign column Healthy with df[Runs] & ~df[Smoking]
[ 4]  Bin Savings by [1].
[ 5]  Scale columns Columns of dtypes <class 'numpy.number'>
[ 6]  Tokenize Quote
[ 7]  Stemming tokens in Quote...
[ 8]  Remove stopwords from Quote
[ 9]  Encode Gender
[10]  One-hot encode Country

The numbers presented in square brackets are the indices of the
corresponding pipeline stages, and they can be used to retrieve either the
specific pipeline stage objects composing the pipeline, e.g. with
`pipeline[5]`, or sub-pipelines composed of sub-sequences of
the pipeline, e.g. with `pipeline[2:6]`.

## Applying pipelines

The pipeline can now be applied to an input dataframe using the `apply`
method. We will also provide the `verbose` keyword with `True` to have a 
informative prints or the progress of dataframe processing, stage by stage:

In [6]:
res = pipeline.apply(raw_df, verbose=True)

- set_index: Apply dataframe method set_index with kwargs {}
- Drop columns Name
- Drop rows by qualifier <RowQualifier: Qualify rows with df[Savings] >
  100>
2 rows dropped.
- Assign column Healthy with df[Runs] & ~df[Smoking]
- Bin Savings by [1].


Savings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 180.80it/s]

- Scale columns Columns of dtypes <class 'numpy.number'>
- Tokenize Quote
- Stemming tokens in Quote...
- Remove stopwords from Quote
- Encode Gender



100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 363.08it/s]

- One-hot encode Country



Country: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 280.57it/s]


We will thus get the dataframe below. We can see all numerical columns were scaled, the `Country` column was one-hot-encoded, `Savings` also got a binned version and the textual `Quote` column underwent some word-level manipulations:

In [7]:
res

Unnamed: 0_level_0,Age,Gender,Smoking,Runs,Savings,Savings_bin,Quote,Healthy,Country_Germany,Country_Greece,Country_USA
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
42,-0.917257,1,True,False,-0.718473,<1,"[live, life, fullest]",False,0,0,1
81,-0.917257,0,True,True,-0.625375,<1,"[pen, mightier, sword]",False,0,0,1
11,-0.806074,1,False,True,0.184172,1≤,"[one, one]",True,0,1,0
14,0.250161,1,True,True,-0.301556,1≤,"[everi, life, precious]",False,0,0,0
22,1.806718,0,True,False,2.127084,1≤,"[get, porch]",False,0,1,0
48,0.583709,1,False,False,-0.665852,<1,"[boy, love, dog, cat]",False,1,0,0


## Fit and transform

Pipelines are also callable objects themselves, so calling `pipeline(df)` is 
equivalent to calling `pipeline.apply(df)`.

Additionally, pipelines inherently have a fit state. If none of the stages
composing them is fittable in nature this doesn't make a lot of a difference,
but many stage have a `fit_transform` vs `transform` logic, like encoders,
scalers and so forth.

>   *The `apply` pipeline method uses either `fit_transform` and `transform`
    in an intelligent and sensible way: If the pipeline is not fitted, calling
    it is equivalent to calling `fit_transform`, while if it is fitted, the
    call is practically a `transform` call.*

Let's say we want to utilize pdpipe's powerful slicing syntax to apply only
*some* of the pipeline stages to the raw dataframe. We will now use the 
`fit_transform` method of the pipeline itself to force all encompassed pipeline
stages to fit-transform themselves.

Here, we will use `pipeline[4:7]` to apply the binning, scaling and tokenization stages only:

In [8]:
pipeline[4:7].fit_transform(raw_df)

Unnamed: 0,Id,Age,Name,Gender,Smoking,Runs,Savings,Savings_bin,Country,Quote
0,-0.05888,-1.135052,Jo,M,True,False,-0.609615,<1,USA,"[Living, life, to, its, fullest]"
1,1.472004,-1.135052,Dana,F,True,True,-0.60482,<1,USA,"[the, pen, is, mightier, then, the, sword]"
2,-1.275737,-1.04979,Bo,M,False,True,-0.563121,1≤,Greece,"[all, for, one, and, one, for, all]"
3,-1.157976,-0.2398,Derek,M,True,True,-0.58814,1≤,Denmark,"[every, life, is, precious]"
4,-0.843949,0.95387,Regina,F,True,False,-0.463043,1≤,Greece,"[all, of, you, get, off, my, porch]"
5,0.17664,0.015987,Jim,M,False,False,-0.606905,<1,Germany,"[boy, do, I, love, dogs, and, cats]"
6,0.255147,1.294918,Richy,M,False,True,1.478052,1≤,Finland,"[I, love, Euro, bills]"
7,1.43275,1.294918,Wealthus,F,False,True,1.957592,1≤,Finland,"[In, Finance, We, Trust]"
