<cite>Darryl Oatridge, August 2022<cite>

## Building a Pipeline
Now we know what a component looks like we can start to build the pipeline adding in actions or intent that gives the component purpose.  

The first component we will build as part of the pipeline is the data selection component with the class name  Transition.  This component provides a set of Intent that focuses on tidying raw data by removing data columns that are not useful to the final feature set.  These may include null columns, single value columns, duplicate columns and noise etc.  We can also ensure the data is properly canonicalised through enforcing data typing.

Before we do that, and as shown in the previous section, we now use the environment variables to define the location of the Domain Contract and datastore.


In [1]:
import os 

In [2]:
os.environ['HADRON_PM_PATH'] = '0_hello_meta/demo/contracts'
os.environ['HADRON_DEFAULT_PATH'] = '0_hello_meta/demo/data'

For the feature selection we are using the Transition component with the ability to select the correct columns from raw data, potentially reducing the column count.  In addition the Transistioning component extends the common reporting tools and provides additional functionality for identifying quality, quantity, veracity and availability.

It should be worth noting we are creating a new component and as such must set up the input and the output of the component. 

In [3]:
from ds_discovery import Transition

In [4]:
# get the instance
tr = Transition.from_env('hello_tr', has_contract=False)

In [5]:
tr.set_source_uri('https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv')
tr.set_persist()

### Adding Intent
At the core of a component is its tasks, in other words how it changes incoming data into a different data outcome. To achieve this we use the components Intent.  Intent is a finate set of methods, unique to each component, that can be applied to the raw data in order to change it in a way that is useful to the outcome of the task. The intent is used individually and selectively to ensure that each outcome is tailored to its specific needs.

Firstly load the raw data.

In [6]:
df = tr.load_source_canonical()

In the first instance through the reporting we observed that empty cells have been replaced by a `?`.  Here we can use our `auto_reinstate_nulls` to replace all the obfusacted cells with nulls.  In addition we can immediately observe columns that are inappropriate for our needs.  In this case we do not need to know the name of the traveller.

In [7]:
# returns obfusacted nulls
df = tr.tools.auto_reinstate_nulls(df, nulls_list=['?'])
# removes data columns of no interest
df = tr.tools.to_remove(df, headers=['name'])

By now using the Intent Report we can see the selection made to this point. 

In [8]:
tr.report_intent()

Unnamed: 0,level,order,intent,parameters,creator
0,base,0,auto_reinstate_nulls,"[""nulls_list=['?']""]",doatridge
1,,0,to_remove,"[""headers=['name']""]",doatridge


We can now run the pipeline and use the canonical report to observe the outcome.  From it we can see the nulls column now indicates the number of nulls in each column correctly so we can deal with them later.  We have also removed `name`.

In [9]:
tr.run_component_pipeline()
tr.canonical_report(tr.load_persist_canonical())

Unnamed: 0,Attributes (13),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,object,20.1%,20.1%,1309,99,Sample: 24 | 22 | 21 | 30 | 18
1,boat,object,62.9%,62.9%,1309,28,Sample: 13 | C | 15 | 14 | 4
2,body,object,90.8%,90.8%,1309,122,Sample: 135 | 101 | 37 | 285 | 156
3,cabin,object,77.5%,77.5%,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
4,embarked,object,0.2%,69.8%,1309,4,Sample: S | C | Q
5,fare,object,0.1%,4.6%,1309,282,Sample: 8.05 | 13 | 7.75 | 26 | 7.8958
6,home.dest,object,43.1%,43.1%,1309,370,"Sample: New York, NY | London | Montreal, PQ | Paris, France | Cornwall / Akron, OH"
7,parch,int64,0.0%,76.5%,1309,8,max=9 | min=0 | mean=0.39 | dominant=0
8,pclass,int64,0.0%,54.2%,1309,3,max=3 | min=1 | mean=2.29 | dominant=3
9,sex,object,0.0%,64.4%,1309,2,Sample: male | female


Now we have seen the changes let's add some more Intent.  In this case, as well as `name` we notice we don't need `boat` `body` and `home.dest` so we add these to the list.  In addition we will auto catagorise any obvious columns in the data and then ensure the numeric values are of the correct type.

In [10]:
df = tr.tools.to_remove(df, headers=['name', 'boat', 'body', 'home.dest'])
df = tr.tools.auto_to_category(df, unique_max=20)
df = tr.tools.to_numeric_type(df, headers=['age', 'fare'])

Using the Intent reporting tool to check the work and see what the Intent currently looks like all together.

In [11]:
tr.report_intent()

Unnamed: 0,level,order,intent,parameters,creator
0,base,0,auto_reinstate_nulls,"[""nulls_list=['?']""]",doatridge
1,,0,auto_to_category,['unique_max=20'],doatridge
2,,0,to_numeric_type,"[""headers=['age', 'fare']""]",doatridge
3,,0,to_remove,"[""headers=['name', 'boat', 'body', 'home.dest']""]",doatridge


As we can see adding the components Intent is a process of looking at the raw data and making decisions on the selection of the features of interest.  From here we would run the pipeline and use the canonical reporting and other tools to check the results then keep iterating until we have what we need.

----------------

## Ordering Intent

With the component Intent now defined the run pipeline does its best to guess the best order of that Intent but sometimes we want to ensure things run in a certain order due to dependancies or other challenges.  Though not necessary, we will clear the previous Intent and write it again, this time in order.

In [12]:
tr.remove_intent()

True

In [13]:
tr.report_intent()

Unnamed: 0,level,order,intent,parameters,creator


This time when we add the Intent we include the parameter `intent_level` to indicate the different order or level of execution.  

We load the source canonical and repeat the Intent, this time including the new intent level.  

In [14]:
df = tr.load_source_canonical()

In [15]:
df = tr.tools.auto_reinstate_nulls(df, nulls_list=['?'], intent_level='reinstate')
df = tr.tools.to_remove(df, headers=['name', 'boat', 'body', 'home.dest'], intent_level='remove')
df = tr.tools.auto_to_category(df, unique_max=20, intent_level='auto_category')
df = tr.tools.to_numeric_type(df, headers=['age', 'fare'], intent_level='to_dtype')
df = tr.tools.to_str_type(df, headers=['cabin', 'ticket'],use_string_type=True , intent_level='to_dtype')

In addition, and as an introduction to a new feature, we will add in the column description that describes the reasoning behind why an Intent was added.

In [16]:
tr.add_column_description('reinstate', description="reinstate nulls that where obfuscated with '?'")
tr.add_column_description('remove', description="remove column of no value")
tr.add_column_description('auto_category', description="auto fit features to categories where their uniqueness is 20 or less")
tr.add_column_description('to_dtype', description="ensure all other columns of interest are appropriately typed")


Using the report we can see the addition of the numbers, in the level column, which helps the run component run the tasks in the order given.  It is worth noting that the tasks can be given the same level if the order is not important and the run component will deal with it using its ordering algorithm.

In [17]:
tr.report_intent()

Unnamed: 0,level,order,intent,parameters,creator
0,auto_category,0,auto_to_category,['unique_max=20'],doatridge
1,reinstate,0,auto_reinstate_nulls,"[""nulls_list=['?']""]",doatridge
2,remove,0,to_remove,"[""headers=['name', 'boat', 'body', 'home.dest']""]",doatridge
3,to_dtype,0,to_numeric_type,"[""headers=['age', 'fare']""]",doatridge
4,,0,to_str_type,"[""headers=['cabin', 'ticket']"", 'use_string_type=True']",doatridge


As we have taken the time to capture the reasoning to include the compoment Intent we can use the reports to produce a view of the Intent column comments that are invaluable when interrogating a component and understanding why decisions were made.

In [18]:
tr.report_column_catalog()

Unnamed: 0,column_name,description
0,auto_category,auto fit features to categories where their uniqueness is 20 or less
1,reinstate,reinstate nulls that where obfuscated with '?'
2,remove,remove column of no value
3,to_dtype,ensure all other columns of interest are appropriately typed


## Component Pipeline

As usual we can now run the Compant pipeline to apply the components tasks.

In [19]:
tr.run_component_pipeline()

As an extension of the default, `run_component_pipeline` provides useful tools to help manage the outcome.  In this case we've specificially defined the Intent order we wanted to run.

In [20]:
tr.run_component_pipeline(intent_levels=['remove', 'reinstate', 'auto_category', 'to_dtype'])

----------------------
## Run Books
We've seen how run pipeline allows us to define the intent in real time, but we can also use 'Run Books' to predefine that same order and reuse it. We simply add our list of Intent to a book in the order needed. In this case we have not specified a book name so this book is allocated to the primary Run Book. Now each time we run our pipeline, it is set to run the primary Run Book.

In [21]:
tr.add_run_book(run_levels=['remove', 'reinstate', 'auto_category', 'to_dtype'])

Here we had a book by name where we select only the intent that cleans the raw data. The Run book report Now what are shows us the two run books;

In [22]:
tr.add_run_book(book_name='cleaner', run_levels=['remove', 'reinstate'])

In [23]:
tr.report_run_book()

Unnamed: 0,name,run_book
0,primary_run_book,"['remove', 'reinstate', 'auto_category', 'to_dtype']"
1,cleaner,"['remove', 'reinstate']"


In this next example we add an additional Run Book that is a subset of the tasks to only clean the data.  By passing this named Run Book to the run pipeline it is obliged to only run this subset and only clean the data.  We can see the results of this in our canonical report below.

In [24]:
tr.run_component_pipeline(run_book='cleaner')

In [25]:
tr.canonical_report(tr.load_persist_canonical())

Unnamed: 0,Attributes (10),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,object,20.1%,20.1%,1309,99,Sample: 24 | 22 | 21 | 30 | 18
1,cabin,object,77.5%,77.5%,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
2,embarked,object,0.2%,69.8%,1309,4,Sample: S | C | Q
3,fare,object,0.1%,4.6%,1309,282,Sample: 8.05 | 13 | 7.75 | 26 | 7.8958
4,parch,int64,0.0%,76.5%,1309,8,max=9 | min=0 | mean=0.39 | dominant=0
5,pclass,int64,0.0%,54.2%,1309,3,max=3 | min=1 | mean=2.29 | dominant=3
6,sex,object,0.0%,64.4%,1309,2,Sample: male | female
7,sibsp,int64,0.0%,68.1%,1309,7,max=8 | min=0 | mean=0.5 | dominant=0
8,survived,int64,0.0%,61.8%,1309,2,max=1 | min=0 | mean=0.38 | dominant=0
9,ticket,object,0.0%,0.8%,1309,929,Sample: CA. 2343 | 1601 | CA 2144 | PC 17608 | 347077


As a contrast to the above we can run the pipeline without providing a Run Book name and it will automatically default to the primary run book, assuming this has been set up.  In this case running the full component Intent the resulting outcome is shown below in the canonical report.

In [26]:
tr.run_component_pipeline()

In [27]:
tr.canonical_report(tr.load_persist_canonical())

Unnamed: 0,Attributes (10),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,20.1%,20.1%,1309,99,max=80.0 | min=0.1667 | mean=29.88 | dominant=24.0
1,cabin,string,77.5%,77.5%,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
2,embarked,category,0.0%,69.8%,1309,4,Sample: S | C | Q | nan
3,fare,float64,0.1%,4.6%,1309,282,max=512.3292 | min=0.0 | mean=33.3 | dominant=8.05
4,parch,category,0.0%,76.5%,1309,8,Sample: 0 | 1 | 2 | 3 | 4
5,pclass,category,0.0%,54.2%,1309,3,Sample: 3 | 1 | 2
6,sex,category,0.0%,64.4%,1309,2,Sample: male | female
7,sibsp,category,0.0%,68.1%,1309,7,Sample: 0 | 1 | 2 | 4 | 3
8,survived,category,0.0%,61.8%,1309,2,Sample: 0 | 1
9,ticket,string,0.0%,0.8%,1309,929,Sample: CA. 2343 | 1601 | CA 2144 | PC 17608 | 347077
