<cite>Darryl Oatridge, August 2022<cite>

## Building a Component
This tutorial shows the fundamentals of how to run a basic Project Hadron component and from it create a resueable Domain Contract. It is the simpliest form of running a task demonstrating the input, throughput and output of a dataset. You will notice 'Domain Contract' used throughout the tutorials and is fundamental to a component.  A Domain Contract contains everything known about the instance of this component. Each time a named component is loaded the Domain Contract defines that components instance. 


Firstly we have imported a component from the Project Hadron library for this demonstration. It should be noted, the choice of component is arbritary as all methods used are common across all components.

In [6]:
from ds_discovery import Transition

To create a Domain Contract instance of the component we have used the Factory method `from_env` and given it a referenceable name `hello_comp`, and as this is the first instantiation, we have used the one off parameter call `has_contract` that by default is set to True and is used to avoid the accidential loading of a Domain Contract instance of the same task name. 

In [7]:
tr = Transition.from_env('hello_comp', has_contract=False)

We have set where the data is coming from and where the resulting data is going to.  The source identifies a URI (URL) from which the data will be collected and in this case persistance uses the default settings, more on this later.   

In [8]:
tr.set_source_uri('https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv')
tr.set_persist()

Finially run the component which takes the source data, passes it through our task, and persists it.

In [9]:
tr.run_component_pipeline()

This concludes building a component and though the component doesn't change the throughput, it shows the core steps to building any component.

--------------------------------------

## Reloading and Extending our Component
As we have already setup the input and output and stored it in our Domain Contract we can now reload the component instance, with all its content, by referencing the components individual name. Just to make sure everything is working we can run the component pipeline and check the results are all positive.

In [10]:
tr = Transition.from_env('hello_comp')

In [11]:
tr.run_component_pipeline()

Now we have reloaded the Domain Contract lets look at a sample of some commonly used features that allow us to peek inside our components.  

The first and probably most useful method call is to be able to retrieve the run results of the pipeline.
We do this using the component method `load_persist_canonical`. Because of the Domain Contract the component already knows what to get and where to get it, and in this instance returns a Pandas dataframe which for these components is our canonical.

In [12]:
df = tr.load_persist_canonical()

The second most used feature is the reporting tool for the canonical.  It allows us to look at the results of the run as an informative dictionary, this gives a deeper insight into the canonical results. Though unlike other reports it requests the canonical of interest, this means it can be used on a wider trajectory of circumstances such as looking at source or other data that is being injested by the task.  

Below we have an example of the processed canonical where we can potentially see the results of the pipeline that was persisted.  The report has a wealth of information and is worth taking time to explore as it is likely to speed up your data discovery and the understanding of the dataset.


In [13]:
tr.canonical_report(df)

Unnamed: 0,Attributes (14),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,object,0.0%,20.1%,1309,99,Sample: ? | 24 | 22 | 21 | 30
1,boat,object,0.0%,62.9%,1309,28,Sample: ? | 13 | C | 15 | 14
2,body,object,0.0%,90.8%,1309,122,Sample: ? | 58 | 285 | 156 | 143
3,cabin,object,0.0%,77.5%,1309,187,Sample: ? | C23 C25 C27 | G6 | B57 B59 B63 B66 | C22 C26
4,embarked,object,0.0%,69.8%,1309,4,Sample: S | C | Q | ?
5,fare,object,0.0%,4.6%,1309,282,Sample: 8.05 | 13 | 7.75 | 26 | 7.8958
6,home.dest,object,0.0%,43.1%,1309,370,"Sample: ? | New York, NY | London | Montreal, PQ | Paris, France"
7,name,object,0.0%,0.2%,1309,1307,"Sample: Connolly, Miss. Kate | Kelly, Mr. James | Allen, Miss. Elisabeth Walton | Ilmakangas, Miss. ..."
8,parch,int64,0.0%,76.5%,1309,8,max=9 | min=0 | mean=0.39 | dominant=0
9,pclass,int64,0.0%,54.2%,1309,3,max=3 | min=1 | mean=2.29 | dominant=3


Thirdly, we can examine the connector contracts.  Connector contracts are setup as part of the Domain Contract and record where both datasets and Domain Contract are to be persisted. 

Through this report we can identify where the source data is taken from and persisted to.  In this instance we have also added the permeter `inc_pm` so the report adds the location of the Domain Contract. 

In [14]:
tr.report_connectors(inc_pm=True)

Unnamed: 0,connector_name,uri,module_name,handler,version,kwargs,query,aligned
0,primary_source,https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv,ds_discovery.handlers.pandas_handlers,PandasPersistHandler,v0.00,,,False
1,primary_persist,./hadron/data/hadron_transition_hello_comp_primary_persist.pickle,ds_discovery.handlers.pandas_handlers,PandasPersistHandler,v0.00,,,True
2,pm_transition_hello_comp,./hadron/contracts/hadron_pm_transition_hello_comp.json,ds_discovery.handlers.pandas_handlers,PandasPersistHandler,v0.00,,,False


This gives a flavour of the tools available to look inside a component and time should be taken viewing the different reports a component offers.

------------------------------

## Environment Variables
To this point we have using the default settings of where to store the Domain Contract and the persisted dataset.  These are in general local and within your working directory.  The use of environment variables frees us up to use an extensive list of connector contracts to store the data to a location of the choice or requirements. 

Hadron provides an extensive list of environment variables to tailor how your components retrieve and persist their information, this is beyond the scope of this tutorial and tend to be for specialist use, therefore we are going to focus on the two most commonly used for the majority of projects. 

We initially import Python's `os` package.

In [15]:
import os

In general and as good practice, most notebooks would `run` a set up file that contains imports and environment vartiables that are common across all notebooks.  In this case, for visability, because this is a tutorial we will import the packages and set up the two environment variables within each notebook. 

The first environment variable we set up is for the Domain Contract, this is critical to the components and the other components that rely on it (more of this later).  In this case we are setting the Domain Contract location to be in a common local directory, specific for Domain Contracts.

In [16]:
os.environ['HADRON_PM_PATH'] = '0_hello_meta/demo/contracts'

In [17]:
os.environ['HADRON_DEFAULT_PATH'] = '0_hello_meta/demo/data'

Next we set up where all the data is going to be persisted and recovered from.  This tends to be a common shared area so that other components can get access to the results of this components output (more of this later). In this case we are setting the default data store location to be in a common local directory, specific for datasets.

In [18]:
from ds_discovery import Transition

Because we have now changed the location of where the Domain Contract can be found we need to reset things from the start giving the source location and using the default persist location which we now know has been set by the environment variable.

In [19]:
tr = Transition.from_env('hello_tr,', has_contract=False)

In [20]:
tr.set_source_uri('https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv')
tr.set_persist()

Finally we run the pipeline with the new environemt variables in place and check everything runs okay.

In [21]:
tr.run_component_pipeline()

And we are there! We now know how to build a component and set its environment variables.  The next step is to build a real pipeline and join that with other pipelines to construct our complete master Domain Contract.