In [1]:
# saves you having to use print as all exposed variables are printed in the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# core libraries
import pandas as pd
import os
from pathlib import Path

%reload_ext autoreload
%autoreload 2


# for cleaning and discovery
from ds_discovery.transition.transitioning import Transition
from ds_foundation.handlers.abstract_handlers import ConnectorContract

# Set the environment Contract Path to use with the from_env() init factory method
os.environ['AISTAC_PM_PATH'] = Path(os.environ['PWD'], 'contracts').as_posix()

# These are not required but are used as reference when setting up the connectors
os.environ['TR_SOURCE_PATH'] = Path(os.environ['PWD'], 'data', '0_raw').as_posix()
os.environ['TR_PERSIST_PATH'] = Path(os.environ['PWD'], 'data', '2_transition').as_posix()

import ds_discovery
print('DTU: {}'.format(ds_discovery.__version__))

DTU: 2.05.030


# Accelerated Machine learning
## Transitioning: Data Sourcing
As part of the Accelerated ML Discovery Vertical, Transitioning is a foundation base truth, facilitating a **transparent** transition<br>
of the raw canonical dataset, to a **fit-for-purpose** canonical dataset, to enable the optimisation of discovery analysis and the identification of **features-of-interest**.
The meaning of cononical is to convert formats into common data language, not just bringing over the dataset but bringing the construct of that dataset ie: type, format, structure, and functionally, in our case because we are Python centric we use Pandas Data Frames as our canonical.

With reference to the diagram, this notebook deals with the Sourcing Contract and the raw canonical dataset as a prerequisite of the Sourcing Contract:
1. Sourcing Notebooks
2. Sourcing Contract
3. Source Connectivity and the Raw Canonical

![transition](../98_images/AccML-Transition.png)


## Creating a Transitioning Contract Pipeline
* Creating an instance of the Transitioning Class, passing a unique reference name. when wishing to reference this in other Juptyer Notebooks.
* The reference name identifies the unique transitioning contract pipeline.

In [2]:
tr = Transition.from_env('synthetic_customer')

### Reset the Source Contract
Reset the source contract so we start afresh. 

**Note:** when we print the source report we do not reset the property soruce. (Set up in the factory from_env())

In [3]:
# reset the transition contract
tr.pm_reset()

### Build the Source Contract
Source Contract is a set of attributes that define the resource, its type and its location for retrieval and convertion to the raw canonical for transitioning. The Source Contract additionally defines the module and handler that is dynamically loaded at runtime.

By default the source contract requires
* resource: a local file, connector, URI or URL
* source_type a reference to the type of resource. if None then extension of resource assumed
* location: a path, region or uri reference that can be used to identify location of resource
* module_name: a module name with full package path e.g 'ds_discovery.handlers.pandas_handlers
* handler: the name os the handler class
* kwargs: additional arguments the handler might require

In this example, we are using our environment variables and default handlers to define our source contract.


In [4]:
sc = ConnectorContract(uri=os.path.join(os.environ['TR_SOURCE_PATH'], 'synthetic_customer.csv'), 
                       module_name=tr.PYTHON_MODULE_NAME, handler=tr.PYTHON_HANDLER, sep=',', encoding='latin1')
tr.set_source_contract(connector_contract=sc)

### Defining the Persist Connectivity
Once we have defined the read only source contract, we need to also define where we are goingot read and write data to and from, our persisted 'cleaned' data.

When defining the resource we follow naming convention and call a method that defines the clean persist name

**Note:** You only set up the source and the persist handlers once and they are then saved off to the transition contract

In [5]:
pc = ConnectorContract(uri=os.path.join(os.environ['TR_PERSIST_PATH'], tr.file_pattern()), module_name=tr.PYTHON_MODULE_NAME, 
                       handler=tr.PYTHON_HANDLER)
tr.set_persist_contract(connector_contract=pc)


## Source Separation of Concerns
The source details have now been recoreded in the contract pipeline 

This Source separation of concerns means:
* New Notebooks are no longer tied to the name or location of the data source
* File governance and naming convention is managed automatically
* Connectivity can be updated or reallocated independantly of the data science activities
* Data location and infrastructure, through the delivery lifecycle, can be hardened without effecting the Machine Learning discovery process


In [6]:
tr.report_connectors(stylise=True)

Unnamed: 0,connector_name,uri,module_name,handler,kwargs,query
0,pm_transition_synthetic_customer,/Users/doatridge/code/projects/prod/discovery-transition-ds/jupyter/working/contracts/aistac_pm_transition_synthetic_customer.yaml,ds_foundation.handlers.python_handlers,PythonPersistHandler,,
1,read_only_connector,/Users/doatridge/code/projects/prod/discovery-transition-ds/jupyter/working/data/0_raw/synthetic_customer.csv,ds_foundation.handlers.python_handlers,PythonPersistHandler,"sep=',' encoding='latin1'",
2,persist_connector,/Users/doatridge/code/projects/prod/discovery-transition-ds/jupyter/working/data/2_transition/aistac_transition_synthetic_customer_v0.00.pickle,ds_foundation.handlers.python_handlers,PythonPersistHandler,,


### Loading the Canonical
Now we have recored the file information, we no longer need to reference these details again<br>
To load the contract data we use the transitioning method `load_source_canonical()`<br>
and then we can use the canonical dictionary report to examine the data set.


In [8]:
df = tr.load_source_canonical()

AttributeError: 'dict' object has no attribute 'columns'

### Observations
The report presents our attribute summary as a stylised data frame, highlighting data points of interest.  We will see more of this in the next tutorial.

### Next Steps

Now we have our raw canonical data extracted and convereted to the canonical from the source we can start the transitioning...