In [1]:
# saves you having to use print as all exposed variables are printed in the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# core libraries
import pandas as pd
import os
from pathlib import Path

%reload_ext autoreload
%autoreload 2
# for cleaning and discovery
from ds_discovery import Transition

# Set the environment working path as the root of the Jupyter instance
os.environ['DSTU_WORK_PATH'] = Path(os.environ['PWD']).as_posix()

import ds_discovery
print('DTU: {}'.format(ds_discovery.__version__))

DTU: 1.09.056


# Accelerated Machine learning
## Transitioning: Data Sourcing
As part of the Accelerated ML Discovery Vertical, Transitioning is a foundation base truth, facilitating a **transparent** transition<br>
of the raw canonical dataset, to a **fit-for-purpose** canonical dataset, to enable the optimisation of discovery analysis and the identification of **features-of-interest**.
The meaning of cononical is to convert formats into common data language, not just bringing over the dataset but bringing the construct of that dataset ie: type, format, structure, and functionally, in our case because we are Python centric we use Pandas Data Frames as our canonical.

With reference to the diagram, this notebook deals with the Sourcing Contract and the raw canonical dataset as a prerequisite of the Sourcing Contract:
1. Sourcing Notebooks
2. Sourcing Contract
3. Source Connectivity and the Raw Canonical

![transition](../98_images/AccML-Transition.png)


## Creating a Transitioning Contract Pipeline
* Creating an instance of the Transitioning Class, passing a unique reference name. when wishing to reference this in other Juptyer Notebooks.
* The reference name identifies the unique transitioning contract pipeline.

In [2]:
tr = Transition('synthetic_customer')

### Reset the Source Contract
Reset the source contract so we start afresh. Printing the source report validates that our values are empty.

In [3]:
# reset the contract and set the source contract
tr.reset_transition_contracts()
tr.report_source()

Unnamed: 0,param,values
0,resource,
1,source_type,
2,location,
3,module_name,
4,handler,
5,modified,


### Find the files
* Use the discovery `find_file(...)` to explore the names of the raw files
* Note, we use the file 'property manager' `file_pm` to get the data_path
* Because this is a canonical, we can manipulate it as we would our source file

In [4]:
files = tr.discover.find_file('.csv', root_dir=tr.file_pm.data_path).iloc[:,[0,4]].sort_values('name', axis=0)
files

Unnamed: 0,name,created
14,ames_housing.csv,Thu May 9 11:04:27 2019
21,ames_housing_dictionary copy-checkpoint.csv,Fri Jun 7 14:04:25 2019
19,ames_housing_dictionary-checkpoint.csv,Fri Jun 7 13:54:35 2019
12,ames_housing_dictionary.csv,Fri Jun 7 14:12:07 2019
11,paribas.csv,Sun Jun 16 22:43:41 2019
15,santander.csv,Mon May 20 18:47:59 2019
18,synthetic_agent.csv,Thu Jun 13 13:27:39 2019
20,synthetic_customer-checkpoint.csv,Thu Jun 13 15:31:49 2019
13,synthetic_customer.csv,Wed Jun 19 09:58:23 2019
22,synthetic_customer_corrupt-checkpoint.csv,Tue Jun 4 20:54:12 2019


### Build the Source Contract
Source Contract is a set of attributes that define the resource, its type and its location for retrieval and convertion to the raw canonical for transitioning. The Source Contract additionally defines the module and handler that is dynamically loaded at runtime.

By default the source contract requires
* resource: a local file, connector, URI or URL
* source_type a reference to the type of resource. if None then extension of resource assumed
* location: a path, region or uri reference that can be used to identify location of resource
* module_name: a module name with full package path e.g 'ds_discovery.handlers.pandas_handlers
* handler: the name os the handler class
* kwargs: additional arguments the handler might require

In this example, because we are using the standard Pandas data frame, file handlers and the localized Transitioning default path locations, as such we only need to provide the resource name and any other Key Word Argument that the specific file handler may need. As our file is csv we have defined the file separator and encoding.


In [5]:
tr.set_source_contract(resource='synthetic_customer.csv', sep=',', encoding='latin1', load=False)

#### Cortex Connectivity
As a comparison, in the following example we utilize the vast array of Cortex connectivity options.  Here we are looking to connect to a remote Mongo database using the Cortex handler to retrive the specified source data.

## Source Separation of Concerns
The source details have now been recoreded in the contract pipeline 

This Source separation of concerns means:
* New Notebooks are no longer tied to the name or location of the data source
* File governance and naming convention is managed automatically
* Connectivity can be updated or reallocated independantly of the data science activities
* Data location and infrastructure, through the delivery lifecycle, can be hardened without effecting the Machine Learning discovery process


### Loading the Canonical
Now we have recored the file information, we no longer need to reference these details again<br>
To load the contract data we use the transitioning method `load_source_canonical()`<br>
and then we can use the canonical dictionary report to examine the data set.


In [6]:
df = tr.load_source_canonical()
tr.canonical_report(df)

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,15.0%,0.5%,425,422,max=88.09299999999999 | min=20.326 | mean=46.34
1,balance,float64,0.0%,0.4%,500,493,max=724.39 | min=33.24 | mean=178.3
2,forename,object,0.0%,0.4%,500,498,Sample: Ruthie | Wilma | Ayah
3,gender,object,0.0%,66.2%,500,2,Sample: M | F
4,id,object,0.0%,0.2%,500,500,Sample: CU_4744313 | CU_7953126 | CU_8389550
5,last_login,object,0.0%,0.4%,500,497,Sample: 02-05-19 15:33 | 04-04-19 23:54 | 03-26-19 13:12
6,,float64,100.0%,0.0%,0,0,max=nan | min=nan | mean=nan
7,online,int64,0.0%,77.6%,500,2,max=1 | min=0 | mean=0.22
8,profession,object,10.0%,24.4%,450,14,Sample: Accountant II | Food Chemist | Statistician III
9,single cat,object,40.0%,100.0%,300,1,Sample: A


### Observations
The report presents our attribute summary as a stylised data frame, highlighting data points of interest.  We will see more of this in the next tutorial.

### Next Steps

Now we have our raw canonical data extracted and convereted to the canonical from the source we can start the transitioning...