In [1]:
# saves you having to use print as all exposed variables are printed in the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# core libraries
import pandas as pd
import os
from pathlib import Path

%reload_ext autoreload
%autoreload 2
# for cleaning and discovery
from ds_discovery import Transition
from ds_behavioral.simulator.cortex_sim import CortexTransitionAgent

# Set the environment working path as the root of the Jupyter instance
os.environ['DSTU_WORK_PATH'] = Path(os.environ['PWD']).as_posix()

import ds_discovery
print('DTU: {}'.format(ds_discovery.__version__))

DTU: 1.09.056


### Reset our example
Just to ensure we have a our appropriate starting point we 
* regenerate the original Customer dataset
* rerun the transitioning notebook for the customer dataset

In [2]:
# run the synthetic data gen
%run ../global_functions.ipynb
Synthetic.Customer()

# run both transitioning contracts
%run tr_synthetic_customer.ipynb

# Accelerated Machine learning
## Transitioning: Productionisation
As part of the Accelerated ML discovery vertical, productionisation of the discovery activities from multiple Notebooks and thought lines proves a challenging and time consuming task. Accelerated ML, though its Separation-of-Concerns (SoC) and the extraction of **_intent_** from the Notebook discovery activities allows us to productionise early without having to re-code. The diagram below illustrates the  **_Discovery Transitioning_** productionsation process.
![transition-prod](../98_images/AccML-Transition-Prod.png)

With our intial investigation and discovery around the transitioning we completed 1, 2, 4, 5 and 6.

The following sections illustrate the final migration and reuse of the productionisation process into Cortex.

## Retrieving the Transitioning Contract Instance
Firstly we need to get back our Discovery Transitioning Contract. In this case the `synthetic_customer`

In [3]:
tr = Transition('synthetic_customer')

### Source Contract 
To this point we have created our Source Contract...


In [24]:
tr.report_source()

Unnamed: 0,param,values
0,resource,synthetic_customer.csv
1,source_type,csv
2,location,/Users/doatridge/code/projects/prod/discovery-transitioning-utils/jupyter/working/data/0_raw
3,module_name,ds_discovery.handlers.pandas_handlers
4,handler,PandasHandler
5,modified,1560934820
6,sep,","
7,encoding,latin1


### TransitioningContract Pipeline
... and created our Transitioning Contract Pipeline from the Discovery and Transitioning Phase

In [25]:
tr.report_cleaners()

Unnamed: 0,level,intent,parameters
0,0.0,auto_clean_header,"rename_map={'start': 'start_date'}, replace_spaces=_"
1,,auto_remove_columns,"null_min=0.99, predominant_max=0.9, nulls_list=['']"
2,,auto_to_category,"null_max=0.7, unique_max=20"
3,,to_bool_type,"headers=online, drop=False, bool_map={1: True}"
4,,to_category_type,"headers=['gender', 'profession'], drop=False"
5,,to_date_type,"headers=start_date, drop=False, as_num=False, day_first=True, year_first=False"
6,,to_float_type,"dtype=['float'], exclude=False, fillna=nan, errors=coerce, precision=3"
7,,to_str_type,"dtype=['object'], exclude=False, nulls_list=['', 'nan']"


## Cortex Transitioning Agent 
The Cortex Transitioning Agent is a reusable library of transitioning methods that can be actioned against, this represents the **_reusable code_**, the first part of the Separation of Concerns of the **_Accelerated ML_**.

In oder to use this production ready Cortex agent we fist must create or reuse an instance of the `CortexTransitionAgent` that runs within the Cortex Ecosystem.

In [26]:
cortex = CortexTransitionAgent('customer')

### Setting the source contract
As we saw in `transition_01_source.ipynd` notebook we set up our **_Source Contract_** (2) and can easily take advantage of the huge array of Cortex Conectivity handlers.

At this point we could, if we had not already, quickly switch from our local source contract to as we know we are guaranteed our raw dataset in its canonical form separating source format from our canonical environment.

In this example we demonstrate connecting to a MongoDB instance:

### Passing the Source Contract to the Cortex Transitioning Skill
As Cortex is fully integrated into the ML discovery vertical we simple pass over the **_Source Contract_** of intent to the Cortex Agent. This sets up the **_cannonical dataset_**, our second part of the Separation of Concerns of the **_Accelerated ML_**.

In [27]:
cortex.set_source_contract(tr.data_pm.source)
cortex.report_source()

Unnamed: 0,param,values
0,resource,synthetic_customer.csv
1,source_type,csv
2,location,/Users/doatridge/code/projects/prod/discovery-transitioning-utils/jupyter/working/data/0_raw
3,module_name,ds_discovery.handlers.pandas_handlers
4,handler,PandasHandler
5,modified,1560934820
6,sep,","
7,encoding,latin1


### Setting the Transitioning Contract
We also saw in `transition_02_transition.ipynd` notebook our discovery and cleaning tools to create the **_Transitioning Contract_** (6) which holds our **_Prameterised Intent_**, more of which in a minute

In [28]:
tr.report_cleaners()

Unnamed: 0,level,intent,parameters
0,0.0,auto_clean_header,"rename_map={'start': 'start_date'}, replace_spaces=_"
1,,auto_remove_columns,"null_min=0.99, predominant_max=0.9, nulls_list=['']"
2,,auto_to_category,"null_max=0.7, unique_max=20"
3,,to_bool_type,"headers=online, drop=False, bool_map={1: True}"
4,,to_category_type,"headers=['gender', 'profession'], drop=False"
5,,to_date_type,"headers=start_date, drop=False, as_num=False, day_first=True, year_first=False"
6,,to_float_type,"dtype=['float'], exclude=False, fillna=nan, errors=coerce, precision=3"
7,,to_str_type,"dtype=['object'], exclude=False, nulls_list=['', 'nan']"


### Passing the Transitioning Contract to the Cortex Transitioning Skill
As with the Source Contract, we simple pass over the Transitioning Contract, or **_Prameterised Intent_**, our thirst part of the Separation of Concerns of the **_Accelerated ML_**.

In [29]:
cortex.set_transition_pipeline(tr.data_pm.cleaners)
cortex.report_transition()

Unnamed: 0,level,intent,parameters
0,0.0,auto_clean_header,"rename_map={'start': 'start_date'}, replace_spaces=_"
1,,auto_remove_columns,"null_min=0.99, predominant_max=0.9, nulls_list=['']"
2,,auto_to_category,"null_max=0.7, unique_max=20"
3,,to_bool_type,"headers=online, drop=False, bool_map={1: True}"
4,,to_category_type,"headers=['gender', 'profession'], drop=False"
5,,to_date_type,"headers=start_date, drop=False, as_num=False, day_first=True, year_first=False"
6,,to_float_type,"dtype=['float'], exclude=False, fillna=nan, errors=coerce, precision=3"
7,,to_str_type,"dtype=['object'], exclude=False, nulls_list=['', 'nan']"


### Setting the Augmented Knowledge
As part of the ML discovery, augmented knowledge was captured that can now be fed into the Cortex Transitiong Skill and referenced within Cortex as part of the wider **_Augmented Knowledge Catalogue_** that services various challenges around explainability, transparency and ethical AI though provision of a richer view of collective subject matter

In [30]:
tr.report_notes()

Unnamed: 0,section,label,date,text
0,overview,notes,2019-06-19 10:00,The file is a synthetic customer data file created for this demonstration
1,,source,2019-06-19 10:00,This was generated using the Discovery Behavioral Synthetic Data Generator
2,,,2019-06-19 10:00,The script to rerun the data generation can be found in the synthetic scripts folder
3,attribute,start,2019-06-19 10:00,changing this to start_date so it being a date is obvious


### Passing the Augmented Knowledge to the Cortex Transitioning Skill
again this is a simple process of passing over the **_Augmented Knowledge_**, an augmentation of **_Accelerated ML_**.

In [31]:
cortex.set_augmented_knowledge(tr.data_pm.notes)
cortex.report_notes()

Unnamed: 0,section,label,date,text
0,overview,notes,2019-06-19 10:00,The file is a synthetic customer data file created for this demonstration
1,,source,2019-06-19 10:00,This was generated using the Discovery Behavioral Synthetic Data Generator
2,,,2019-06-19 10:00,The script to rerun the data generation can be found in the synthetic scripts folder
3,attribute,start,2019-06-19 10:00,changing this to start_date so it being a date is obvious


#### That It, we are now fully productionised

--------------

------------------------
## Running the Cortex Production Pipeline
Running the Cortex production pipeling in cortex is now a simple method call, the outcome of which is out Canonical Dataset ready for feature extraction and Model build. Typed, Cleaned and Prepered and in exactly the same canonical format as all the other datasets the Data Scientist is managing

In [32]:
df = cortex.run_transition_pipeline()
tr.canonical_report(df)

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,15.0%,0.2%,425,425,max=79.313 | min=20.203 | mean=45.94
1,balance,float64,0.0%,0.4%,500,493,max=965.35 | min=33.15 | mean=185.35
2,forename,object,0.0%,0.4%,500,499,Sample: Elise | Matylda | Brian
3,gender,category,0.0%,57.8%,500,2,F|M
4,id,object,0.0%,0.2%,500,500,Sample: CU_8727780 | CU_5110035 | CU_8367034
5,online,bool,0.0%,79.4%,500,2,False | True
6,profession,category,10.0%,24.4%,450,15,Accounting Assistant I|Accounting Assistant IV|Computer Systems Analyst II|Computer Systems Analyst ...
7,start_date,datetime64[ns],0.0%,1.6%,500,267,max=2018-12-30 00:00:00 | min=2018-01-02 00:00:00 | yr mean= 2018
8,surname,object,0.0%,0.2%,500,500,Sample: Nauarro | Villanveua | Braverman


------------

----------------
## Accelerated ML and the Separation of Concerns
Lets now explore how Separation of Concerns works and how we can remain produtionised while still Discovering our datasets.

----------------
### Changing the dataset
Lets consider our dataset changes it is still Customer Data but the file has been updated or its origin changed. 

In this example we 
* Generate an update to our data using our Synthetic data generator adding additional attributes so we can clearly see the change.
* Run our production contract pipeline against the new data
* Report the results


In [33]:
# run the synthetic data gen
Synthetic.Customer(extra=True)

# Run the production contract
df = cortex.run_transition_pipeline()
tr.canonical_report(df)


Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,15.0%,0.5%,425,424,max=89.645 | min=20.089 | mean=46.09
1,balance,float64,0.0%,0.4%,500,493,max=998.64 | min=33.27 | mean=190.27
2,forename,object,0.0%,0.4%,500,497,Sample: Winifred | Gabriel | Mustafa
3,gender,category,0.0%,63.4%,500,2,F|M
4,id,object,0.0%,0.2%,500,500,Sample: CU_6947984 | CU_9890116 | CU_2704139
5,last_login,object,0.0%,0.4%,500,497,Sample: 04-21-19 12:44 | 03-02-19 05:32 | 04-25-19 14:36
6,online,bool,0.0%,79.8%,500,2,False | True
7,profession,category,10.0%,23.8%,450,15,Analyst Programmer|Community Outreach Specialist|Director of Sales|Food Chemist|Junior Executive|Mar...
8,start_date,datetime64[ns],0.0%,1.2%,500,271,max=2018-12-30 00:00:00 | min=2018-01-02 00:00:00 | yr mean= 2018
9,status,category,0.0%,54.0%,500,4,Active|Closed|Pending|Suspended


### Observations: 
As we see without any changes to code or intent we have our new transitioned canonical with the extra fields ``last_login`` and ``status``

### Augmented Knowledge
We can even add notes to pass through into the Cortex Ecosystem to capture the different in the dataset

In [34]:
tr.add_notes(label='source', text='The source has been rerun and new attributes added')
tr.add_attribute_notes(attribute='last_login', text="Last_login has been included in the new run.")
tr.add_attribute_notes(attribute='status', text="status has also been added and auto categorised")
tr.add_notes(note_type='dictionary', label='last_login', text="The last time the customer logged into the online system")
tr.add_notes(note_type='dictionary', label='status', text="the current status of the customer")
cortex.set_augmented_knowledge(tr.data_pm.notes)
cortex.report_notes()

Unnamed: 0,section,label,date,text
0,overview,notes,2019-06-19 10:00,The file is a synthetic customer data file created for this demonstration
1,,source,2019-06-19 10:00,This was generated using the Discovery Behavioral Synthetic Data Generator
2,,,2019-06-19 10:00,The script to rerun the data generation can be found in the synthetic scripts folder
3,,,2019-06-19 10:00,The source has been rerun and new attributes added
4,attribute,last_login,2019-06-19 10:00,Last_login has been included in the new run.
5,,start,2019-06-19 10:00,changing this to start_date so it being a date is obvious
6,,status,2019-06-19 10:00,status has also been added and auto categorised
7,dictionary,last_login,2019-06-19 10:00,The last time the customer logged into the online system
8,,status,2019-06-19 10:00,the current status of the customer


### Changing the Parameterised Intent
With the new fields ``status`` has already been auto converted to an appropriate data type as part of the ``auto_to_category(...)`` cleaner but our ``start_date`` is still an ``object`` type.

We don't need to pull everything out of production to change our intent, because of Separation of Concerns, we can change our **_Parameterised Intent_** without having to change any code, simply change our transitioning contract we give to the Cortex Transitioning Agent.

To do this we:
1. update the Transitioning instance with the additional ``to_date_type(...)`` cleaner
2. set the new cleaner with level -1 so if a date cleaner already exists it won't be overwriting. -1 places the cleaner in the next avaialbe level.
3. update the Cortex Agent with the new Transitioning Contract
4. re-run the Cortex Transitioning Contract Pipeline



In [35]:
tr.set_cleaner(tr.clean.to_date_type(df, headers='last_login', day_first=True, inplace=True), level=-1)
tr.report_cleaners()

Unnamed: 0,level,intent,parameters
0,0.0,auto_clean_header,"rename_map={'start': 'start_date'}, replace_spaces=_"
1,,auto_remove_columns,"null_min=0.99, predominant_max=0.9, nulls_list=['']"
2,,auto_to_category,"null_max=0.7, unique_max=20"
3,,to_bool_type,"headers=online, drop=False, bool_map={1: True}"
4,,to_category_type,"headers=['gender', 'profession'], drop=False"
5,,to_date_type,"headers=start_date, drop=False, as_num=False, day_first=True, year_first=False"
6,,to_float_type,"dtype=['float'], exclude=False, fillna=nan, errors=coerce, precision=3"
7,,to_str_type,"dtype=['object'], exclude=False, nulls_list=['', 'nan']"
8,1.0,to_date_type,"headers=last_login, drop=False, as_num=False, day_first=True, year_first=False"


In [36]:
#set the Transitioning Contract
cortex.set_transition_pipeline(tr.data_pm.cleaners)

# Now run the pipeline and view the results
df = cortex.run_transition_pipeline()
tr.canonical_report(df, stylise=True)

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,15.0%,0.5%,425,424,max=89.645 | min=20.089 | mean=46.09
1,balance,float64,0.0%,0.4%,500,493,max=998.64 | min=33.27 | mean=190.27
2,forename,object,0.0%,0.4%,500,497,Sample: Honey | Shane | Alan
3,gender,category,0.0%,63.4%,500,2,F|M
4,id,object,0.0%,0.2%,500,500,Sample: CU_6243334 | CU_2247011 | CU_8579526
5,last_login,datetime64[ns],0.0%,0.4%,500,497,max=2019-12-04 20:00:00 | min=2019-01-02 01:37:00 | yr mean= 2019
6,online,bool,0.0%,79.8%,500,2,False | True
7,profession,category,10.0%,23.8%,450,15,Analyst Programmer|Community Outreach Specialist|Director of Sales|Food Chemist|Junior Executive|Mar...
8,start_date,datetime64[ns],0.0%,1.2%,500,271,max=2018-12-30 00:00:00 | min=2018-01-02 00:00:00 | yr mean= 2018
9,status,category,0.0%,54.0%,500,4,Active|Closed|Pending|Suspended


#### Results:
As we can see we now have ``last_login`` in now correctly typed, but notice we didn't have to change anything in our production environment. The Separation of Concerns results in us beig able to change the **_Intent_** without changing production code of production parameters.

... and we did all this in our familiar Jupyter Notebook, and more importantly in the middle of notebook that has lots of other ideas without the Data Scientist having to write production code or go back and find in all the other notebooks where the origional cleaning was written.

---------------------

---------------------
## Changed Intent on a Different Dataset
Having seen how we can change ``Intent`` within effecting production code we now want to transtion a completely new dataset with a completely different daset. 

Normally this would now involved creating a new set of skills but not here. Infact we are going to use the exact same instance of the Cortex Transtioning Agent so we don't even have to spin up a new instance... 


### Retrieving the Transitioning Instance
Firstly we retrieve the Transitioning instance, in this case for the customer agents.

In [37]:
tr_agent = Transition('synthetic_agent')

### Passing the Source Contract to the Cortex Transitioning Skill
As before, Cortex is fully integrated into the ML discovery vertical so we simple pass over the **_Source Contract_** of intent to the Cortex Agent. This sets up the **_cannonical dataset_**, our second part of the Separation of Concerns of the **_Accelerated ML_**.

In [38]:
cortex.set_source_contract(tr_agent.data_pm.source)
cortex.report_source()

Unnamed: 0,param,values
0,resource,synthetic_agent.csv
1,source_type,csv
2,location,/Users/doatridge/code/projects/prod/discovery-transitioning-utils/jupyter/working/data/0_raw
3,module_name,ds_discovery.handlers.pandas_handlers
4,handler,PandasHandler
5,modified,1560428859
6,encoding,latin1
7,sep,","


### Passing the Transitioning Contract to the Cortex Transitioning Skill
With this new transitioning instance, as before we pass the **_Prameterised Intent_** to the Cortex Transitioning Agent 

In [39]:
# set the agent transitioning pipeline
cortex.set_transition_pipeline(tr_agent.data_pm.cleaners)
cortex.report_transition()

Unnamed: 0,level,intent,parameters
0,0.0,auto_clean_header,replace_spaces=_
1,,auto_remove_columns,"null_min=0.98, nulls_list=[''], predominant_max=0.99"
2,,auto_to_category,"null_max=0.7, unique_max=45"
3,,to_date_type,"as_num=False, day_first=False, drop=False, headers=['call_date', 'duration'], year_first=False"
4,,to_remove,"re_ignore_case=False, regex=['stat']"
5,,to_str_type,"drop=False, headers=['call_id', 'customer_id'], nulls_list=['']"


### Run the new Intent 

In [40]:
df = cortex.run_transition_pipeline()
tr.canonical_report(df)

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
0,agent,category,0.0%,5.2%,10000,40,Alana |Aleksandra |Amalie |Aniyah |Beryl|Carole |Casey |Cecilia|Claudette|Cornelia|Elina |Elisa|Ella...
1,call_date,datetime64[ns],0.0%,0.0%,10000,9973,max=2030-08-18 22:27:16 | min=2001-05-18 06:15:29 | yr mean= 2016
2,call_id,object,0.0%,0.0%,10000,10000,Sample: 4227125 | 5911774 | 8481374
3,complaint,category,0.0%,15.9%,10000,29,All points not addressed|Customer payment processed incorrectly|FA advice queried|Fund Performance -...
4,contact,category,0.0%,53.7%,10000,13,Account manager|E-mail|E-mail & Phone Call|Fax|Internet|Letter|Letter & Phone Call|MyPortal|Phone Ca...
5,customer_id,object,0.0%,0.1%,10000,1000,Sample: CU_6575128 | CU_5882614 | CU_9925330
6,duration,datetime64[ns],1.5%,0.3%,9853,1259,max=2019-06-19 23:59:00 | min=2019-06-19 00:00:00 | yr mean= 2019
7,escalated,int64,0.0%,95.4%,10000,2,max=1 | min=0 | mean=0.05
8,referred,int64,0.0%,95.4%,10000,2,max=1 | min=0 | mean=0.05


#### Result:
Without changing any production code we have used the exxact same Cortex Skill with a comnpletely different set of data and different intent