In [3]:
# saves you having to use print as all exposed variables are printed in the cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# core libraries
import pandas as pd
import os
from pathlib import Path

%reload_ext autoreload
%autoreload 2
# for cleaning and discovery
from ds_discovery import TransitionAgent as Transition

# Set the environment working path as the root of the Jupyter instance
os.environ['DTU_CONFIG_PATH'] = Path(os.environ['PWD']).as_posix()

import ds_discovery
print('DTU: {}'.format(ds_discovery.__version__))

DTU: 2.03.046


# Accelerated Machine learning
## Transitioning: Contract and Canonical Version control
Version control is a vital enabler for any discovery activity, particularly in Jupyter Notebooks, where free form discovery code can quickly get out of hand.<br>
Built into the transition instance, is managed governace, under the hood, to help quickly manage nameing, versioning, snapshots and backups.

### Retrieving the Transitioning instance
* To retrieve the transitioning contract we create a Tranisioning instance using the unique contract name
* The transitioning object is a singletone instance and will load the current contract

In [6]:
tr = Transition.from_env('synthetic_customer')
tr.data_pm.reset_snapshots()

AttributeError: 'AbstractProperty' object has no attribute 'augmented'

### Create Backup and Snapshot
* Contract Backup created an version named copy of the configuration file. Backups are self managing with a rolling versions limit
* Creating a snapshop takes a copy of the current transitioning contract and sames as part of the Transitioning contract
* Snapshot versions can be recovered using `tr.recover_snapshot(name)`
* Finally we clean out the 

In [3]:
# Create a perminant backup copy our current Transitioning Contract as good practice
tr.backup_contract()

# Make a snapshot of the Transitioning Contract so we can recover back to it if we need
tr.create_snapshot('reset')

# reset the current Transitioning Contract attributes
tr.reset_transition_contracts()

'synthetic_customer_#reset'

### Canonical Version control
Canonical version control, allows you to control the versioning of output canonical datasets to allow experimentation or progressive version control of the Canonical datasets

#### Check the current version
To check the version, you call the `version` property of the transitioning instance

In [4]:
tr.version

'v0.00'

#### Changing the version
To change the version you call the method `set_version`

In [5]:
tr.set_version('v0.01')
tr.version

'v0.01'

#### What does this do?
The version number controls the naming pattern of output canonicals and artefacts, for example the output of the contract pipeline and the data dictionary excel file.<br>
This allows you to create different versions of files without overwriting previous versions and also to return to previous versions if desired.

Lets see how this works
#### Change back our version number 

In [6]:
tr.set_version('v0.00')
tr.version

'v0.00'

#### Run our contract pipeline
After running we are going to pay particular attention to the `id` column and the number of nulls.

In [7]:
df = tr.refresh_clean_canonical()
tr.canonical_report(df, stylise=False).iloc[3:6]

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations


#### Change version to v0.01

In [8]:
tr.set_version('v0.01')
tr.version

'v0.01'

#### Change the source to a new source file
For the purposes of demo we are going to point to a corrupted file

**Note:** Version control **ONLY** manages the naming convention of output canonicals and artefacts and does not effect **_source contracts_** or **_contract pipelines_**<br>
From more information on managing these, see **_contract snapshots_** below

In [9]:
tr.set_source_contract(resource='synthetic_customer_corrupt.csv', sep=',', encoding='latin1', load=False)

#### Re-run the contract pipeline
After running we again, are going to pay particular attention to the `id` column and the number of nulls.

In [10]:
df = tr.refresh_clean_canonical()
tr.canonical_report(df, stylise=False).iloc[3:6]

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
3,gender,object,0.5,0.61,500,2,Sample: M | F
4,id,object,0.4,0.002,600,600,Sample: CU_9850914 | CU_1631492 | CU_4979758
5,,float64,1.0,0.0,0,0,max=nan | min=nan | mean=nan


#### Returning to the previous version
As we see the corrupted file has 40% nulls in the id<br>
By resetting the version we can return to our original data without rerunning the pipeline contract


In [11]:
tr.set_version('v0.00')
tr.version

'v0.00'

In [12]:
df = tr.load_clean_canonical()
tr.discover.data_dictionary(df).iloc[3:6]

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations


#### Changing back we can see the corrupted file is still under version 0.01

In [13]:
tr.set_version('v0.01')
tr.version

'v0.01'

In [14]:
df = tr.load_clean_canonical()
tr.discover.data_dictionary(df).iloc[3:6]

Unnamed: 0,Attribute,dType,%_Null,%_Dom,Count,Unique,Observations
3,gender,object,0.5,0.61,500,2,Sample: M | F
4,id,object,0.4,0.002,600,600,Sample: CU_7760922 | CU_3413348 | CU_2270638
5,,float64,1.0,0.0,0,0,max=nan | min=nan | mean=nan


#### Remove corrupted  canonical
Finally lets remove the corrupted canonical and return everything back to normal<br>
Note: When we remove the corrupted canonical we don't have to give any names, this ensures we don't inadvertently remove any wrong data

In [15]:
tr.remove_clean_canonical()

In [16]:
tr.set_source_contract(resource='synthetic_customer.csv', sep=',', encoding='latin1', load=False)
tr.set_version('v0.00')
tr.version

'v0.00'

### Contract Version Control
Unlike canonical version control, contract version control takes a copy of the whole contract.<br>
There are two types of Contract Versioning
* **Snapshot**: that allows you to create list and return to versions of your contracts
* **Backup**: A one time backup of the current contract

#### Creating a snapshot
Within the Transitioning object are a number of methods, starting with `snapshot` the first of which is create<br>
Here we are going to create a snapshot of our current `synthetic_customer` contract. Note, it returns the name of the snapshot for reference

In [17]:
tr.create_snapshot()

'synthetic_customer_#2019-06-19_10:00'

#### Create named snapshot
This time lets create another snapshot but give it a suffix as a version placeholder, we can also add in extra notes

In [18]:
start_of_test_snap = tr.create_snapshot(suffix='start_of_test', note="The beginning of our test cycle")

#### Make changes to our contract
We can now make changes to our contract knowing we have a safe return point.<br>
In this case, changing the source contract to point to another file we want to test out

In [19]:
tr.set_source_contract(resource='synthetic_customer_corrupt.csv', sep=',', encoding='latin1', load=False)
tr.report_source()

Unnamed: 0,param,values
0,resource,synthetic_customer_corrupt.csv
1,source_type,csv
2,location,/Users/doatridge/code/projects/prod/discovery-transitioning-utils/jupyter/working/data/0_raw
3,module_name,ds_discovery.handlers.pandas_handlers
4,handler,PandasHandler
5,modified,
6,sep,","
7,encoding,latin1


Lets also change the version number to make sure we don't overwrite our current canonicals.

In [20]:
tr.set_version('corrupt')
tr.version

'corrupt'

#### Viewing available snapshots
Now we have played with our 'alternative' contract we want to return to were we were...<br>
We have our variable `start_of_test_snap` but we can also list our avaialble `snapshots` by calling `tr.snapshots`

In [21]:
tr.snapshots

['synthetic_customer_#2019-06-19_10:00', 'synthetic_customer_#start_of_test']

#### Return to previous snapshot
To return the current contract to a previous snapshot you use `tr.snapshot_recover()` and pass it the snapshot name you wish to recover to.<br>
It is worth noting, the recovery snapshot remains and can be returned to as long as it exists.

In [22]:
tr.recover_snapshot(start_of_test_snap)

#### Checking our work
Now when we look at our source contract and our version we see we have returned to our origional contract

In [23]:
tr.report_source()

Unnamed: 0,param,values
0,resource,synthetic_customer.csv
1,source_type,csv
2,location,/Users/doatridge/code/projects/prod/discovery-transitioning-utils/jupyter/working/data/0_raw
3,module_name,ds_discovery.handlers.pandas_handlers
4,handler,PandasHandler
5,modified,
6,sep,","
7,encoding,latin1


In [24]:
tr.version

'v0.00'