In [1]:
%run ../base_setup.ipynb

DTU: 1.08.027
DBU: 1.01.003


# Accelerated Machine learning
## Transitioning: Discovery and Observations
As part of the Accelerated ML discovery vertical Transitioning is a foundation base truth allowing separation<br>
of raw data and data fit-for-purpose for discovery analysis and the identification of features-of-interest

### Retrieving the Transitioning Contract Pipeline
* to retrieve the transitioning contract we create a Tranisioning instance using the unique reference name
* the transitioning object is a singletone instance and will load the current pipeline contract or create a new one if it doesn't exist


In [2]:
tr = Transition('synthetic')

### Loading the source dataset
All data coming through the Accelerated ML vertical is now in __canonical__ form, be it source data, reference data, data dictionaries or value add information, <br>
In this case our __canonical__ is Pandas Dataframe as the most familar canonical with Python and Data Scientists. We will see more of this later.


In [3]:
df = tr.get_source_data()

### Transitioning Discovery
Within the Transitioning instance is a number of discovery tools to help with the visualisation and selection of data. <br>
Within these are a set of discovery disctionaries
* Data Disctionary
* Stats Dictionary
* Analysis Dictionary

At this stage, of these we are going to use the `tr.data_dictionary(df)` to help us exainme the raw canonical<br>
Note: as all dictionaries are also canonical in form, they can also be used as a source of features-of-interest. 

We will see more of this later


#### Data Dictionary
the Data dictionary provides a reference view of the dataset's attribute properties to allow for attribute typing and selection.

In [4]:
tr.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations
0,age,float64,0.15,4250,4011,max=89.932 | min=20.07 | mean=46.92
1,balance,float64,0.0,5000,4386,max=990.62 | min=2.81 | mean=184.23
2,forename,object,0.0,5000,3962,Sample: Jared | Scarlett-Rose | Silvia
3,gender,object,0.0,5000,2,Sample: M | F
4,id,object,0.0,5000,5000,Sample: CU_7659073 | CU_7864005 | CU_5298287
5,,float64,1.0,0,0,max=nan | min=nan | mean=nan
6,online,float64,0.35,3250,2,max=1.0 | min=0.0 | mean=0.21
7,profession,object,0.0,5000,14,Sample: Health Coach III | Help Desk Operator | Senior Quality Engineer
8,single cat,object,0.4,3000,1,Sample: A
9,single num,float64,0.2,4000,1,max=1.0 | min=1.0 | mean=1.0


-------------
### Discovery Observations
#### Add any observations of the dataset
* Add an overview description of the dataset
* Include relevant information such as the source system and any issues or problems
* Then add any observations for specific attributes that are noteworthy
* Notes on attributes should only relate to transitioning that are not relevant to enrichment knowledge later in the processes

First we add any general notes about this contract

In [5]:
tr.notes_add(text='The file is a synthetic data file created for this demonstration')

#### Add the data source
It is good practice to also include the source of the data

In [6]:
tr.notes_add(label='source', text='This was generated using the Discovery Behavioral Synthetic Data Generator')
tr.notes_add(label='source', text='The script to rerun the data generation can be found in the synthetic scripts folder')

#### Attribution Observations
it is worth capturing observation where attributes might be removed or changed that are hidden from the transitioned view of data<br>
for example the `weight_cat` attribute has values but it as a predomionant value that makes this column a likely candidtate for removal

In [7]:
tr.notes_add(label='attr: null', text="Here for demo of removal of nulls")
tr.notes_add(label='attr: weight_cat', text="Demonstration of removal of columns with predominant values")
tr.notes_add(label='attr: weight_cat', text="the value 'A' is over 95% predominant")
tr.notes_add(label='attr: start', text="changing this to start_date so it being a date is obvious")


### Create a report on the notes
We have asked for the notes to be stylised, this returns a style dataframe with elements blanked and formatted dates for presentation.<br>
removing this parameter returns a canonical dataframe.

In [8]:
tr.notes_report(stylise=True)

Unnamed: 0,label,date,text
0,attr: null,2019-05-23 12:16,Here for demo of removal of nulls
1,attr: start,2019-05-23 12:16,changing this to start_date so it being a date is obvious
2,attr: weight_cat,2019-05-23 12:16,Demonstration of removal of columns with predominant values
3,,2019-05-23 12:16,the value 'A' is over 95% predominant
4,notes,2019-05-23 12:16,The file is a synthetic data file created for this demonstration
5,source,2019-05-23 12:16,This was generated using the Discovery Behavioral Synthetic Data Generator
6,,2019-05-23 12:16,The script to rerun the data generation can be found in the synthetic scripts folder


-----------------
## Transitioning: Contract Pipeline
The clean methods are separated into two main types:
* auto: allows the auto selection and filtering of a complete dataset
* to: for data typing as a transitioning process into useable datatypes for feature discovery

They are static methods that can be used as tools but here we use them to build our filter and typing intent as a contract pipeline.

In [9]:
tr.clean.__dir__()

['auto_clean_header',
 'auto_drop_duplicates',
 'auto_remove_columns',
 'auto_to_category',
 'filter_columns',
 'filter_headers',
 'run_contract_pipeline',
 'to_bool_type',
 'to_category_type',
 'to_date_from_excel_type',
 'to_date_type',
 'to_float_type',
 'to_int_type',
 'to_numeric_type',
 'to_remove',
 'to_select',
 'to_str_type']

### Using the Cleaner Class methods
The class methods are static and by default return the typed or filtered DataFrame<br>
in this example converting `start` to a date type. Note being a typed attribute the observations now change giving max, min and mean.

In [10]:
df_typed = tr.clean.to_date_type(df, headers=['start'])
tr.discover.data_dictionary(df_typed).iloc[9:12]

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations
9,single num,float64,0.2,4000,1,max=1.0 | min=1.0 | mean=1.0
10,start,datetime64[ns],0.0,5000,364,max=2018-12-30 00:00:00 | min=2018-01-01 00:00:00 | yr mean= 2018
11,surname,object,0.0,5000,5000,Sample: Brems | Waycott | Rivena


### Extracting the Parameterised Intent
in order to create the **_contract pipeline_** we are looking to extract the **_parameterised intent_** from the method.<br>
To do this we commit the change to the origional Dataframe by setting the parameter `inplace` to True

In [11]:
intent = tr.clean.to_date_type(df, headers=['start'], inplace=True)
intent

{'to_date': {'headers': ['start'],
  'drop': False,
  'exclude': False,
  'as_num': False,
  'day_first': False,
  'year_first': False}}

### Adding the Intent to the Transitioning Pipeline
When we include `inplace` what is now returned is the **_parameterised intent_** from the change to the DataFrame<br>
We can now take this **_intent_** and record it in our transitioning instance as a **_pipeline contract_**.<br>

In [12]:
tr.set_cleaner(intent)

... and that's it now recorded as part of our runnable pipeline contract.

We can of course run it as a single command
> `tr.set_cleaner(tr.clean.to_date_type(df, headers=['start'], inplace=True))`

### Auto filters special case
The methods `auto_remove_columns` and `auto_to_category` have two different methods of recording **_intent_**
* an auto generated outcome that represents the actual parameterised intent
* the intent to auto filter. 

Here is an example:

* in the first instance we will generate the intent to auto remove
* we then reload the file and run it again with the parameter `auto_contract` and set it to False

In [13]:
tr.clean.auto_remove_columns(df, inplace=True)

{'auto_remove': {'null_min': 0.998, 'predominant_max': 0.998}}

In [14]:
# reload the data source to return the missing columns
df = tr.get_source_data()

tr.clean.auto_remove_columns(df, auto_contract=False, inplace=True)

{'to_remove': {'headers': ['single num', 'null', 'single cat'],
  'drop': False,
  'exclude': False}}

as you see wit the second output, the actual headers to be removed have been recorded as the intent. This allows flexibility in how we choose to control the auto filtering of incoming files though our contract pipeline.

The logic works like this:
* With the first exmaple each time we run our contract pipeline the auto remove will remove **ANY** column that matches the auto remove criteria
* With the second example **ONLY** the columns identified in this discovery analysis will be removed. Therefore in a subsequent file should another column appear that has null, constant or Quasi-contant values it will be passed through.
* The second example also allows us to discover what it is the Auto Remove has removed allowing us to optimise the threshold values passed.

------------
## Transitioning: Selection, Filter and Typing
We can reload our data source, re-examine the Data Dictionary and start the process of transitioning the Dataset.


In [15]:
df = tr.get_source_data()
tr.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations
0,age,float64,0.15,4250,4011,max=89.932 | min=20.07 | mean=46.92
1,balance,float64,0.0,5000,4386,max=990.62 | min=2.81 | mean=184.23
2,forename,object,0.0,5000,3962,Sample: Richie | Elisa | Fermin
3,gender,object,0.0,5000,2,Sample: F | M
4,id,object,0.0,5000,5000,Sample: CU_8927785 | CU_9279989 | CU_1927711
5,,float64,1.0,0,0,max=nan | min=nan | mean=nan
6,online,float64,0.35,3250,2,max=1.0 | min=0.0 | mean=0.21
7,profession,object,0.0,5000,14,Sample: Structural Engineer | Help Desk Operator | Actuary
8,single cat,object,0.4,3000,1,Sample: A
9,single num,float64,0.2,4000,1,max=1.0 | min=1.0 | mean=1.0


### Tidy the headers
as good practice we clean the headers.
* this removes any hidden characters that sometimes lurk in the header name
* replaces spaces in `single num` and `single cat` with underscore (use `replace_spaces` parameter to specify a different character) 
* optionally set a case type for consitency across the headers (options are `lower`, `upper`, `title`)
* optionally in this case we are also going to rename `start` to be `start_date` to identify it as a date

In [16]:
tr.set_cleaner(tr.clean.auto_clean_header(df, rename_map={'start': 'start_date'}, inplace=True))

### Auto remove selection
The `auto_remove_columns` and method quickly allows us to remove columns that contain poor quality data. We will remove columns with:
* more than 99% nulls
* has a predominant value of more than 90% 

in addition we are going to pass in an extra list of considered null values, in this case empty string, to be considered as a null.

In [17]:
tr.set_cleaner(tr.clean.auto_remove_columns(df, null_min=0.99, predominant_max=0.90,inplace=True, nulls_list=['']))

### Auto categorise filter
The `auto_to_category` method allows us to quickly convert large coulns of data into the useful Categorical data type.<br>
as setable parameters it considers
* the number of unique items in the columns
* a null value threshold so the unique numbers are not a result of poor data quantity

In [18]:
tr.set_cleaner(tr.clean.auto_to_category(df, unique_max=20, null_max=0.7, inplace=True))

### Category and Date typing
with most data transposition to a useful and usable dataset, the conversion of Dates and Categories is probably the most common.<br>
the methods `to_category_type` and `to_date_type` fullfil this.

In both methods we are using the `headers` parameter though you can also filter by `dtype` or a regex.

In [19]:
# Typing Catagories
tr.set_cleaner(tr.clean.to_category_type(df, headers=['gender', 'profession'], inplace=True))
# Typing Dates 
tr.set_cleaner(tr.clean.to_date_type(df, headers='start_date', inplace=True))


### Boolean typing
The `to_bool_type` allows us to specify a map of values to turn to true, with others default to false.

In [20]:
tr.set_cleaner(tr.clean.to_bool_type(df, bool_map={1: True}, headers='online', inplace=True))

### Float, String Typing
Finally to tidy up our final types we run them through our typing methods to ensure they are fit for purpose.<br>
here we have used the `dtype` parameter to capture all remaing columns of those types, and also set the precision of the `floats` to be 3

In [21]:
tr.set_cleaner(tr.clean.to_float_type(df, dtype=['float'], precision=3, inplace=True))

tr.set_cleaner(tr.clean.to_str_type(df, dtype=['object'], inplace=True))

### Integer Typing
it should be noted that we didn't convert `age` to an `int` as we are going to defer that up to the feature cataloging to make the decision on how to convert out the `nulls`. But we could have used the built in functions here to convert `age` and replace the `nulls` with an alternative value

> `tr.clean.to_int_type(df, headers='age', fillna='mean' inplace=True)`

In the above code snippet we converted `age` from a `float`, replacing the `nulls`, that stopped it being an `int`, to the `mean` of the values.

This convertsion should be made with care as you are hiding data changes from the feature cataloguing 


----------
## Transitioning: Finalise and Validate
### Persist the the Canonical
We now have out typed, selected and filtered canonical ready for Feature Cataloging. From this we need to 
* Create an Excel Data Dictionary for external SME feedback and reporting
* Persist the canonical so it can be used for Feature Cataloging

Because the transitioning instance is managing governance and naming convention we only have to pass the DataFrame and it does all the rest.

In [22]:
# Create the excel data dictionary
tr.create_data_dictionary(df)

# save the clean file
tr.save_canonical(df)

### Validation
We can validate our file has been saved by reloading it and checking all the correct typing and filtering has happened.

Again because we are using the transitioning instance we only need to call `load_canonical` without parameters to retrieve our canonical dataset

In [23]:
# check the results worked
df = tr.load_canonical()
tr.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations
0,age,float64,0.15,4250,4011,max=89.932 | min=20.07 | mean=46.92
1,balance,float64,0.0,5000,4386,max=990.62 | min=2.81 | mean=184.23
2,forename,object,0.0,5000,3962,Sample: Emir | Tyrese | Fay
3,gender,category,0.0,5000,2,F|M
4,id,object,0.0,5000,5000,Sample: CU_1707826 | CU_3033492 | CU_6880978
5,online,bool,0.0,5000,2,False | True
6,profession,category,0.0,5000,14,Actuary|Financial Advisor|Health Coach III|Help Desk Operator|Internal ...
7,start_date,datetime64[ns],0.0,5000,364,max=2018-12-30 00:00:00 | min=2018-01-01 00:00:00 | yr mean= 2018
8,surname,object,0.0,5000,5000,Sample: Garnham | Ottum | Keen


---------
## Transitioning: Running the Pipeline Contract
Running the Pipeline Contract is very easy, because you have the Transitioning instance you simply refresh your cannonical dataset<br>

How it works:
* loads the new source dataset
* loads the contracts parameterised intent (pipeline contract)
* runs the intent against the dataset
* returns the transitioned canonical

In [25]:
df = tr.refresh_canonical()

tr.discover.data_dictionary(df)

Unnamed: 0,Attribute,Type,% Nulls,Count,Unique,Observations
0,age,float64,0.15,4250,4040,max=89.603 | min=20.003 | mean=46.38
1,balance,float64,0.0,5000,4382,max=979.54 | min=8.72 | mean=185.9
2,forename,object,0.0,5000,3925,Sample: Xavier | Mahir | Jimmy
3,gender,category,0.0,5000,2,F|M
4,id,object,0.0,5000,5000,Sample: CU_3564360 | CU_7136283 | CU_5223059
5,online,bool,0.0,5000,2,True | False
6,profession,category,0.1,4500,15,Accountant I|Assistant Professor|Data Coordiator|Developer I|Food Chemi...
7,start,object,0.0,5000,364,Sample: 01-26-18 | 05-04-18 | 11-05-18
8,surname,object,0.0,5000,5000,Sample: Kassem | Amauty | Neary
