<cite>Darryl Oatridge, August 2022<cite>

## Contextualizing
- Using metadata to link the data set to related sources and attributions and/or projects that provide added context for how the data were generated and why.

## Validating and Adding Metadata 
- Information about a data set that is structured (often in machine-readable format) for purposes of search and retrieval.

In [16]:
from ds_discovery import Transition

## Schema

A Schema is a representation of our dataset as a set of statisticial and probablistic values that are semantically common across all schemas.  The schema separates each data element into four parts: 

- Intent: shows how the data content is being discretionised and its type. 
- Params: the parameters used to specialise the Intent such as granularity, value limits etc.
- Patterns: probabilstic values of how the datas relative frequency is distributed, along with a number of other values, related to the data type.
- Stats: a broad set of statisticial analysis of the data dependant upon the data type including distribution indicators, limits and observations.

A schema can be fully or partially stored or represented as a relational tree, through naming. One can build a semantic and contexualised view of its data that can be distributed as a machine readable set of comparitives or as part of some other outcome.

In [17]:
tr = Transition.from_env('demo_schema', has_contract=False)

#### Set File Source
Initially we set the file source for the data of interest and run the component.


In [18]:
## Set the file source location
tr.set_source_uri('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
tr.set_persist()
tr.set_description("Titanic Dataset used by Seaborn")

In [19]:
df = tr.load_source_canonical()
tr.tools.auto_to_category(df, unique_max=20, inplace=True)
tr.tools.to_numeric_type(df, headers=['age', 'fare'], inplace=True)

In [20]:
tr.run_component_pipeline()

## Creating and Presenting Schema

By default the primary schema is generated using default values taking a flat view of the data or feature set and producing a schema that is either distributable through a given connector contract or, as in our case, displayed within the notebook.



In [21]:
tr.save_canonical_schema()

In [22]:
tr.report_canonical_schema()

Unnamed: 0,root,section,element,value
0,survived,intent,categories,"[0, 1]"
1,,,dtype,category
2,,params,freq_precision,2
3,,,top,10
4,,patterns,relative_freq,"[61.62, 38.38]"
5,,,sample_distribution,"[549, 342]"
6,,stats,category_count,2
7,,,highest_unique,61.620000
8,,,lowest_unique,38.380000
9,,,nulls_percent,0.000000


### Report

As with all reports one can redistribute our schema to interested parties or systems where the data can be observed or schematically examined to produce decision making outcomes. For example with the observation of concept drift.

In [23]:
schema = tr.report_canonical_schema(tr.load_persist_canonical(), stylise=False)
tr.save_report_canonical(reports=tr.REPORT_SCHEMA, report_canonical=schema)

### Filter the Schema

In the following example we taylor the view of the schema without changing the underlying schema's content.  In this instance we have filtered on:

- root, with our interests in the data features 'survived' and 'fare' and
- section, where our interest is particulary the pattern subset.

This provides quick and easy visualisation of complex schemas and can help to identify individuals or groups of elements of interest within that schema.

In [24]:
tr.report_canonical_schema(roots=['survived', 'fare'], sections='patterns')

Unnamed: 0,root,section,element,value
0,survived,patterns,relative_freq,"[61.62, 38.38]"
1,,,sample_distribution,"[549, 342]"
2,fare,,relative_freq,"[34.67, 21.46, 17.33, 5.9, 1.89, 4.6, 2.12, 3.42, 2.12, 0.24, 0.94, 0.83, 0.0, 0.83, 1.06, 0.0, 0.24, 0.0, 0.0, 0.0, 0.47, 0.12, 0.47, 0.0, 0.24, 0.71, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.35]"
3,,,sample_distribution,"[294, 182, 147, 50, 16, 39, 18, 29, 18, 2, 8, 7, 0, 7, 9, 0, 2, 0, 0, 0, 4, 1, 4, 0, 2, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3]"
4,,,freq_mean,[ 7.45 14.04 26.01 35.14 46.54 54.89 68.18 77.57 86.7 93.5  109.27 117.12 0. 134.74 151.07 0. 164.87 0. 0. 0.  211.38 221.78 227.52 0. 247.52 262.79 0. 0. 0. 0.  0. 0. 0. 0. 0. 0. 0. 0. 0. 0.  0. 0. 0. 0. 0. 0. 0. 0. 0. 512.33]
5,,,freq_std,[1.87 2.56 2.64 3.28 3.02 2.68 2.67 2.28 3.27 0. 1.83 3.33 0. 0.84  2.57 0. 0. 0. 0. 0. 0.07 0. 0. 0. 0. 0.29 0. 0.  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.  0. 0. 0. 0. 0. 0. 0. 0. ]
6,,,dominant_excluded,[8.05]
7,,,dominance_freq,[100.0]
8,,,dominant_percent,4.830000


## Semantic Schema

Beyond the basic schema lies a complex but accessable set of paramatization that allows for the creation of relational comparisions between the data type.

In our demonstration below, when creating the schema, we have given it a name and then provide the relational tree we are interested in.  In this case we take 'survived' as our root, being the target feature of interest. We next relate this to 'age' to understand how age is distributed both by 'survived' and 'gender'.



In [25]:
tr.save_canonical_schema(schema_name='survived', schema_tree=[
    {'survived': {'dtype': 'bool'}},
    {'age': {'granularity': [(0, 18), (18, 30), (30, 50), (50, 100)]}}])

In [26]:
tr.report_canonical_schema(schema='survived')

Unnamed: 0,root,section,element,value
0,survived,intent,categories,"[0, 1]"
1,,,dtype,category
2,,params,freq_precision,2
3,,,top,10
4,,patterns,relative_freq,"[61.62, 38.38]"
5,,,sample_distribution,"[549, 342]"
6,,stats,category_count,2
7,,,highest_unique,61.620000
8,,,lowest_unique,38.380000
9,,,nulls_percent,0.000000


### Distrubutable Reporting

With this done one can now further investigate distributions and discover a view of the data. In this case, as a simple example, one can see the age range percentage of those that 'survived'.

From this simple example one can see how schemas can be captured over a period of time or fixed at a moment in time then distributed and compared to provide monitoring and insight into data as it flows through your system.

In [27]:
result = tr.report_canonical_schema(schema='survived', roots='survived.1.age', elements=['relative_freq'], stylise=False)
result['value'].to_list()

[[24.14, 33.1, 35.17, 7.59]]