<cite>Darryl Oatridge, August 2022<cite>

## Validating and Adding Metadata 
- Information about a data set that is structured (often in machine-readable format) for purposes of search and retrieval.

## Citing the Data 
- Adding citations to support appropriate attribution by third-party users in order to formally incorporate data reuse.


In [1]:
from ds_discovery import Transition, Wrangle

## Adding Metadata

During the process of development multiple experts add value to our understanding of the dataset.  Project Hadron captures this knowledge as part of its metadata and provides easy access tools to retain this knowledge at real or near real time as well as adding it retrospectively through automated processes. 

Knowledge capture is placed under a tree structure of: 
- catalogue: provides an encompassing group identifier such as attributes or observations. 
- label: a subset of catagories identifing the individual set of text such as attribute name or observation type.
- text: a brief or descriptive narrative of the catalogue and label. Text is immutable thus new text with the same catalogue and label will be added to the existing content.




In [2]:
tr = Transition.from_env('demo_metadata', has_contract=False)

#### Set File Source
Initially we set the file source for our data of interest and run the component.

In [3]:
## Set the file source location
tr.set_source_uri('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv', template_aligned=False)
tr.set_persist()
tr.set_description("Titanic Dataset used by Seaborn")

### Adding Attributes

A vital part of understanding one's dataset is to describe the attributes provided.  In this instance we name our  catalogue group 'attributes'.  The attributes are labeled with the name of the attribute and given a description.


In [4]:
## Add some attribute descriptions
tr.add_notes(catalog='attributes', label='age', text='The age of the passenger has limited null values')
tr.add_notes(catalog='attributes', label='deck', text='cabin has already been split into deck from the originals')
tr.add_notes(catalog='attributes', label='fare', text='the price of the fair')
tr.add_notes(catalog='attributes', label='pclass', text='The class of the passenger')
tr.add_notes(catalog='attributes', label='sex', text='The gender of the passenger')
tr.add_notes(catalog='attributes', label='survived', text='If the passenger survived or not as the target')
tr.add_notes(catalog='attributes', label='embarked', text='The code for the port the passengered embarked')

### Adding  Observations

In addition we can capture feedback from an SME or data owner, for example.  In this case we capture 'observations' as our catalogue and 'describe' as our label which we maintain for both descriptions.

One can now use the reporting tool to visually present the knowledge added.  It is worth noting that with observations each description has been captured.



In [5]:
tr.add_notes(catalog='observations', label='describe', 
             text='The original Titanic dataset has been engineered to fit Seaborn functionality')
tr.add_notes(catalog='observations', label='describe', 
             text='The age and deck attributes still maintain their null values')


In [6]:
tr.report_notes(drop_dates=True)

Unnamed: 0,section,label,text
0,observations,describe,The original Titanic dataset has been engineered to fit Seaborn functionality
1,,describe,The age and deck attributes still maintain their null values
2,attributes,age,The age of the passenger has limited null values
3,,deck,cabin has already been split into deck from the originals
4,,fare,the price of the fair
5,,pclass,The class of the passenger
6,,sex,The gender of the passenger
7,,survived,If the passenger survived or not as the target
8,,embarked,The code for the port the passengered embarked


-------------------
## Bulk Notes

In addition to adding individual notes one also has the ability to upload bulk notes from an external data source.  In our next example we take an order book and from an already existing description catalogue extract that knowledge and add it to our attributes.  

In [7]:
tr = Transition.from_env('cs_orders', has_contract=False)

#### Set File Source

Initially set the file source for the data of interest and run the component.

In [8]:
tr.set_source_uri(uri='data/CS_ORDERS.txt', sep='\t', error_bad_lines=False, low_memory=True, encoding='Latin1')
tr.set_persist()
tr.set_description("Consumer Notebook Orders for Q4 FY20")

#### Connect the Bulk Uploiad

First create a connector to the information source.

In [9]:
tr.add_connector_uri(connector_name='bulk_notes', uri='data/cs_orders_dictionary.csv')

#### Upload the Descriptions

With our connector in place one can now load that data and specify the columns of interest that provide both the label and the text.

Using our reporting tool one can now observe that attribute descriptions have been uploaded.

In [10]:
notes = tr.load_canonical(connector_name='bulk_notes')
tr.upload_notes(canonical=notes, catalog='attributes', label_key='Attribute', text_key='Description')

In [11]:
tr.report_notes(drop_dates=True)

Unnamed: 0,section,label,text
0,attributes,BILT_CUST_NBR,Bill to customer number
1,,BU_ID,Business Unit ID (country)
2,,CNCL_DTS,Order Cancel date
3,,CUST_TYPE_CODE,"Customer Type (segment ex: consumer, small business)"
4,,CUST_TYPE_DESC,"Customer Type (segment ex: consumer, small business)"
5,,DLVR_DTS,Delivery data (not sure how accurate this is)
6,,EXTRNL_COMB_HIER_CD,This is the product Hierarchy Code
7,,FISC_QTR_ID,Fiscal Quarter
8,,FISC_WK_ID,Fiscal Week
9,,HOLD_DTS,Date of order put on hold


#### Report Filtering

Sometimes bulk uploads can result in a large amount of added information.  Our reporting tool has the ability to filter what we visualize giving us a clean summery of items of interest.  In our example we are filtering on 'label' across all sections, or catalogues.


In [12]:
tr.report_notes(labels=['ORD_DTS', 'INV_DTS', 'HOLD_DTS'], drop_dates=True)

Unnamed: 0,section,label,text
0,attributes,HOLD_DTS,Date of order put on hold
1,,INV_DTS,Date of order invoiced
2,,ORD_DTS,Order date
