<cite>Darryl Oatridge, August 2022<cite>

 ## Validation of Data
 
- The review of a dataset by an expert with similar credentials and subject knowledge as the data creator to validate the accuracy of the data.

In [1]:
from ds_discovery import Transition

## Quality Assurance

Quality assurance provides an immediate insight into the quality, quantity, verasity and availability of the dataset being provided. This is a critical step to the success of any machine learning or product outcome. 

Observational immediacy to the content of the dataset allows quick decision making at the earliest stage of the process. It also provides output for discussion for SME's and data architects to share common reports that are based on best practice and familiar to both parties. 

Finially it provides observational tools presenting a broad-set of information in a compacted and common display format.

In [2]:
tr = Transition.from_env('demo_quality', has_contract=False)

#### Set File Source
Initially we set the file source for our data of interest and run the component.


In [3]:
## Set the file source location
data = 'https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv'
tr.set_source_uri(data)
tr.set_persist()
tr.set_description("Original Titanic Dataset")

### Data Dictionary

The data dictionary is a go to tool that gives both a visual and shareable summary of the dataset provided. In this case one looks at the raw source so as to assess its visual suitability.

In this instance, taking the original Titanic dataset, data elements such as nulls have been masked and in some cases inappropriately 'typed' the data. There are also multiple features that are not required, all of which need to be dealt with before one can get a better view of the data presented.

In [4]:
df = tr.load_source_canonical()
tr.canonical_report (df)

Unnamed: 0,Attributes (14),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,object,0.0%,20.1%,1309,99,Sample: ? | 24 | 22 | 21 | 30
1,boat,object,0.0%,62.9%,1309,28,Sample: ? | 13 | C | 15 | 14
2,body,object,0.0%,90.8%,1309,122,Sample: ? | 58 | 285 | 156 | 143
3,cabin,object,0.0%,77.5%,1309,187,Sample: ? | C23 C25 C27 | G6 | B57 B59 B63 B66 | C22 C26
4,embarked,object,0.0%,69.8%,1309,4,Sample: S | C | Q | ?
5,fare,object,0.0%,4.6%,1309,282,Sample: 8.05 | 13 | 7.75 | 26 | 7.8958
6,home.dest,object,0.0%,43.1%,1309,370,"Sample: ? | New York, NY | London | Montreal, PQ | Paris, France"
7,name,object,0.0%,0.2%,1309,1307,"Sample: Connolly, Miss. Kate | Kelly, Mr. James | Allen, Miss. Elisabeth Walton | Ilmakangas, Miss. ..."
8,parch,int64,0.0%,76.5%,1309,8,max=9 | min=0 | mean=0.39 | dominant=0
9,pclass,int64,0.0%,54.2%,1309,3,max=3 | min=1 | mean=2.29 | dominant=3


#### Engineering Selection

The canonical is tidied up through engineering selection where one adjusts the features of interest, whilst removing the data columns that are of no interest and making sure the data is correctly typed.

In [5]:
df = tr.tools.auto_reinstate_nulls(df, nulls_list=['?'])
df = tr.tools.to_remove(df, headers=['body', 'name', 'ticket', 'boat', 'home.dest'])
df = tr.tools.auto_to_category(df)
df = tr.tools.to_numeric_type(df, headers=['age', 'fare'])
df = tr.tools.to_int_type(df, headers='survived')

tr.run_component_pipeline()

### Validation

Now our selection engineering has been applied to the dataset one has a clearer view of the value of the data provided.

The canonical report provides an enhancment of already existing data science tools to give a clear single view of our data set that is familuar to a broader audience.

In [6]:
tr.canonical_report(df)

Unnamed: 0,Attributes (9),dType,%_Null,%_Dom,Count,Unique,Observations
0,age,float64,20.1%,20.1%,1309,99,max=80.0 | min=0.1667 | mean=29.88 | dominant=24.0
1,cabin,object,77.5%,77.5%,1309,187,Sample: C23 C25 C27 | G6 | B57 B59 B63 B66 | F4 | F33
2,embarked,category,0.0%,69.8%,1309,4,Sample: S | C | Q | nan
3,fare,float64,0.1%,4.6%,1309,282,max=512.3292 | min=0.0 | mean=33.3 | dominant=8.05
4,parch,category,0.0%,76.5%,1309,8,Sample: 0 | 1 | 2 | 3 | 4
5,pclass,category,0.0%,54.2%,1309,3,Sample: 3 | 1 | 2
6,sex,category,0.0%,64.4%,1309,2,Sample: male | female
7,sibsp,category,0.0%,68.1%,1309,7,Sample: 0 | 1 | 2 | 4 | 3
8,survived,int64,0.0%,61.8%,1309,2,max=1 | min=0 | mean=0.38 | dominant=0


### Reporting

As well as its visual display the enhanced dictionary can be distributed to any connecting service, such as an XL spreadsheet and its graphical tooling.


In [7]:
dictionary = tr.canonical_report(tr.load_persist_canonical(), stylise=False)
tr.save_report_canonical(reports=tr.REPORT_DICTIONARY, report_canonical=dictionary)

### Report Tailoring

By default reports are given their own name and data type, though this can be tailored to suit a targeted system with options of name, versioning, timestamp and the data type of the data to be reported.

In [8]:
reports = [tr.report2dict(report=tr.REPORT_DICTIONARY, prefix='titanic_', file_type='csv', stamped='days')]
tr.save_report_canonical(reports=reports, report_canonical=dictionary)

## Quality Summary

When looking at the data as well as the detail in the dictionary one can also produce a summary overview of the dataset as a whole. The quality report provides a subset view of quality score, data shape, data types, usability summary and cost, if applicable. 


In [9]:
tr.report_quality_summary()

Unnamed: 0,report,summary,result
0,score,quality_avg,75%
1,,usability_avg,100%
2,,provenance_complete,0%
3,,data_described,0%
4,data_shape,rows,1309
5,,columns,14
6,,memory,593.08KB
7,data_type,numeric,0
8,,category,6
9,,datetime,0


### Report

As with the dictionary the quality report can be saved and redistributed to interested parties.

In [10]:
quality = tr.report_quality_summary(stylise=False)
tr.save_report_canonical(reports=tr.REPORT_SUMMARY, report_canonical=quality)