# Data quality

In [1]:
from pathlib import Path

import pandas as pd
from ydata_profiling import ProfileReport

## Load (restricted) health data

We consider synthetic, simplified data a small number of features.

In [2]:
project_root = Path('.') / '..'
data_dir = project_root / 'data'
notebooks_dir = project_root / 'notebooks'
eda_out_dir = notebooks_dir / 'eda-output'
eda_out_dir.mkdir(exist_ok=True)


simple_customer_df = pd.read_csv(data_dir / 'simplified-customer-health.csv')
simple_customer_df.head()

Unnamed: 0,customer_id,height,weight,occupation_group_idx,gender_idx,skin_cancer,depression
0,0,162.29231,83.763466,3,0,0.0,0.0
1,1,136.2501,53.3336,2,0,0.0,0.0
2,2,162.22852,78.86671,0,0,0.0,0.0
3,3,206.18896,141.28181,2,1,0.0,0.0
4,4,168.00403,108.01881,3,1,0.0,0.0


In [3]:
aggregate_claims_df = pd.read_csv(data_dir / 'aggregate-claim.csv')
aggregate_claims_df.head()

Unnamed: 0,customer_id,agg_claim_amount,year
0,0,10.26,2022
1,1,0.0,2022
2,2,0.0,2022
3,3,0.03,2022
4,4,4.84,2022


## Explore data with [ydata-profiling]()

In [4]:
# minimal set to True due to BinderHub resource limitations
profile_health = ProfileReport(simple_customer_df, title="Simple customer health data", minimal=True)
# Write to disk
profile_health.to_file(eda_out_dir / 'simple-health-eda.html')
# Display in notebook (note: this step may crash binderhub. A workaround is to open created html in binderhub)
# profile_health

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|███████████████████████████████████████████████████| 7/7 [00:00<00:00, 10866.07it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

In [5]:
# minimal set to True due to BinderHub resource limitations
profile_claims = ProfileReport(aggregate_claims_df, title="Aggregate claims data", minimal=True)
# Write to disk
profile_claims.to_file(eda_out_dir / 'aggregate-claims-eda.html')
# Display in notebook (note: this step may crash binderhub. A workaround is to open created html in binderhub)
# profile_claims

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


100%|█████████████████████████████████████████████████████| 3/3 [00:00<00:00, 693.50it/s][A


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

## Task: Data quality concerns

Given your use-case of clustering existing insurance customers into segments for pricing of private health insurance, identify data quality issues or questions for business experts in the data sets `data/simplified-customer-health.csv` and `data/aggregate-claims.csv`.

## Solution

TODO


## Task: Implement a basic data contract / expectation enforcement to flag duplicate records


You may either implement your code here in the notebook, or within one of project's modules (a potential solution is given in `src/workshop/possible_data_contract.py`

### Out-of-scope

Unless you have extra time now :)

* Integrate with data pipeline
* Together with business / data producer, decide what action to take, e.g. `WARNING`, `ERROR`
* Data document generation

## Solution

TODO

## Task (optional): More realistic data

Note: by "optional" I mean that no solution is presented in the `notebooks/completed` version of this notebook, and we may not have time to discuss your work during the workshop.

Perform EDA on the synthetic, but more realistic data in terms of number of fields in `data/customer-health`.

What challenges arise as the number of fields grow large?

## Solution

TODO