# Demo 1: Computation and Storing of DQ Measurement Results

This first demonstration shows how DQ measurement results and the metric used for their computation can be stored using DaQSS.
As example in this demonstration, the completeness per record of a CSV file of  [fake customer data](https://github.com/johannesschrott/fake_customer_data) with missing values is used.

 1. Import required Python packages:

In [10]:
from daqss import *  # for accessing the database part of DaQSS

import pandas
# for the representation of data and the computation of DQ measurement results 

2. Load data from a CSV file into a Pandas DataFrame:

In [11]:
data: pandas.DataFrame = pandas.read_csv("../demo_data/fake_customer_data.csv")
data = data.set_index("CustomerID", drop=False)
# Create a DataFrame holding the data from the CSV file, set its index, 
# and retain its index as part of the data.

data_global_identifier: str = "https://johannes.schrott.onl/fake_customer_data/fake_customer_data.csv"
data_local_identifier: str = "fake_customer_data.csv"
# Identifier that uniquely identifies the CSV file within DaQSS.
# The global_identifier may have use other protocols and schemes than used above.

# Show some records of the data:
data.head(5)

Unnamed: 0_level_0,CustomerID,FirstName,LastName,AddressID,EmailAddress,Phone,Mobile
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
77700.0,77700.0,Susan,Ellis,64811.0,yorkvanessa@example.org,001-301-718-7221x9375,
86928.0,86928.0,Caroline,Barr,7716.0,,(363)436-7243x6386,
16019.0,16019.0,Katherine,Hess,40569.0,,636.605.6222x71499,
30163.0,30163.0,Gina,Gross,33463.0,gluna@example.org,,(971)505-3898x2674
85773.0,85773.0,Albert,Hall,66320.0,john03@example.net,5863061645,001-377-640-5676x2420


3. Define a DQ metric:

In [12]:
def arith_mean_completeness_per_row(row) -> int:
    """Computes the unweighted arithmetic mean of the availability of values in a row.
     The value 1 means a row is completely available, where 0 declares that there is no value in the row."""
    return 1 - row.isna().mean()


4. Apply the DQ metric to the data:

In [13]:
dq_measurement_results: pandas.Series = data.apply(arith_mean_completeness_per_row, axis=1)  # apply the DQ metric

# Show some records of the DQ measurement results:
dq_measurement_results.head(5)

CustomerID
77700.0    0.857143
86928.0    0.714286
16019.0    0.714286
30163.0    0.857143
85773.0    1.000000
dtype: float64

5. Store the DQ metric and the computed results into the database of DaQSS:

In [14]:
d = DaQSS()  # create a new instance of the DaQSS class to provide easy access to the database

In [16]:
# Create a new DQ dimension for the completeness metric,
# in case it has not been created already.
# If it already exists, nothing happens.
d.store_dq_dimension("Completeness")

# Store the metric. If it already exists nothing happens. 
# If it is associated to a DQ dimension that does not exist,
# the metric will still be created, but that dimension association is skipped.
d.store_dq_metric(arith_mean_completeness_per_row, ["Completeness"], ROW)

 - this association is already in place, or 
 the dimension "Completeness" does not exist.


In [17]:
# DaQSS requires that all parent data elements of the data elements, of which the DQ is measured,
# need to be also represented in the system
d.store_data_element(data_global_identifier, data_local_identifier, TABLE)

In [18]:
# Store th DQ measurement result computed for each row
d.store_dq_measurement_results_from_series(arith_mean_completeness_per_row, ROW, data_global_identifier,
                                           dq_measurement_results)

 - no data element global_identifier was provided, or
 - the DQ metric computed two results for the same data element, or
 - the provided parent data element is not represented in DaQSS


The warning raised when storing DQ result values from `fake_customer_data.csv` is acceptable, since the CSV file contains rows that miss an identifier.