# Dimensions of Data Quality
#### By: Devin McCormack

## 0. Types of data.

The general broad categories of data are **Quantitative** or measureable amounts, and **Qualitative** or descriptive characteristics.

### Quantitative variables:
There are two major categories within quantitative variables: **Continuous** and **Discrete**. All quantitative variables require an inherent quality of "distance" where closer numbers are more related than further numbers, and therefore you can perform math with the values. Although all quantitative variables are numerical, do not assume numbers are quantitative. Zip code is a example number where there isn't a quantitative value to the number.

- **Continuous Variables:** infinitely divisible units, such as a height, weight distance, etc.


- **Discrete Variables:** A tally, or a non-divisible unit. only whole quantities. Like population, or goals scored 


### Qualitative variables:
There are three major categories of qualitative variables, **Binomial**, **Ordinal**, and **Nominal**. The major distinctions between them are the ability to order the qualitative values in a meaningful way. They may be numbers, but the numbers are not mathematical in nature, even if there is a strong order to them. 

- **Binomial Variables:** These are true/false type variables, with only two possible values.


- **Ordinal Variables:** These are variables where you can assign a meaningful order to the values with respect to the specific question. For examples, Awful, Bad, Okay, Good, and Great has a strong order, but our concept of distance is limited to the fact that some categories are closer than others - we have no concept of how much closer they are. These can often be binned quantitative variables as well if they are tied to a relativistic conceptual model, like short, medium, tall. Note that some orders, like alphabetical, may not merit calling it ordinal if the order adds no value to the variable.


- **Nominal Variables:** All other variables are likely nominal. These are things like names. Nominal variables lack any semblance of "distance". The nominality/ordinality of a specific variable is heavily dependent on the question posed. "Cat" and "Cactus" are nominal variables, but if we are building a NLP model, we may use a Levenshtein distance to create a distance measure by comparing letter composition. Words would then be framed with respect to size and letter composition (potentially quantitative measures) instead of their semantic content. Semantic content itself has clusterable distance (Cat is closer to a lion than a dog in certain semantic sense - biological - and closer to a dog in others - domestication/pet status. Either way, requires a 3 way comparison to create distance.)


NOW THEN. On to the dimensions of data quality.

# Dimensions

There is no consistent definition of these dimensions, but they can broadly be sorted into four buckets of decreasing order of severity:

## 1. Completeness
Data is complete if all essential variables are measured, and there are no missing/null values. Data can be incomplete but usable, but more complete data is *always* better (in the sense of analytic possibility)

Check if we have all the data we need/want. Essential data should be 100% complete.
> ### Questions to ask:

> - Do we have all the records that we should? 
> - Do we have missing records? 
> - Are there specific rows(observations), columns(variables), or cells(values) missing?

> ### Examples:
> - Important data columns missing
> - Unequal number of observations across different sources for same observational unit
> - NaNs in cells
> - Missing timepoints

## 2. Validity
Data is valid if if conforms to the syntax(format, type, range) of its definition.

Check if the data makes sense. Special datatype functions require valid datatypes.
> ### Questions to ask:
> - Does the data conform to a defined schema (set of rules for the data)?
> - Are variable data types consistent with the data values?
> - Are there nonsense values for certain dimensions? 
> - Does the data conform to primary/foreign key constraints?

> ### Examples:
> - date stored as an object string (instead of datetime), zipcode stored as a float (instead of integer), assigned sex as an object string (instead of category), numberical id as a string (instead of integer)
> - Negative height, zip codes with only 4 digits, names with digits in them
> - mutiple rows for the same primary key

## 3. Accuracy
Data is accurate if it correctly describes the real world object or event being described.

Check if the data is correct. Assessing accuracy may require specific domain knowledge.
> ### Questions to ask:
> - Were there any data entry errors? (or was this instrumented properly?)
> - Does this encompass all ways an event can occur? (are there edge cases that are not captured?)
> - Even if valid, does it make sense?
> - Do we have the means to check accuracy? - calculations or references?

> ### Examples:
> - Highly unlikely weight measurement (10 lbs for a full grown adult)
> - Default or filler data (John Doe, 1234 Main Street)
> - calculated variables do not match the variables that create it (bmi not consistent with height and weight measurements)
> - Data indicates completion of step 2 before step 1 when that is not physically possible

## 4. Consistency
Data is consistent when there is no difference when comparing two or more representations of a thing against a definition.

Check if the data is coherent across the whole dataset. Important for aggregation.

> ### Questions to ask:
> - Are there multiple ways the dataset is representing a single thing?
> - Do same values mean the same thing across a variable?

> ### Examples:
> - State names spelled fully for some observations, but abbreviations for others (Ohio and OH in same dataset)
> - dates (birth date, visit date) encoded differently
> - a patient_id is associated with the same name (specifically, the same *identity*) across multiple tables.




definitions and organization inspired by DAMA UK Working Group “Data Quality Dimensions” White paper and Udacity