# Dimensions of Data Quality
#### By: Devin McCormack

There is no consistent definition of these dimensions, but they can broadly be sorted into four buckets of decreasing order of severity:

## 1. Completeness
Data is complete if all essential variables are measured, and there are no missing/null values. Data can be incomplete but usable.

Check if we have all the data we need/want. Essential data should be 100% complete.
> ### Questions to ask:

> - Do we have all the records that we should? 
> - Do we have missing records? 
> - Are there specific rows(observations), columns(variables), or cells(values) missing?

> ### Examples:
> - Important data columns missing
> - Unequal number of observations across different sources for same observational unit
> - NaNs in cells

## 2. Validity
Data is valid if if conforms to the syntax(format, type, range) of its definition.

Check if the data makes sense. Special datatype functions require valid datatypes.
> ### Questions to ask:
> - Does the data conform to a defined schema (set of rules for the data)?
> - Are variable data types consistent with the data values?
> - Are there nonsense values for certain dimensions? 
> - Does the data conform to primary/foreign key constraints?

> ### Examples:
> - date stored as am object string (instead of datetime), zipcode stored as a float (instead of integer), assigned sex as an object string (instead of category)
> - Negative height, zip codes with only 4 digits, names with digits in them
> - mutiple rows for the same primary key

## 3. Accuracy
Data is accurate if it correctly describes the real world object or even being described.

Check if the data is correct. Assessing accuracy may require specific domain knowledge.
> ### Questions to ask:
> - Were there any data entry errors?
> - Even if valid, does it make sense?
> - Do we have the means to check accuracy? - calculations or references?

> ### Examples:
> - Highly unlikely weight measurement (10 lbs for a full grown adult)
> - Default or filler data (John Doe, 1234 Main Street)
> - calculated variables do not match the variables that create it (bmi not consistent with height and weight measurements)

## 4. Consistency
Data is consistent when there is no difference when comparing two or more representations of a thing against a definition.

Check if the data is coherent across the whole dataset. Important for aggregation.

> ### Questions to ask:
> - Are there multiple ways the dataset is representing a single thing?
> - Do same values mean the same thing across a variable?

> ### Examples:
> - State names spelled fully for some observations, but abbreviations for others (Ohio and OH in same dataset)
> - dates (birth date, visit date) encoded differently
> - a patient_id is associated with the same name across multiple tables.




definitions and organization inspired by DAMA UK Working Group “Data Quality Dimensions” White paper and Udacity