# Measuring Data Quality

## Measures of Data Quality

- Validity: Conforms to a Schema
- Accuracy: Conforms to gold standard
- Completeness: All records?
- Consistency: Matches other data
- Uniformity: Same units

## Blueprint for Cleaning Data

- Audit your Data
- Create a Data Cleaning Plan
  - Identify causes
  - Define operations
  - Test
- Execute the plan (run a script)
- Manually correct


Iterate on the process

# Quiz 1

In [27]:
"""
Your task is to check the "productionStartYear" of the DBPedia autos datafile for valid values.
The following things should be done:
- check if the field "productionStartYear" contains a year
- check if the year is in range 1886-2014
- convert the value of the field to be just a year (not full datetime)
- the rest of the fields and values should stay the same
- if the value of the field is a valid year in the range as described above,
  write that line to the output_good file
- if the value of the field is not a valid year as described above, 
  write that line to the output_bad file
- discard rows (neither write to good nor bad) if the URI is not from dbpedia.org
- you should use the provided way of reading and writing data (DictReader and DictWriter)
  They will take care of dealing with the header.

You can write helper functions for checking the data and writing the files, but we will call only the 
'process_file' with 3 arguments (inputfile, output_good, output_bad).
"""
import csv
import pprint
import datetime as dt

INPUT_FILE = 'data/autos.csv'
OUTPUT_GOOD = 'data/autos-valid.csv'
OUTPUT_BAD = 'data/FIXME-autos.csv'

In [28]:
def get_date(date_str):
    try:
        return dt.datetime.strptime(date_str[:10], '%Y-%m-%d')
    except ValueError:
        return None

In [29]:
def date_in_range(year):
    min_year = 1886
    max_year = dt.datetime.now().year
    return (min_year <= year) and (year <= max_year)

In [30]:
def process_file(input_file, output_good, output_bad):
    good_data = list()
    bad_data = list()

    with open(input_file, "r") as f:
            reader = csv.DictReader(f)
            header = reader.fieldnames
            start_year_idx = header.index('productionStartYear')
            for row in reader:
                # Discard if the URI is not correct
                if not row['URI'].startswith('http://dbpedia.org'):
                    continue
                # Check that it is a date
                date = get_date(row['productionStartYear'])
                if not date:
                    bad_data.append(row)
                    continue
                # Convert to year alone
                year = date.year
                row['productionStartYear'] = year
                # Check the year's range
                if not date_in_range(year):
                    bad_data.append(row)
                    continue
                good_data.append(row)

    # Write the output to files
    with open(output_good, "w") as g:
        writer = csv.DictWriter(g, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in good_data:
            writer.writerow(row)

    with open(output_bad, "w") as g:
        writer = csv.DictWriter(g, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in bad_data:
            writer.writerow(row)

In [32]:
def test():
    process_file(INPUT_FILE, OUTPUT_GOOD, OUTPUT_BAD)


test()