# Auditing Data Quality Example
In this example we work with cities infobox data, audit it, come up with a cleaning idea and then clean it up. Initially, we audit the datatypes that can be found in some particular fields in the dataset. The possible types of values can be:
- NoneType if the value is a string "NULL" or an empty string ""
- List, if the value starts with "{"
- int, if the value can be cast to int
- float, if the value can be cast to float, but CANNOT be cast to int.
   For example, '3.23e+07' should be considered a float because it can be cast
   as float but int('3.23e+07') will throw a ValueError
- 'str', for all other values

In [1]:
import csv
from zipfile import ZipFile
import io
import re

fname = 'cities.zip'

with ZipFile(fname, 'r') as zfile:
    CITIES = io.TextIOWrapper(zfile.open('cities.csv'))

FIELDS = ["name", "timeZone_label", "utcOffset", "homepage", "governmentType_label",
          "isPartOf_label", "areaCode", "populationTotal", "elevation",
          "maximumElevation", "minimumElevation", "populationDensity",
          "wgs84_pos#lat", "wgs84_pos#long", "areaLand", "areaMetro", "areaUrban"]

def audit_file(filename, fields):
    fieldtypes = {}
    # Initialise fieldtypes as dictionary of empty sets (unordered collections with no duplicate elements).
    for field in fields:
        fieldtypes[field] = set([])
        
    # Skip the three unwanted header rows.
    reader = csv.DictReader(filename)
    for i in range(3):
        next(reader)
    
    for row in reader:
        for field in FIELDS:
            value = row[field]
            value_type = type(value)
            if (value == 'NULL' or value == ''):
                value_type = type(None)
            elif (re.match('{', value)):
                value_type = type([])
            try:
                value = int(value)
                value_type = type(1)
                fieldtypes[field].update([value_type])
            except ValueError:
                pass
            try:
                value = float(value)
                value_type = type(1.1)
                fieldtypes[field].update([value_type])
            except ValueError:
                pass
            fieldtypes[field].update([value_type])
            
    return fieldtypes

# audit_file(CITIES, FIELDS)

## Fix the Name
It would make it easier to process and query the data later, if all values for the name would be in a Python list, instead of being just a string separated with special characters. The function fix_name() will recieve a string as an input, and it has to return a list of all the names. If there is only one name, the list with have only one item in it, if the name is "NULL", the list should be empty.

In [2]:
with ZipFile(fname, 'r') as zfile:
    CITIES = io.TextIOWrapper(zfile.open('cities.csv'))
    
def fix_name(name):
    if name == 'NULL' or name == '':
        return []
    if name[0] == '{':
        # Use a list comprehension.
        return [x for x in name[1:-1].split('|')]
    else:
        return [name]

def process_file(filename):
    data = []
    reader = csv.DictReader(filename)
    # Skip the extra metadata.
    for i in range(3):
        next(reader)
    # Process the file.
    for line in reader:
        # Call fix_name() function to fix the city names.
        if 'name' in line:
            line['name'] = fix_name(line['name'])
        data.append(line)
    return data

# process_file(CITIES)

## Crossfield Auditing
There are couple of values that seem to provide the same information in different formats: "point" seems to be the combination of "wgs84_pos#lat" and "wgs84_pos#long". However we do not know if that is the case and should check if they are equivalent with the function check_loc(). It will recieve 3 strings, first will be the combined value of "point" and then the "wgs84_pos#" values separately. The lat and long values will be extracted from the "point" and will be compared to the "wgs84_pos# values and return True or False.

In [3]:
with ZipFile(fname, 'r') as zfile:
    CITIES = io.TextIOWrapper(zfile.open('cities.csv'))

def check_loc(point, lat, longi):
    try:
        p_lat, p_long = [float(x) for x in point.split(' ')]
        return float(p_lat) == float(lat) and float(p_long) == float(longi)
    except ValueError:
        pass

def process_file(filename):
    data = []
    reader = csv.DictReader(filename)
    # Skip the extra metadata.
    for i in range(3):
        next(reader)
    # Process the file.
    for line in reader:
        # Call check_loc() function to check the location.
        result = check_loc(line['point'], line['wgs84_pos#lat'], line['wgs84_pos#long'])
        if not result:
            print('{}: {} != {} {}'.format(line['name'], line['point'], line['wgs84_pos#lat'], line['wgs84_pos#long']))
        data.append(line)

    return data

# process_file(CITIES)