## What is data cleaning?

#### Might have numbers or string, dates in european format vs us format
#### Sources of dirty data include user entry errors, poor coding standards, different schemas used for same type of item, legacy data systems (when memory constraint occurred), some data lost in transformation format, programmer error, corruption in transmission, etc.

### 5 Measures of Data Quality
#### Validity: conforms to a schema
#### Accuracy: conforms to gold standard (do all street addresses exist? gold standard would be a subset of data that is 100% accurate)
#### Completeness: All records?
#### Consistency: Matches other data
#### Uniformity: Same units (ex. miles vs kilometeres)

### Blueprint for Cleaning
#### 1) Audit your data
#### 2) Create a data cleaning plan - identify causes, define operations, test
#### 3) Execute the plan & Manually Correct - most likely will be necessary

In [None]:
# looking to standardize data in street maps - ex. Avenue to Ave.
# taking 'addr:street' 

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import xml.etree.cElementTree as ET
from collections import defaultdict
import re

osm_file = open("chicago.osm", "r")

# matching a sequendce of nonwhite characters optionally followed  
# by a period and this much must occur at the end of the string ($)
street_type_re = re.compile(r'\S+\.?$', re.IGNORECASE)
street_types = defaultdict(int)

def audit_street_type(street_types, street_name):
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()

        street_types[street_type] += 1

def print_sorted_dict(d):
    keys = d.keys()
    keys = sorted(keys, key=lambda s: s.lower())
    for k in keys:
        v = d[k]
        print "%s: %d" % (k, v) 

def is_street_name(elem):
    return (elem.tag == "tag") and (elem.attrib['k'] == "addr:street")

def audit():
    for event, elem in ET.iterparse(osm_file):
        if is_street_name(elem):
            audit_street_type(street_types, elem.attrib['v'])    
    print_sorted_dict(street_types)    

if __name__ == '__main__':
    audit()

### Looking at Data Quality Metrics in Further Detail

#### Auditing Validity - here we are concerned with individual fields
#### Some fields have mandatory fillings, some have foreign-key constraints (ex product record - each mush have a manufacturer reference to the product), cross-field constraints (ex. start type before end date), data type (structure - dictionaries vs arrays, numbers), regular expressions (repeated structure like a phone number xxx-xxx-xxxx), some fields have ranges we expect (think t shirt size)
#### Auditing Validity is about figuring out the constraints of the data

In [None]:
# looking to transform human made data to JSON data
# each column has different data validity concerns

# auditing a cross field constraint
# cross field constraint - population/area = population density
# this check gives us an opportunity for data validity

"""
Your task is to check the "productionStartYear" of the DBPedia autos datafile for valid values.
The following things should be done:
- check if the field "productionStartYear" contains a year
- check if the year is in range 1886-2014
- convert the value of the field to be just a year (not full datetime)
- the rest of the fields and values should stay the same
- if the value of the field is a valid year in the range as described above,
  write that line to the output_good file
- if the value of the field is not a valid year as described above, 
  write that line to the output_bad file
- discard rows (neither write to good nor bad) if the URI is not from dbpedia.org
- you should use the provided way of reading and writing data (DictReader and DictWriter)
  They will take care of dealing with the header.

You can write helper functions for checking the data and writing the files, but we will call only the 
'process_file' with 3 arguments (inputfile, output_good, output_bad).
"""
import csv
import pprint

INPUT_FILE = 'autos.csv'
OUTPUT_GOOD = 'autos-valid.csv'
OUTPUT_BAD = 'FIXME-autos.csv'

def process_file(input_file, output_good, output_bad):
    data_good = []
    data_bad = []
    with open(input_file, "r") as f:
        reader = csv.DictReader(f)
        header = reader.fieldnames
        for row in reader:
            # validate URI value
            if row['URI'].find("dbpedia.org") < 0:
                continue

            ps_year = row['productionStartYear'][:4]
            try: # use try/except to filter valid items
                ps_year = int(ps_year)
                row['productionStartYear'] = ps_year
                if (ps_year >= 1886) and (ps_year <= 2014):
                    data_good.append(row)
                else:
                    data_bad.append(row)
            except ValueError: # non-numeric strings caught by exception
                if ps_year == 'NULL':
                    data_bad.append(row)
                    
        # format all f['productionStartYear] in a consistent date format (use datetime library)
        # if ['URI'] does not contain domain 'dbpedia.org': pass rows of autos.csv
        # convert the value of the field to be just a year (not full datetime)
        # test to see if f['productionStartYear'] == range 1886-2014
        # if it is: write line to 'output_good' file
        # if it isn't: write line to 'output_bad' file
        # use Dictreader (writes CSV into dictionary object) & Dictwriter 


    # This is just an example on how you can use csv.DictWriter
    # Remember that you have to output 2 files
    with open(output_good, "w") as good:
        writer = csv.DictWriter(good, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in data_good:
            writer.writerow(row)

    with open(output_bad, "w") as bad:
        writer = csv.DictWriter(bad, delimiter=",", fieldnames= header)
        writer.writeheader()
        for row in data_bad:
            writer.writerow(row)

def test():

    process_file(INPUT_FILE, OUTPUT_GOOD, OUTPUT_BAD)


if __name__ == "__main__":
    test()


### Auditing Accuracy -  difficult to do because it requires a legend ("gold standard")

#### Problems with Country column:
#### Some values are arrays, column shift issues, regex (regular expressions) to the rescue (pulls out country name in string) 

### Auditing Completeness

#### You don't know what you don't know - missing records 
#### Similar solution to accuracy (need 'gold standard' or reference data), except you are trying to find out if an entire record is missing

### Auditing Consistency

#### When two different entries contradict one another
#### This issue usually comes down to 'which data entry do I trust the most?' and using that data

### Auditing Uniformity

#### This is checking that a field that has the same unit of measure

### A little more about correcting data

#### Other potential data cleansing steps: removing typographical errors, cross checking data with a reference sheet, data enhancement (merging data sets), data harmonization (street vs road), changing reference data (new standard for some set of codes) - the issues you will face are very situation specific

# Quiz for Lesson 3

#### Quiz 1 (below) would be easier if using np.dtype and to categorize the data

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up. In the first exercise we want you to audit
the datatypes that can be found in some particular fields in the dataset.
The possible types of values can be:
- NoneType if the value is a string "NULL" or an empty string ""
- list, if the value starts with "{"
- int, if the value can be cast to int
- float, if the value can be cast to float, but CANNOT be cast to int.
   For example, '3.23e+07' should be considered a float because it can be cast
   as float but int('3.23e+07') will throw a ValueError
- 'str', for all other values

The audit_file function should return a dictionary containing fieldnames and a 
SET of the types that can be found in the field. e.g.
{"field1": set([type(float()), type(int()), type(str())]),
 "field2": set([type(str())]),
  ....
}
The type() function returns a type object describing the argument given to the 
function. You can also use examples of objects to create type objects, e.g.
type(1.1) for a float: see the test function below for examples.

Note that the first three rows (after the header row) in the cities.csv file
are not actual data points. The contents of these rows should not be included
when processing data types. Be sure to include functionality in your code to
skip over or detect these rows.
"""
import codecs
import csv
import json
import pprint

CITIES = 'cities.csv'

FIELDS = ["name", "timeZone_label", "utcOffset", "homepage", "governmentType_label",
          "isPartOf_label", "areaCode", "populationTotal", "elevation",
          "maximumElevation", "minimumElevation", "populationDensity",
          "wgs84_pos#lat", "wgs84_pos#long", "areaLand", "areaMetro", "areaUrban"]

def audit_file(filename, fields):
    fieldtypes = {}

    for k in fields:
        fieldtypes[k]=set([])
    
    with open(filename, 'r') as file:
        r = csv.DictReader(file)
        for row in r:
            if row["URI"].find("dbpedia")>0:
                for i in fields:
                    # Null is spelled NULL
                    if row[i] == 'NULL':
                        fieldtypes[i].add(type(None))
                    elif row[i] == "":
                        fieldtypes[i].add(type(None))
                    # .find() needs '>= 0' because we need this field to be a condition for the loop
                    # if we didn't have >= every single instance of '{' would be both a list and string
                    # causing a duplication of all list & string values
                    elif row[i].find('{')>=0:
                        fieldtypes[i].add(type([]))
                    else:
                        try:
                            int(row[i])
                            fieldtypes[i].add(type(1))
                            # print "int"
                        except ValueError:
                            try:
                                float(row[i])
                                fieldtypes[i].add(type(1.1))
                                # print 'float'
                            except ValueError:
                                fieldtypes[i].add(type('abc'))
        return fieldtypes

def test():
    fieldtypes = audit_file(CITIES, FIELDS)

    pprint.pprint(fieldtypes)

    assert fieldtypes["areaLand"] == set([type(1.1), type([]), type(None)])
    assert fieldtypes['areaMetro'] == set([type(1.1), type(None)])
    
if __name__ == "__main__":
    test()


### More examples of try & exceptions

https://docs.python.org/3/library/exceptions.html

In [None]:
# try and except code is used when your code block runs into an error
# example types of errors seen below

except IOError:
    print('An error occured trying to read the file.')
    
except ValueError:
    print('Non-numeric data found in the file.')

except ImportError:
    print "NO module found"
    
except EOFError:
    print('Why did you do an EOF on me?')

except KeyboardInterrupt:
    print('You cancelled the operation.')

except:
    print('An error occured.')

### Quiz 2

#### Keep the value with the most significant digits

### Quiz 3

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

Since in the previous quiz you made a decision on which value to keep for the
"areaLand" field, you now know what has to be done.

Finish the function fix_area(). It will receive a string as an input, and it
has to return a float representing the value of the area or None.
You have to change the function fix_area. You can use extra functions if you
like, but changes to process_file will not be taken into account.
The rest of the code is just an example on how this function can be used.
"""
import codecs
import csv
import json
import pprint

CITIES = 'cities.csv'


def fix_area(area):
    
    # YOUR CODE BELOW
    
    if (area == 'NULL') or (area == ''):
        area = None
    
    elif area.startswith('{'):
        area1 = area.split('|')[0][1:]
        area2 = area.split('|')[1][:-1]
        # print area1
        # print area2
        if len(area1) >= len(area2):
            area = float(area1)
        else:
            area = float(area2)
    
    else:
        area = float(area)

    return area



def process_file(filename):
    # CHANGES TO THIS FUNCTION WILL BE IGNORED WHEN YOU SUBMIT THE EXERCISE
    data = []

    with open(filename, "r") as f:
        reader = csv.DictReader(f)

        #skipping the extra metadata
        for i in range(3):
            l = reader.next()

        # processing file
        for line in reader:
            # calling your function to fix the area value
            if "areaLand" in line:
                line["areaLand"] = fix_area(line["areaLand"])
            data.append(line)

    return data


def test():
    data = process_file(CITIES)

    print "Printing three example results:"
    for n in range(5,8):
        pprint.pprint(data[n]["areaLand"])

    assert data[3]["areaLand"] == None        
    assert data[8]["areaLand"] == 55166700.0
    assert data[20]["areaLand"] == 14581600.0
    assert data[33]["areaLand"] == 20564500.0    


if __name__ == "__main__":
    test()

### Quiz 4

In [3]:
# populationTotal & areaMetro are columns that could be cleaned in a similar method as the code above

### Quiz 5

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

In the previous quiz you recognized that the "name" value can be an array (or
list in Python terms). It would make it easier to process and query the data
later if all values for the name are in a Python list, instead of being
just a string separated with special characters, like now.

Finish the function fix_name(). It will recieve a string as an input, and it
will return a list of all the names. If there is only one name, the list will
have only one item in it; if the name is "NULL", the list should be empty.
The rest of the code is just an example on how this function can be used.
"""
import codecs
import csv
import pprint

CITIES = 'cities.csv'


def fix_name(name):

    # YOUR CODE HERE
    if name.startswith('{'):
        new_name = name.replace("{","").replace("}","").split('|')
        name = new_name
    elif (name == 'NULL') or (name == None) or (name == ''):
        name = []
    else:
        name = [name]
    return name


def process_file(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        #skipping the extra metadata
        for i in range(3):
            l = reader.next()
        # processing file
        for line in reader:
            # calling your function to fix the area value
            if "name" in line:
                line["name"] = fix_name(line["name"])
            data.append(line)
    return data


def test():
    data = process_file(CITIES)

    print "Printing 20 results:"
    for n in range(20):
        pprint.pprint(data[n]["name"])

    assert data[14]["name"] == ['Negtemiut', 'Nightmute']
    assert data[9]["name"] == ['Pell City Alabama']
    assert data[3]["name"] == ['Kumhari']

if __name__ == "__main__":
    test()

### Quiz 6

In [4]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
"""
In this problem set you work with cities infobox data, audit it, come up with a
cleaning idea and then clean it up.

If you look at the full city data, you will notice that there are couple of
values that seem to provide the same information in different formats: "point"
seems to be the combination of "wgs84_pos#lat" and "wgs84_pos#long". However,
we do not know if that is the case and should check if they are equivalent.

Finish the function check_loc(). It will recieve 3 strings: first, the combined
value of "point" followed by the separate "wgs84_pos#" values. You have to
extract the lat and long values from the "point" argument and compare them to
the "wgs84_pos# values, returning True or False.

Note that you do not have to fix the values, only determine if they are
consistent. To fix them in this case you would need more information. Feel free
to discuss possible strategies for fixing this on the discussion forum.

The rest of the code is just an example on how this function can be used.
Changes to "process_file" function will not be taken into account for grading.
"""
import csv
import pprint

CITIES = 'cities.csv'


def check_loc(point, lat, longi):
    # YOUR CODE HERE
    data = point.split(" ")
    if data[0] == lat:
        if data[1] == longi:
            return True
        else:
            return False
    else:
        return False


def process_file(filename):
    data = []
    with open(filename, "r") as f:
        reader = csv.DictReader(f)
        #skipping the extra matadata
        for i in range(3):
            l = reader.next()
        # processing file
        for line in reader:
            # calling your function to check the location
            result = check_loc(line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"])
            if not result:
                print "{}: {} != {} {}".format(line["name"], line["point"], line["wgs84_pos#lat"], line["wgs84_pos#long"])
            data.append(line)

    return data


def test():
    assert check_loc("33.08 75.28", "33.08", "75.28") == True
    assert check_loc("44.57833333333333 -91.21833333333333", "44.5783", "-91.2183") == False

if __name__ == "__main__":
    test()

SyntaxError: invalid syntax (<ipython-input-4-74676cf35660>, line 48)

## Next we will be studying SQL
### To understand the differences between SQL & NoSQL - check out links below

https://www.sitepoint.com/sql-vs-nosql-differences/

https://www.quora.com/Should-a-newbie-learn-SQL-or-NoSQL

https://www.quora.com/Should-I-learn-SQL-or-NoSQL-MySQL-or-MongoDB-And-why