# ETL


## Overview

- What is it?
- Types of ETL
  -  Batch
  -  Realtime


## What is it?

The bright future of decision making has been, and still is making decisions using data; not solely trusting human intuition.
Analysts, scientists and statisticians have a problem, though. They want to understand data: but data is almost always inconsistent,
corrupted, missing, or just plain invalid.

That's because people are involved in data collection most of the time.

However, as we hear over and over again: 

- "You can have data without information, but you cannot have information without data." - Daniel Keys Moran

The job of an Extraction Transformation and Loading (ETL) system is to try and homogenize those data into a consistent
format so the data can be compared.

It's much like a body's digestive system. It digests information into its constituent parts, orders what it can for use and 
discards the rest. As data engineers, you're the plumbers for your organization's GI tracts.

### You're already practiced

Already you know something about ETL. Even in your first classes you were loading data into the database using the `COPY FROM CSV` command.

You were doing ETL there! Admittedly it was a very simple workflow -- most of the work was being doing in the database, but ETL is a continuum.


### Extraction

This is where we take information in one format and pull out the bits that are useful to our purpose.

e.g. Pulling certain attributes out of a JSON object result from an API call.

### Transformation

Taking those extracted data, and putting them into whatever format we desire, correcting incorrect values where possible, possibly annotating related
information into the same destination format.

e.g. Putting the selected JSON attributes into a Protobuffer, adding identifier annotations to data in other systems.


#### Loading

Putting your data into a database for later analysis.

e.g. psql -c \COPY your_table FROM 'your_file.csv' CSV


## Types of ETL

### Batch

This is in many ways the simplest way to construct a system, and how many of the highest performance ETL systems organize their work.

One downside is that up-to-date information is only available after each batch is run.

### Realtime

This system means that you continuously update your database(s) as new information comes into your system. It's a good choice
when the requirement is that your system's information must be close to real-time.

One downside is that this is a more difficult system to scale as your data size and frequency increase.


In [27]:
### Imports

import collections
import random

import numpy as np
import pandas

from functools import wraps

In [3]:
### Data Vars

columns_headers = []
num_rows = 10


In [4]:
### Decorators

def destroy_percent(percent, value):
    """Will corrupt, destoy or mangle a percentage of whatever data your wrapped function returns."""
    def decorator(func):
        @wraps(func)
        def _wrapped(*args, **kwargs):
            ret_val = func(*args, **kwargs)
            if isinstance(ret_val, collections.Iterable):
                changed_values = {}
                for idx, item in enumerate(ret_val):
                    if random.randint(0, 100) < percent:
                        changed_values[idx] = item
                        
                for change_idx, item in changed_values.items():
                    if callable (value):
                        ret_val[change_idx] = value(item)
                    else:
                        ret_val[change_idx] = value
                    
                return ret_val
                        
            # if we're a regular scalar, just replace our return value a random percent of the time.
            if (random.randint(0, 100) < percent):
                if callable(value):
                    return value(ret_val)
                return value
            else:
                return ret_val
            
        return _wrapped 

    return decorator

In [19]:
### Finite Data
states = ['OR', 'WA', 'CA', 'ID']
state_initial_pops = {state : random.randint(10, 400) for state in states}
BAD_CONTINUOUS_DATA_VALUES = [-1, None, 0, 45.3]

def bad_data(*args, **kwargs):
    return random.choice(BAD_CONTINUOUS_DATA_VALUES)

def capitalize(input):
    """returns list of single letter that is captialized."""
    return [input.capitalize()]

def insert_space(input):
    """returns list of a letter and a space character"""
    return [input, ' ']

string_transforms = [capitalize, insert_space]

def randomize_string(input, percent=5):
    output = []
    letters = input.split()
    for letter in input:
        out_letter = [letter]
        if random.randint(0, 100) < percent:
            out_letter = random.choice(string_transforms)(letter)
        output.extend(out_letter)
        
    return ''.join(output)
    

In [20]:
### Continuous Data

def get_population(mean, sigma, num_years):
    return np.random.normal(mean, sigma, num_years)

@destroy_percent(30, None)
def get_pop_30_nan(current, sigma, num_years):
    return get_population(current, sigma, num_years)

@destroy_percent(50, bad_data)
def get_pop_50_bad(current, sigma, num_years):
    return get_population(current, sigma, num_years)

def get_average_annual_income(current, sigma, num_years):
    return np.random.normal(current, sigma, num_years)

@destroy_percent(2, bad_data)
def get_monthly_income(current, sigma, num_years):
    return get_average_annual_income(current, sigma, num_years * 12)
    

In [21]:
num_years = 4
simple_data = [
    {
        'state': state,
        'population': get_pop_50_bad(
            state_initial_pops[state], random.randint(0, 40), num_years
        ),
        'income': get_average_annual_income(40, 7, num_years)
    }
    for state in states
]

In [22]:
simple_data

[{'income': array([ 44.15375771,  42.46976976,  42.59330565,  34.93537914]),
  'population': array([  45.3       ,  240.02832273,  250.8050366 ,           nan]),
  'state': 'OR'},
 {'income': array([ 34.41578499,  40.62633106,  44.45342219,  43.39362928]),
  'population': array([ 343.9634566,   45.3      ,          nan,  324.5777359]),
  'state': 'WA'},
 {'income': array([ 24.71569336,  42.31702963,  32.62325157,  47.5580181 ]),
  'population': array([          nan,  328.71091114,  324.69037304,    0.        ]),
  'state': 'CA'},
 {'income': array([ 42.96313886,  41.14147102,  52.21482124,  48.00284636]),
  'population': array([   0.        ,  374.65407565,   -1.        ,  371.48091012]),
  'state': 'ID'}]

In [35]:
# Interpolation of missing data
# Sometimes this is pretty straight forward, esp. for missing data

pandas.Series(simple_data[0]['population'])

0     45.300000
1    240.028323
2    250.805037
3           NaN
dtype: float64

In [34]:
pandas.Series(simple_data[0]['population']).interpolate()

0     45.300000
1    240.028323
2    250.805037
3    250.805037
dtype: float64

In [None]:
### Exercise
## Build an ETL pipeline for our Simple Data

def extract(uncleaned_data):
    pass

def transform(untransformed_data):
    pass

def load(unloaded_data):
    """Here let's just return a format that can be converted into a CSV with headers easily,
    a list of dictionaries would do nicely.
    """
    pass

In [23]:
advanced_data = [
    {
        randomize_string('state', 50): randomize_string(state),
        randomize_string('population', 25): get_pop_50_bad(
            state_initial_pops[state], random.randint(0, 40), num_years
        ),
        randomize_string('income', 40): get_average_annual_income(40, 7, num_years)
    }
    for state in states
]

In [24]:
advanced_data

[{'iNcome': array([ 27.48434952,  32.97035589,  26.82094304,  51.45420639]),
  'popu latio n': array([  -1.       ,   45.3      ,  218.2871671,          nan]),
  'st ate': 'OR'},
 {'income': array([ 38.47283945,  42.45225343,  30.91099613,  34.70738569]),
  'populAtion ': array([ 368.90184878,    0.        ,   -1.        ,  365.1133105 ]),
  's taTE': 'WA'},
 {'iNc oME': array([ 36.85357938,  37.00056181,  43.07557702,  36.09485356]),
  'po pulation': array([          nan,   -1.        ,  348.33769686,           nan]),
  'sta te': 'CA'},
 {'in Co Me ': array([ 36.04575008,  33.19909722,  32.01395239,  38.83888557]),
  'poPu la tion': array([   0.        ,           nan,  340.06933955,  353.046158  ]),
  's TatE': 'ID'}]

In [65]:
# Normalizing Strings
from difflib import SequenceMatcher

all_column_names = [sorted(item.keys()) for item in advanced_data]
first_column_names = [item[0] for item in all_column_names]

def get_column_similarities(list_of_columns):
    ratios = []
    for name in list_of_columns:
        # Going through our list of column_names and comparing it with the next one in the list.
        if list_of_columns.index(name) + 1 < len(list_of_columns):
            ratios.append((name, SequenceMatcher(
                        None, name, list_of_columns[list_of_columns.index(name) + 1]
                    ).ratio()))
        else:
            ratios.append((name, 0.0))
    return ratios

my_ratios = get_column_similarities(first_column_names)
un_sorted = get_column_similarities([item.keys()[0] for item in advanced_data])

In [66]:
my_ratios

[('iNcome', 0.8333333333333334),
 ('income', 0.46153846153846156),
 ('iNc oME', 0.5),
 ('in Co Me ', 0.0)]

In [67]:
un_sorted

[('st ate', 0.5),
 ('s taTE', 0.23529411764705882),
 ('po pulation', 0.782608695652174),
 ('poPu la tion', 0.0)]

In [None]:
###Exercise
## Build and ETL pipleine for our Advanced Data

def extract(uncleaned_data):
    pass

def transform(untransformed_data):
    pass

def load(unloaded_data):
    """Here let's just return a format that can be converted into a CSV with headers easily,
    a list of dictionaries would do nicely.
    """
    pass
