# Part 2: Dependency Inversion/Abstraction to manage complexity and enable POCs to scale

    This is the most critical take-away from the workshop today so I will spend the bulk of our 2 hours on it, if this causes us to miss the later material, feel free to refer to it after the workshop on your own and reach out via slack if you have any questions. To start with lets talk about dependacies. If you are new to the software world you might be asking yourself: What is a dependancy? Simply put, it is a relationship between two parts of your pipeline/program. Your pipeline will accumulate dependancies as you work. These dependancies could be third party libraries outside your control like Pandas, it could be certain files or databases, one part of your code could depend on another part, your code could become dependant on a certain visualization software (spotfire, excel, etc), and there are many more potential examples. 

    If you were to isolate one part of your pipeline (let's say it is some code for this example). You can imagine looking backwards in your pipeline and everything backwards would be dependencies of this bit of code (i.e. this code depends on *x*). You could also imagine looking foward in your pipeline, these would be the things that *depend* on this particular bit of code. Let's now introduce the idea of code **stability**. The more things that depend on a particular bit of code, the more stable that code must be. Why? Imagine a bit of code with only one dependancy. If we make a change to this code, we only have to consider the impact of that change on one dependency downstream of this code. This makes changes to the code cheap, so there is low stability/high agility for this code. Now imagine you had 30 dependencies on this piece of code instead. This code would have to be very stable, because any change is enormously costly! You have to account for changes across a huge range of dependencies. 
    
    This relationship between agility/stability is at the core of why data projects that started as small POCs grow unmanageable with scale. Luckily, there is a strategy for managing this dynamic: dependency inversion (also called abstraction). Let's see a simple example of this in practice:

In [1]:
# Let's say we have geologic pick data for a POC data science project on a small field
import pandas as pd

# Example Pick Data
poc_picks_df = pd.DataFrame({
    'pick_name': ['A1', 'B2', 'C2', 'A1', 'B2', 'D1'],
    'pick_depth': [100.0, 150.0, 400.0, 300.0, 400.0, 800.0],
    'pick_interpreter': ['AJ', 'AJ', 'AJ', 'AJ', 'AJ', 'AJ'],
    'well': ['00000000010000', '00000000010000', '00000000010000', '00000000020000', '00000000020000', '00000000020000']
})

# Our project pipeline (imagine this is 2000 lines long and very awesome/complicated)
def project_pipeline(picks_df):
    # Imagine a long, complicated pipeline for the rest of the function!!
    for idx, row in picks_df.iterrows():
        # We are just printing the pick name to keep the actual example really simple
        print(row['pick_name'])
    return 'awesome pipeline outputs'

# Alright let's run our 'POC' pipeline and see what happens!
pipeline_output = project_pipeline(poc_picks_df)

A1
B2
C2
A1
B2
D1


Alright, our POC is working great, the data runs through our 'pipeline' and returns outputs. The higher-ups are excited at the results and the inevitable requests comes: 'How soon can you run this for the whole company?'. Let's see what happens for our example...

In [2]:
# The boss suggests we try our pipeline on an up-and-coming asset called 'The pit of despair' what a strange name! 
# They send you the info to get their picks:

pit_of_despair_picks = pd.DataFrame({
    'PICK_SURF_NAME': ['a_1', None, 'AA2', 'bb1', 'AA2'],
    'MD': [100.0, 200.0, 300.0, 100.0, 600.0],
    'GUY WHO PICKED IT': ['BOB', 'SATAN', 'BOB', 'BOB', 'BOB'],
    'WellName': ['HADES1', 'HADIES2', 'HADES2', 'HADES3', 'HADES3'],
    'API': ['0000000010', '0000000011', '0000000011', '0000000012', '0000000012']
})

# Thinking that your big, well organized company had excellent controls over the data, you deploy your massive, 
# complex pipeline on this grand new dataset.
pipeline_output = project_pipeline(pit_of_despair_picks)

KeyError: 'pick_name'

Oh my! Your pipelines has failed because the schema of the original POC does not match the new data! Now in reality our pipeline is tiny, and a small tweak to the data could fix this, but what if the pipeline were truly large and 1000's of lines long? What if we were in a company with hundreds of assets each with their own quirks and differences to how picks are managed? How can we prevent this **dependency** on the POC's database schema from bogging down our scaling up to the whole company? The solution? Let's invert the dependency using an abstraction!

In [4]:
# Let's create an abstract class called a validator
class Validator(object):
    # This class will define two methods that all validators should implement:
    def check_if_valid(self, df)->bool:
        # This method will be used to quickly check if the validator needs to run
        return True
    def validate(self, df)->pd.DataFrame:
        # This method will attempt to automatically validate an input dataframe and return the result
        # Because this is an abstract class, we want to raise an error if it is called directly!
        raise NotImplementedError('Do not invoke abstract method directly')
        

What did we do above? We created an abstract class or simply an abstraction. What **is** an abstraction? Think of it as communicating an idea instead of instructions for how to do something specifically. In this case, we are conveying the **idea** that all validators should implement a function to check the validity of data and a function to attempt to perform an automatic validation on the data. Let's make a specific **implementation** of this **idea** to fix our column naming problem above.

In [5]:
# We creating a class that is a child of our abstract class
class ColumnNameValidation(Validator):
    column_names = {
        'PICK_SURF_NAME': 'pick_name',
        'MD': 'pick_depth',
        'GUY WHO PICKED IT': 'pick_interpreter',
        'API': 'well'
    }
    # First we implement the first function (in an overly simple way for this example)
    def check_if_valid(self, df)->bool:
        # If pick name is not in the database, we know our pipeline will fail so lets focus on that in this example
        if 'pick_name' not in df:
            # Tell the pipeline we need to validate the dataframe
            return False
        # If we get here, assume the df is ok
        return True
    
    # Next up, actually validating the dataframe if we can
    def validate(self, df)->pd.DataFrame:
        df = df.rename(columns=self.column_names)
        return df



In [6]:
# Let's update the project pipeline
# Our project pipeline (imagine this is 2000 lines long and very awesome/complicated)
def project_pipeline(picks_df):
    #*** NEW *** Let's put our validation process here
    validators = [ColumnNameValidation()]
    for validator in validators:
        if not validator.check_if_valid(picks_df):
            picks_df = validator.validate(picks_df)
        
    # Imagine a long, complicated pipeline for the rest of the function!!
    for idx, row in picks_df.iterrows():
        # We are just printing the pick name to keep the actual example really simple
        print(row['pick_name'])
    return 'awesome pipeline outputs'

# Now re-run the 'pit of despair'
pipeline_output = project_pipeline(pit_of_despair_picks)

# And run the old pipeline to check it still works
pipeline_output = project_pipeline(poc_picks_df)

a_1
None
AA2
bb1
AA2
A1
B2
C2
A1
B2
D1


Ok, so it works on both assets now! If you are thinking, hey that seems really complicated for just changing some column names, then you would be right! However, think about this in terms a larger, more complex pipeline. There are two points where you are using abstractions, the 1st is when you **created an abstraction of what all the picks should look like after validation for the entire pipeline**. No matter how the picks looked originally, in **your** pipeline they all look exactly the same post validation. This is actually pretty huge because your pipeline **no longer has a direct dependency on any of the database/file schemas.** Your pipeline now depends on an *idea*, an abstraction of what picks should look like. 

The second abstraction was the creation of an abstraction of a validator. We will talk more about this in part 2, but notice that your pipeline has essentially an **unlimited capacity to add new implementations of validators**. We created one to handle columns, but we could create type, well name, or other types of validations and add them to our pipeline at any time and our pipeline won't care one bit about it. Why? because the pipeline only **depends on the abstraction validator and not any any specific implementation of the validator**. As long as the validator has a check_if_valid method and a validate method, our pipeline can run it **without knowing a single thing about how it is implemented**. 

### The power of a dependency inversion

Imagine a pipeline that goes from start to finish with no effort at abstraction/inverting dependencies. A failure or change anywhere in the chain will ripple through the pipeline and cause failures, technical debt, and above all **stability** or resistance to change. Code with too many dependancies can't change easily and when **forced** to change, causes a lot of technical debt and suffering your part. Abstraction can free you from this dynamic because it allows you to *isolate* changes to one part of your code base. In our pick example above, changes to the database schema are isolated to the region below the abstaction. A new asset or schema **will never require any changes to your pipeline post the column validation/abstraction**. One place for the changes to happen is WAY better than 10s or 100s of places to handle those type of changes. Let's look at another example with 3rd party libraries:


In [7]:
# Let's create a abstraction for machine learning for wrapping sklearn
class MachineLearningModel(object):
    # For our simple example let's define two methods for our abstraction:
    def fit(self, x, y):
        pass
    def predict(self, x):
        pass


Ok, so we have an abstraction, now how should we use it, let's try 'wrapping' a linear model from sk_learn:



In [8]:
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_moons

class SkLearnWrappedModel(MachineLearningModel):
    def __init__(self):
        self.model = LinearRegression()
        
    def fit(self, x, y):
        self.model.fit(x, y)
        
    def predict(self, x):
        return self.model.predict(x)

## Let's create an example dataset to try:
dataset, y = make_moons(100)

model = SkLearnWrappedModel()
# Fit the dataset
model.fit(dataset, y)
# Predict the dataset
predictions = model.predict(dataset)
predictions
# Note this is to illustrate abstraction not good data science practices 
# (note the lack of test/train split, cross validation, etc)

array([ 1.05007685,  0.33744006,  0.0069527 ,  0.05596978,  0.63257931,
        0.49220713,  0.03034089,  0.32403152,  1.03415888,  1.08163055,
        0.24723058,  0.71496601,  0.32731097,  0.05376386, -0.08754283,
        0.42260194,  0.91864455,  0.62007031,  0.99488139,  0.21374181,
        0.36742069, -0.09005192,  0.78922337,  0.17582194,  0.82134264,
        1.07335718, -0.05007685,  0.21077663,  1.06342366,  1.08720309,
        0.94403022,  0.78625819,  0.45003852,  1.06241699,  0.59161401,
        0.97168328,  0.67268903,  1.04885494,  0.28822164,  0.70469236,
        0.37992969,  0.40838599,  0.11097812, -0.03272674,  0.08135545,
        0.85748986,  0.96965911,  0.25031333,  0.88902188,  0.46528146,
       -0.08219522, -0.07414447,  0.71177836,  1.08754283,  0.63593709,
       -0.04885494,  0.50779287, -0.01573515,  1.01409868,  0.17865736,
        0.08373407,  0.54996148,  0.02831672,  1.07414447,  0.59503633,
        0.88648034,  1.09016532,  0.94623614, -0.06342366,  0.74

Ok, that might seem kind of a dumb thing to do right? The developers of sci-kit learn put so much effort into designing a nice API and you go a make your own to put around theirs? Why do that? Well, let's explore the benefits of building a 'wrapper' type of abstaction/dependency inversion:
- 1: It isolates the change from the third party library to one location in your code. 
- 2: It abstracts away the api of the library to one *you* control and that allows you to use other similar packages interchangeablely in your pipeline!

This is actually pretty huge! Imagine you did NOT do this and the library releases a new version that breaks your pipeline. This could mean hunting down dozens or hundreds of references to this library in your pipeline to make your code compatible with the changes. If you do this though, you only have to go to one place to update your code to be compatible... The best part? It doesn't cost you much time at all if you do it early! 

When it comes to abstraction there is not really a speed vs quality trade off, it is just are you going to setup your abstractions early when it is cheap to do or are you going to wait until it is a big, unpleasant job of not do it at all and watch the complexity and technical debt kill the project??