# Part 3: Open/Closed Principle

This part of the workshop focuses on the open/closed principle: "Code should be open for *extension* but closed for *modification*." This principle is very important for projects that need to keep technical debt minimized and deployed portions of the pipeline working despite ongoing development. Applying this principle effectively in data science leads to true agility and rapid adaptation to new requirements and changes. Let's go back to the example from part 2 looking at open/closed principle as applied to the 'validator' abstraction.

In [1]:
import pandas as pd


# Let's return to our simple example "pipeline" from part 2:
def project_pipeline(picks_df):
    # We are going to focus here for this part of the workshop:
    validators = [ColumnNameValidation()]
    for validator in validators:
        if not validator.check_if_valid(picks_df):
            picks_df = validator.validate(picks_df)
        
    # Imagine a long, complicated pipeline for the rest of the function!!
    for idx, row in picks_df.iterrows():
        # We are just printing the pick name to keep the actual example really simple
        print(row['pick_name'])
    return 'awesome pipeline outputs'


# Example Picks data 
poc_picks_df = pd.DataFrame({
    'pick_name': ['A1', 'B2', 'C2', 'A1', 'B2', 'D1'],
    'pick_depth': [100.0, 150.0, 400.0, 300.0, 400.0, 800.0],
    'pick_interpreter': ['AJ', 'AJ', 'AJ', 'AJ', 'AJ', 'AJ'],
    'well': ['00000000010000', '00000000010000', '00000000010000', '00000000020000', '00000000020000', '00000000020000']
})
pit_of_despair_picks = pd.DataFrame({
    'PICK_SURF_NAME': ['a_1', None, 'AA2', 'bb1', 'AA2'],
    'MD': [100.0, 200.0, 300.0, 100.0, 600.0],
    'GUY WHO PICKED IT': ['BOB', 'SATAN', 'BOB', 'BOB', 'BOB'],
    'WellName': ['HADES1', 'HADIES2', 'HADES2', 'HADES3', 'HADES3'],
    'API': ['0000000010', '0000000011', '0000000011', '0000000012', '0000000012']
})

Now let's add in the validator class and the implementation we did from last time:

In [2]:
# Let's create an abstract class called a validator
class Validator(object):
    # This class will define two methods that all validators should implement:
    def check_if_valid(self, df)->bool:
        # This method will be used to quickly check if the validator needs to run
        return True
    def validate(self, df)->pd.DataFrame:
        # This method will attempt to automatically validate an input dataframe and return the result
        # Because this is an abstract class, we want to raise an error if it is called directly!
        raise NotImplementedError('Do not invoke abstract method directly')
        
# We creating a class that is a child of our abstract class
class ColumnNameValidation(Validator):
    column_names = {
        'PICK_SURF_NAME': 'pick_name',
        'MD': 'pick_depth',
        'GUY WHO PICKED IT': 'pick_interpreter',
        'API': 'well'
    }
    # First we implement the first function (in an overly simple way for this example)
    def check_if_valid(self, df)->bool:
        # If pick name is not in the database, we know our pipeline will fail so lets focus on that in this example
        if 'pick_name' not in df:
            # Tell the pipeline we need to validate the dataframe
            return False
        # If we get here, assume the df is ok
        return True
    
    # Next up, actually validating the dataframe if we can
    def validate(self, df)->pd.DataFrame:
        df = df.rename(columns=self.column_names)
        return df

Let's verify that our pipeline functions before we begin:

In [3]:
# 'pit of despair' asset
print('starting pit of despair')
pipeline_output = project_pipeline(pit_of_despair_picks)
print('starting POC')
# poc asset
pipeline_output = project_pipeline(poc_picks_df)

starting pit of despair
a_1
None
AA2
bb1
AA2
starting POC
A1
B2
C2
A1
B2
D1


### And now we begin!

Your management loves the pipeline and the speed with which you can add new assets is astonishing, but there is a problem... One of your 'customers' needs you to process your well API numbers as integers instead of strings! What should we do if we make this change to directly on our data our current pipeline could break! Let's apply the idea that our code should be closed to *modification* but open to extension, so step one is extend our validators with a new implementation that makes this change for us:

In [4]:
import numpy as np
# This is a simple (and a bit absurd!) implementation of validator to change the target column to an integer
class ApiToIntegerValidator(Validator):
    def check_if_valid(self, df)->bool:
        # Determine if the wells column is an integer
        integer_cols_df = df.loc[:, df.dtypes <= np.integer]
        if 'well' in integer_cols_df: # this validator could be made more flexible here, but to keep things simple, I did not
            return True
        else:
            return False
        
    def validate(self, df)->pd.DataFrame:
        # This method will attempt to automatically validate an input dataframe and return the result
        # Because this is an abstract class, we want to raise an error if it is called directly!
        wells_col = df['well'].astype(int)
        df['well'] = wells_col
        return df


Ok we wrote a new implementation of validator (an extension in this case) to do what the customer asked of us, now we need to make our pipeline work for both the old pipeline and the new, modified one:

In [5]:
# Let's create an idea of what we want our pipelines to do via an abstraction
class AbstractPipeline(object):
    def __init__(self):
        self.df = None
        self.validators = []
        
    def run(self, df):
        self.df = df
        self.validate_dataset()
        return self.pipeline_function()
    
    def validate_dataset(self):
        for validator in self.validators:
            if not validator.check_if_valid(self.df):
                self.df = validator.validate(self.df)
    
    def pipeline_function(self):
        print('my pipeline')
        return ''
    
# Let's turn our basic pipeline into an implementation of this abstraction
class MainPipeline(AbstractPipeline):
    def __init__(self):
        super().__init__()
        self.validators.append(ColumnNameValidation())
    
    # This is our old pipeline implemented in this class now
    def pipeline_function(self):
        # Imagine a long, complicated pipeline for the rest of the function!!
        for idx, row in self.df.iterrows():
            # We are just printing the pick name to keep the actual example really simple
            print(row['pick_name'])
        return 'awesome pipeline outputs'

Ok we've changed our pipeline into a 'class' and then made our old pipeline into an implementation of this, let's check that it works:


In [6]:
# 'pit of despair' asset
print('starting pit of despair')
pipeline_output = MainPipeline().run(pit_of_despair_picks)
print('starting POC')
# poc asset
pipeline_output = MainPipeline().run(poc_picks_df)


starting pit of despair
a_1
None
AA2
bb1
AA2
starting POC
A1
B2
C2
A1
B2
D1


Looks good! Now let's extend our old pipeline to use our new validator:

In [8]:
class NewPipeline(MainPipeline):
    def __init__(self):
        super().__init__()
        self.validators.append(ApiToIntegerValidator())

Pretty quick right!? We just needed to add the new validator and leave the rest untouched, let's try it and see if the well column is now an int:

In [9]:
pipeline = NewPipeline()
pipeline.run(poc_picks_df)
pipeline.df.head(4)

A1
B2
C2
A1
B2
D1


Unnamed: 0,pick_name,pick_depth,pick_interpreter,well
0,A1,100.0,AJ,10000
1,B2,150.0,AJ,10000
2,C2,400.0,AJ,10000
3,A1,300.0,AJ,20000


In [10]:
#Note that the df well column is now an integer instead of a string

# Here is the original (pre-validation df):
poc_picks_df = pd.DataFrame({
    'pick_name': ['A1', 'B2', 'C2', 'A1', 'B2', 'D1'],
    'pick_depth': [100.0, 150.0, 400.0, 300.0, 400.0, 800.0],
    'pick_interpreter': ['AJ', 'AJ', 'AJ', 'AJ', 'AJ', 'AJ'],
    'well': ['00000000010000', '00000000010000', '00000000010000', '00000000020000', '00000000020000', '00000000020000']
})
poc_picks_df.head(4)

Unnamed: 0,pick_name,pick_depth,pick_interpreter,well
0,A1,100.0,AJ,10000
1,B2,150.0,AJ,10000
2,C2,400.0,AJ,10000
3,A1,300.0,AJ,20000


Hopefully this illustrates the power and flexibility of the open/closed principle, with some refactoring of the way we ran our pipeline were able to avoid changing anything about our legacy 'deployed' pipeline and still we able to service a new request via writing new implementations and extensions of our existing code and the code was very minimal and focused only on the required changes and without duplicating already implemented code. This example also touches on the interface segregation principle in that we split the implementation of the original pipeline from the implementation of the new customer's request, this important as if too many customers/actors depend on the same code they can have conflicting requirements and cause lots of conflict/technical debt when those conflicts are resolved. 