# The power of abstraction(for geoscientist types)
### **NOTICE: Lesson not guaranteed to be effective for engineers.**

By: Nathan Jones, Data Geoscientist

<img src="https://encrypted-tbn0.gstatic.com/images?q=tbn%3AANd9GcTOji6pqnyV_8CijbZAeyfEAVMmX1YQ5TxSjBPgkqTcPu1EZrMa&usqp=CAU"
     alt="Agile Programming"
     style="float: left; margin-right: 10px;" />
www.dilbert.com



## Abstraction
<img src="https://imgs.xkcd.com/comics/abstraction.png"
     alt="Relevant XKCD"
     style="float: left; margin-right: 10px;" />
     https://xkcd.com/676/



## Seriously? Why is abstraction important??

We all aspire to flexible, agile, modular, ... code. Why do we fail!?? 

Well, there are several reasons, but a big one is the concept of dependancies...
<img src="https://i.ibb.co/yfrrsgk/Dependancies-1.png" alt="Dependancies-1"
     alt="Agile Programming"
     style="float: left; margin-right: 10px;" />


### Not too bad yet...
<img src="https://i.ibb.co/Mfb7RTh/slide-2.png" alt="slide-2" alt="Dependancies-1"
     alt="Agile Programming"
     style="float: left; margin-right: 10px;" />

### Uh, oh....

<img src="https://i.ibb.co/gTxc1jz/slide-3.png" alt="slide-2" alt="Dependancies-1"
 alt="Agile Programming"
 style="float: left; margin-right: 10px;" />
 
 As dependancies increase, your ability to change the code that is depended on decreases until it is nearly impossible to change even the simplest thing without spawning an epic bug hunt on all the downstream code...

### Abstraction to the rescue!
<img src="https://i.ibb.co/c2Hb9cf/slide-4.png" alt="slide-2" alt="Dependancies-1"
 alt="Agile Programming"
 style="float: left; margin-right: 10px;" />

Abstraction takes an **idea** and makes all the downstream code depend on it. This 'inverts' the dependancy flow so that all the downstream code no longer depends directly on anythin beyond the abstraction...

Let's do a simple code example...


In [16]:
import pandas as pd
### Let's load some picks!
picks_data_1 = pd.DataFrame(dict(picks=['A', 'B', 'C'],
                    well_name=['Well-1', 'Well-1', 'Well-2'],
                    uwis=['00000001', '00000001', '00000002'],
                    measured_depth=[100.0, 200.0, 150.0],
                   ))
picks_data_1.head()

Unnamed: 0,picks,well_name,uwis,measured_depth
0,A,Well-1,1,100.0
1,B,Well-1,1,200.0
2,C,Well-2,2,150.0


In [15]:
# Let's do some kind of process on our picks
def sum_picks_for_well_process(picks_df):
    wells = picks_df.groupby('well_name')
    output = wells['measured_depth'].sum()
    return output

# Let's try our process on our data
summed_picks = sum_picks_for_well_process(picks_data_1)
print(summed_picks)

well_name
Well-1    300.0
Well-2    150.0
Name: measured_depth, dtype: float64


In [5]:
# All is well! Our process works on the picks data time to move on!!!

# Enter: The Request: "Hi! We just bought a field, could you use your amazing script on their data for me??"

# Here's the data:
picks_data_from_our_fantastic_new_asset = pd.DataFrame(dict(HERE_ARE_THE_PICKS=['HADES1', 'PIT OF DESPAIR', 'PIT OF DESPAIR'],
                                                            WELL=['MUHAHAHA', 'MUHAHAHA', 'DIEDIEDIE'],
                                                            API2000=['00000001', '00000001', '00000002'],
                                                            MD_FIXED=[200.0, 400.0, 150.0],
                                                           ))
picks_data_from_our_fantastic_new_asset.head()

Unnamed: 0,HERE_ARE_THE_PICKS,WELL,API2000,MD_FIXED
0,HADES1,MUHAHAHA,1,200.0
1,PIT OF DESPAIR,MUHAHAHA,1,400.0
2,PIT OF DESPAIR,DIEDIEDIE,2,150.0


In [14]:
# Welp, let's try my awesome script on it...
summed_picks = sum_picks_for_well_process(picks_data_from_our_fantastic_new_asset)

In [13]:
# WHAT!??? WHY??? WHHHHHYYYYYYYYYY!!!!!!???? THIS NEVER HAPPENS!!!

# What would an abstraction look like here?

def column_cleaner(df):
    cleaner = dict(HERE_ARE_THE_PICKS='picks',
                   WELL='well_name',
                   API2000='uwis',
                   MD_FIXED='measured_depth'
                  )
    cols = list(df.columns)
    for idx, col in enumerate(cols):
        if col in cleaner.keys():
            print(col)
            cols[idx] = cleaner[col]
    df.columns = cols
    return df

# Let's try this again
summed_picks = sum_picks_for_well_process(column_cleaner(picks_data_from_our_fantastic_new_asset))
print(summed_picks)

MD_FIXED
well_name
DIEDIEDIE    150.0
MUHAHAHA     600.0
Name: measured_depth, dtype: float64


### The Abstraction
<img src="https://i.ibb.co/Yyw4v20/slide-5.png" alt="slide-2" alt="Dependancies-1"
 alt="Agile Programming"
 style="float: left; margin-right: 10px;" />
It worked! Our process now works with both sets of picks, but... What was the 'abstraction'? In this case we are creating the "idea" of a universal pick schema that applies to ALL of the code downstream of "column_cleaner" whether that is one little function or a data pipeline for your new awesome geomodelling script. Without this abstract schema in place the code would depend directly on the data sources... and chaos will eventually ensue...



### In Closing

This a simple, kinda dumb example, but abstraction is a powerful principle of agility in code. There are many use cases for preventing code dependancies from ruining your day, but I'll mention one more: outside, 3rd party code packages. Outside packages may not care about you, your code that suddenly broke from their change, or your need for them to keep maintaining said code. They owe you nothing (probably, look at the license, but if you didn't pay them... nothing). Building an abstraction layer between your code and theirs can save the day! Sort of dumb example: Pandas. You could build an abstraction layer between your code in pandas. Instead of calling pandas directly, you'd call your abstract dataframe class (which would route the request on to pandas). Why do this? Well, one reason is that you could write an extension to your abstraction to support another dataframe package (spark, arcgis dataframes, xarray, etc etc) and hot swap them in your code whenever you want/need without your code being the wiser. Another reason is that if pandas breaks your code, you probably only need to fix it in one spot (the abstract class). Pretty cool eh?