## singledispatch from python standard library's functools.
[Documentation](https://docs.python.org/3/library/functools.html#functools.singledispatch)

## Motivation to use this module

I want to analyse/take a quick look at my data. Based on the size on the dataset, I may want to use pandas (small data) or pyspark/dask (big data) to play with it.

**General Case** > (
    You want to implement some action across classes in many packages. You know what result you want, but may need to handle each class in a specific way.
)


In [107]:
import pandas as pd
import pyspark as pys
import pyspark.sql.functions as F

columns = ["firstname","middlename","lastname","dob","gender","salary"]
data = [('James','','Smith','1991-04-01','M',3000),
  ('Michael','Rose','','2000-05-19','M',4000),
  ('Robert','','Williams','1978-09-05','M',4000),
  ('Maria','Anne','Jones','1967-12-01','F',4000),
  ('Jen','Mary','Brown','1980-02-17','F',-1)
]

# pandas dataframe
df_pandas = pd.DataFrame(data, columns=columns)

# pyspark dataframe
builder = pys.sql.SparkSession.builder.appName("tutorial")
spark = builder.getOrCreate()
df_pyspark = spark.createDataFrame(data=data, schema=columns)

In [108]:
df_pandas.head()

Unnamed: 0,firstname,middlename,lastname,dob,gender,salary
0,James,,Smith,1991-04-01,M,3000
1,Michael,Rose,,2000-05-19,M,4000
2,Robert,,Williams,1978-09-05,M,4000
3,Maria,Anne,Jones,1967-12-01,F,4000
4,Jen,Mary,Brown,1980-02-17,F,-1


In [109]:
df_pyspark.show()

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



**What would I like?** I want functions that adapt to both packages even though api calls for similar functionality (head/filter/groupBy etc) are often different on pandas and pyspark. 

**What does this mean in computer science?** [Parametric Polymorphisim](https://en.wikipedia.org/wiki/Parametric_polymorphism) 
> In computer science, it describes the concept that objects of different types can be accessed through the same interface. Each type can provide its own, independent implementation of this interface. 

**How to do it?** [Functional Overloading](https://en.wikipedia.org/wiki/Function_overloading)
> In some programming languages, function overloading or method overloading is the ability to create multiple functions of the same name with different implementations. Calls to an overloaded function will run a specific implementation of that function appropriate to the context of the call, allowing one function call to perform different tasks depending on context.


### In Python, **singledispatch** can be used to achieve polymorphism most of the time. 
As per PEP-443 , singledispatch only happens based on the first argument’s type

## Show me an example already!

#### Code without using singledispatch

In [114]:
from typing import List
# procedural
def process(input):
    if isinstance(input, int):
        return process_int(input)
    elif isinstance(input, List):
        return process_list(input)

def process_int(input):
    # processing integer
    return (f"Integer {input**3} has been processed successfully!")

def process_list(input):
    # processing float
    for idx, val in enumerate(input):
        print(idx, "ALPS "*val)
    return "Enumerated the input list!"

print(process(1), "\n")
print(process([3,4,5,6]), "\n")
print(process(12.0), "\n") #float should give None




Integer 1 has been processed successfully! 

0 ALPS ALPS ALPS 
1 ALPS ALPS ALPS ALPS 
2 ALPS ALPS ALPS ALPS ALPS 
3 ALPS ALPS ALPS ALPS ALPS ALPS 
Enumerated the input list! 

None 



#### Refactored code to use singledispatch

In [118]:
# single_dispatch
from functools import singledispatch

@singledispatch
def process(input):
    raise NotImplementedError("Implement process function.")

#@function_name.register(type)
@process.register(int)
def sub_process(input):
     return f"Integer {input**3} has been processed successfully!"

@process.register(list)
def sub_process(input):
    for idx, val in enumerate(input):
        print(idx, "ALPS "*val)
    return "Enumerated the input list!"

In [119]:
# pass an int
print(process(1))


Integer 1 has been processed successfully!


In [120]:
# pass a list
print(process([3,4,5,6]))

0 ALPS ALPS ALPS 
1 ALPS ALPS ALPS ALPS 
2 ALPS ALPS ALPS ALPS ALPS 
3 ALPS ALPS ALPS ALPS ALPS ALPS 
Enumerated the input list!


In [121]:
# fails for float
print(process(12.0))


NotImplementedError: Implement process function.

In [122]:
@process.register(float)
def sub_process(input):
    # processing interger
     return f"Float {input**3} has been processed successfully!"


In [123]:
# success for float
print(process(12.0))


Integer 1728.0 has been processed successfully!


## Coming back to data science/analysis usage

In [124]:
df_pandas.head()

Unnamed: 0,firstname,middlename,lastname,dob,gender,salary
0,James,,Smith,1991-04-01,M,3000
1,Michael,Rose,,2000-05-19,M,4000
2,Robert,,Williams,1978-09-05,M,4000
3,Maria,Anne,Jones,1967-12-01,F,4000
4,Jen,Mary,Brown,1980-02-17,F,-1


In [125]:
df_pyspark.show()

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



### Implement a function that outputs top n rows from a dataframe using same function call for a pandas and pyspark dataframe

In [127]:
@singledispatch
def show_top_rows(df=None, n=5):
    raise NotImplementedError("Implement head function. ")

@show_top_rows.register(pd.DataFrame)
def _show_top_rows_pandas(df, n=5):
    return df.head()

@show_top_rows.register(pys.sql.DataFrame)
def _show_top_rows_pyspark(df, n=5):
    return df.show()



In [128]:
show_top_rows(df_pandas, 10)

Unnamed: 0,firstname,middlename,lastname,dob,gender,salary
0,James,,Smith,1991-04-01,M,3000
1,Michael,Rose,,2000-05-19,M,4000
2,Robert,,Williams,1978-09-05,M,4000
3,Maria,Anne,Jones,1967-12-01,F,4000
4,Jen,Mary,Brown,1980-02-17,F,-1


In [129]:
show_top_rows(df_pyspark, 10)

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
|  Michael|      Rose|        |2000-05-19|     M|  4000|
|   Robert|          |Williams|1978-09-05|     M|  4000|
|    Maria|      Anne|   Jones|1967-12-01|     F|  4000|
|      Jen|      Mary|   Brown|1980-02-17|     F|    -1|
+---------+----------+--------+----------+------+------+



In [130]:
show_top_rows(data)

NotImplementedError: Implement head function. 

### Now,  filter the dataframe by some column value

In [131]:
@singledispatch
def filter_based_on_column_value(df, filter_column:str, filter_value):
    raise NotImplementedError("Implement filtering on data. Currently not implemented for datatype {}".format(type(df)))

@filter_based_on_column_value.register(pd.DataFrame)
def _filter_pandas(df, filter_column, filter_value):
    return df[df[filter_column]==filter_value]

@filter_based_on_column_value.register(pys.sql.DataFrame)
def _filter_pyspark(df, filter_column, filter_value):
    return df.filter(F.col(filter_column)==filter_value)

In [132]:
filter_based_on_column_value(df_pandas, "firstname", "James")

Unnamed: 0,firstname,middlename,lastname,dob,gender,salary
0,James,,Smith,1991-04-01,M,3000


In [133]:
filter_based_on_column_value(df_pyspark, "firstname", "James")

DataFrame[firstname: string, middlename: string, lastname: string, dob: string, gender: string, salary: bigint]

In [134]:
filter_based_on_column_value(data, "firstname", "James")

NotImplementedError: Implement filtering on data. Currently not implemented for datatype <class 'list'>

In [135]:
show_top_rows(
    filter_based_on_column_value(df_pandas, "firstname", "James")
    )

Unnamed: 0,firstname,middlename,lastname,dob,gender,salary
0,James,,Smith,1991-04-01,M,3000


In [136]:
show_top_rows(
    filter_based_on_column_value(df_pyspark, "firstname", "James")
    )

+---------+----------+--------+----------+------+------+
|firstname|middlename|lastname|       dob|gender|salary|
+---------+----------+--------+----------+------+------+
|    James|          |   Smith|1991-04-01|     M|  3000|
+---------+----------+--------+----------+------+------+



In [137]:
spark.stop()

P.S: If dispatching on a single parameter is not sufficient, then there is also a 3rd party library called [multipledispatch](https://pypi.org/project/multipledispatch/) that is well maintained


## References:
1. [Confluence Page within C1](https://confluence.kdc.capitalone.com/display/CONVERS/Team+SynApps+Python+Style+Guide)
2. [Random Blog post](https://rednafi.github.io/digressions/python/2020/04/05/python-singledispatch.html)
3. [Yet Another blog post](https://mchow.com/posts/2020-02-24-single-dispatch-data-science/)