# Consistency Issues (10 mins)

In order to be a proper abstraction layer, Fugue spends a lot of effort guaranteeing consistency. A solution prototyped on the Pandas engine must behave the same way when running on Spark, Dask, and Ray. The core Fugue repository has a unified test suite so all of the operations have the same results. So even if data teams had the bandwidth to re-write Python and Pandas solutions to native Spark, they have to worry about consistency.

Consistency comes in two ways, the first one is result consistency, and the second one is execution consistency.

## Result Consistency

Dask is more compatible with Pandas, but Spark is less so. Take a look at the following table that outlines differences in Pandas and Spark.

### Setup

First we create an identical DataFrame in both Pandas and Spark.

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

In [4]:
import pandas as pd

df = pd.DataFrame({"a": [None, None, 1, 1, 2, 2], "b": [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({"a":[None,1,2], "c":["a","b","c"]})
sdf = spark.createDataFrame(df)
sdf2 = spark.createDataFrame(df2)
df.head()

Unnamed: 0,a,b
0,,1
1,,2
2,1.0,3
3,1.0,4
4,2.0,5


### Joining

**Pandas**

Pandas has `merge()` and `join()`. In order to join on columns, we need to use the `merge()` function. Pay attention to the NULL values in the result DataFrame.

In [7]:
df.merge(df2)

Unnamed: 0,a,b,c
0,,1,a
1,,2,a
2,1.0,3,b
3,1.0,4,b
4,2.0,5,c
5,2.0,6,c


**Spark**

On the other hand, Spark join drops NULL values.

In [8]:
sdf.join(sdf2, on="a").show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|1.0|  3|  b|
|1.0|  4|  b|
|2.0|  5|  c|
|2.0|  6|  c|
+---+---+---+



### Sorting

**Pandas**

When we sort a DataFrame in Pandas descendingly, the NULL values appear at the bottom.

In [11]:
df2.sort_values("a", ascending=False)

Unnamed: 0,a,c
2,2.0,c
1,1.0,b
0,,a


But when we do it ascendingly, the sort still places the NULL values at the bottom.

In [12]:
df2.sort_values("a", ascending=True)

Unnamed: 0,a,c
1,1.0,b
2,2.0,c
0,,a


**Spark**

On the other hand, Spark actually has a value for NULL. It is the smallest number.

In [15]:
sdf2.orderBy("a", ascending = True).show()

+----+---+
|   a|  c|
+----+---+
|null|  a|
| 1.0|  b|
| 2.0|  c|
+----+---+



In [16]:
sdf2.orderBy("a", ascending = False).show()

+----+---+
|   a|  c|
+----+---+
| 2.0|  c|
| 1.0|  b|
|null|  a|
+----+---+



**PySpark Pandas**

Which one does PySpark Pandas follow?

In [21]:
import pyspark.pandas as ps

kdf = ps.DataFrame(df2)
kdf.sort_values("a", ascending=True)

Unnamed: 0,a,c
1,1.0,b
2,2.0,c
0,,a


## Table of Consistency Issues

In our local computing engine has different results from the distributed computing engine, then that makes our local tests unreliable. 

<img src="https://miro.medium.com/v2/resize:fit:1400/0*fv0FKyt3jB0ehVrU" align="left" width="700"/>

## Behavior Consistency

### Lazy Evaluation

One of the fundamental concepts in distributed computing is lazy evaluation. In Pandas, all code is executed eagerly, meaning that the code run the moment the Python interpreter passes it. On the other hand, distributed computing frameworks like Spark, Dask, and Ray are lazily evaluated. The code is not immediately executed, instead, a computation graph is built. When a result needs to be materialized, then the graph is optimized and ran.



In [28]:
import numpy as np

def gen_data(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(b=np.random.random((df.shape[0], 1)))

pdf = pd.DataFrame([[0],[1],[2],[3],[4],[5],[6],[7]], columns=["a"])
result = gen_data(pdf)
print(result.head(3))
print(result.head(3))

   a         b
0  0  0.688982
1  1  0.840180
2  2  0.665857
   a         b
0  0  0.688982
1  1  0.840180
2  2  0.665857


In [51]:
kdf = ps.DataFrame([[0],[1],[2],[3],[4],[5],[6],[7]], columns=["a"])

def gen_data(df):
    return df.assign(b=np.random.random((df.shape[0], 1)))

# test = kdf.pandas_on_spark.apply_batch(gen_data)
# print(test.head())
# print(test.head())

kdf.sum()

a    28
dtype: int64