## Fugue API - Fully Agnostic Workflows

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

All of our discussion by this point has been about the `transform()` function. While the `transform()` function can already handle a lot, it's not enough to describe end-to-end framework-agnostic workflows.

Take a look at the previous examples. We're always creating a DataFrame with Pandas before the distributed operation. If we load with Spark, the workflow becomes dependent on Spark.

```
df = ...
sdf = transform(..., engine="spark")
sdf.write.parquet(...)
```

This is where the broader Fugue API comes in. Fugue has a suite of standalone functions that are all compatible with Pandas, Spark, Dask, and Ray DataFrames. First, we take a look at loading and saving.

## Saving and Loading

Let's setup a DataFrame and save it for use in this section.

In [1]:
import pandas as pd

df = pd.DataFrame({"col1": ["a", "a", "a", "b", "b", "b"],
                   "col2": [1, 4, 2, 5, 3, 2]})
df.to_parquet("/tmp/test.parquet")

Now we can see the engine agnostic functions.

In [2]:
import fugue.api as fa

def add_new_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(col3 = df['col2'] * 3)

# Using Pandas
df = fa.load("/tmp/test.parquet")
tdf = fa.transform(df, add_new_col, schema="*, col3: int")
fa.save(tdf, "/tmp/output.parquet")

To check the results, we can use Pandas again:

In [3]:
pd.read_parquet("/tmp/output.parquet")

Unnamed: 0,col1,col2,col3
0,a,1,3
1,a,4,12
2,a,2,6
3,b,5,15
4,b,3,9
5,b,2,6


## Bringing to Spark

These functions support an `engine` keyword argument. We can set it to a Spark session. If we want to use Pandas, we can simply set it to `None`.

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

engine = spark
df = fa.load("/tmp/test.parquet", engine=engine)
tdf = fa.transform(df, add_new_col, schema="*, col3:int", engine=engine)
fa.save(tdf, "/tmp/output.parquet", engine=engine)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/25 11:45:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/04/25 11:45:42 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Engine Context

Writing out the engine several times can be tedious, so Fugue also has an `engine_context` that can be used to define the engine for a set of operations.

In [5]:
with fa.engine_context(spark):
    df = fa.load("/tmp/test.parquet")
    tdf = fa.transform(df, add_new_col, schema="*, col3:int")
    print(type(tdf))
    fa.save(tdf, "/tmp/output.parquet")

<class 'pyspark.sql.dataframe.DataFrame'>


## Other Functions

We won't provide a full list of functions in this tutorial. There is a dedicated section in the [Fugue API Documentation](https://dask.discourse.group/). We'll just show some commonly used operations.

**Drop Columns**, **Rename**, **Distinct**

In [13]:
def get_distinct(engine=None):
    with fa.engine_context(engine):
        df = fa.load("/tmp/test.parquet")
        res = fa.drop_columns(df, ["col2"])
        res = fa.rename(res, {"col1": "_col1"})
        res = fa.distinct(res)
        fa.show(res)
    return

In [14]:
get_distinct()

Unnamed: 0,_col1:str
0,a
1,b


In [15]:
get_distinct(spark)

Unnamed: 0,_col1:str
0,b
1,a


Other functions include **Alter Schema**, **Dropna**, **Fillna**.

## Take

Take is a very commonly used operation to return the top rows when sorted.

In [18]:
df = pd.DataFrame({"a": ["Apple", "Apple", "Banana", "Banana", "Carrot", "Carrot"],
                   "b": [1,2,3,4,5,6]})

In [20]:
fa.show(fa.take(df, 2, presort="b desc"))

Unnamed: 0,a:str,b:long
0,Carrot,6
1,Carrot,5


In [22]:
fa.show(fa.take(df, 1, presort="b asc", partition={"by": "a"}))

Unnamed: 0,a:str,b:long
0,Apple,1
1,Banana,3
2,Carrot,5


In [23]:
fa.show(fa.take(df, 1, presort="b asc", partition={"by": "a"}, engine="dask"))

Unnamed: 0,a:str,b:long
0,Apple,1
0,Banana,3
0,Carrot,5


## Raw SQL

The raw SQL lets us do one line SQL statements. For more involved queries, the FugueSQL is recommended.

In [25]:
res = fa.raw_sql("SELECT * FROM",df, engine="duckdb")
res

pyarrow.Table
a: string
b: int64
----
a: [["Apple","Apple","Banana","Banana","Carrot","Carrot"]]
b: [[1,2,3,4,5,6]]

In [26]:
fa.as_pandas(res)

Unnamed: 0,a,b
0,Apple,1
1,Apple,2
2,Banana,3
3,Banana,4
4,Carrot,5
5,Carrot,6


## Stratified Sampling

In [None]:
sdf.when().when().when().otherwise()