# Fugue API - Fully Agnostic Workflows

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

All of our discussion by this point has been about the `transform()` function. While the `transform()` function can already handle a lot, it's not enough to describe end-to-end framework-agnostic workflows.

Take a look at the previous examples. We're always creating a DataFrame with Pandas before the distributed operation. If we load with Spark, the workflow becomes dependent on Spark.

```
df = ...
sdf = transform(..., engine="spark")
sdf.write.parquet(...)
```

This is where the broader Fugue API comes in. Fugue has a suite of standalone functions that are all compatible with Pandas, Spark, Dask, and Ray DataFrames. First, we take a look at loading and saving.

In [None]:
_=!mamba install -y openjdk
_=!pip install -r ../requirements.txt

## Saving and Loading

Let's setup a DataFrame and save it for use in this section.

In [15]:
import pandas as pd

df = pd.DataFrame({"col1": ["a", "a", "a", "b", "b", "b"],
                   "col2": [1, 4, 2, 5, 3, 2]})
df.to_parquet("/tmp/test.parquet")

Now we can see the engine agnostic functions.

In [16]:
import fugue.api as fa

def add_new_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(col3 = df['col2'] * 3)

# Using Pandas
df = fa.load("/tmp/test.parquet")
tdf = fa.transform(df, add_new_col, schema="*, col3: int")
fa.save(tdf, "/tmp/output.parquet")

To check the results, we can use Pandas again:

In [17]:
pd.read_parquet("/tmp/output.parquet")

Unnamed: 0,col1,col2,col3
0,a,1,3
1,a,4,12
2,a,2,6
3,b,5,15
4,b,3,9
5,b,2,6


## Bringing to Spark

These functions support an `engine` keyword argument. We can set it to a Spark session. If we want to use Pandas, we can simply set it to `None`.

In [18]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

engine = spark
df = fa.load("/tmp/test.parquet", engine=engine)
tdf = fa.transform(df, add_new_col, schema="*, col3:int", engine=engine)
fa.save(tdf, "/tmp/output.parquet", engine=engine)

## Engine Context

Writing out the engine several times can be tedious, so Fugue also has an `engine_context` that can be used to define the engine for a set of operations.

In [19]:
with fa.engine_context(spark):
    df = fa.load("/tmp/test.parquet")
    tdf = fa.transform(df, add_new_col, schema="*, col3:int")
    print(type(tdf))
    fa.save(tdf, "/tmp/output.parquet")

<class 'pyspark.sql.dataframe.DataFrame'>


## Other Functions

We won't provide a full list of functions in this tutorial. There is a dedicated section in the [Fugue API Documentation](https://dask.discourse.group/). We'll just show some commonly used operations.

**Drop Columns**, **Rename**, **Distinct**

In [20]:
def get_distinct(engine=None):
    with fa.engine_context(engine):
        df = fa.load("/tmp/test.parquet")
        res = fa.drop_columns(df, ["col2"])
        res = fa.rename(res, {"col1": "_col1"})
        res = fa.distinct(res)
        fa.show(res)
    return

In [21]:
get_distinct()

Unnamed: 0,_col1:str
0,a
1,b


In [22]:
get_distinct(spark)

Unnamed: 0,_col1:str
0,b
1,a


Other functions include **Alter Schema**, **Dropna**, **Fillna**.

## Take

Take is a very commonly used operation to return the top rows when sorted.

In [23]:
df = pd.DataFrame({"a": ["Apple", "Apple", "Banana", "Banana", "Carrot", "Carrot"],
                   "b": [1,2,3,4,5,6]})

In [24]:
fa.show(fa.take(df, 2, presort="b desc"))

Unnamed: 0,a:str,b:long
0,Carrot,6
1,Carrot,5


In [25]:
fa.show(fa.take(df, 1, presort="b asc", partition={"by": "a"}))

Unnamed: 0,a:str,b:long
0,Apple,1
1,Banana,3
2,Carrot,5


In [26]:
fa.show(fa.take(df, 1, presort="b asc", partition={"by": "a"}, engine="dask"))

Unnamed: 0,a:str,b:long
0,Apple,1
0,Banana,3
0,Carrot,5


## Raw SQL

The raw SQL lets us do one line SQL statements. For more involved queries, the FugueSQL is recommended.

In [27]:
res = fa.raw_sql("SELECT * FROM",df, engine="duckdb")
res

pyarrow.Table
a: string
b: int64
----
a: [["Apple","Apple","Banana","Banana","Carrot","Carrot"]]
b: [[1,2,3,4,5,6]]

In [28]:
fa.as_pandas(res)

Unnamed: 0,a,b
0,Apple,1
1,Apple,2
2,Banana,3
3,Banana,4
4,Carrot,5
5,Carrot,6


## Stratified Sampling

In [29]:
import numpy as np

def stratified_sampling(df, by, max_ct):
    def label(df: pd.DataFrame) -> pd.DataFrame:
        return df.assign(_label=np.random.rand(len(df)))

    _df = fa.transform(df, label, schema="*, _label:double")
    _df = fa.take(_df, n=max_ct, presort="_label", partition={"by": by})
    return fa.drop_columns(_df, ["_label"])

In [30]:
def create_data(n):
    target = np.concatenate((np.ones(int(n*0.9)), np.zeros(int(n*0.1))), axis=0)
    np.random.shuffle(target)
    feat1 = np.random.normal(0, 1, n)
    feat2 = np.random.normal(0, 1, n)
    data = pd.DataFrame({'feature1': feat1, 'feature2': feat2, 'target': target})
    return data

data = create_data(10000)
data['target'].value_counts()

1.0    9000
0.0    1000
Name: target, dtype: int64

In [31]:
with fa.engine_context():
    res = stratified_sampling(data, "target", 800)

In [32]:
res["target"].value_counts()

1.0    800
0.0    800
Name: target, dtype: int64

In [33]:
with fa.engine_context("spark"):
    res = stratified_sampling(data, "target", 800)
print(type(res))
res.groupBy('target').count().show()

<class 'pyspark.sql.dataframe.DataFrame'>
+------+-----+
|target|count|
+------+-----+
|   1.0|  800|
|   0.0|  800|
+------+-----+



## Utility

**Peek**

The peek functions return the first row of the DataFrame.

In [35]:
fa.peek_dict(res)

{'feature1': 3.7341876305773423, 'feature2': 0.9921228147036096, 'target': 1.0}

In [36]:
fa.peek_array(res)

[0.2462285588228613, -0.0060895710652030975, 1.0]

In [37]:
fa.is_local(res)

False

In [38]:
fa.as_pandas(res)

Unnamed: 0,feature1,feature2,target
0,1.696083,-0.203977,1.0
1,-1.129033,0.135361,1.0
2,0.006575,-0.752696,1.0
3,-2.431510,-1.412702,1.0
4,1.220347,-0.070418,1.0
...,...,...,...
1595,0.133997,0.395380,0.0
1596,1.033733,0.069877,0.0
1597,-1.668650,0.041101,0.0
1598,2.292593,-0.291643,0.0
