# Introduction to Fugue

The [Fugue](https://github.com/fugue-project/fugue/) project aims to make big data effortless by accelerating iteration speed and providing a simpler interface for users to utilize distributed computing engines.

Here we just take a quick look at Fugue. These examples are taken from the [Fugue in 10 minutes](https://fugue-tutorials.readthedocs.io/tutorials/quick_look/ten_minutes.html#partitioning). 

In [None]:
import pandas as pd
import os
import numpy as np

## Setup

The simplest way to scale pandas based code to Spark or Dask is with the transform() function. With the addition of this minimal wrapper, we can bring existing Pandas and Python code to distributed execution with minimal refactoring. The transform() function also provides quality of life enhancements that can eliminate boilerplate code for users.

Let’s quickly demonstrate how this concept can be applied. In the following code snippets below we will train a model using scikit-learn and pandas. Then we will perform predictions using this model in parallel on top of Spark through Fugue.

In [None]:
from sklearn.linear_model import LinearRegression

X = pd.DataFrame({"x_1": [1, 1, 2, 2], "x_2":[1, 2, 2, 3]})
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)

In [None]:
def predict(df: pd.DataFrame, model: LinearRegression) -> pd.DataFrame:
    return df.assign(predicted=model.predict(df))

In [None]:
# create test data
input_df = pd.DataFrame({"x_1": [3, 4, 6, 6], "x_2":[3, 3, 6, 6]})

# test the predict function
predict(input_df, reg)

## Bringing to Spark

We can bring the `predict()` function to Spark by using the Fugue `transform()` function and passing an engine.

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
from fugue import transform

# create a spark dataframe
sdf = spark.createDataFrame(input_df)

# use Fugue transform to switch exection to spark
result = transform(
    df=sdf,
    using=predict,
    schema="*,predicted:double",
    params=dict(model=reg),
    engine=spark
)

# display results
print(type(result))
result.show()

## Partitioning

This example will clarify what the `transform()` function is doing and how it is applied per partition of data.

In [None]:
df = pd.DataFrame({"col1": ["a","a","a","b","b","b"], 
                   "col2": [1,2,3,4,5,6]})
df

In [None]:
from typing import Any, List, Dict

def min_max(df:pd.DataFrame) -> List[Dict[str,Any]]:
    return [{"group": df.iloc[0]["col1"], 
             "max": df['col2'].max(), 
             "min": df['col2'].min()}]

In [None]:
res = transform(
    df=df, 
    using=min_max, 
    schema="group:str, max:int, min:int",
    partition={"by": "col1"},
    engine=spark
    )
res.show()