# Introduction to Fugue


<img src="img/fugue_architecture.png" align="center" width="800"/>

## PyCaret

[PyCaret](https://github.com/pycaret/pycaret) is a low code machine learning framework that automates a lot of parts of the machine learning pipeline. With just a few lines of code, several models can be trained on a dataset. In this post, we explore how to scale this capability by running several PyCaret training jobs in a distributed manner on Spark or Dask.

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import pandas as pd

In [None]:
from pycaret.datasets import get_data
df = get_data('titanic')

In [None]:
from pycaret.classification import *
clf = setup(data = df, target = 'Survived', session_id=123, silent = True, verbose=False, html=False)
models = compare_models(fold = 3, sort = "Accuracy", turbo = True, verbose=False)
results = pull().reset_index(drop=True)
results

## Wrapping Logic in Function

In [None]:
def wrapper(df: pd.DataFrame) -> pd.DataFrame:
    clf = setup(data = df, 
                target = 'Survived', 
                session_id=123, 
                silent = True, 
                verbose=False, 
                html=False)
    models = compare_models(fold = 3,  
                            sort = "Accuracy", 
                            turbo = True, 
                            verbose=False)
    results = pull().reset_index(drop=True)

    return pd.DataFrame(dict(model=results["Model"], 
                             auc=results["AUC"], 
                             recall=results["Recall"],
                             precision=results["Prec."],
                             time=results["TT (Sec)"]))

## Fugue Transform

In [None]:
from fugue import transform

schema = """model:str, auc:float, recall:float, precision:float, time:float"""

res = transform(df, wrapper, schema=schema)
res[0:5]

## Partition Male and Female

In [None]:
schema = """sex:str, model:str, auc:float, recall:float, precision:float, time:float"""

res = transform(df, wrapper, schema=schema, partition={"by":"Sex"})
res.sort_values("auc")[0:5]

## Bringing to Spark

In [None]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

In [None]:
res = transform(df.replace({np.nan: None}), 
                wrapper, 
                schema=schema, 
                partition={"by":"Sex"}, 
                save_path="/tmp/results.parquet")

## FugueSQL

In [None]:
%%fsql spark
LOAD "/tmp/results.parquet"
PRINT

In [None]:
%%fsql spark
df = LOAD "/tmp/results.parquet"

SELECT sex, AVG(time) AS time
  FROM df
 GROUP BY sex
 PRINT

## Collecting to Local DataFrame

In [None]:
%%fsql spark
df = LOAD "/tmp/results.parquet"

TAKE 5 ROWS FROM df PREPARTITION BY sex PRESORT auc DESC
YIELD LOCAL DATAFRAME AS result

In [None]:
result.native

## Invoking Python Code

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
def plotter(df:pd.DataFrame) -> None:

    fig = plt.figure(figsize=(12,10))
    ax = sns.scatterplot(x=df["precision"],y=df["recall"],hue=df["sex"])
    # The magic starts here:
    for line in range(0,df.shape[0]):
         ax.text(df["precision"].iloc[line]+0.01, df["recall"].iloc[line], 
                 df["model"].iloc[line], horizontalalignment='left', 
                 size='medium', color='black', weight='semibold')

    plt.title('Precision and Recall')
    plt.xlabel('Precision')
    plt.ylabel('Recall')

In [None]:
%%fsql
SELECT * 
  FROM result
 WHERE sex = 'male'

OUTPUT USING plotter