# Defining Schema (10 mins)

When using Pandas, it's easy to rely on schema inference. In a distributed setting, this is a bad practice because it can be expensive or inaccurate. 

In this section we answer:

* Why is schema important for distributed computing?
* What are the ways to define schema in Fugue?


## Inference can be Expensive

Take a look at the following operation. If we don't supply the output schema of the operation, Dask will execute it for one partition first to infer the schema. This can easily double the execution time of expensive operations.

In [34]:
import pandas as pd
import dask.dataframe as dd
from time import sleep

pdf = pd.DataFrame([[0],[1],[2],[3],[4],[5],[6],[7]], columns=["a"])
ddf = dd.from_pandas(pdf, npartitions=2)

In [35]:
def add_col(df):
    if df["a"].iloc[0] == 1:
        sleep(5)
    return df.assign(b=1)

In [36]:
%%time
ddf.groupby("a").apply(add_col).compute()

  Before: .apply(func)
  After:  .apply(func, meta={'x': 'f8', 'y': 'f8'}) for dataframe result
  or:     .apply(func, meta=('x', 'f8'))            for series result


CPU times: user 46.4 ms, sys: 4.83 ms, total: 51.2 ms
Wall time: 10 s


Unnamed: 0,a,b
1,1,1
0,0,1
2,2,1
3,3,1
4,4,1
5,5,1
6,6,1
7,7,1


In [38]:
%%time
# Now we add the meta argument
ddf.groupby("a").apply(add_col, meta={'a': 'int', 'b': 'int'}).compute()

CPU times: user 30.4 ms, sys: 4.29 ms, total: 34.7 ms
Wall time: 5.03 s


Unnamed: 0,a,b
1,1,1
0,0,1
2,2,1
3,3,1
4,4,1
5,5,1
6,6,1
7,7,1


## Schema in Fugue

There are three ways to define schema in Fugue.

In [42]:
df = pd.DataFrame({"a": [2,3,4]})

def add_new_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(new=[1,2,3])

### Defining During Runtime

In [44]:
from fugue import transform

transform(df, add_new_col, schema="*, new:int")

Unnamed: 0,a,new
0,2,1
1,3,2
2,4,3


### Decorator Approach

In [45]:
from fugue import transformer

@transformer(schema="*, new:int")
def add_new_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(new=[1,2,3])

In [46]:
add_new_col(df)

Unnamed: 0,a,new
0,2,1
1,3,2
2,4,3


In [47]:
transform(df, add_new_col)

Unnamed: 0,a,new
0,2,1
1,3,2
2,4,3


### Schema Hint

In [48]:
#schema: *, new:int
def add_new_col(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(new=[1,2,3])

transform(df, add_new_col)

Unnamed: 0,a,new
0,2,1
1,3,2
2,4,3
