# Using Type Hints (15 mins)

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

One of Fugue's core philosophies is adapting to data practitioners to let them define their logic in the most grammar they are most comfortable with. In some cases, native Python is the most natural way to express operations. By abstracting framework-specific code, we can focus on the pure business logic.

In effect, this:

* reduces boilerplate code
* reduces framework lock-in
* increases maintainability

## Different Ways to Define Functions

Let's take a look at the following operation where we are preprocessing a list of tweets before feeding it into machine learning. First we create some mock data.

In [1]:
import pandas as pd

tweet_data = {'tweet': ['Just saw the most beautiful sunset 🌅', 
                  'I can’t believe I finished my first marathon! 🏃‍♀️🏅', 
                  'This new restaurant in town is amazing 🍔🍟', 
                  'I’m so excited to start my new job tomorrow! 💼', 
                  'Finally got around to reading this amazing book 📖']}

tweets = pd.DataFrame(tweet_data)

In order to preprocess, it's easiest to focus on one tweet at a time. We use the `nltk` library to remove stop words and lemmatize the remaining words. All of this code is just meant to operate on one string. In practice, this function can be more complicated because of cleaning around capitalization, punctuation, symbols, etc.

Similar operations can also be applied to parse for phone numbers or e-mail addresses.

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_tweet(tweet: str):
    cleaned_tweets = []
    tokens = word_tokenize(tweet.lower())
    filtered_tokens = [token for token in tokens if token not in stop_words]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    cleaned_tweet = ' '.join(lemmatized_tokens)
    return cleaned_tweet


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinkho/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Now, people that are very familiar may immediately write a function with Pandas in and Pandas out. But actually, there are other ways to define running the above function on our mock data. Here, we look at three possible definitions. The second one is interesting because it has no dependency on Pandas, even.

It may seem like Pandas is the simplest for now, and that we shouldn't need to bother with the other ways. But we'll see later that some of these may be a better fit for other use cases.

In [4]:
from typing import List, Dict, Iterable, Any

def preprocess_text(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(cleaned=df['tweet'].apply(clean_tweet))

def preprocess_text2(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
    for row in df:
        row['cleaned'] = clean_tweet(row['tweet'])
        yield row

def preprocess_text3(df: List[List[Any]]) -> pd.DataFrame:
    for row in df:
        row.append(clean_tweet(row[0]))

    return pd.DataFrame(df, columns=['tweet', 'cleaned'])


These are all still native Python functions. We can run them without any concern for Fugue.

In [6]:
preprocess_text(tweets)

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


The second definition returns a generator so we need to call list.

In [8]:
list(preprocess_text2(tweets.to_dict('records')))

[{'tweet': 'Just saw the most beautiful sunset 🌅',
  'cleaned': 'saw beautiful sunset 🌅'},
 {'tweet': 'I can’t believe I finished my first marathon! 🏃\u200d♀️🏅',
  'cleaned': '’ believe finished first marathon ! 🏃\u200d♀️🏅'},
 {'tweet': 'This new restaurant in town is amazing 🍔🍟',
  'cleaned': 'new restaurant town amazing 🍔🍟'},
 {'tweet': 'I’m so excited to start my new job tomorrow! 💼',
  'cleaned': '’ excited start new job tomorrow ! 💼'},
 {'tweet': 'Finally got around to reading this amazing book 📖',
  'cleaned': 'finally got around reading amazing book 📖'}]

In [10]:
preprocess_text3(tweets.values.tolist())

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


## Bringing it to Fugue

All of the above definitions can be interpreted by Fugue. If it works on the Pandas-engine with Fugue, it will automatically work on Spark, Dask, and Ray.

In [40]:
from fugue import transform
transform(tweets, preprocess_text, schema="*, cleaned:str")

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


Recall this is the one without any dependency on Pandas. Fugue applies the necessary conversions.

In [41]:
transform(tweets, preprocess_text2, schema="*, cleaned:str")

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


For the last one, we'll run it with the Spark engine. If we use the Spark, Dask, or Ray engine, then the output will be a DataFrame that matches that engine. Note some of these frameworks are evaluated lazily, so we may need to call the appropriate methods.

In [42]:
result = transform(tweets, preprocess_text3, schema="*, cleaned:str", engine="dask")
print(result.type)

# Use Dask methods to trigger lazy evaluation
result.compute().head()

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


# Incremental Adoption of Fugue

One of Fugue's core principles is being non-invasive. It adjusts to developers. This is a good example of how Fugue can be leveraged for a single step in a pipeline. It's not all of nothing. A very common use case is having one or two steps that are very computationally expensive, and the rest can be left in Pandas. We can choose to only bring the necessary step to a cluster, saving money.

For example, look at the code below.

```python
df = pd.read_parquet(...)
result = transform(df, expensive_fn, schema="*, new_col:str", engine="dask")

# This brings Dask DataFrames to Pandas
result.compute()
```

In the same vein, Fugue can be used as a single step inside already existing Spark, Dask, and Ray pipelines to bring a Python/Pandas function over (we'll see more of this later). If a Spark, Dask, or Ray DataFrame is passed, we don't even need to specify the execution engine.

```python
ddf = dd.read_parquet(...)
result = transform(ddf, expensive_fn, schema="*, new_col:str")
result.to_parquet(...)
```

## Use Cases

### High Custom Business Logic

It's very common for organizations to have Pandas-sized projects, and Spark-sized projects. But at the same time, there can be highly custom rule-based business logic. Fugue can be used to keep the code native, and also let users avoid having to maintain a Pandas implementation, and a Spark implementation. In these cases, the Pandas version is often highly tested and stable, so we should just use that rather than reinventing the wheel. It also helps operationally when the business logic changes.

In [51]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack'],
    'Age': [25, 35, 45, 22, 29, 38, 50, 31, 27, 42],
    'Salary': [50000, 70000, 90000, 55000, 60000, 80000, 100000, 75000, 65000, 85000],
    'Experience': [3, 7, 12, 2, 5, 9, 15, 6, 4, 10]
})

def custom_logic(df: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    for row in df:
        if row['Age'] <= 30:
            row['title'] = 'Junior'
        elif row['Experience'] >= 10:
            row['title'] ='Senior'
        elif row['Salary'] > 80000:
            row['title'] ='High Earner'
        elif row['Age'] > 40:
            row['title'] ='Mid-Career'
        else:
            row['title'] ='Other'
    return df

In [52]:
transform(df, custom_logic, schema="*, title:str")

Unnamed: 0,Name,Age,Salary,Experience,title
0,Alice,25,50000,3,Junior
1,Bob,35,70000,7,Other
2,Charlie,45,90000,12,Senior
3,David,22,55000,2,Junior
4,Eve,29,60000,5,Junior
5,Frank,38,80000,9,Other
6,Grace,50,100000,15,Senior
7,Henry,31,75000,6,Other
8,Isabella,27,65000,4,Junior
9,Jack,42,85000,10,Senior


In [11]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

transform(df, custom_logic, schema="*, title:str", engine=spark).show()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


23/04/24 16:21:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


NameError: name 'transform' is not defined

### Distributed API Calls

Another use case is mapping over a lot of items that take long to process, but the result is small. These can easily be parallelized by Fugue, and users can get immediate benefit, even on a single machine. Below we show how to parallelize expensive API calls easily using Fugue.

Using type hints can also give us an elegant solution.

In [None]:
import time
import requests as re

Here we use the Pokemon API. The result is very large, but we're only concerned with getting a few fields.

In [4]:
res = re.get("https://pokeapi.co/api/v2/pokemon/ditto")
res.json()

We will hit the API for each of the following Pokemon.

In [29]:
pokemon_to_get = pd.DataFrame({"pokemon":["ditto", "pikachu", "bulbasaur", "squirtle", "geodude", "charmander", "articuno", "jigglypuff"]})
pokemon_to_get.head()

Unnamed: 0,pokemon
0,ditto
1,pikachu
2,bulbasaur
3,squirtle
4,geodude


We are just adding a `time.sleep(3)` to our API call so that we can simulate an expensive API call or operation. It also helps us see the parallelization benefits.

In [21]:
def expensive_api(pokemon):
    base_url = "https://pokeapi.co/api/v2/pokemon/"
    time.sleep(3)
    return re.get(f"{base_url}{pokemon}").json()

How would we do this with Pandas DataFrame in and Pandas DataFrame out? The function below is one possible solution. It's a best practice to only create the Pandas DataFrame once rather than using append.

In [16]:
def get_data(pokemon: pd.DataFrame) -> pd.DataFrame:
    names = []
    id_numbers = []
    types = []
    for item in pokemon.values.tolist():
        res = expensive_api(item)
        names.append(res['forms'][0]['name'])
        id_numbers.append(res['id'])
        types.append(res['types'][0]['type']['name'])
    return pd.DataFrame({'name': names, 'id': id_numbers, 'type': types})

call_api(list_of_pokemon)

Unnamed: 0,name,id,type
0,ditto,132,normal
1,pikachu,25,electric
2,bulbasaur,1,grass
3,squirtle,7,water
4,geodude,74,rock


But if we break free from the Pandas-constraint, we can use a more elegant definition. Fugue will handle the conversions.

In [26]:
def get_data_native(pokemon: List[List[Any]]) -> Iterable[Dict[str,Any]]:
    for item in pokemon:
        res = expensive_api(f"{base_url}{item[0]}")
        yield {"name": res['forms'][0]['name'],
               "id": res['id'],
               "type": res['types'][0]['type']['name']}

In [27]:
transform(pokemon_to_get, get_data_native, schema="name: str, id: int, type: str")

Unnamed: 0,name,id,type
0,ditto,132,normal
1,pikachu,25,electric
2,bulbasaur,1,grass
3,squirtle,7,water
4,geodude,74,rock


Now, this will also work on Spark, Dask, and Ray. They will run the function for each row of the DataFrame in parallel or distributedly.

In [28]:
transform(pokemon_to_get, 
          get_data_native, 
          schema="name: str, id: int, type: str",
          engine="dask").compute()

Unnamed: 0,name,id,type
0,ditto,132,normal
0,pikachu,25,electric
0,bulbasaur,1,grass
0,squirtle,7,water
0,geodude,74,rock


## Exercise

Create a function to call the [Dinosaur API](https://dinosaur-facts-api.shultzlab.com/) random endpoint. It can take any input/output type that Fugue supports.

For input, the suggested ones are:

* `List[Dict[str,Any]]`
* `pd.DataFrame`
* `List[List[Any]]`

For output, the suggested ones are:

* `Iterable[Dict[str,Any]]`
* `pd.DataFrame`
* `List[List]`

Remember that you can mix and match. Call the random endpoint for each item in the DataFrame below, and return the `name` and `description` of the dinosaur.

In [31]:
df = pd.DataFrame({"item": [1,2,3,4,5,6,7,8]})
df.head()

Unnamed: 0,item
0,1
1,2
2,3
3,4
4,5


The `transform()` call should look like the following. You just need to write the appropriate `fn`.

`transform(df, fn, schema="name:str, description:str")`

## Increased Maintainability

By keeping code as native as following, and only bringing it to Spark, Dask, and Ray during runtime, the code becomes highly testable and maintainable. Specific framework expertise is not needed to maintain the logic. Data practitioners that only know Python can read and understand the logic. Second, the logic can be unit tested without having the needed cluster. Users can iterate as much as possible on the local machine. 