# Using Type Hints (15 mins)

One of Fugue's core philosophies is adapting to data practitioners to let them define their logic in the most native grammar. In effect, this:

* reduces boilerplate code
* reduces framework lock-in
* increases maintainability

## Different Ways to Define Functions



In [37]:
import pandas as pd

tweet_data = {'tweet': ['Just saw the most beautiful sunset 🌅', 
                  'I can’t believe I finished my first marathon! 🏃‍♀️🏅', 
                  'This new restaurant in town is amazing 🍔🍟', 
                  'I’m so excited to start my new job tomorrow! 💼', 
                  'Finally got around to reading this amazing book 📖']}

tweets = pd.DataFrame(tweet_data)

In [38]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_tweet(tweet: str):
    cleaned_tweets = []
    tokens = word_tokenize(tweet.lower())
    filtered_tokens = [token for token in tokens if token not in stop_words]
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
    cleaned_tweet = ' '.join(lemmatized_tokens)
    return cleaned_tweet


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinkho/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [39]:
def preprocess_text(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(cleaned=df['tweet'].apply(clean_tweet))

def preprocess_text2(df: List[Dict[str,Any]]) -> Iterable[Dict[str,Any]]:
    for row in df:
        row['cleaned'] = clean_tweet(row['tweet'])
        yield row

def preprocess_text3(df: List[List[Any]]) -> pd.DataFrame:
    for row in df:
        row.append(clean_tweet(row[0]))

    return pd.DataFrame(df, columns=['tweet', 'cleaned'])


In [40]:
transform(tweets, preprocess_text, schema="tweet:str, cleaned:str")

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


In [41]:
transform(tweets, preprocess_text2, schema="tweet:str, cleaned:str")

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


In [42]:
transform(tweets, preprocess_text3, schema="tweet:str, cleaned:str")

Unnamed: 0,tweet,cleaned
0,Just saw the most beautiful sunset 🌅,saw beautiful sunset 🌅
1,I can’t believe I finished my first marathon! ...,’ believe finished first marathon ! 🏃‍♀️🏅
2,This new restaurant in town is amazing 🍔🍟,new restaurant town amazing 🍔🍟
3,I’m so excited to start my new job tomorrow! 💼,’ excited start new job tomorrow ! 💼
4,Finally got around to reading this amazing book 📖,finally got around reading amazing book 📖


### High Custom Business Logic



In [51]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Isabella', 'Jack'],
    'Age': [25, 35, 45, 22, 29, 38, 50, 31, 27, 42],
    'Salary': [50000, 70000, 90000, 55000, 60000, 80000, 100000, 75000, 65000, 85000],
    'Experience': [3, 7, 12, 2, 5, 9, 15, 6, 4, 10]
})

def custom_logic(df: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    for row in df:
        if row['Age'] <= 30:
            row['title'] = 'Junior'
        elif row['Experience'] >= 10:
            row['title'] ='Senior'
        elif row['Salary'] > 80000:
            row['title'] ='High Earner'
        elif row['Age'] > 40:
            row['title'] ='Mid-Career'
        else:
            row['title'] ='Other'
    return df

In [52]:
transform(df, custom_logic, schema="*, title:str")

Unnamed: 0,Name,Age,Salary,Experience,title
0,Alice,25,50000,3,Junior
1,Bob,35,70000,7,Other
2,Charlie,45,90000,12,Senior
3,David,22,55000,2,Junior
4,Eve,29,60000,5,Junior
5,Frank,38,80000,9,Other
6,Grace,50,100000,15,Senior
7,Henry,31,75000,6,Other
8,Isabella,27,65000,4,Junior
9,Jack,42,85000,10,Senior


### Distributed API Calls

Below we show how to parallelize expensive API calls easily using Fugue.

In [None]:
import pandas as pd
from typing import List
import time

In [4]:
import requests as re

res = re.get("https://pokeapi.co/api/v2/pokemon/ditto")
res.json()

In [29]:
pokemon_to_get = pd.DataFrame({"pokemon":["ditto", "pikachu", "bulbasaur", "squirtle", "geodude", "charmander", "articuno", "jigglypuff"]})
pokemon_to_get.head()

Unnamed: 0,pokemon
0,ditto
1,pikachu
2,bulbasaur
3,squirtle
4,geodude


In [21]:
base_url = "https://pokeapi.co/api/v2/pokemon/"

def expensive_api(url):
    time.sleep(3)
    return re.get(url).json()

In [16]:
def get_data(pokemon: pd.DataFrame) -> pd.DataFrame:
    names = []
    id_numbers = []
    types = []
    for item in pokemon.values.tolist():
        res = expensive_api(f"{base_url}{item[0]}")
        names.append(res['forms'][0]['name'])
        id_numbers.append(res['id'])
        types.append(res['types'][0]['type']['name'])
    return pd.DataFrame({'name': names, 'id': id_numbers, 'type': types})

call_api(list_of_pokemon)

Unnamed: 0,name,id,type
0,ditto,132,normal
1,pikachu,25,electric
2,bulbasaur,1,grass
3,squirtle,7,water
4,geodude,74,rock


In [26]:
from typing import Iterable, List, Any, Dict

def get_data_native(pokemon: List[List[Any]]) -> Iterable[Dict[str,Any]]:
    for item in pokemon:
        res = expensive_api(f"{base_url}{item[0]}")
        yield {"name": res['forms'][0]['name'],
               "id": res['id'],
               "type": res['types'][0]['type']['name']}

In [27]:
from fugue import transform
transform(pokemon_to_get, get_data_native, schema="name: str, id: int, type: str")

Unnamed: 0,name,id,type
0,ditto,132,normal
1,pikachu,25,electric
2,bulbasaur,1,grass
3,squirtle,7,water
4,geodude,74,rock


In [28]:
transform(pokemon_to_get, 
          get_data_native, 
          schema="name: str, id: int, type: str",
          engine="dask").compute()

Unnamed: 0,name,id,type
0,ditto,132,normal
0,pikachu,25,electric
0,bulbasaur,1,grass
0,squirtle,7,water
0,geodude,74,rock


### Exercise

Create a function to call the [Dinosaur API](https://dinosaur-facts-api.shultzlab.com/) random endpoint. It can take any input/output type that Fugue supports.

For input, the suggested ones are:

* `List[Dict[str,Any]]`
* `pd.DataFrame`
* `List[List[Any]]`

For output, the suggested ones are:

* `Iterable[Dict[str,Any]]`
* `pd.DataFrame`
* `List[List]`

Remember that you can mix and match.

In [31]:
df = pd.DataFrame({"item": [1,2,3,4,5,6,7,8]})
df.head()

Unnamed: 0,item
0,1
1,2
2,3
3,4
4,5


`transform(df, fn, schema="name:str, description:str")`

## Increased Maintainability