## Better Pandas

This section cover tools to make your experience with Pandas a litte bit better.

### tqdm: Add Progress Bar to Your Pandas Apply

In [None]:
!pip install tqdm 

If you want to have a progress bar to get updated about the progress of your pandas apply, try tqdm.

In [1]:
import pandas as pd 
from tqdm import tqdm 
import time 

df = pd.DataFrame({'a': [1, 2, 3, 4, 5], 'b': [2, 3, 4, 5, 6]})

tqdm.pandas()
def func(row):
    time.sleep(1)
    return row + 1

df['a'].progress_apply(func)

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:05<00:00,  1.00s/it]


0    2
1    3
2    4
3    5
4    6
Name: a, dtype: int64

[Link to tqdm](https://github.com/tqdm/tqdm).

### fugue: Use pandas Functions on the Spark and Dask Engines. 

In [None]:
!pip install fugue pyspark

Wouldn't it be nice if you can leverage Spark or Dask to parallelize data science workloads using Pandas syntax? Fugue allows you to do exactly that.

Fugue provides the `transform` function allowing users to use pandas functions on the Spark and Dask engines. 

In [25]:
import pandas as pd
from typing import Dict
from fugue import transform
from fugue_spark import SparkExecutionEngine

input_df = pd.DataFrame({"id": [0, 1, 2], "fruit": (["apple", "banana", "orange"])})
map_price = {"apple": 2, "banana": 1, "orange": 3}


def map_price_to_fruit(df: pd.DataFrame, mapping: dict) -> pd.DataFrame:
    df["price"] = df["fruit"].map(mapping)
    return df


df = transform(
    input_df,
    map_price_to_fruit,
    schema="*",
    params=dict(mapping=map_price),
    engine=SparkExecutionEngine,
)
df.show()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[v.name] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[v.name] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[v.name] = s


+---+------+
| id| fruit|
+---+------+
|  0| apple|
|  1|banana|
|  2|orange|
+---+------+





<IPython.core.display.Javascript object>

[Link to fugue](https://github.com/fugue-project/fugue).