# Introduction

In [None]:
import pandas as pd
import numpy as np

One way to obtain Pandas DataFrames (other than reading them from file or database) is by manually creating them, as in the example below. Creating such small toy examples is useful for studying the exact behavior of API methods or our own tooling.

In [None]:
transaction_df = pd.DataFrame({
    'amount': [42., 100., 999.],
    'from': ['bob', 'alice', 'bob'],
    'to': ['alice', 'bob', 'alice']
})
transaction_df

Below are some examples of commonly used methods from the Pandas DataFrame API with prompts to study the API docs. In this context, "study" doesn't mean you should fully understand and master these API methods. "Studying" in this sense is more of an invitation to start thinking about how these methods internally work, how Pandas as a toolset is constructed, and how you could create your own tools using similar constructs.

## Selection and Transformation

The following statement selects all transactions with 'alice' as recipient and adds a column that doubles the transaction amount.

In [None]:
(
    transaction_df
    .loc[lambda df: df['to'] == 'alice']  # 1
    .assign(mod_amount=lambda df: df['amount'] * 2)  # 2
)

Study the [API documentation for `.loc[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html), and the [API documentation for `.assign()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.assign.html).

- What _is_ `loc[]` exactly? Is it a method or function? Something else entirely?
- How does it handle the statements that are passed to these square brackets?
- How does it have access to the DataFrame's data?
- What is this `lambda` expression?
- What is the `df` within the `lambda` expression?
- Is the original `transaction_df` DataFrame modified by the above statement?

Again, the prompt is not to provide an answer to these questions, but to prepare you for the content of this tutorial.

## Grouping and Aggregating

In [None]:
for recipient, recipient_df in transaction_df.groupby('to'):
    print(f'{recipient} received a total sum of {recipient_df["amount"].sum()}')

In [None]:
transaction_df.groupby('to').sum()

Study the [API documentation for `.groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html).

- What is the result of calling `groupby()` on the DataFrame? What is being returned?
- How is it possible that we can use that result in a `for ... in` loop?
- How is the `sum()` aggregation created?

## Pipelines

In [None]:
def select_amounts_greater_than(transaction_df, amount=100):
    return transaction_df.loc[lambda df: df['amount'] > amount]

transaction_df.pipe(select_amounts_greater_than, amount=99)

Study the [API documentation for `.pipe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pipe.html).

- What is being passed as the first argument to `pipe()`?
- What object or method or function is actually calling the `select_amounts_greater_than()` function?
- What is being returned by `pipe()`?