# `pandas` Part II

This document continues to cover data manipulation with `pandas`, including aggregation, reorganizing, and merging data.

## Grouping data

Grouping together that are in the same category to aggregate over rows in each category.

Useful in 
- performing large operations, and
- summarizing trends in a dataset.

Say we have a dataset with baby naming frequency throughout the years.  Perhaps we are first interested in 

- how many babies are born in each year? (Good indicator of societal confidence..)

```{figure} ../img/pandas-group-schema.png
---
width: 80%
name: pandas-group
---
Example of aggregation in `pandas` {cite:p}`lau2023learning`
```

In [None]:
import pandas as pd

baby = pd.read_csv('../data/ssa-names.csv.zip')

In [None]:
# number of total babies
baby['Count'].sum()

### Grouping and aggregating

**How many babies are born each year?**

In [None]:
counts_by_year = baby.groupby('Year')['Count'].sum()

### A general recipe for grouping

```python
(baby                # the dataframe
 .groupby('Year')    # column(s) to group
 ['Count']           # column(s) to aggregate
 .sum()              # how to aggregate
)
```

### Grouping by multiple attributes

**How many female and male babies are born each year?**

In [None]:
counts_by_year_and_sex = baby.groupby(['Year', 'Sex'])['Count'].sum()
counts_by_year_and_sex

### Aggregating by a custom function

**What about number of unique names by year?**

In [None]:
def count_unique(names):
    return len(names.unique())

unique_names_by_year = (baby
 .groupby('Year')
 ['Name']
 .agg(count_unique) # aggregate using the custom count_unique function
)
unique_names_by_year

## Pivoting
Pivoting is one way to organize and present data, by arranging the results of a group and aggregation when grouping with two columns.

In [None]:
mf_pivot = pd.pivot_table(
    baby,
    index='Year',   # Column to turn into new index
    columns='Sex',  # Column to turn into new columns
    values='Count', # Column to aggregate for values
    aggfunc='sum')    # Aggregation function
mf_pivot

## Melting
Melting is the "reverse" of pivoting, transforming *wide* tables into *long* tables.

In [None]:
mf_long = mf_pivot.reset_index().melt(
    id_vars='Year', # column that uniquely identifies a row (can be multiple)
    var_name='Sex', # name for the new column created by melting
    value_name='Count' # name for new column containing values from melted columns
)
mf_long

*Why do we need* `reset_index()`?

## Practice 3

Using the meteorite data from the `Meteorite_Landings.csv` file, 

1. create a pivot table that shows for each year
    - the number of meteorites, and
    - the 95th percentile of meteorite mass.
2. create a pivot table to compare for each year
    - the 5%, 25%, 50%, 75%, and 95% percentile of the mass column for the meteorites that were found versus observed falling.
3. melt the two tables above to create a *long*-format table.