# `pandas` Part II

This document continues to cover data manipulation with `pandas`, including aggregation, reorganizing, and merging data.

## Grouping data

Grouping together that are in the same category to aggregate over rows in each category.

Useful in 
- performing large operations, and
- summarizing trends in a dataset.

Say we have a dataset with baby naming frequency throughout the years.  Perhaps we are first interested in 

- how many babies are born in each year? (Good indicator of societal confidence..)

```{figure} ../img/pandas-group-schema.png
---
width: 80%
name: pandas-group
---
Example of aggregation in `pandas` {cite:p}`lau2023learning`
```

In [1]:
import pandas as pd

baby = pd.read_csv('../data/ssa-names.csv.zip')

In [None]:
# number of total babies
baby['Count'].sum()

### Grouping and aggregating

**How many babies are born each year?**

In [None]:
counts_by_year = baby.groupby('Year')['Count'].sum()

### A general recipe for grouping

```python
(baby                # the dataframe
 .groupby('Year')    # column(s) to group
 ['Count']           # column(s) to aggregate
 .sum()              # how to aggregate
)

# general form
dataframe.groupby(column_name).agg(aggregation_function)
```

### Grouping by multiple attributes

**How many female and male babies are born each year?**

In [None]:
counts_by_year_and_sex = baby.groupby(['Year', 'Sex'])['Count'].sum()
counts_by_year_and_sex

### Aggregating by a custom function

**What about number of unique names by year?**

In [None]:
def count_unique(names):
    return len(names.unique())

unique_names_by_year = (baby
 .groupby('Year')
 ['Name']
 .agg(count_unique) # aggregate using the custom count_unique function
)
unique_names_by_year

## Apply
The `Series.apply()` function applies an arbitrary function on each row entry.

**Retrieve first letter of name**

In [2]:
def get_first_letter(s):
    return s[0]  # assumes string input

In [3]:
names = baby['Name']
names.apply(get_first_letter)

0          M
1          V
2          E
3          R
4          M
          ..
6311499    Z
6311500    Z
6311501    Z
6311502    Z
6311503    Z
Name: Name, Length: 6311504, dtype: object

**Number of letters in name**

### Quick word about `apply()` effectiveness

The `apply()` function is flexible, accommodating custom operations.  But it is *slow*.

## Pivoting
Pivoting is one way to organize and present data, by arranging the results of a group and aggregation when grouping with two columns.

```{figure} ../img/pandas-pivot.png
---
width: 80%
name: pandas-pivot
---
Example of pivoting in `pandas` (Data 100)
```

In [55]:
mf_pivot = pd.pivot_table(
    baby,
    index='Year',   # Column to turn into new index
    columns='Sex',  # Column to turn into new columns
    values='Count', # Column to aggregate for values
    aggfunc='sum')    # Aggregation function
mf_pivot

Sex,F,M
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1910,352089,164223
1911,372382,193441
1912,504299,383704
1913,566973,461606
1914,696907,596441
...,...,...
2017,1405262,1606186
2018,1382391,1570957
2019,1360299,1545678
2020,1303090,1478890


## Melting
Melting is the "reverse" of pivoting, transforming *wide* tables into *long* tables.

In [None]:
mf_long = mf_pivot.reset_index().melt(
    id_vars='Year', # column that uniquely identifies a row (can be multiple)
    var_name='Sex', # name for the new column created by melting
    value_name='Count' # name for new column containing values from melted columns
)
mf_long

*Why do we need* `reset_index()`?

## Practice 3

Using the baby names data, find the names with most occurrences in each year for both sexes.

Using the meteorite data from the `Meteorite_Landings.csv` file, 

1. use `groupby` to examine the number of meteors recorded each year.
2. use `groupby` to find the heaviest meteorite from each year and report its name and mass.
3. create a pivot table that shows for each year
    - the number of meteorites, and
    - the 95th percentile of meteorite mass.
4. create a pivot table to compare for each year
    - the 5%, 25%, 50%, 75%, and 95% percentile of the mass column for the meteorites that were found versus observed falling.
5. melt the two tables above to create a *long*-format table.

In [40]:
meteor = pd.read_csv('../data/Meteorite_Landings.csv')

In [41]:
meteor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45716 entries, 0 to 45715
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   name         45716 non-null  object 
 1   id           45716 non-null  int64  
 2   nametype     45716 non-null  object 
 3   recclass     45716 non-null  object 
 4   mass (g)     45585 non-null  float64
 5   fall         45716 non-null  object 
 6   year         45425 non-null  float64
 7   reclat       38401 non-null  float64
 8   reclong      38401 non-null  float64
 9   GeoLocation  38401 non-null  object 
dtypes: float64(4), int64(1), object(5)
memory usage: 3.5+ MB
