# Grouping in pandas

Sometimes all you need is a good example. Or the copy paste of a [good example](https://realpython.com/pandas-groupby/) into a notebook you can run yourself.

Let's start by importing the U.S. Congress Dataset

In [None]:
import pandas as pd

dtypes = {
    "first_name": "category",
    "gender": "category",
    "type": "category",
    "state": "category",
    "party": "category",
}
df = pd.read_csv("files/legislators-historical.csv",
    dtype=dtypes,
    usecols=list(dtypes) + ["birthday", "last_name"],
    parse_dates=["birthday"]
)

df.tail()

Note how using this way of importing will minimizes cleanup later on. This is a good way of importing data that you know and are familiar with. If you are not simply import and cleanup later.

First question: how many representatives are there by state?

In [None]:
n_by_state = df.groupby("state", observed=True)["last_name"].count()
n_by_state.head(10)

You can also group an two columns.

In [None]:
df.groupby(["state", "gender"], observed=True)["last_name"].count()

Fun fact: when grouping in SQL you get one big blob of grouped data (meaning that when you group on "name" and "year" you get groups for every "name-year"-combination, but within you can't distinct between "name" or "year"). In pandas you get a multi-index, meaning you can split up the groups.

In [None]:
n_by_state_gender = df.groupby(["state", "gender"], observed=True)["last_name"].count()
print(type(n_by_state_gender))
print(n_by_state_gender.index[:5])

Another alternative is to drop the multi-index altogether and get the indexes back as columns.

In [None]:
df.groupby(["state", "gender"], observed=True, as_index=False)["last_name"].count()

We've always counted the results of the group by and stored the result of this operation in a dataframe. We can however keep the groups seperately and store them in a dataframe.

In [None]:
by_state = df.groupby("state", observed=True)
print(by_state)

When doing grouping-operations, keep the "split-apply-combine" in mind:

1. **Split** a table into groups.
2. **Apply** some operations to each of those smaller tables.
3. **Combine** the results.

This means that if you don't apply an operation to the group dataframe, it won't be combined into a new table. That is why the "by_state" dataframe wouldn't print earlier. You can print it when you loop through it:

In [None]:
for state, frame in by_state:
    print(f"First 2 entries for {state!r}")
    print("------------------------")
    print(frame.head(2), end="\n\n")

The above is probably not an end result, but it's a nice step along the way. Sometimes you need to see your data before you can understand (and analyze) it.

You could also go directly to a group. Let's take the state of Pennsylvania for example:

In [None]:
by_state.groups["PA"]

Or use "get_group" to look at the group as if it were a dataframe.

In [None]:
by_state.get_group("PA")

And to close the loop: can we get back to counting all the values in one dataframe?

In [None]:
by_state.get_group("PA")['last_name'].count()

Our example ends here, the one we have been [following online](https://realpython.com/pandas-groupby/) keeps on going with some very interesting groupings on parts of columns. Do take a look at that as well!