# Problem Set 2.9: Aggregations

[Click here to open this notebook in your browser](https://leifwalsh.github.io/data-analysis-problem-sets/lab/index.html?path=2-pandas-basics/2.9-aggregations/2.9-aggregations.ipynb)

Learn how to summarize numerical data with different aggregations like `sum()` and `mean()`.

In [None]:
import pandas as pd

In this notebook we'll explore historical NFL team stats, from http://www.habitatring.com/standings.csv, via https://github.com/nflverse/nfldata/blob/master/DATASETS.md#standings.

Columns:

* `season`: The year of the NFL season. This reperesents the whole season, so regular season games that happen in January as well as playoff games will occur in the year after this number.
* `conf`: The conference the team is in. This will be either AFC or NFC.
* `division`: The division the team is in. This will be the value of conf followed by either East, North, South, or West.
* `team`: The team.
* `wins`: The number of games the team won in the regular season.
* `losses`: The number of games the team lost in the regular season.
* `ties`: The number of games the team tied in the regular season.
* `pct`: The win rate of the team in the regular season. Equals (wins + 0.5 * ties) / (wins + losses + ties).
* `div_rank`: This is where this team ranks compared to the other teams in the division based on regular season games only. Will be a number 1-4. If the teams have identical pct values, NFL tiebreakers are applied.
* `scored`: The number of points the team has scored in regular season games.
* `allowed`: The number of points the team has allowed to be scored on them in regular season games.
* `net`: Net points scored in regular season games. Equals scored - allowed.
* `sov`: As used in NFL tiebreakers, strength of victory, defined as the combined win rates for teams this team has beaten.
* `sos`: As used in NFL tiebreakers, strength of schedule, defined as the combined win rates for teams this team has played.
* `seed`: The seed earned by the team in its conference for playoff games. Is NA for teams which do not make the playoffs.
* `playoff`: The outcome of the team's playoff run. Is NA for teams which do not make the playoffs, otherwise will be one of LostWC, LostDV, LostCC, LostSB, or WonSB.


In [None]:
standings = pd.read_csv("standings.csv")
standings.head()

Which years do we have coverage for?

In [None]:
standings["season"].unique()

We've seen one form of aggregation before: the `.describe()` method:

In [None]:
standings.loc[:, ["wins", "losses", "ties"]].describe()

I think it makes sense that the median (50%) is 8 wins, 8 losses, and 0 ties (since there are 16 games in the regular season). But it doesn't have to be this way!

Now, think about the mean. Why are the mean wins and losses the same? Why are they less than 8? Does this make sense to you?

## Aggregation Functions

There are a lot of aggregation functions available on a DataFrame: https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats. We can't cover them all, but we'll try out a few here.

The basic stats you might be familiar with are available. All of these are computed by `.describe()`, but if you need a specific one you can use these:

In [None]:
standings[["wins", "losses", "ties"]].mean()

In [None]:
standings[["wins", "losses", "ties"]].median()

In [None]:
standings[["wins", "losses", "ties"]].max()

In [None]:
standings[["wins", "losses", "ties"]].min()

In [None]:
standings[["wins", "losses", "ties"]].count()

There are others that `.describe()` doesn't give us:

In [None]:
standings[["wins", "losses", "ties"]].sum()

When you compute an aggregation over a DataFrame with multiple columns, you get a Series back with one value per column, as we just saw. It aggregates over each column individually.

If you just do it on one column, you just get a number:

In [None]:
standings["wins"].mean()

In other words, aggregation functions are things that apply to a Series and produce a single value. When you call an aggregation function on a DataFrame, all that really does is call the same function on each Series, one by one.

Since the result is a Series with one value per original column, you can then aggregate a second time. For example, if we want to just count the number of games, we can do this:

In [None]:
standings[["wins", "losses", "ties"]].sum().sum() / 2

(Why divide by 2?)

## Aggregating subsets

Above, we're getting totals over the whole table. What if we want to break things down by team?

One way is to combine what we already learned (filtering to subsets of data with `.loc[]`) with aggregations:

In [None]:
standings.loc[standings["team"] == "PIT", ["wins", "losses", "ties"]].sum()

Now let's do that for every team. First, we need to know which teams exist:

In [None]:
standings["team"].unique()

We can loop over those teams like this:

In [None]:
for team in standings["team"].unique():
    print(team)

Now, can you compute the wins, losses, and ties for each team?

## Examples

In [None]:
# Re-read data just in case:
standings = pd.read_csv("standings.csv")
standings

### Example 1

Compute the sums of wins, losses, and ties for the AFC conference:

### Example 2

Compute the sums of wins, losses, and ties for every division:

### Example 3

Compute the average number of points scored and allowed per season (and check that the result makes sense):

### Example 4

Count the number of teams in each season (check the pandas documentation linked above: is there a descriptive statistic you can use?):