In [None]:
# HIDDEN
Base.displaysize() = (5, 80)
using DataFrames
using CSV

## Grouping

In this section, we will answer the question:

**What were the most popular male and female names in each year?**

Here's the Baby Names dataset once again:

In [None]:
baby = CSV.read("babynames.csv")
first(baby, 5)
# the first(df, n) method outputs the first n rows of the dataframe df

### Breaking the Problem Down

We should first notice that the question in the previous section has similarities to this one; the question in the previous section restricts names to babies born in 2016 whereas this question asks for names in all years.

We once again decompose this problem into simpler table manipulations.

1. Group the `baby` DataFrame by 'Year' and 'Sex'.
2. For each group, compute the most popular name.

Recognizing which operation is needed for each problem is sometimes tricky. Usually, a convoluted series of steps will signal to you that there might be a simpler way to express what you want. If we didn't immediately recognize that we needed to group, for example, we might write steps like the following:

1. Loop through each unique year.
2. For each year, loop through each unique sex.
3. For each unique year and sex, find the most common name.

There is almost always a better alternative to looping over a `DataFrame`. **In particular, looping over unique values of a DataFrame should usually be replaced with a group.**

### Grouping

To group in `DataFrames` by a certain column we can use the `groupby()` method which returns `GroupedDataFrame` for each group formed by the specified key.

In [4]:
groupby(baby, :Year)

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Mary,F,9217,1884
2,Anna,F,3860,1884
3,Emma,F,2587,1884
4,Elizabeth,F,2549,1884
5,Minnie,F,2243,1884
⋮,⋮,⋮,⋮,⋮

Unnamed: 0_level_0,Name,Sex,Count,Year
Unnamed: 0_level_1,String,String,Int64,Int64
1,Mary,F,8012,1883
2,Anna,F,3306,1883
3,Emma,F,2367,1883
4,Elizabeth,F,2255,1883
5,Minnie,F,2035,1883
⋮,⋮,⋮,⋮,⋮


More simply, for grouping by a column and aggregation function we use `by()`:

In [3]:
baby_by_year = by(baby, :Year, :Count => length)

Unnamed: 0_level_0,Year,Count_length
Unnamed: 0_level_1,Int64,Int64
1,1884,2297
2,1885,2294
3,1886,2392
4,1887,2373
5,1888,2651
⋮,⋮,⋮


`by()` allows us to group the `DataFrame` using different aggregations functions that are applied to the selected column, and returns a `DataFrame`.
We can now sort the resulting `DataFrame` and slice subsets of years using the `:` notation as before:

In [5]:
sort!(baby_by_year, :Year)
baby_by_year[1:20:end, :]

Unnamed: 0_level_0,Year,Count_length
Unnamed: 0_level_1,Int64,Int64
1,1880,2000
2,1900,3730
3,1920,10755
4,1940,8961
5,1960,11924
⋮,⋮,⋮


### Grouping on Multiple Columns

As we've seen in Data 8, we can group on multiple columns to get groups based on unique pairs of values. To do this, pass in a list of column labels into `by()`.

In [6]:
grouped_counts = by(baby, [:Year, :Sex], :Count => sum)

Unnamed: 0_level_0,Year,Sex,Count_sum
Unnamed: 0_level_1,Int64,String,Int64
1,1884,F,129020
2,1884,M,114443
3,1885,F,133055
4,1885,M,107799
5,1886,F,144533
⋮,⋮,⋮,⋮


The code above computes the total number of babies born for each year and sex. Let's now use grouping by muliple columns to compute the most popular names for each year and sex. We can use `by()` passing multiple columns as we did before, but also passing multiple aggregation functions. Since the data is not sorted, we use `sort!()` on the resulting `DataFrame`:

In [7]:
most_popular_baby_names = by(baby, [:Year, :Sex], [:Count => maximum, :Name => first])
sort!(most_popular_baby_names, :Year)

Unnamed: 0_level_0,Year,Sex,Count_maximum,Name_first
Unnamed: 0_level_1,Int64,String,Int64,String
1,1880,F,7065,Mary
2,1880,M,9655,John
3,1881,F,6919,Mary
4,1881,M,8769,John
5,1882,F,8148,Mary
⋮,⋮,⋮,⋮,⋮


## In Conclusion

We now have the most popular baby names for each sex and year in our dataset and learned to express the following operations in `DataFrames`:

| Operation | `DataFrames` |
| --------- | -------  |
| Group | `groupby(df, label)` |
| Group and aggregate | `by(df, label1, func)` |
| Group and aggregate multiple columns and functions| `by(df, [label1, label2], [fun1, func2])` |
| Pivot | `pd.pivot_table()` |