In [2]:
import pandas as pd
import numpy as np

# GroupBy and Aggregates

This section of the course introduces the split-apply-combine pattern. We will first do this manually, and then learn the `grouby()` method. This method allows us to quickly slice and transform our dataset depending on the keys and logic that we provide.

We will then explore aggregation functions, grouping my multiple keys, and some advanced topics such as combining `groupby()` with additional methods, such as `transform()`, `filter()`, and `apply()`. This allows for highly customizable operations.

## New Data: Game Sales

In this section, we'll work with a dataset that contains sales information for the most popular games for the Xbox and Playstation consoles. It provides sales in millions of dollars. This dataset is originally from Kaggle, and has been modified by the instructor to enable the learnings of this section.

https://andybek.com/pandas-games

Let's start by importing the data

In [3]:
games_url = 'https://andybek.com/pandas-games'

In [4]:
games = pd.read_csv(games_url)

In [5]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3143 entries, 0 to 3142
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          3143 non-null   object 
 1   Platform      3143 non-null   object 
 2   Year          3088 non-null   float64
 3   Genre         3143 non-null   object 
 4   Publisher     3136 non-null   object 
 5   NA_Sales      3143 non-null   float64
 6   EU_Sales      3143 non-null   float64
 7   JP_Sales      3143 non-null   float64
 8   Other_Sales   3143 non-null   float64
 9   Global_Sales  3143 non-null   float64
dtypes: float64(6), object(4)
memory usage: 245.7+ KB


In [6]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


We've got some information on the game name, the console ("Platform"), year of release, genre of game, the game publisher, and sales in different countries as well as overall global sales. Let's get to it!

In [7]:
games["Platform"].unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

## Simple Aggregations Overview

In previous sections, we used simple aggregation functions to calculate things like mean, sum, standard deviation, variance, etc. Let's review in this lecture, in preparation for the `groupby()` method.

Let's first answer "what are the total sales across ALL regions". One way to solve this is to isolate the columns of interest using the `loc[]` indexer.

In [8]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,14.97,4.94,0.24,1.67
1,7.01,9.27,0.97,4.14
2,9.63,5.31,0.06,1.38
3,9.03,4.28,0.13,1.32
4,9.67,3.73,0.11,1.13
...,...,...,...,...
3138,0.00,0.01,0.00,0.00
3139,0.01,0.00,0.00,0.00
3140,0.01,0.00,0.00,0.00
3141,0.00,0.01,0.00,0.00


We've created a small dataframe slice with just the sales columns. Now we can chain on a `sum()` function to get the sums of all of the columns.

In [9]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()

NA_Sales       1173.30
EU_Sales        793.64
JP_Sales        107.06
Other_Sales     282.75
dtype: float64

Notice that the dimensionality changed when we applied the aggregate `sum()` function. What began as a dataframe of 3134 rows and 4 columns ended up as a series with four entries. This is a  feature of aggregate functions in general.

In [10]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].median()

NA_Sales       0.14
EU_Sales       0.07
JP_Sales       0.00
Other_Sales    0.03
dtype: float64

In [11]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].std()

NA_Sales       0.801483
EU_Sales       0.570164
JP_Sales       0.093115
Other_Sales    0.203866
dtype: float64

One thing to note about aggregation: it is technically an operation that can be applied in more ways than one. For example, any give column can be collapsed or aggregated across all of its rows. Similarly, any row can be aggregated across all of its columns.

When dealing with dataframes, Pandas defaults to the vertical aggregation. We can change this by changing the axis parameter of the aggregate function. For instance, let's perform the standard deviation calculation along the column axis.

In [12]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].std(axis = 1)

0       6.641358
1       3.594926
2       4.311415
3       3.964602
4       4.286739
          ...   
3138    0.005000
3139    0.005000
3140    0.005000
3141    0.005000
3142    0.005000
Length: 3143, dtype: float64

Notice now that when changing the axis of aggregation, we now have a series with 3143 entries. We're getting the standard deviation of the sales across the four regions for each individual row (video game).

So two quick takeaways:
1. When we apply aggregation functions, the dimensions of the data frequently change.
2. We can aggregate either horizontally or vertically.

## Conditional Aggregates

The aggregates that we computed in the previous lecture applied across the entire column or row. Sometimes however, we only want to apply the function to a smaller set of values within the dataframe.

Suppose we want the total sales by region, but only for particular platforms/consoles. How do we do that?

In [13]:
games.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

Let's go with the Xbox 360 and PS3 platforms. Find the total sales by region for the X360 and PS3.

Let's start by creating a new view that only includes the Platform and Sales columns.

In [14]:
sales = games.loc[:, ['Platform', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

In [15]:
sales.head()

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,X360,14.97,4.94,0.24,1.67
1,PS3,7.01,9.27,0.97,4.14
2,X360,9.63,5.31,0.06,1.38
3,X360,9.03,4.28,0.13,1.32
4,X360,9.67,3.73,0.11,1.13


How do we segregate the platforms? If we were to apply the `sum()` function here directly, we'll get the sum across both platforms.

In [16]:
sales.sum(numeric_only=True)

NA_Sales       1173.30
EU_Sales        793.64
JP_Sales        107.06
Other_Sales     282.75
dtype: float64

That's not what we want. Instead, we want *platform-specific* sales for each region. To do that, we'll need to be more specific to our data selection. 

One thing we can do is use label-based conditional indexing, selecting only for the platform we want, and then summing for that platform.

In [17]:
sales.loc[sales.Platform == 'X360']

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,X360,14.97,4.94,0.24,1.67
2,X360,9.63,5.31,0.06,1.38
3,X360,9.03,4.28,0.13,1.32
4,X360,9.67,3.73,0.11,1.13
7,X360,8.25,4.30,0.07,1.12
...,...,...,...,...,...
3127,X360,0.00,0.01,0.00,0.00
3128,X360,0.00,0.00,0.01,0.00
3130,X360,0.01,0.00,0.00,0.00
3135,X360,0.00,0.00,0.01,0.00


In [18]:
sales.loc[sales.Platform == 'X360'].sum(numeric_only=True)

NA_Sales       601.05
EU_Sales       280.58
JP_Sales        12.43
Other_Sales     85.54
dtype: float64

We can do the same for PS3.

In [19]:
sales.loc[sales.Platform == 'PS3'].sum(numeric_only=True)

NA_Sales       392.26
EU_Sales       343.71
JP_Sales        79.99
Other_Sales    141.93
dtype: float64

Simple enough, but rather verbose and repetitive. What if we wanted to do it in a more eloquent manner with less code? Moreover, what if we had dozens or hundreds of platforms that we wanted to do an aggregate analysis for? This manual platform-by-platform approach is not practical for that. We would need a better approach. And there is such an approach. Stay tuned!

## The Split-Apply-Combine Pattern

Before we get into the `groupby()` method, let's conceptually review what a data analysis pattern that we've been applying.

In the previous lecture, we wanted to calculate the sum of sales for all regions for two specific platforms - the X360 and the PS3. We could not just simply apply `sum()` to the dataframe because the resulting total gave us the total sales across all platforms.

To address this issue, we partitioned the data into separate groups and created smaller dataframes containing data from only the platforms we cared about. Once we had those datasets, we applied aggregation functions to calculate our sums. This approach is known as the **split-apply-combine** pattern, coined by Hadley Wickham from Rice University.

The main idea is to break down a big problem into smaller pieces, solve those pieces, and then put it all back together at the end. We've covered the first two steps - split and apply. We have not yet seen combine, but we will! In fact, the `groupby()` method which we will see in the next method implements ALL THREE of these steps in one single operation. Totally awesome.

## Introducing the `groupby()` Method

Everything and more that we have been doing manually earlier in this section can be done by the `groupby()` method. It is essentially a pivot table for Pandas, which applies a split, apply, combine operation in one go.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Let's tackle that platform-specific sales question again. All we need to do is identify what we want to group by ("Platform" in this case), and which aggregate function we want to apply!

In [20]:
sales.groupby("Platform").sum()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,392.26,343.71,79.99,141.93
PS4,96.8,123.7,14.3,43.36
X360,601.05,280.58,12.43,85.54
XOne,83.19,45.65,0.34,11.92


Man that was so easy, and much better than the manual work we had to do earlier. Even better, it combined all of the output in the end. The `groupby()` method is extremely powerful, and there will be many instances in our data analysis journey where we'll want to use it.

We can apply essentially any function that we want. For example, we can find the average sales across all regions for each platform.

In [21]:
sales.groupby("Platform").mean()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,0.295154,0.258623,0.060188,0.106795
PS4,0.288095,0.368155,0.04256,0.129048
X360,0.475138,0.221802,0.009826,0.067621
XOne,0.390563,0.214319,0.001596,0.055962


It looks like the Xbox 360 sells far more in North American than the other platforms. But the average is only one side of the story. Perhaps this value is skewed by some very high-selling games. We can investigate by looking at the median.

In [22]:
sales.groupby("Platform").median()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,0.12,0.07,0.01,0.03
PS4,0.07,0.08,0.02,0.03
X360,0.17,0.06,0.0,0.02
XOne,0.15,0.07,0.0,0.02


We see that the median North American sales for the Xbox 360 is less than half of the mean, indicating that the majority of the sales are below-average and that the average is driven upward by a few games that sell very well. 

Stay tuned, we'll later see more advanced operations such as splitting by multiple keys.

## The DataFrameGroupBy Object

When using `groupby()`, you create what's known as a **DataFrameGroupBy** object. Let's take a look under the hood here.

What happens if we just perform the `groupby()` but do not apply a method/function to it?

In [23]:
sales.groupby("Platform")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff6bdf7dcd0>

What returns is a DataFrameGroupby object and a reference to its place in memory. Think of this as a special view into an intermediate output. At this point, Pandas has validated our mapping based on the key that we want to group by, and is awaiting instructions on what to do next. The split has not yet happened, but it will once Pandas knows what to do.

There are four platforms that we are grouping by, and thus the length of the DataFrameGroupBy object is 4.

In [24]:
len(sales.groupby("Platform"))

4

As soon as we apply a function to this object, the computation will kick in. So the key thing to remember here is that applying `groupby()` on a dataframe returns this DataFrameGroupBy object. It does *not* return a dataframe.

## Customizing Index to Group Mappings

So far in this section we have grouped by "Platform", resulting in regional sales totals for all platforms. This is actually equivalent to the following long form version of the syntax.

In [25]:
sales.groupby(sales['Platform']).sum()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,392.26,343.71,79.99,141.93
PS4,96.8,123.7,14.3,43.36
X360,601.05,280.58,12.43,85.54
XOne,83.19,45.65,0.34,11.92


In fact, the long form is what Pandas is technically doing behind the scenes. The instructor however never uses this form due to its verbosity.

Now, suppose we don't want to analyze all four platforms, but instead we want to analyze by a fewer number of platforms. In other words, we want to analyze by fewer platforms than are in the "Platform" column. For example, supposed we want to analyze by Playstation of Xbox.

To accomplish this in Pandas, we **provide a dictionary that maps index keys to the group labels that we are interested in and that we want our index keys to map to**. As an added bonus, this can be done without changing the underlying data at all.

In [26]:
platform_names = {
    'PS3': 'Playstation',
    'PS4': 'Playstation',
    'X360': 'Xbox',
    'XOne': 'Xbox'
}

In [27]:
platform_names

{'PS3': 'Playstation', 'PS4': 'Playstation', 'X360': 'Xbox', 'XOne': 'Xbox'}

Let's now set Platform as the index of our dataframe. This is important because projection of one label to another in `groupby()` only works on the index of the dataframe. Thus, we must make the column into the index.

In [28]:
sales.set_index('Platform')

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
X360,14.97,4.94,0.24,1.67
PS3,7.01,9.27,0.97,4.14
X360,9.63,5.31,0.06,1.38
X360,9.03,4.28,0.13,1.32
X360,9.67,3.73,0.11,1.13
...,...,...,...,...
X360,0.00,0.01,0.00,0.00
XOne,0.01,0.00,0.00,0.00
XOne,0.01,0.00,0.00,0.00
PS4,0.00,0.01,0.00,0.00


Finally, we use `groupby()` to create our new subgroups. Instead of passing in a column name like we did previously, we pass in the dictionary of platform names.

In [29]:
sales.set_index('Platform').groupby(platform_names).sum()

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Playstation,489.06,467.41,94.29,185.29
Xbox,684.24,326.23,12.77,97.46


Thus, we now get Playstation and Xbox totals across all regions. And we got this without affecting the structure of the underlying data.

## BONUS - Series `groupby()`

The split-apply-combine pattern that is utilized by `groupby()` is entirely compatible with Series in addition to dataframes.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.groupby.html

In the first example, we'll take a look at global sales by Genre. We start by creating a series for this data.

In [30]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


In [31]:
games.loc[:, ['Genre', 'Global_Sales']].set_index('Genre')

Unnamed: 0_level_0,Global_Sales
Genre,Unnamed: 1_level_1
Misc,21.82
Action,21.40
Action,16.38
Shooter,14.76
Shooter,14.64
...,...
Role-Playing,0.01
Platform,0.01
Shooter,0.01
Simulation,0.01


Note that this is still technically a dataframe.

In [32]:
type(games.loc[:, ['Genre', 'Global_Sales']].set_index('Genre'))

pandas.core.frame.DataFrame

We can use the `squeeze()` method to convert it into a series, which we'll call *ser*.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.squeeze.html

In [33]:
ser = games.loc[:, ['Genre', 'Global_Sales']].set_index('Genre').squeeze()

In [34]:
ser.head(10)

Genre
Misc       21.82
Action     21.40
Action     16.38
Shooter    14.76
Shooter    14.64
Shooter    14.24
Shooter    14.03
Shooter    13.73
Shooter    13.51
Shooter    13.46
Name: Global_Sales, dtype: float64

In [35]:
type(ser)

pandas.core.series.Series

We now have a series which gives the genre and global sales for each game (note that we have no idea which game it is, but that's not important for this example).

To calculate the average global sales by genre, we can apply the `groupby()` method to this series.

In [36]:
ser.groupby('Genre').mean()

Genre
Action          0.751007
Adventure       0.298289
Fighting        0.604182
Misc            0.550250
Platform        0.651842
Puzzle          0.133636
Racing          0.687854
Role-Playing    0.715804
Shooter         1.412019
Simulation      0.336076
Sports          0.681094
Strategy        0.264333
Name: Global_Sales, dtype: float64

Let's sort it to make it nicer looking.

In [37]:
ser.groupby('Genre').mean().sort_values(ascending = False)

Genre
Shooter         1.412019
Action          0.751007
Role-Playing    0.715804
Racing          0.687854
Sports          0.681094
Platform        0.651842
Fighting        0.604182
Misc            0.550250
Simulation      0.336076
Adventure       0.298289
Strategy        0.264333
Puzzle          0.133636
Name: Global_Sales, dtype: float64

Looks like shooters are quite popular.

As an aside, the following code could have also been used to accomplish the same task by acting on the initial dataframe.

In [38]:
games.groupby('Genre')['Global_Sales'].mean()

Genre
Action          0.751007
Adventure       0.298289
Fighting        0.604182
Misc            0.550250
Platform        0.651842
Puzzle          0.133636
Racing          0.687854
Role-Playing    0.715804
Shooter         1.412019
Simulation      0.336076
Sports          0.681094
Strategy        0.264333
Name: Global_Sales, dtype: float64

## Skill Challenge

#### 1. Create a smaller dataframe from *games*, selecting only the Publisher, Genre, Platform, and NA_Sales columns. Assign this dataframe to the variable *publishers*.

We can accomplish this by using a simple `loc[]` indexing command.

In [39]:
publishers = games.loc[:, ['Publisher','Genre','Platform','NA_Sales']]

In [40]:
publishers.head(10)

Unnamed: 0,Publisher,Genre,Platform,NA_Sales
0,Microsoft Game Studios,Misc,X360,14.97
1,Take-Two Interactive,Action,PS3,7.01
2,Take-Two Interactive,Action,X360,9.63
3,Activision,Shooter,X360,9.03
4,Activision,Shooter,X360,9.67
5,Activision,Shooter,PS4,5.77
6,Activision,Shooter,PS3,4.99
7,Activision,Shooter,X360,8.25
8,Activision,Shooter,X360,8.52
9,Activision,Shooter,PS3,5.54


#### 2. From the *publishers* dataframe, find the top 10 game publishers in North America by total sales.

We'll use the `groupby()` method here, grouping by Publisher and then summing over "NA_Sales", and then sorting in descending order.

In [41]:
publishers.groupby("Publisher")['NA_Sales'].sum().sort_values(ascending = False).head(10)

Publisher
Electronic Arts                           213.38
Activision                                193.16
Take-Two Interactive                      120.99
Microsoft Game Studios                    116.77
Ubisoft                                    98.65
Sony Computer Entertainment                76.35
Warner Bros. Interactive Entertainment     45.24
THQ                                        36.44
Bethesda Softworks                         33.88
Capcom                                     24.74
Name: NA_Sales, dtype: float64

In [42]:
type(publishers.groupby("Publisher")['NA_Sales'].sum().sort_values(ascending = False).head(10))

pandas.core.series.Series

Note that we could also have done this without selecting for "NA_Sales" because by default `sum()` will only work on numeric data, and on the "NA_Sales" column contains data of this type. However, this does mean that we need to identify the column to sort by when implementing `sort_values()` by passing it into the `by` parameter. 

Essentially you can identify the column now or later. It's up to you. The only difference between the approaches is that the former results in a Series, while the latter results in a dataframe with only one column.

In [43]:
publishers.groupby("Publisher").sum().sort_values(by = "NA_Sales", ascending = False).head(10)

Unnamed: 0_level_0,NA_Sales
Publisher,Unnamed: 1_level_1
Electronic Arts,213.38
Activision,193.16
Take-Two Interactive,120.99
Microsoft Game Studios,116.77
Ubisoft,98.65
Sony Computer Entertainment,76.35
Warner Bros. Interactive Entertainment,45.24
THQ,36.44
Bethesda Softworks,33.88
Capcom,24.74


In [44]:
type(publishers.groupby("Publisher").sum().sort_values(by = "NA_Sales", ascending = False).head(10))

pandas.core.frame.DataFrame

This shows that Electronic Arts is the highest-selling publisher, followed by Activision, Take-Two Interactive, and so forth.

#### 3. Determine the gaming platform/system that has attracted the most sales in North America.

This is similar to above, except we will group by "Platform" instead of by "Publisher". We will also change up our approach by first creating a Series with just the "Platform" and "NA_Sales" data, and then performing our `groupby()` aggregate analysis. The logic flows as follows:
1. Use the `loc[]` indexer to isolate the "Platform" and "NA_Sales" columns from *publishers*.
2. Use `set_index()` to set the index to "Platform".
3. Use squeeze to convert our dataframe to a Series.
4. Use `groupby()` to aggregate by "Platform".
5. Apply the `sum()` function.

In [45]:
publishers.loc[:, ['Platform','NA_Sales']].set_index("Platform").squeeze().groupby('Platform').sum()

Platform
PS3     392.26
PS4      96.80
X360    601.05
XOne     83.19
Name: NA_Sales, dtype: float64

We therefore see that the Xbox360 had the greatest sales in North American by a significant margin, with a total sales of $601 million.

The alternative dataframe approach is equally valid and arguably much easier. Once again this approach leaves you with a dataframe instead of a series.

In [46]:
publishers.groupby("Platform").sum()

Unnamed: 0_level_0,NA_Sales
Platform,Unnamed: 1_level_1
PS3,392.26
PS4,96.8
X360,601.05
XOne,83.19


## Iterating Through Groups

Recall that the `groupby()` method returns a DataFrameGroupBy object where subgroups have already been formed, but nothing has been done to them yet.

In [47]:
sales.groupby('Platform')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff6bdf68590>

In [48]:
sales.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

How do we access the information in that object without actually doing anything to it? It's actually simple - it's an iterable that we can iterate over.

In [49]:
for i in sales.groupby('Platform'):
  print(i)

('PS3',      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
1         PS3      7.01      9.27      0.97         4.14
6         PS3      4.99      5.88      0.65         2.52
9         PS3      5.54      5.82      0.49         1.62
10        PS3      5.98      4.44      0.48         1.83
14        PS3      2.96      4.88      0.81         2.12
...       ...       ...       ...       ...          ...
3124      PS3      0.00      0.01      0.00         0.00
3125      PS3      0.00      0.00      0.01         0.00
3129      PS3      0.00      0.00      0.01         0.00
3132      PS3      0.00      0.00      0.01         0.00
3136      PS3      0.00      0.00      0.01         0.00

[1329 rows x 5 columns])
('PS4',      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
5         PS4      5.77      5.81      0.35         2.31
12        PS4      3.80      5.81      0.36         2.02
24        PS4      1.11      6.06      0.06         1.26
26        PS4      2.93      3.29      0.22   

What the heck are we looking at here? 

Well first off, there are four blocks of "stuff", which lines up with the fact that we have four subgroups ("Platforms" in this case) in this particular groupby. 

Secondly, each "block" consists of two things - a "name", and a "dataframe-like object". Let's modify our call a bit so that we can access the labels and dataframe-like objects inside.

In [50]:
for name, df in sales.groupby('Platform'):
  print('-----------------')
  print('Subgroup Label: ', name)
  print('-----------------')
  print(df, '\n')


-----------------
Subgroup Label:  PS3
-----------------
     Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
1         PS3      7.01      9.27      0.97         4.14
6         PS3      4.99      5.88      0.65         2.52
9         PS3      5.54      5.82      0.49         1.62
10        PS3      5.98      4.44      0.48         1.83
14        PS3      2.96      4.88      0.81         2.12
...       ...       ...       ...       ...          ...
3124      PS3      0.00      0.01      0.00         0.00
3125      PS3      0.00      0.00      0.01         0.00
3129      PS3      0.00      0.00      0.01         0.00
3132      PS3      0.00      0.00      0.01         0.00
3136      PS3      0.00      0.00      0.01         0.00

[1329 rows x 5 columns] 

-----------------
Subgroup Label:  PS4
-----------------
     Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
5         PS4      5.77      5.81      0.35         2.31
12        PS4      3.80      5.81      0.36         2.02
24  

We can now see the content of that dataframegroupby object. Again we see four groups, one for each platform, and the associated data for each platform. With this view, we can think of `groupby()` aggregate analysis as applying individually to each one of these subgroups within the dataframegroupby object

## Handpicking Subgroups with `get_group()`



We saw in the previous lecture that we can easily iterate through each one of the subgroups in the DataFrameGroupBy. Occasionally, you may get output that is surprising or seemingly nonsensical, such as null values or invalid datapoints that may affect the results in ways that are confusing.

Therefore, it is sometimes useful to source subgroups in isolation. How do we do this?

It's tempting to use square bracketing to select a single subgroup. Let's try it. (Spoiler: it won't work)

In [51]:
## This results in a "Column not found" error.
# sales.groupby('Platform')['PS3']

That didn't work because square bracket notation is used for selecting **columns**. In this case, the individual platforms are not columns - they are in embedded within the "Platform" column.

As a side note, bracket notation is a totally valid way to select for columns after a `groupby()`. Notice that when you do this (and as we saw above), you get a reduction in dimensionality, resulting in a SeriesGroupBy object.

In [52]:
sales.groupby('Platform')['JP_Sales']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7ff6bdf11210>

Back to the question at hand: how do we access any given subgroup. One way to extract a single subgroup is the convert the entire dataframe into an iterator, and then convert the entire iterator into a dictionary.

In [53]:
dict(iter(sales.groupby('Platform')))

{'PS3':      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
 1         PS3      7.01      9.27      0.97         4.14
 6         PS3      4.99      5.88      0.65         2.52
 9         PS3      5.54      5.82      0.49         1.62
 10        PS3      5.98      4.44      0.48         1.83
 14        PS3      2.96      4.88      0.81         2.12
 ...       ...       ...       ...       ...          ...
 3124      PS3      0.00      0.01      0.00         0.00
 3125      PS3      0.00      0.00      0.01         0.00
 3129      PS3      0.00      0.00      0.01         0.00
 3132      PS3      0.00      0.00      0.01         0.00
 3136      PS3      0.00      0.00      0.01         0.00
 
 [1329 rows x 5 columns],
 'PS4':      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
 5         PS4      5.77      5.81      0.35         2.31
 12        PS4      3.80      5.81      0.36         2.02
 24        PS4      1.11      6.06      0.06         1.26
 26        PS4      2.93      

What did we do here? We created a dictionary, where the platform label is the dictionary key and the value is the dataframe-like stuff. We can now treat this as a normal dictionary and access the dataframe-like information for a specific platform by identifying that platform as the key.
* Important note: We did not apply any aggregate function on this. We're simply pulling data out of the `groupby()` object.

In [54]:
dict(iter(sales.groupby('Platform')))['PS3']

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
1,PS3,7.01,9.27,0.97,4.14
6,PS3,4.99,5.88,0.65,2.52
9,PS3,5.54,5.82,0.49,1.62
10,PS3,5.98,4.44,0.48,1.83
14,PS3,2.96,4.88,0.81,2.12
...,...,...,...,...,...
3124,PS3,0.00,0.01,0.00,0.00
3125,PS3,0.00,0.00,0.01,0.00
3129,PS3,0.00,0.00,0.01,0.00
3132,PS3,0.00,0.00,0.01,0.00


The instructor doesn't like this approach because you have to go through the annoying step of creating a dictionary with all of the information available, prior to pulling out what we want.

Pandas can help us circumvent this nonsense with the `get_group()` method, which is executed on a `GroupBy` object. It does the exact same thing as the convoluted path we took above.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.get_group.html
* This is more efficient because it doesn't waste computational power calculating values for each group, when we only want one group.
* It is also more efficient because we stay in "Pandas code" the entire time, not having to switch to pure Python

In [55]:
sales.groupby("Platform").get_group('PS3')

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
1,PS3,7.01,9.27,0.97,4.14
6,PS3,4.99,5.88,0.65,2.52
9,PS3,5.54,5.82,0.49,1.62
10,PS3,5.98,4.44,0.48,1.83
14,PS3,2.96,4.88,0.81,2.12
...,...,...,...,...,...
3124,PS3,0.00,0.01,0.00,0.00
3125,PS3,0.00,0.00,0.01,0.00
3129,PS3,0.00,0.00,0.01,0.00
3132,PS3,0.00,0.00,0.01,0.00


## MultiIndex Grouping

We can use `groupby()` with multiple keys!

In [56]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


Let's start by subsetting the dataframe to just include  Genre, Publisher, and Global_Sales.

In [57]:
studios = games.loc[:, ['Genre','Publisher','Global_Sales']]

In [58]:
studios.head()

Unnamed: 0,Genre,Publisher,Global_Sales
0,Misc,Microsoft Game Studios,21.82
1,Action,Take-Two Interactive,21.4
2,Action,Take-Two Interactive,16.38
3,Shooter,Activision,14.76
4,Shooter,Activision,14.64


We have a 3-column dataframe with each row representing a single game.

We previously grouped by single keys as follows. Suppose we want to know the top publisher by Global_Sales. Easy enough with `groupby()`.

In [59]:
studios.groupby('Publisher').sum().sort_values(by = 'Global_Sales', ascending = False)

Unnamed: 0_level_0,Global_Sales
Publisher,Unnamed: 1_level_1
Electronic Arts,434.41
Activision,349.22
Take-Two Interactive,218.08
Ubisoft,201.98
Microsoft Game Studios,190.56
...,...
UIG Entertainment,0.01
ChunSoft,0.01
Kaga Create,0.01
Epic Games,0.01


From this analysis, EA is the leading publisher by sales. However, this aggregation becomes inadequete if we add another dimension to our question. For instance,  what if we want to go deeper and find the top publisher by "Global_Sales" for each genre of game?

To do this using `groupby()`, we pass in a *list of keys*.



In [60]:
studios.groupby(['Genre','Publisher'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ff6b9bf89d0>

As before, in the absence of any function, this produces our DataFrameGroupBy object that is awaiting further orders.

Let's now force the evaluation.

In [62]:
studios.groupby(['Genre','Publisher']).sum().sort_values(by = "Global_Sales", ascending = False).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Shooter,Activision,245.46
Sports,Electronic Arts,203.5
Action,Take-Two Interactive,106.04
Action,Ubisoft,96.44
Shooter,Electronic Arts,92.58
Shooter,Microsoft Game Studios,77.02
Action,Warner Bros. Interactive Entertainment,71.89
Action,Sony Computer Entertainment,60.38
Sports,Take-Two Interactive,56.89
Action,Electronic Arts,49.61


We now see that while EA is the leading studio overall, Activition is the leader in within-genre sales. More specifically, Activision leads all publishers in the sales of Shooter games, while EA leads the pack in sales of Sports games.

More generally, notice how this multiple-key `groupby()` broke our analysis down even further into groups that are indexed by both genre and publisher. This allows us to answer a more specific question than aggregate analysis by any one key alone.

Finally, using multiple keys on `groupby()` results in a multiIndex dataframe.

## Fine-Tuned Aggregates

Let's now shift our attention to the aggregation functions. So far we have mainly been using the built-in aggregate functions such as `sum()` and `max()`. 

Taking `sum()` as an example, the following command is equivalent to what we've done above.


In [63]:
studios.groupby(['Genre','Publisher']).aggregate('sum')

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Action,505 Games,2.25
Action,Abylight,0.08
Action,Ackkstudios,0.33
Action,Acquire,0.11
Action,Activision,42.84
...,...,...
Strategy,Square Enix,0.35
Strategy,Takara Tomy,0.09
Strategy,Take-Two Interactive,2.92
Strategy,Tecmo Koei,0.58


Or equivalently, we can use the shorthand alias `agg()`.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html

In [64]:
studios.groupby(['Genre','Publisher']).agg('sum')

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Action,505 Games,2.25
Action,Abylight,0.08
Action,Ackkstudios,0.33
Action,Acquire,0.11
Action,Activision,42.84
...,...,...
Strategy,Square Enix,0.35
Strategy,Takara Tomy,0.09
Strategy,Take-Two Interactive,2.92
Strategy,Tecmo Koei,0.58


The `agg()` function also accepts numpy functions.

In [65]:
studios.groupby(['Genre','Publisher']).agg(np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Action,505 Games,2.25
Action,Abylight,0.08
Action,Ackkstudios,0.33
Action,Acquire,0.11
Action,Activision,42.84
...,...,...
Strategy,Square Enix,0.35
Strategy,Takara Tomy,0.09
Strategy,Take-Two Interactive,2.92
Strategy,Tecmo Koei,0.58


One of the biggest advantages of using the `agg()` function is that **we can apply several functions in one go**, as the function accepts a Python list of functions. 

As an example, suppose we want to summarize the sum, count, average, and standard deviation of sales as well as the number of games published by each publisher within each genre. This sounds like a complicated mess of a question, this becomes a relatively simple question to answer.

We start by grouping our data according to Genre and Publisher, as that is the question we want to answer. We will then pass in a list of functions.

In [67]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count,mean,std
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Action,505 Games,2.25,8,0.281250,0.266482
Action,Abylight,0.08,1,0.080000,
Action,Ackkstudios,0.33,1,0.330000,
Action,Acquire,0.11,1,0.110000,
Action,Activision,42.84,95,0.450947,0.559717
...,...,...,...,...,...
Strategy,Square Enix,0.35,1,0.350000,
Strategy,Takara Tomy,0.09,1,0.090000,
Strategy,Take-Two Interactive,2.92,6,0.486667,0.364289
Strategy,Tecmo Koei,0.58,6,0.096667,0.055015


Check this out. We now have an aggregate analysis for each publisher and each genre, with data pertaining to the functions that we passed in. Note that some publishers have zero or one game within a given genre, and so the standard deviation cannot be calculated.

Notice that we've created a hierarchical index on both axes of the dataframe. In index axis rows has a two-level index - Genre and Publisher. The Column axis has a two-level index as well - Global Sales on the higher level, then sum, count, mean, and std on the lower level.

In [68]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).columns

MultiIndex([('Global_Sales',   'sum'),
            ('Global_Sales', 'count'),
            ('Global_Sales',  'mean'),
            ('Global_Sales',   'std')],
           )

Thus, if we want to sort columns in this multi-index dataframe, we need to approach it slightly differently. Case and point, we cannot simply sort by "Global_Sales"

In [71]:
## This results in a ValueError
# studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).sort_values(by = "Global_Sales")

For a multi-index dataframe with multiple levels on the column axis, we need to identify each column uniquely by specifying labels on each level. This means we have to use a tuple to perform our `sort_values()` method.

In [73]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).sort_values(by = ("Global_Sales",'sum'), ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count,mean,std
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Shooter,Activision,245.46,72,3.409167,4.621920
Sports,Electronic Arts,203.50,170,1.197059,1.404108
Action,Take-Two Interactive,106.04,23,4.610435,5.843768
Action,Ubisoft,96.44,67,1.439403,1.636460
Shooter,Electronic Arts,92.58,50,1.851600,1.794404
...,...,...,...,...,...
Adventure,Cave,0.01,1,0.010000,
Role-Playing,TopWare Interactive,0.01,1,0.010000,
Sports,"Interworks Unlimited, Inc.",0.01,1,0.010000,
Strategy,Ackkstudios,0.01,1,0.010000,


We have all of these nice metrics in one organized dataframe. What does this analysis tell us? Well, we see that Activision publishes fewer games than its closest within-genre competitor EA despite having greater sales overall. 

What games do Activision publish? Let's find out.



In [74]:
games[games.Publisher == 'Activision'].sort_values('Global_Sales', ascending = False)

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64
5,Call of Duty: Black Ops 3,PS4,2015.0,Shooter,Activision,5.77,5.81,0.35,2.31,14.24
6,Call of Duty: Black Ops II,PS3,2012.0,Shooter,Activision,4.99,5.88,0.65,2.52,14.03
7,Call of Duty: Black Ops II,X360,2012.0,Shooter,Activision,8.25,4.30,0.07,1.12,13.73
...,...,...,...,...,...,...,...,...,...,...
2923,Ghostbusters (2016),XOne,2016.0,Action,Activision,0.02,0.00,0.00,0.00,0.02
2932,The Voice,PS3,2014.0,Action,Activision,0.02,0.00,0.00,0.00,0.02
2935,Rapala for Kinect,X360,2011.0,Sports,Activision,0.00,0.02,0.00,0.00,0.02
3012,Call of Duty: Modern Warfare Trilogy,PS3,2016.0,Shooter,Activision,0.00,0.01,0.00,0.00,0.02


The Call of Duty series seems to be a massive cash cow for Activision, especially in North America.