In [1]:
import pandas as pd
import numpy as np

# GroupBy and Aggregates

This section of the course introduces the split-apply-combine pattern. We will first do this manually, and then learn the `grouby()` method. This method allows us to quickly slice and transform our dataset depending on the keys and logic that we provide.

We will then explore aggregation functions, grouping my multiple keys, and some advanced topics such as combining `groupby()` with additional methods, such as `transform()`, `filter()`, and `apply()`. This allows for highly customizable operations.

## New Data: Game Sales

In this section, we'll work with a dataset that contains sales information for the most popular games for the Xbox and Playstation consoles. This dataset is originally from Kaggle, and has been modified by the instructor to enable the learnings of this section.

https://andybek.com/pandas-games

Let's start by importing the data

In [2]:
games_url = 'https://andybek.com/pandas-games'

In [3]:
games = pd.read_csv(games_url)

In [10]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3143 entries, 0 to 3142
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          3143 non-null   object 
 1   Platform      3143 non-null   object 
 2   Year          3088 non-null   float64
 3   Genre         3143 non-null   object 
 4   Publisher     3136 non-null   object 
 5   NA_Sales      3143 non-null   float64
 6   EU_Sales      3143 non-null   float64
 7   JP_Sales      3143 non-null   float64
 8   Other_Sales   3143 non-null   float64
 9   Global_Sales  3143 non-null   float64
dtypes: float64(6), object(4)
memory usage: 245.7+ KB


In [5]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


We've got some information on the game name, the console ("Platform"), year of release, genre of game, the game publisher, and sales in different countries as well as overall global sales. Let's get to it!

In [12]:
games["Platform"].unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

## Simple Aggregations Overview

In previous sections, we used simple aggregation functions to calculate things like mean, sum, standard deviation, variance, etc. Let's review in this lecture, in preparation for the `groupby()` method.

Let's first answer "what are the total sales across ALL regions". One way to solve this is to isolate the columns of interest using the `loc[]` indexer.

In [13]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,14.97,4.94,0.24,1.67
1,7.01,9.27,0.97,4.14
2,9.63,5.31,0.06,1.38
3,9.03,4.28,0.13,1.32
4,9.67,3.73,0.11,1.13
...,...,...,...,...
3138,0.00,0.01,0.00,0.00
3139,0.01,0.00,0.00,0.00
3140,0.01,0.00,0.00,0.00
3141,0.00,0.01,0.00,0.00


We've created a small dataframe slice with just the sales columns. Now we can chain on a `sum()` function to get the sums of all of the columns.

In [14]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()

NA_Sales       1173.30
EU_Sales        793.64
JP_Sales        107.06
Other_Sales     282.75
dtype: float64

Notice that the dimensionality changed when we applied the aggregate `sum()` function. What began as a dataframe of 3134 rows and 4 columns ended up as a series with four entries. This is a  feature of aggregate functions in general.

In [15]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].median()

NA_Sales       0.14
EU_Sales       0.07
JP_Sales       0.00
Other_Sales    0.03
dtype: float64

In [16]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].std()

NA_Sales       0.801483
EU_Sales       0.570164
JP_Sales       0.093115
Other_Sales    0.203866
dtype: float64

One thing to note about aggregation: it is technically an operation that can be applied in more ways than one. For example, any give column can be collapsed or aggregated across all of its rows. Similarly, any row can be aggregated across all of its columns.

When dealing with dataframes, Pandas defaults to the vertical aggregation. We can change this by changing the axis parameter of the aggregate function. For instance, let's perform the standard deviation calculation along the column axis.

In [17]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].std(axis = 1)

0       6.641358
1       3.594926
2       4.311415
3       3.964602
4       4.286739
          ...   
3138    0.005000
3139    0.005000
3140    0.005000
3141    0.005000
3142    0.005000
Length: 3143, dtype: float64

Notice now that when changing the axis of aggregation, we now have a series with 3143 entries. We're getting the standard deviation of the sales across the four regions for each individual row (video game).

So two quick takeaways:
1. When we apply aggregation functions, the dimensions of the data frequently change.
2. We can aggregate either horizontally or vertically.

## Conditional Aggregates

The aggregates that we computed in the previous lecture applied across the entire column or row. Sometimes however, we only want to apply the function to a smaller set of values within the dataframe.

Suppose we want the total sales by region, but only for particular platforms/consoles. How do we do that?

In [18]:
games.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

Let's go with the Xbox 360 and PS3 platforms. Find the total sales by region for the X360 and PS3.

Let's start by creating a new view that only includes the Platform and Sales columns.

In [20]:
sales = games.loc[:, ['Platform', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

In [21]:
sales.head()

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,X360,14.97,4.94,0.24,1.67
1,PS3,7.01,9.27,0.97,4.14
2,X360,9.63,5.31,0.06,1.38
3,X360,9.03,4.28,0.13,1.32
4,X360,9.67,3.73,0.11,1.13


How do we segregate the platforms? If we were to apply the `sum()` function here directly, we'll get the sum across both platforms.

In [22]:
sales.sum(numeric_only=True)

NA_Sales       1173.30
EU_Sales        793.64
JP_Sales        107.06
Other_Sales     282.75
dtype: float64

That's not what we want. Instead, we want *platform-specific* sales for each region. To do that, we'll need to be more specific to our data selection. 

One thing we can do is use label-based conditional indexing, selecting only for the platform we want, and then summing for that platform.

In [24]:
sales.loc[sales.Platform == 'X360']

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,X360,14.97,4.94,0.24,1.67
2,X360,9.63,5.31,0.06,1.38
3,X360,9.03,4.28,0.13,1.32
4,X360,9.67,3.73,0.11,1.13
7,X360,8.25,4.30,0.07,1.12
...,...,...,...,...,...
3127,X360,0.00,0.01,0.00,0.00
3128,X360,0.00,0.00,0.01,0.00
3130,X360,0.01,0.00,0.00,0.00
3135,X360,0.00,0.00,0.01,0.00


In [25]:
sales.loc[sales.Platform == 'X360'].sum(numeric_only=True)

NA_Sales       601.05
EU_Sales       280.58
JP_Sales        12.43
Other_Sales     85.54
dtype: float64

We can do the same for PS3.

In [26]:
sales.loc[sales.Platform == 'PS3'].sum(numeric_only=True)

NA_Sales       392.26
EU_Sales       343.71
JP_Sales        79.99
Other_Sales    141.93
dtype: float64

Simple enough, but rather verbose and repetitive. What if we wanted to do it in a more eloquent manner with less code? Moreover, what if we had dozens or hundreds of platforms that we wanted to do an aggregate analysis for? This manual platform-by-platform approach is not practical for that. We would need a better approach. And there is such an approach. Stay tuned!