In [None]:
import pandas as pd
import numpy as np

# GroupBy and Aggregates

This section of the course introduces the split-apply-combine pattern. We will first do this manually, and then learn the `grouby()` method. This method allows us to quickly slice and transform our dataset depending on the keys and logic that we provide.

We will then explore aggregation functions, grouping my multiple keys, and some advanced topics such as combining `groupby()` with additional methods, such as `transform()`, `filter()`, and `apply()`. This allows for highly customizable operations.

## New Data: Game Sales

In this section, we'll work with a dataset that contains sales information for the most popular games for the Xbox and Playstation consoles. It provides sales in millions of dollars. This dataset is originally from Kaggle, and has been modified by the instructor to enable the learnings of this section.

https://andybek.com/pandas-games

Let's start by importing the data

In [None]:
games_url = 'https://andybek.com/pandas-games'

In [None]:
games = pd.read_csv(games_url)

In [None]:
games.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3143 entries, 0 to 3142
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Name          3143 non-null   object 
 1   Platform      3143 non-null   object 
 2   Year          3088 non-null   float64
 3   Genre         3143 non-null   object 
 4   Publisher     3136 non-null   object 
 5   NA_Sales      3143 non-null   float64
 6   EU_Sales      3143 non-null   float64
 7   JP_Sales      3143 non-null   float64
 8   Other_Sales   3143 non-null   float64
 9   Global_Sales  3143 non-null   float64
dtypes: float64(6), object(4)
memory usage: 245.7+ KB


In [None]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


We've got some information on the game name, the console ("Platform"), year of release, genre of game, the game publisher, and sales in different countries as well as overall global sales. Let's get to it!

In [None]:
games["Platform"].unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

## Simple Aggregations Overview

In previous sections, we used simple aggregation functions to calculate things like mean, sum, standard deviation, variance, etc. Let's review in this lecture, in preparation for the `groupby()` method.

Let's first answer "what are the total sales across ALL regions". One way to solve this is to isolate the columns of interest using the `loc[]` indexer.

In [None]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,14.97,4.94,0.24,1.67
1,7.01,9.27,0.97,4.14
2,9.63,5.31,0.06,1.38
3,9.03,4.28,0.13,1.32
4,9.67,3.73,0.11,1.13
...,...,...,...,...
3138,0.00,0.01,0.00,0.00
3139,0.01,0.00,0.00,0.00
3140,0.01,0.00,0.00,0.00
3141,0.00,0.01,0.00,0.00


We've created a small dataframe slice with just the sales columns. Now we can chain on a `sum()` function to get the sums of all of the columns.

In [None]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].sum()

NA_Sales       1173.30
EU_Sales        793.64
JP_Sales        107.06
Other_Sales     282.75
dtype: float64

Notice that the dimensionality changed when we applied the aggregate `sum()` function. What began as a dataframe of 3134 rows and 4 columns ended up as a series with four entries. This is a  feature of aggregate functions in general.

In [None]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].median()

NA_Sales       0.14
EU_Sales       0.07
JP_Sales       0.00
Other_Sales    0.03
dtype: float64

In [None]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].std()

NA_Sales       0.801483
EU_Sales       0.570164
JP_Sales       0.093115
Other_Sales    0.203866
dtype: float64

One thing to note about aggregation: it is technically an operation that can be applied in more ways than one. For example, any give column can be collapsed or aggregated across all of its rows. Similarly, any row can be aggregated across all of its columns.

When dealing with dataframes, Pandas defaults to the vertical aggregation. We can change this by changing the axis parameter of the aggregate function. For instance, let's perform the standard deviation calculation along the column axis.

In [None]:
games.loc[:, ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']].std(axis = 1)

0       6.641358
1       3.594926
2       4.311415
3       3.964602
4       4.286739
          ...   
3138    0.005000
3139    0.005000
3140    0.005000
3141    0.005000
3142    0.005000
Length: 3143, dtype: float64

Notice now that when changing the axis of aggregation, we now have a series with 3143 entries. We're getting the standard deviation of the sales across the four regions for each individual row (video game).

So two quick takeaways:
1. When we apply aggregation functions, the dimensions of the data frequently change.
2. We can aggregate either horizontally or vertically.

## Conditional Aggregates

The aggregates that we computed in the previous lecture applied across the entire column or row. Sometimes however, we only want to apply the function to a smaller set of values within the dataframe.

Suppose we want the total sales by region, but only for particular platforms/consoles. How do we do that?

In [None]:
games.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

Let's go with the Xbox 360 and PS3 platforms. Find the total sales by region for the X360 and PS3.

Let's start by creating a new view that only includes the Platform and Sales columns.

In [None]:
sales = games.loc[:, ['Platform', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales']]

In [None]:
sales.head()

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,X360,14.97,4.94,0.24,1.67
1,PS3,7.01,9.27,0.97,4.14
2,X360,9.63,5.31,0.06,1.38
3,X360,9.03,4.28,0.13,1.32
4,X360,9.67,3.73,0.11,1.13


How do we segregate the platforms? If we were to apply the `sum()` function here directly, we'll get the sum across both platforms.

In [None]:
sales.sum(numeric_only=True)

NA_Sales       1173.30
EU_Sales        793.64
JP_Sales        107.06
Other_Sales     282.75
dtype: float64

That's not what we want. Instead, we want *platform-specific* sales for each region. To do that, we'll need to be more specific to our data selection. 

One thing we can do is use label-based conditional indexing, selecting only for the platform we want, and then summing for that platform.

In [None]:
sales.loc[sales.Platform == 'X360']

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
0,X360,14.97,4.94,0.24,1.67
2,X360,9.63,5.31,0.06,1.38
3,X360,9.03,4.28,0.13,1.32
4,X360,9.67,3.73,0.11,1.13
7,X360,8.25,4.30,0.07,1.12
...,...,...,...,...,...
3127,X360,0.00,0.01,0.00,0.00
3128,X360,0.00,0.00,0.01,0.00
3130,X360,0.01,0.00,0.00,0.00
3135,X360,0.00,0.00,0.01,0.00


In [None]:
sales.loc[sales.Platform == 'X360'].sum(numeric_only=True)

NA_Sales       601.05
EU_Sales       280.58
JP_Sales        12.43
Other_Sales     85.54
dtype: float64

We can do the same for PS3.

In [None]:
sales.loc[sales.Platform == 'PS3'].sum(numeric_only=True)

NA_Sales       392.26
EU_Sales       343.71
JP_Sales        79.99
Other_Sales    141.93
dtype: float64

Simple enough, but rather verbose and repetitive. What if we wanted to do it in a more eloquent manner with less code? Moreover, what if we had dozens or hundreds of platforms that we wanted to do an aggregate analysis for? This manual platform-by-platform approach is not practical for that. We would need a better approach. And there is such an approach. Stay tuned!

## The Split-Apply-Combine Pattern

Before we get into the `groupby()` method, let's conceptually review what a data analysis pattern that we've been applying.

In the previous lecture, we wanted to calculate the sum of sales for all regions for two specific platforms - the X360 and the PS3. We could not just simply apply `sum()` to the dataframe because the resulting total gave us the total sales across all platforms.

To address this issue, we partitioned the data into separate groups and created smaller dataframes containing data from only the platforms we cared about. Once we had those datasets, we applied aggregation functions to calculate our sums. This approach is known as the **split-apply-combine** pattern, coined by Hadley Wickham from Rice University.

The main idea is to break down a big problem into smaller pieces, solve those pieces, and then put it all back together at the end. We've covered the first two steps - split and apply. We have not yet seen combine, but we will! In fact, the `groupby()` method which we will see in the next method implements ALL THREE of these steps in one single operation. Totally awesome.

## Introducing the `groupby()` Method

Everything and more that we have been doing manually earlier in this section can be done by the `groupby()` method. It is essentially a pivot table for Pandas, which applies a split, apply, combine operation in one go.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Let's tackle that platform-specific sales question again. All we need to do is identify what we want to group by ("Platform" in this case), and which aggregate function we want to apply!

In [None]:
sales.groupby("Platform").sum()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,392.26,343.71,79.99,141.93
PS4,96.8,123.7,14.3,43.36
X360,601.05,280.58,12.43,85.54
XOne,83.19,45.65,0.34,11.92


Man that was so easy, and much better than the manual work we had to do earlier. Even better, it combined all of the output in the end. The `groupby()` method is extremely powerful, and there will be many instances in our data analysis journey where we'll want to use it.

We can apply essentially any function that we want. For example, we can find the average sales across all regions for each platform.

In [None]:
sales.groupby("Platform").mean()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,0.295154,0.258623,0.060188,0.106795
PS4,0.288095,0.368155,0.04256,0.129048
X360,0.475138,0.221802,0.009826,0.067621
XOne,0.390563,0.214319,0.001596,0.055962


It looks like the Xbox 360 sells far more in North American than the other platforms. But the average is only one side of the story. Perhaps this value is skewed by some very high-selling games. We can investigate by looking at the median.

In [None]:
sales.groupby("Platform").median()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,0.12,0.07,0.01,0.03
PS4,0.07,0.08,0.02,0.03
X360,0.17,0.06,0.0,0.02
XOne,0.15,0.07,0.0,0.02


We see that the median North American sales for the Xbox 360 is less than half of the mean, indicating that the majority of the sales are below-average and that the average is driven upward by a few games that sell very well. 

Stay tuned, we'll later see more advanced operations such as splitting by multiple keys.

## The DataFrameGroupBy Object

When using `groupby()`, you create what's known as a **DataFrameGroupBy** object. Let's take a look under the hood here.

What happens if we just perform the `groupby()` but do not apply a method/function to it?

In [None]:
sales.groupby("Platform")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd593196fd0>

What returns is a DataFrameGroupby object and a reference to its place in memory. Think of this as a special view into an intermediate output. At this point, Pandas has validated our mapping based on the key that we want to group by, and is awaiting instructions on what to do next. The split has not yet happened, but it will once Pandas knows what to do.

There are four platforms that we are grouping by, and thus the length of the DataFrameGroupBy object is 4.

In [None]:
len(sales.groupby("Platform"))

4

As soon as we apply a function to this object, the computation will kick in. So the key thing to remember here is that applying `groupby()` on a dataframe returns this DataFrameGroupBy object. It does *not* return a dataframe.

## Customizing Index to Group Mappings

So far in this section we have grouped by "Platform", resulting in regional sales totals for all platforms. This is actually equivalent to the following long form version of the syntax.

In [None]:
sales.groupby(sales['Platform']).sum()

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
PS3,392.26,343.71,79.99,141.93
PS4,96.8,123.7,14.3,43.36
X360,601.05,280.58,12.43,85.54
XOne,83.19,45.65,0.34,11.92


In fact, the long form is what Pandas is technically doing behind the scenes. The instructor however never uses this form due to its verbosity.

Now, suppose we don't want to analyze all four platforms, but instead we want to analyze by a fewer number of platforms. In other words, we want to analyze by fewer platforms than are in the "Platform" column. For example, supposed we want to analyze by Playstation of Xbox.

To accomplish this in Pandas, we **provide a dictionary that maps index keys to the group labels that we are interested in and that we want our index keys to map to**. As an added bonus, this can be done without changing the underlying data at all.

In [None]:
platform_names = {
    'PS3': 'Playstation',
    'PS4': 'Playstation',
    'X360': 'Xbox',
    'XOne': 'Xbox'
}

In [None]:
platform_names

{'PS3': 'Playstation', 'PS4': 'Playstation', 'X360': 'Xbox', 'XOne': 'Xbox'}

Let's now set Platform as the index of our dataframe. This is important because projection of one label to another in `groupby()` only works on the index of the dataframe. Thus, we must make the column into the index.

In [None]:
sales.set_index('Platform')

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Platform,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
X360,14.97,4.94,0.24,1.67
PS3,7.01,9.27,0.97,4.14
X360,9.63,5.31,0.06,1.38
X360,9.03,4.28,0.13,1.32
X360,9.67,3.73,0.11,1.13
...,...,...,...,...
X360,0.00,0.01,0.00,0.00
XOne,0.01,0.00,0.00,0.00
XOne,0.01,0.00,0.00,0.00
PS4,0.00,0.01,0.00,0.00


Finally, we use `groupby()` to create our new subgroups. Instead of passing in a column name like we did previously, we pass in the dictionary of platform names.

In [None]:
sales.set_index('Platform').groupby(platform_names).sum()

Unnamed: 0,NA_Sales,EU_Sales,JP_Sales,Other_Sales
Playstation,489.06,467.41,94.29,185.29
Xbox,684.24,326.23,12.77,97.46


Thus, we now get Playstation and Xbox totals across all regions. And we got this without affecting the structure of the underlying data.

## BONUS - Series `groupby()`

The split-apply-combine pattern that is utilized by `groupby()` is entirely compatible with Series in addition to dataframes.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.groupby.html

In the first example, we'll take a look at global sales by Genre. We start by creating a series for this data.

In [None]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


In [None]:
games.loc[:, ['Genre', 'Global_Sales']].set_index('Genre')

Unnamed: 0_level_0,Global_Sales
Genre,Unnamed: 1_level_1
Misc,21.82
Action,21.40
Action,16.38
Shooter,14.76
Shooter,14.64
...,...
Role-Playing,0.01
Platform,0.01
Shooter,0.01
Simulation,0.01


Note that this is still technically a dataframe.

In [None]:
type(games.loc[:, ['Genre', 'Global_Sales']].set_index('Genre'))

pandas.core.frame.DataFrame

We can use the `squeeze()` method to convert it into a series, which we'll call *ser*.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.squeeze.html

In [None]:
ser = games.loc[:, ['Genre', 'Global_Sales']].set_index('Genre').squeeze()

In [None]:
ser.head(10)

Genre
Misc       21.82
Action     21.40
Action     16.38
Shooter    14.76
Shooter    14.64
Shooter    14.24
Shooter    14.03
Shooter    13.73
Shooter    13.51
Shooter    13.46
Name: Global_Sales, dtype: float64

In [None]:
type(ser)

pandas.core.series.Series

We now have a series which gives the genre and global sales for each game (note that we have no idea which game it is, but that's not important for this example).

To calculate the average global sales by genre, we can apply the `groupby()` method to this series.

In [None]:
ser.groupby('Genre').mean()

Genre
Action          0.751007
Adventure       0.298289
Fighting        0.604182
Misc            0.550250
Platform        0.651842
Puzzle          0.133636
Racing          0.687854
Role-Playing    0.715804
Shooter         1.412019
Simulation      0.336076
Sports          0.681094
Strategy        0.264333
Name: Global_Sales, dtype: float64

Let's sort it to make it nicer looking.

In [None]:
ser.groupby('Genre').mean().sort_values(ascending = False)

Genre
Shooter         1.412019
Action          0.751007
Role-Playing    0.715804
Racing          0.687854
Sports          0.681094
Platform        0.651842
Fighting        0.604182
Misc            0.550250
Simulation      0.336076
Adventure       0.298289
Strategy        0.264333
Puzzle          0.133636
Name: Global_Sales, dtype: float64

Looks like shooters are quite popular.

As an aside, the following code could have also been used to accomplish the same task by acting on the initial dataframe.

In [None]:
games.groupby('Genre')['Global_Sales'].mean()

Genre
Action          0.751007
Adventure       0.298289
Fighting        0.604182
Misc            0.550250
Platform        0.651842
Puzzle          0.133636
Racing          0.687854
Role-Playing    0.715804
Shooter         1.412019
Simulation      0.336076
Sports          0.681094
Strategy        0.264333
Name: Global_Sales, dtype: float64

## Skill Challenge

#### 1. Create a smaller dataframe from *games*, selecting only the Publisher, Genre, Platform, and NA_Sales columns. Assign this dataframe to the variable *publishers*.

We can accomplish this by using a simple `loc[]` indexing command.

In [None]:
publishers = games.loc[:, ['Publisher','Genre','Platform','NA_Sales']]

In [None]:
publishers.head(10)

Unnamed: 0,Publisher,Genre,Platform,NA_Sales
0,Microsoft Game Studios,Misc,X360,14.97
1,Take-Two Interactive,Action,PS3,7.01
2,Take-Two Interactive,Action,X360,9.63
3,Activision,Shooter,X360,9.03
4,Activision,Shooter,X360,9.67
5,Activision,Shooter,PS4,5.77
6,Activision,Shooter,PS3,4.99
7,Activision,Shooter,X360,8.25
8,Activision,Shooter,X360,8.52
9,Activision,Shooter,PS3,5.54


#### 2. From the *publishers* dataframe, find the top 10 game publishers in North America by total sales.

We'll use the `groupby()` method here, grouping by Publisher and then summing over "NA_Sales", and then sorting in descending order.

In [None]:
publishers.groupby("Publisher")['NA_Sales'].sum().sort_values(ascending = False).head(10)

Publisher
Electronic Arts                           213.38
Activision                                193.16
Take-Two Interactive                      120.99
Microsoft Game Studios                    116.77
Ubisoft                                    98.65
Sony Computer Entertainment                76.35
Warner Bros. Interactive Entertainment     45.24
THQ                                        36.44
Bethesda Softworks                         33.88
Capcom                                     24.74
Name: NA_Sales, dtype: float64

In [None]:
type(publishers.groupby("Publisher")['NA_Sales'].sum().sort_values(ascending = False).head(10))

pandas.core.series.Series

Note that we could also have done this without selecting for "NA_Sales" because by default `sum()` will only work on numeric data, and on the "NA_Sales" column contains data of this type. However, this does mean that we need to identify the column to sort by when implementing `sort_values()` by passing it into the `by` parameter. 

Essentially you can identify the column now or later. It's up to you. The only difference between the approaches is that the former results in a Series, while the latter results in a dataframe with only one column.

In [None]:
publishers.groupby("Publisher").sum().sort_values(by = "NA_Sales", ascending = False).head(10)

Unnamed: 0_level_0,NA_Sales
Publisher,Unnamed: 1_level_1
Electronic Arts,213.38
Activision,193.16
Take-Two Interactive,120.99
Microsoft Game Studios,116.77
Ubisoft,98.65
Sony Computer Entertainment,76.35
Warner Bros. Interactive Entertainment,45.24
THQ,36.44
Bethesda Softworks,33.88
Capcom,24.74


In [None]:
type(publishers.groupby("Publisher").sum().sort_values(by = "NA_Sales", ascending = False).head(10))

pandas.core.frame.DataFrame

This shows that Electronic Arts is the highest-selling publisher, followed by Activision, Take-Two Interactive, and so forth.

#### 3. Determine the gaming platform/system that has attracted the most sales in North America.

This is similar to above, except we will group by "Platform" instead of by "Publisher". We will also change up our approach by first creating a Series with just the "Platform" and "NA_Sales" data, and then performing our `groupby()` aggregate analysis. The logic flows as follows:
1. Use the `loc[]` indexer to isolate the "Platform" and "NA_Sales" columns from *publishers*.
2. Use `set_index()` to set the index to "Platform".
3. Use squeeze to convert our dataframe to a Series.
4. Use `groupby()` to aggregate by "Platform".
5. Apply the `sum()` function.

In [None]:
publishers.loc[:, ['Platform','NA_Sales']].set_index("Platform").squeeze().groupby('Platform').sum()

Platform
PS3     392.26
PS4      96.80
X360    601.05
XOne     83.19
Name: NA_Sales, dtype: float64

We therefore see that the Xbox360 had the greatest sales in North American by a significant margin, with a total sales of $601 million.

The alternative dataframe approach is equally valid and arguably much easier. Once again this approach leaves you with a dataframe instead of a series.

In [None]:
publishers.groupby("Platform").sum()

Unnamed: 0_level_0,NA_Sales
Platform,Unnamed: 1_level_1
PS3,392.26
PS4,96.8
X360,601.05
XOne,83.19


## Iterating Through Groups

Recall that the `groupby()` method returns a DataFrameGroupBy object where subgroups have already been formed, but nothing has been done to them yet.

In [None]:
sales.groupby('Platform')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd5931961d0>

In [None]:
sales.Platform.unique()

array(['X360', 'PS3', 'PS4', 'XOne'], dtype=object)

How do we access the information in that object without actually doing anything to it? It's actually simple - it's an iterable that we can iterate over.

In [None]:
for i in sales.groupby('Platform'):
  print(i)

('PS3',      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
1         PS3      7.01      9.27      0.97         4.14
6         PS3      4.99      5.88      0.65         2.52
9         PS3      5.54      5.82      0.49         1.62
10        PS3      5.98      4.44      0.48         1.83
14        PS3      2.96      4.88      0.81         2.12
...       ...       ...       ...       ...          ...
3124      PS3      0.00      0.01      0.00         0.00
3125      PS3      0.00      0.00      0.01         0.00
3129      PS3      0.00      0.00      0.01         0.00
3132      PS3      0.00      0.00      0.01         0.00
3136      PS3      0.00      0.00      0.01         0.00

[1329 rows x 5 columns])
('PS4',      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
5         PS4      5.77      5.81      0.35         2.31
12        PS4      3.80      5.81      0.36         2.02
24        PS4      1.11      6.06      0.06         1.26
26        PS4      2.93      3.29      0.22   

What the heck are we looking at here? 

Well first off, there are four blocks of "stuff", which lines up with the fact that we have four subgroups ("Platforms" in this case) in this particular groupby. 

Secondly, each "block" consists of two things - a "name", and a "dataframe-like object". Let's modify our call a bit so that we can access the labels and dataframe-like objects inside.

In [None]:
for name, df in sales.groupby('Platform'):
  print('-----------------')
  print('Subgroup Label: ', name)
  print('-----------------')
  print(df, '\n')


-----------------
Subgroup Label:  PS3
-----------------
     Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
1         PS3      7.01      9.27      0.97         4.14
6         PS3      4.99      5.88      0.65         2.52
9         PS3      5.54      5.82      0.49         1.62
10        PS3      5.98      4.44      0.48         1.83
14        PS3      2.96      4.88      0.81         2.12
...       ...       ...       ...       ...          ...
3124      PS3      0.00      0.01      0.00         0.00
3125      PS3      0.00      0.00      0.01         0.00
3129      PS3      0.00      0.00      0.01         0.00
3132      PS3      0.00      0.00      0.01         0.00
3136      PS3      0.00      0.00      0.01         0.00

[1329 rows x 5 columns] 

-----------------
Subgroup Label:  PS4
-----------------
     Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
5         PS4      5.77      5.81      0.35         2.31
12        PS4      3.80      5.81      0.36         2.02
24  

We can now see the content of that dataframegroupby object. Again we see four groups, one for each platform, and the associated data for each platform. With this view, we can think of `groupby()` aggregate analysis as applying individually to each one of these subgroups within the dataframegroupby object

## Handpicking Subgroups with `get_group()`



We saw in the previous lecture that we can easily iterate through each one of the subgroups in the DataFrameGroupBy. Occasionally, you may get output that is surprising or seemingly nonsensical, such as null values or invalid datapoints that may affect the results in ways that are confusing.

Therefore, it is sometimes useful to source subgroups in isolation. How do we do this?

It's tempting to use square bracketing to select a single subgroup. Let's try it. (Spoiler: it won't work)

In [None]:
## This results in a "Column not found" error.
# sales.groupby('Platform')['PS3']

That didn't work because square bracket notation is used for selecting **columns**. In this case, the individual platforms are not columns - they are in embedded within the "Platform" column.

As a side note, bracket notation is a totally valid way to select for columns after a `groupby()`. Notice that when you do this (and as we saw above), you get a reduction in dimensionality, resulting in a SeriesGroupBy object.

In [None]:
sales.groupby('Platform')['JP_Sales']

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fd593155890>

Back to the question at hand: how do we access any given subgroup? One way to extract a single subgroup is the convert the entire dataframe into an iterator, and then convert the entire iterator into a dictionary.

In [None]:
dict(iter(sales.groupby('Platform')))

{'PS3':      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
 1         PS3      7.01      9.27      0.97         4.14
 6         PS3      4.99      5.88      0.65         2.52
 9         PS3      5.54      5.82      0.49         1.62
 10        PS3      5.98      4.44      0.48         1.83
 14        PS3      2.96      4.88      0.81         2.12
 ...       ...       ...       ...       ...          ...
 3124      PS3      0.00      0.01      0.00         0.00
 3125      PS3      0.00      0.00      0.01         0.00
 3129      PS3      0.00      0.00      0.01         0.00
 3132      PS3      0.00      0.00      0.01         0.00
 3136      PS3      0.00      0.00      0.01         0.00
 
 [1329 rows x 5 columns],
 'PS4':      Platform  NA_Sales  EU_Sales  JP_Sales  Other_Sales
 5         PS4      5.77      5.81      0.35         2.31
 12        PS4      3.80      5.81      0.36         2.02
 24        PS4      1.11      6.06      0.06         1.26
 26        PS4      2.93      

What did we do here? We created a dictionary, where the platform label is the dictionary key and the value is the dataframe-like stuff. We can now treat this as a normal dictionary and access the dataframe-like information for a specific platform by identifying that platform as the key.
* Important note: We did not apply any aggregate function on this. We're simply pulling data out of the `groupby()` object.

In [None]:
dict(iter(sales.groupby('Platform')))['PS3']

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
1,PS3,7.01,9.27,0.97,4.14
6,PS3,4.99,5.88,0.65,2.52
9,PS3,5.54,5.82,0.49,1.62
10,PS3,5.98,4.44,0.48,1.83
14,PS3,2.96,4.88,0.81,2.12
...,...,...,...,...,...
3124,PS3,0.00,0.01,0.00,0.00
3125,PS3,0.00,0.00,0.01,0.00
3129,PS3,0.00,0.00,0.01,0.00
3132,PS3,0.00,0.00,0.01,0.00


The instructor doesn't like this approach because you have to go through the annoying step of creating a dictionary with all of the information available, prior to pulling out what we want.

Pandas can help us circumvent this nonsense with the `get_group()` method, which is executed on a `GroupBy` object. It does the exact same thing as the convoluted path we took above.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.get_group.html
* This is more efficient because it doesn't waste computational power calculating values for each group, when we only want one group.
* It is also more efficient because we stay in "Pandas code" the entire time, not having to switch to pure Python

In [None]:
sales.groupby("Platform").get_group('PS3')

Unnamed: 0,Platform,NA_Sales,EU_Sales,JP_Sales,Other_Sales
1,PS3,7.01,9.27,0.97,4.14
6,PS3,4.99,5.88,0.65,2.52
9,PS3,5.54,5.82,0.49,1.62
10,PS3,5.98,4.44,0.48,1.83
14,PS3,2.96,4.88,0.81,2.12
...,...,...,...,...,...
3124,PS3,0.00,0.01,0.00,0.00
3125,PS3,0.00,0.00,0.01,0.00
3129,PS3,0.00,0.00,0.01,0.00
3132,PS3,0.00,0.00,0.01,0.00


## MultiIndex Grouping

We can use `groupby()` with multiple keys!

In [None]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


Let's start by subsetting the dataframe to just include  Genre, Publisher, and Global_Sales.

In [None]:
studios = games.loc[:, ['Genre','Publisher','Global_Sales']]

In [None]:
studios.head()

Unnamed: 0,Genre,Publisher,Global_Sales
0,Misc,Microsoft Game Studios,21.82
1,Action,Take-Two Interactive,21.4
2,Action,Take-Two Interactive,16.38
3,Shooter,Activision,14.76
4,Shooter,Activision,14.64


We have a 3-column dataframe with each row representing a single game.

We previously grouped by single keys as follows. Suppose we want to know the top publisher by Global_Sales. Easy enough with `groupby()`.

In [None]:
studios.groupby('Publisher').sum().sort_values(by = 'Global_Sales', ascending = False)

Unnamed: 0_level_0,Global_Sales
Publisher,Unnamed: 1_level_1
Electronic Arts,434.41
Activision,349.22
Take-Two Interactive,218.08
Ubisoft,201.98
Microsoft Game Studios,190.56
...,...
UIG Entertainment,0.01
ChunSoft,0.01
Kaga Create,0.01
Epic Games,0.01


From this analysis, EA is the leading publisher by sales. However, this aggregation becomes inadequete if we add another dimension to our question. For instance,  what if we want to go deeper and find the top publisher by "Global_Sales" for each genre of game?

To do this using `groupby()`, we pass in a *list of keys*.



In [None]:
studios.groupby(['Genre','Publisher'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd5930f45d0>

As before, in the absence of any function, this produces our DataFrameGroupBy object that is awaiting further orders.

Let's now force the evaluation.

In [None]:
studios.groupby(['Genre','Publisher']).sum().sort_values(by = "Global_Sales", ascending = False).head(20)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Shooter,Activision,245.46
Sports,Electronic Arts,203.5
Action,Take-Two Interactive,106.04
Action,Ubisoft,96.44
Shooter,Electronic Arts,92.58
Shooter,Microsoft Game Studios,77.02
Action,Warner Bros. Interactive Entertainment,71.89
Action,Sony Computer Entertainment,60.38
Sports,Take-Two Interactive,56.89
Action,Electronic Arts,49.61


We now see that while EA is the leading studio overall, Activition is the leader in within-genre sales. More specifically, Activision leads all publishers in the sales of Shooter games, while EA leads the pack in sales of Sports games.

More generally, notice how this multiple-key `groupby()` broke our analysis down even further into groups that are indexed by both genre and publisher. This allows us to answer a more specific question than aggregate analysis by any one key alone.

Finally, using multiple keys on `groupby()` results in a multiIndex dataframe.

## Fine-Tuned Aggregates with `agg()`

Let's now shift our attention to the aggregation functions. So far we have mainly been using the built-in aggregate functions such as `sum()` and `max()`. 

Taking `sum()` as an example, the following command is equivalent to what we've done above.


In [None]:
studios.groupby(['Genre','Publisher']).aggregate('sum')

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Action,505 Games,2.25
Action,Abylight,0.08
Action,Ackkstudios,0.33
Action,Acquire,0.11
Action,Activision,42.84
...,...,...
Strategy,Square Enix,0.35
Strategy,Takara Tomy,0.09
Strategy,Take-Two Interactive,2.92
Strategy,Tecmo Koei,0.58


Or equivalently, we can use the shorthand alias `agg()`.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html

In [None]:
studios.groupby(['Genre','Publisher']).agg('sum')

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Action,505 Games,2.25
Action,Abylight,0.08
Action,Ackkstudios,0.33
Action,Acquire,0.11
Action,Activision,42.84
...,...,...
Strategy,Square Enix,0.35
Strategy,Takara Tomy,0.09
Strategy,Take-Two Interactive,2.92
Strategy,Tecmo Koei,0.58


The `agg()` function also accepts numpy functions.

In [None]:
studios.groupby(['Genre','Publisher']).agg(np.sum)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Genre,Publisher,Unnamed: 2_level_1
Action,505 Games,2.25
Action,Abylight,0.08
Action,Ackkstudios,0.33
Action,Acquire,0.11
Action,Activision,42.84
...,...,...
Strategy,Square Enix,0.35
Strategy,Takara Tomy,0.09
Strategy,Take-Two Interactive,2.92
Strategy,Tecmo Koei,0.58


One of the biggest advantages of using the `agg()` function is that **we can apply several functions in one go**, as the function accepts a Python list of functions. 

As an example, suppose we want to summarize the sum, count, average, and standard deviation of sales as well as the number of games published by each publisher within each genre. This sounds like a complicated mess of a question, this becomes a relatively simple question to answer.

We start by grouping our data according to Genre and Publisher, as that is the question we want to answer. We will then pass in a list of functions.

In [None]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count,mean,std
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Action,505 Games,2.25,8,0.281250,0.266482
Action,Abylight,0.08,1,0.080000,
Action,Ackkstudios,0.33,1,0.330000,
Action,Acquire,0.11,1,0.110000,
Action,Activision,42.84,95,0.450947,0.559717
...,...,...,...,...,...
Strategy,Square Enix,0.35,1,0.350000,
Strategy,Takara Tomy,0.09,1,0.090000,
Strategy,Take-Two Interactive,2.92,6,0.486667,0.364289
Strategy,Tecmo Koei,0.58,6,0.096667,0.055015


Check this out. We now have an aggregate analysis for each publisher and each genre, with data pertaining to the functions that we passed in. Note that some publishers have zero or one game within a given genre, and so the standard deviation cannot be calculated.

Notice that we've created a hierarchical index on both axes of the dataframe. In index axis rows has a two-level index - Genre and Publisher. The Column axis has a two-level index as well - Global Sales on the higher level, then sum, count, mean, and std on the lower level.

In [None]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).columns

MultiIndex([('Global_Sales',   'sum'),
            ('Global_Sales', 'count'),
            ('Global_Sales',  'mean'),
            ('Global_Sales',   'std')],
           )

Thus, if we want to sort columns in this multi-index dataframe, we need to approach it slightly differently. Case and point, we cannot simply sort by "Global_Sales"

In [None]:
## This results in a ValueError
# studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).sort_values(by = "Global_Sales")

For a multi-index dataframe with multiple levels on the column axis, we need to identify each column uniquely by specifying labels on each level. This means we have to use a tuple to perform our `sort_values()` method.

In [None]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).sort_values(by = ("Global_Sales",'sum'), ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count,mean,std
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Shooter,Activision,245.46,72,3.409167,4.621920
Sports,Electronic Arts,203.50,170,1.197059,1.404108
Action,Take-Two Interactive,106.04,23,4.610435,5.843768
Action,Ubisoft,96.44,67,1.439403,1.636460
Shooter,Electronic Arts,92.58,50,1.851600,1.794404
...,...,...,...,...,...
Adventure,Cave,0.01,1,0.010000,
Role-Playing,TopWare Interactive,0.01,1,0.010000,
Sports,"Interworks Unlimited, Inc.",0.01,1,0.010000,
Strategy,Ackkstudios,0.01,1,0.010000,


We have all of these nice metrics in one organized dataframe. What does this analysis tell us? Well, we see that Activision publishes fewer games than its closest within-genre competitor EA despite having greater sales overall. 

What games do Activision publish? Let's find out.



In [None]:
games[games.Publisher == 'Activision'].sort_values('Global_Sales', ascending = False)

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64
5,Call of Duty: Black Ops 3,PS4,2015.0,Shooter,Activision,5.77,5.81,0.35,2.31,14.24
6,Call of Duty: Black Ops II,PS3,2012.0,Shooter,Activision,4.99,5.88,0.65,2.52,14.03
7,Call of Duty: Black Ops II,X360,2012.0,Shooter,Activision,8.25,4.30,0.07,1.12,13.73
...,...,...,...,...,...,...,...,...,...,...
2923,Ghostbusters (2016),XOne,2016.0,Action,Activision,0.02,0.00,0.00,0.00,0.02
2932,The Voice,PS3,2014.0,Action,Activision,0.02,0.00,0.00,0.00,0.02
2935,Rapala for Kinect,X360,2011.0,Sports,Activision,0.00,0.02,0.00,0.00,0.02
3012,Call of Duty: Modern Warfare Trilogy,PS3,2016.0,Shooter,Activision,0.00,0.01,0.00,0.00,0.02


The Call of Duty series seems to be a massive cash cow for Activision, especially in North America.

## Named Aggregations with `agg()`

In the last aggregation we did, Pandas by default assigned the function name (sum, count, mean, std) to the columns that were created in the resulting dataframe. We have the option to chain on a `rename()` method in which we can change the names of these columns.
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [None]:
studios.groupby(['Genre','Publisher']).agg(['sum','count','mean','std']).rename(columns = {'sum': 'total_revenue','count':'num_games'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,Global_Sales,Global_Sales,Global_Sales
Unnamed: 0_level_1,Unnamed: 1_level_1,total_revenue,num_games,mean,std
Genre,Publisher,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Action,505 Games,2.25,8,0.281250,0.266482
Action,Abylight,0.08,1,0.080000,
Action,Ackkstudios,0.33,1,0.330000,
Action,Acquire,0.11,1,0.110000,
Action,Activision,42.84,95,0.450947,0.559717
...,...,...,...,...,...
Strategy,Square Enix,0.35,1,0.350000,
Strategy,Takara Tomy,0.09,1,0.090000,
Strategy,Take-Two Interactive,2.92,6,0.486667,0.364289
Strategy,Tecmo Koei,0.58,6,0.096667,0.055015


As an alternative, the `agg()` method itself also supports **named aggregations**. This is a special syntax that gives us the option of customizing the output column labels. 

To do this, we pass in tuples containing the aggregation columns and functions, and then assign names to those tuples based on what we want the column names to be. 

In [None]:
studios.groupby(['Genre','Publisher']).agg(total__global_revenue = ("Global_Sales",'sum'), num_games = ('Global_Sales', 'count'))

Unnamed: 0_level_0,Unnamed: 1_level_0,total__global_revenue,num_games
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,505 Games,2.25,8
Action,Abylight,0.08,1
Action,Ackkstudios,0.33,1
Action,Acquire,0.11,1
Action,Activision,42.84,95
...,...,...,...
Strategy,Square Enix,0.35,1
Strategy,Takara Tomy,0.09,1
Strategy,Take-Two Interactive,2.92,6
Strategy,Tecmo Koei,0.58,6


We have therefore grouped all games by Genre and Publisher, then applied the "sum" and "count" functions to the "Global_Sales" column. This results in a dataframe, indexed as we grouped by, with total revenue and total number of games for each publisher and each genre as applicable.

Notice that we do not have a multilevel axis on the column axis when using this approach. 

Let's add a few more columns

In [None]:
studios.groupby(['Genre','Publisher']).agg(
    total_global_revenue = ("Global_Sales",'sum'), 
    num_games = ('Global_Sales', 'count'), 
    revenue_std = ('Global_Sales', 'std'), 
    revenue_avg = ('Global_Sales', 'mean')
    ).sort_values('total_global_revenue', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_global_revenue,num_games,revenue_std,revenue_avg
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Shooter,Activision,245.46,72,4.621920,3.409167
Sports,Electronic Arts,203.50,170,1.404108,1.197059
Action,Take-Two Interactive,106.04,23,5.843768,4.610435
Action,Ubisoft,96.44,67,1.636460,1.439403
Shooter,Electronic Arts,92.58,50,1.794404,1.851600
...,...,...,...,...,...
Adventure,Cave,0.01,1,,0.010000
Role-Playing,TopWare Interactive,0.01,1,,0.010000
Sports,"Interworks Unlimited, Inc.",0.01,1,,0.010000
Strategy,Ackkstudios,0.01,1,,0.010000


Finally, let's execute a sort. Since the column axis is single-level, we just specify the column that we want to sort by.

The power of named aggregations doesn't stop there. We can elegantly aggregate various columns in one single step. Let's illustrate this by going back to our *games* dataframe.

In [None]:
games.groupby(["Genre","Publisher"]).agg(
    total_global_revenue=('Global_Sales', 'sum'),
    average_EU_revenue = ("EU_Sales",'mean')
)

Unnamed: 0_level_0,Unnamed: 1_level_0,total_global_revenue,average_EU_revenue
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,505 Games,2.25,0.131250
Action,Abylight,0.08,0.000000
Action,Ackkstudios,0.33,0.000000
Action,Acquire,0.11,0.000000
Action,Activision,42.84,0.143053
...,...,...,...
Strategy,Square Enix,0.35,0.100000
Strategy,Takara Tomy,0.09,0.000000
Strategy,Take-Two Interactive,2.92,0.145000
Strategy,Tecmo Koei,0.58,0.000000


We essentially applied two different aggregate functions to due different sequences of values in an elegant, easy-to-understand manner.

It gets even better. We can achieve a similar effect by passing in a Python dictionary to our `agg()` method. The disadvantage here, however, is that we don't have precise control over the output column names. To get custom-named columns, we could do a `.rename()` chain-on.

In [None]:
games.groupby(["Genre", "Publisher"]).agg({
    'Global_Sales':'sum',
    'EU_Sales': 'mean'
})

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales,EU_Sales
Genre,Publisher,Unnamed: 2_level_1,Unnamed: 3_level_1
Action,505 Games,2.25,0.131250
Action,Abylight,0.08,0.000000
Action,Ackkstudios,0.33,0.000000
Action,Acquire,0.11,0.000000
Action,Activision,42.84,0.143053
...,...,...,...
Strategy,Square Enix,0.35,0.100000
Strategy,Takara Tomy,0.09,0.000000
Strategy,Take-Two Interactive,2.92,0.145000
Strategy,Tecmo Koei,0.58,0.000000


## The `filter()` Method with GroupBy

We can use `groupby()` method for more than just aggregations (although that is likely the most common use case). Specifically, we can combine `groupby()` with `filter()` to exclude records from the dataframe based on group-level characteristics.

Suppose we want to analyze only games whose publishers have sold over $50 million in North America on a per-genre basis. The challenge here is in the complexity of the question: 

* We need to group by publisher and genre in order to access the relevant subgroups
* Next we need to calculate the total revenue for each subgroup
* Finally we need to exclude some records based on their subgroup total revenue. That is, we need to exclude all games whose publisher sold less than $50 million within that genre

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html


In [None]:
games.groupby(['Publisher','Genre']).filter(lambda subgroup: subgroup['NA_Sales'].sum() > 50)

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64
5,Call of Duty: Black Ops 3,PS4,2015.0,Shooter,Activision,5.77,5.81,0.35,2.31,14.24
6,Call of Duty: Black Ops II,PS3,2012.0,Shooter,Activision,4.99,5.88,0.65,2.52,14.03
7,Call of Duty: Black Ops II,X360,2012.0,Shooter,Activision,8.25,4.30,0.07,1.12,13.73
...,...,...,...,...,...,...,...,...,...,...
2908,Cabela's Big Game Hunter: Pro Hunts,X360,2014.0,Shooter,Activision,0.02,0.00,0.00,0.00,0.03
3012,Call of Duty: Modern Warfare Trilogy,PS3,2016.0,Shooter,Activision,0.00,0.01,0.00,0.00,0.02
3033,NHL 16,X360,2015.0,Sports,Electronic Arts,0.00,0.02,0.00,0.00,0.02
3035,Call of Duty: Modern Warfare Trilogy,X360,2016.0,Shooter,Activision,0.01,0.01,0.00,0.00,0.02


So what did we do here? 
1. We grouped the games dataframe by Publisher and Genre using `groupby()`, which created a DataFrameGroupBy object. Remember that this object is an iteratable containing multiple subgroups of Publishers and Genres. 
2. We then passed those subgroups into the lambda function, where we performed an aggregate analysis of each Publisher-Genre subgroup (aka subframe) and calculated the sum total of NA_Sales for each subgroup. 
3. Finally, we introduced a condition in which we wanted to filter the original dataframe to exclude any game whose Publisher sold less than 50 million dollars total within that game's genre, per our lambda function. In other words, if a Publisher sold less than $50 million worth of games within a given genre, then that publisher's games of that genre are excluded from the output.

Let's take a closer look to better understand the output. Let's first look at the "Publisher" column of the output dataframe and see how many publishers made the cut.



In [None]:
games.groupby(['Publisher','Genre']).filter(lambda subgroup: subgroup['NA_Sales'].sum() > 50)['Publisher'].unique()

array(['Activision', 'Microsoft Game Studios', 'Electronic Arts'],
      dtype=object)

Only Activision, Microsoft Game Studios, and Electronic Arts developed and sold games in genres that earned more than $50 million in North America. We can further verify this by performing a sum aggregate analysis on the original dataframe and sorting by "NA_Sales"

In [None]:
games.groupby(['Publisher','Genre']).sum().sort_values(by = 'NA_Sales', ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Publisher,Genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Activision,Shooter,142800.0,129.77,81.50,4.60,29.48,245.46
Electronic Arts,Sports,339823.0,99.73,77.19,1.05,25.47,203.50
Microsoft Game Studios,Shooter,36184.0,50.27,19.18,0.74,6.82,77.02
Take-Two Interactive,Action,46220.0,49.55,39.77,2.58,14.16,106.04
Electronic Arts,Shooter,100573.0,47.58,31.94,2.05,11.00,92.58
...,...,...,...,...,...,...,...
G.Rev,Fighting,2010.0,0.00,0.00,0.03,0.00,0.03
G.Rev,Shooter,2009.0,0.00,0.00,0.01,0.00,0.01
Genki,Racing,2007.0,0.00,0.00,0.03,0.00,0.03
Ghostlight,Role-Playing,2007.0,0.00,0.03,0.03,0.01,0.07


Indeed, if we look at the "NA_Sales" column, the only studios who sold more than $50 million in particular game genres in North America were:
* Activision in the Shooter genre 
* Electronic Arts in the Sports genre
* Microsoft Games Studios in the Shooter genre

Thus, only those games, 260 in total, are included in the output of the filter above.

As a final point, we don't need to use in-line lambda functions in the `filter()` method if we don't want to. We can define a standalone function just as well. Let's do that - we'll create a function that accepts a dataframe (each subgroup) as an argument, and then returns a boolean describing whether the sum for the subgroup for NA_Sales is greater than 50 million.

In [None]:
def more_than_50(df):
  return df['NA_Sales'].sum() > 50

In [None]:
games.groupby(['Publisher','Genre']).filter(more_than_50)

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64
5,Call of Duty: Black Ops 3,PS4,2015.0,Shooter,Activision,5.77,5.81,0.35,2.31,14.24
6,Call of Duty: Black Ops II,PS3,2012.0,Shooter,Activision,4.99,5.88,0.65,2.52,14.03
7,Call of Duty: Black Ops II,X360,2012.0,Shooter,Activision,8.25,4.30,0.07,1.12,13.73
...,...,...,...,...,...,...,...,...,...,...
2908,Cabela's Big Game Hunter: Pro Hunts,X360,2014.0,Shooter,Activision,0.02,0.00,0.00,0.00,0.03
3012,Call of Duty: Modern Warfare Trilogy,PS3,2016.0,Shooter,Activision,0.00,0.01,0.00,0.00,0.02
3033,NHL 16,X360,2015.0,Sports,Electronic Arts,0.00,0.02,0.00,0.00,0.02
3035,Call of Duty: Modern Warfare Trilogy,X360,2016.0,Shooter,Activision,0.01,0.01,0.00,0.00,0.02


In [None]:
games.groupby(['Publisher','Genre']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Year,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Publisher,Genre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
49Games,Sports,2009.0,0.00,0.04,0.00,0.00,0.04
505 Games,Action,16098.0,0.84,1.05,0.10,0.26,2.25
505 Games,Adventure,4030.0,0.06,0.09,0.00,0.02,0.17
505 Games,Fighting,10049.0,0.28,0.09,0.02,0.05,0.44
505 Games,Misc,14078.0,0.56,0.16,0.00,0.06,0.80
...,...,...,...,...,...,...,...
Yeti,Fighting,2016.0,0.00,0.00,0.02,0.00,0.02
Zoo Games,Misc,2011.0,0.30,0.00,0.00,0.02,0.32
Zushi Games,Racing,2009.0,0.02,0.00,0.00,0.00,0.02
Zushi Games,Sports,2009.0,0.18,0.01,0.00,0.01,0.21


## GroupBy Transformations with `transform()`

The `transform()` method also combines nicely with `groupby()`. Just like filter, `transform()` accepts a function and applies that function in-place. It gives us access to a whole new array of transforms that can be applied at the subgroup level.
* https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html

Suppose our manager has asked us to convert the raw global_sales to within-genre standard scores. In other words, how may standard deviations above the genre's average score did each game sell? What do we need to do here? We need to convert the raw global_sales data to a measure of the game's sales relative that its genre's average and standard deviation.

In [None]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


Let's do a quick example before tackling this task. Support we have a game that sold $5 million and it came from a genre in which the average sales was 0.5 million (half a million dollars) and the standard deviation was 0.1 million (100,000 dollars).

In [None]:
game_sale = 5
game_genre_avg = 0.5
game_genre_std = 0.1

This game performed exceptionally well, as can be shown by the z-score. 

In [None]:
(game_sale - game_genre_avg) / game_genre_std

45.0

Thus, we have transformed the game_sale value to a z-score. 

Now back to the main question. Our goal is to convert all of our global_sales data into z-scores, and this conversion depends on the within-genre aggregate analysis of average and standard deviation.

Let's start by creating a new dataframe to focus only on the columns we need: Name, Genre, Platform, and Global_Sales

In [None]:
games_relative = games.loc[:, ['Name','Genre','Platform','Global_Sales']]

In [None]:
games_relative

Unnamed: 0,Name,Genre,Platform,Global_Sales
0,Kinect Adventures!,Misc,X360,21.82
1,Grand Theft Auto V,Action,PS3,21.40
2,Grand Theft Auto V,Action,X360,16.38
3,Call of Duty: Modern Warfare 3,Shooter,X360,14.76
4,Call of Duty: Black Ops,Shooter,X360,14.64
...,...,...,...,...
3138,Bound By Flame,Role-Playing,X360,0.01
3139,Mighty No. 9,Platform,XOne,0.01
3140,Resident Evil 4 HD,Shooter,XOne,0.01
3141,Farming 2017 - The Simulation,Simulation,PS4,0.01


Next, let's set the game name and platform as the index so it's more clear how we want to look at the data. If we don't do this step, then the basic range index will carry through to the end and won't be very informative.

In [None]:
games_relative.set_index(['Name','Platform'])

Unnamed: 0_level_0,Unnamed: 1_level_0,Genre,Global_Sales
Name,Platform,Unnamed: 2_level_1,Unnamed: 3_level_1
Kinect Adventures!,X360,Misc,21.82
Grand Theft Auto V,PS3,Action,21.40
Grand Theft Auto V,X360,Action,16.38
Call of Duty: Modern Warfare 3,X360,Shooter,14.76
Call of Duty: Black Ops,X360,Shooter,14.64
...,...,...,...
Bound By Flame,X360,Role-Playing,0.01
Mighty No. 9,XOne,Platform,0.01
Resident Evil 4 HD,XOne,Shooter,0.01
Farming 2017 - The Simulation,PS4,Simulation,0.01


Next, we'll group by genre. Remember that the final analysis depends on within-genre performance.

In [None]:
games_relative.set_index(['Name','Platform']).groupby("Genre")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fd593093fd0>

Finally, we apply the `transform()` method using the z-score calculation as a lambda function. 

In [None]:
games_relative.set_index(['Name','Platform']).groupby("Genre").transform(lambda x: (x - x.mean())/ x.std())

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Name,Platform,Unnamed: 2_level_1
Kinect Adventures!,X360,13.814162
Grand Theft Auto V,PS3,13.831175
Grand Theft Auto V,X360,10.468663
Call of Duty: Modern Warfare 3,X360,5.301112
Call of Duty: Black Ops,X360,5.253454
...,...,...
Bound By Flame,X360,-0.576944
Mighty No. 9,XOne,-0.728965
Resident Evil 4 HD,XOne,-0.556808
Farming 2017 - The Simulation,PS4,-0.728496


Very nice. We can also sort this result to see the top performers in terms of above-average sales for the genre.

In [None]:
games_relative.set_index(['Name','Platform']).groupby("Genre").transform(lambda x: (x - x.mean())/ x.std()).sort_values("Global_Sales", ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Global_Sales
Name,Platform,Unnamed: 2_level_1
Grand Theft Auto V,PS3,13.831175
Kinect Adventures!,X360,13.814162
Grand Theft Auto V,X360,10.468663
Gran Turismo 5,PS3,9.159261
Grand Theft Auto V,PS4,7.521441
...,...,...
Dragon Ball Z for Kinect,X360,-0.872762
Nitroplus Blasterz: Heroines Infinite Duel,PS3,-0.872762
Battle Fantasia,PS3,-0.872762
"Sakigake!! Otokojuku - Nihon yo, Kore ga Otoko Dearu!",PS3,-0.872762


According to our dataset, GTA V is a top-seller within its genre, as indicated by very high z-score.

Let's summarize what we did here.
1. We set the index of the *games_relative* dataframe to include the game name and platform. We did this so that once the aggregate analysis is done, we can see the results in terms of this index instead of the rather uninformative range index.
2. We performed a `groupby()` on "Genre", as we want to calculate the mean and standard deviation of total sales by genre.
3. We applied a `transform()` function for each genre subgroup within the groupby object, passing in a lambda function that calculates the z-score for every game in the original dataframe using the within-genre mean and standard deviation values obtained from the genre subgroups.

To verify the result and convince ourselves that it is correct, let's do the analysis for GTA V manually. First, we isolate GTA V from the rest of the dataframe and determine its sales.



In [None]:
games_relative.loc[games_relative['Name'] == 'Grand Theft Auto V']

Unnamed: 0,Name,Genre,Platform,Global_Sales
1,Grand Theft Auto V,Action,PS3,21.4
2,Grand Theft Auto V,Action,X360,16.38
12,Grand Theft Auto V,Action,PS4,11.98
65,Grand Theft Auto V,Action,XOne,5.08


We see that GTA V sold 21.4 million in global sales on the PS3. Keeping in this in mind, we now perform a `groupby()` on "Genre" and calculate the mean and standard deviation sales per genre.

In [None]:
games_relative.groupby("Genre").mean()

Unnamed: 0_level_0,Global_Sales
Genre,Unnamed: 1_level_1
Action,0.751007
Adventure,0.298289
Fighting,0.604182
Misc,0.55025
Platform,0.651842
Puzzle,0.133636
Racing,0.687854
Role-Playing,0.715804
Shooter,1.412019
Simulation,0.336076


In [None]:
games_relative.groupby("Genre").std()

Unnamed: 0_level_0,Global_Sales
Genre,Unnamed: 1_level_1
Action,1.492931
Adventure,0.750145
Fighting,0.680807
Misc,1.539706
Platform,0.880484
Puzzle,0.126354
Racing,1.10076
Role-Playing,1.223348
Shooter,2.517959
Simulation,0.447602


GTA V is an action game with sales of 21.4 million dollars on the PS3. From our calculations, within the entire action genre, the mean sale was 0.751007 million dollars while the standard deviation was 1.492931 million dollars. with this data we can calculate the z-score for GTA V on PS3.

In [None]:
(21.4 - 0.751007) / 1.492931

13.83117706042677

Thus, our manually-calculated z-score matches what we did using the `transform()` method in one line of code.

Some key takeaways from these recent lessons when pairing `groupby()` with additional methods:
* The `agg()` method usually reduces the dimensions of the dataset by collapsing one of the axes
* The `filter()` method simply removes records per the criteria of the filter
* The `transform()` method changes the values in-place without altering the shape of the dataframe.

## Bonus: The `apply()` method

The `groupby()` method can be very efficiently combined with the `apply()` method. In fact, `apply()` is the most flexible choice in that it enables all of the functionality of the other methods.

* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html

Let's start with a new slice of our data focusing exclusively on PS3 games.

In [None]:
ps3 = games.loc[games.Platform == 'PS3', ['Name','Genre','EU_Sales','Global_Sales']]

In [None]:
ps3.head(20)

Unnamed: 0,Name,Genre,EU_Sales,Global_Sales
1,Grand Theft Auto V,Action,9.27,21.4
6,Call of Duty: Black Ops II,Shooter,5.88,14.03
9,Call of Duty: Modern Warfare 3,Shooter,5.82,13.46
10,Call of Duty: Black Ops,Shooter,4.44,12.73
14,Gran Turismo 5,Racing,4.88,10.77
15,Call of Duty: Modern Warfare 2,Shooter,3.69,10.69
16,Grand Theft Auto IV,Action,3.76,10.57
20,Call of Duty: Ghosts,Shooter,3.73,9.59
25,FIFA Soccer 13,Action,5.05,8.24
31,Battlefield 3,Shooter,2.93,7.23


The `apply()` method takes the entire subgroup, which then applies a function to it and shoots out an output. The output could be a dataframe of the same dimensions, a Python list, or a single scalar value.

As an example, let's use the `apply()` method to classify PS3 game genres as either "solid" or "weak" based on the sum of their EU sales.

In [None]:
ps3.groupby('Genre').apply(lambda sg: 'solid' if sg.EU_Sales.sum() > 50 else 'weak')

Genre
Action          solid
Adventure        weak
Fighting         weak
Misc             weak
Platform         weak
Puzzle           weak
Racing           weak
Role-Playing     weak
Shooter         solid
Simulation       weak
Sports           weak
Strategy         weak
dtype: object

In doing this analysis, we created a series from our dataframe where each "Genre" subgroup was given a label depending on EU sales as defined by our lambda function.

To kick things up a notch, suppose we want to get a feel of how variable these sales are without relying on these qualitative labels. For this, we'll create a separate function in which we calculate the qualitative level of the sales as well as characterize the sales as volatile or weak depending on the relative standard deviation (aka coefficient of variation).

In [None]:
def sales_detail(sg):
  level = 'solid' if sg.EU_Sales.sum() > 50 else 'weak'
  variability = 'volatile' if sg.EU_Sales.std()/sg.EU_Sales.mean() > 2 else 'steady'
  return (variability, level + ' sales')

Now let's go back to the syntax for the application. Again, what this will do is group the PS3 games by genre, then apply our function to each genre subgroup, providing us with classifications for level and variability.

In [None]:
ps3.groupby('Genre').apply(sales_detail)

Genre
Action          (volatile, solid sales)
Adventure        (volatile, weak sales)
Fighting           (steady, weak sales)
Misc               (steady, weak sales)
Platform           (steady, weak sales)
Puzzle             (steady, weak sales)
Racing             (steady, weak sales)
Role-Playing     (volatile, weak sales)
Shooter           (steady, solid sales)
Simulation         (steady, weak sales)
Sports           (volatile, weak sales)
Strategy           (steady, weak sales)
dtype: object

This only scratches the surface of what you can do with `apply()`. The key takeaway here is that the function passed to the `apply()` method takes each subgroup as an argument and returns anything you want the function to return.

## Skill Challenge

#### 1. Starting with the *games* dataframe, calculate the total global sales (Global_Sales) for each year (Year) across all records. Whare the top 3 years in terms of total aggregate global sales?

This task should be simple enough. Start by grouping the games by Year, and then perform a sum of the sales by Year. Finally, sort by "Global_Sales" and determine the top 3. 

In [None]:
games.head()

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,Kinect Adventures!,X360,2010.0,Misc,Microsoft Game Studios,14.97,4.94,0.24,1.67,21.82
1,Grand Theft Auto V,PS3,2013.0,Action,Take-Two Interactive,7.01,9.27,0.97,4.14,21.4
2,Grand Theft Auto V,X360,2013.0,Action,Take-Two Interactive,9.63,5.31,0.06,1.38,16.38
3,Call of Duty: Modern Warfare 3,X360,2011.0,Shooter,Activision,9.03,4.28,0.13,1.32,14.76
4,Call of Duty: Black Ops,X360,2010.0,Shooter,Activision,9.67,3.73,0.11,1.13,14.64


In [None]:
games.groupby('Year').sum().sort_values(by = "Global_Sales", ascending = False)

Unnamed: 0_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010.0,168.13,99.57,11.98,35.65,315.47
2011.0,151.5,101.96,15.88,35.11,304.49
2008.0,139.62,78.35,7.72,29.77,255.45
2009.0,136.6,76.34,10.98,29.35,253.19
2013.0,116.33,89.5,13.5,31.02,250.36
2014.0,100.71,96.2,9.37,32.39,238.57
2012.0,98.12,73.72,13.0,25.39,210.37
2015.0,86.92,80.61,10.03,26.58,204.23
2007.0,95.1,49.17,5.74,19.65,169.65
2006.0,43.99,18.59,2.27,8.15,72.95


The analysis shows that 2010, 2011, and 2008 were the top 3 years in terms of global aggregate sales.

#### 2. In the *games* dataframe, determine which Genre, Year, and Platform sold the most in the EU.

We can approach this by grouping the games by Genre, Year, and Platform, then performing a sum analysis and sorting the result by EU_Sales.

In [None]:
games.groupby(["Genre","Year",'Platform']).sum().sort_values(by = "EU_Sales", ascending = False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
Genre,Year,Platform,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Action,2013.0,PS3,18.47,21.72,4.95,9.34,54.44
Action,2014.0,PS4,14.68,19.18,1.63,7.36,42.84
Action,2012.0,PS3,16.27,17.37,4.35,7.46,45.45
Action,2011.0,PS3,16.66,16.49,5.05,6.37,44.59
Shooter,2011.0,PS3,16.83,15.98,1.88,5.59,40.25
...,...,...,...,...,...,...,...
Adventure,2013.0,X360,0.00,0.00,0.02,0.00,0.02
Strategy,2011.0,PS3,0.18,0.00,0.12,0.01,0.31
Misc,2016.0,PS4,0.01,0.00,0.20,0.00,0.22
Misc,2016.0,PS3,0.00,0.00,0.14,0.00,0.14


It looks like Action Games on the PS3 in the year 2013 had the highest sales in the EU compared to all other Genre-Year-Platform groupings.

#### 3. Find all of the game names (Names) in the *games* dataset whose Genre in their respective Platform sold more in Japan (JP_Sales) than in Europe (EU_Sales)

This one is a bit more complex, requiring us to filter the original dataframe based on how much the games sold on a Genre-Platform basis. 

We start by grouping the games by Genre and Platform. We then use the `filter()` method and pass in a lambda function that compares the sum of sales in Japan for each Genre-Platform subgroup to the sum of sales in the EU. Only games that belong to Genre-Platform subgroups that sold more in Japan than in the EU are included.

In [None]:
games.groupby(["Genre","Platform"]).filter(lambda x: x['JP_Sales'].sum() > x['EU_Sales'].sum())

Unnamed: 0,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
1246,Katamari Forever,PS3,2009.0,Puzzle,Namco Bandai Games,0.26,0.05,0.06,0.04,0.42
1440,Beautiful Katamari,X360,2007.0,Puzzle,Namco Bandai Games,0.14,0.02,0.15,0.02,0.32
2117,Bejeweled 3,PS3,,Puzzle,Unknown,0.13,0.0,0.0,0.01,0.14
2132,Bejeweled 3,X360,,Puzzle,Unknown,0.13,0.0,0.0,0.01,0.14
2214,Are You Smarter than a 5th Grader? Game Time,X360,2009.0,Puzzle,THQ,0.12,0.0,0.0,0.01,0.12
2318,Tetris Evolution,X360,2007.0,Puzzle,THQ,0.08,0.02,0.0,0.01,0.11
2497,Qubed,X360,2009.0,Puzzle,Atari,0.07,0.0,0.0,0.01,0.08
2744,Puyo Puyo Tetris,PS3,2014.0,Puzzle,Sega,0.0,0.0,0.04,0.0,0.04
2767,PopCap Arcade Vol 1,X360,2007.0,Puzzle,PopCap Games,0.04,0.0,0.0,0.0,0.04
2787,Bomberman: Act Zero,X360,2006.0,Puzzle,Konami Digital Entertainment,0.04,0.0,0.0,0.0,0.04


And there we have it! A total of 10 games belong to Genre-Platform subgroups that sold better in Japan than in the EU. Interestingly these are all puzzle games.