In [1]:
import pandas as pd

# Aggregating & Reshaping DataFrames

 <img src="../images/sec04-01_goals.png">

## 1 - Grouping Columns

### 1.1 - Aggregating DataFrames

You can `aggregate a DataFrame` column by using aggregation methods (like Series)

In [2]:
premier_league = pd.read_excel('../retail/premier_league_games.xlsx')

In [3]:
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,4389,England Premier League,2015/2016,Arsenal,West Ham United,0,2
1,4390,England Premier League,2015/2016,Bournemouth,Aston Villa,0,1
2,4391,England Premier League,2015/2016,Chelsea,Swansea City,2,2
3,4392,England Premier League,2015/2016,Everton,Watford,2,2
4,4393,England Premier League,2015/2016,Leicester City,Sunderland,4,2


In [5]:
premier_league.loc[:, ['HomeGoals', 'AwayGoals']].sum()

HomeGoals    567
AwayGoals    459
dtype: int64

In [4]:
premier_league.loc[:, ['HomeGoals', 'AwayGoals']].mean()

HomeGoals    1.492105
AwayGoals    1.207895
dtype: float64

In [6]:
premier_league.loc[:, ['HomeGoals', 'AwayGoals']].std()

HomeGoals    1.259242
AwayGoals    1.146955
dtype: float64

### 1.2 - Grouping DataFrames
`Grouping a DataFrame` allows you to aggregate the data at a differet level
* For example, transform daily data into monthly, roll up transaction level data by store, etc.

To group data, use the `.groupby()` method and specify a column to group by 
* The grouped column becomes the index by default

 <img src="../images/sec04-02_groupby.png">

In [7]:
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,4389,England Premier League,2015/2016,Arsenal,West Ham United,0,2
1,4390,England Premier League,2015/2016,Bournemouth,Aston Villa,0,1
2,4391,England Premier League,2015/2016,Chelsea,Swansea City,2,2
3,4392,England Premier League,2015/2016,Everton,Watford,2,2
4,4393,England Premier League,2015/2016,Leicester City,Sunderland,4,2


In [9]:
premier_league.groupby('HomeTeam')['HomeGoals'].mean() # single bracket returns a series

HomeTeam
Arsenal                 1.631579
Aston Villa             0.736842
Bournemouth             1.210526
Chelsea                 1.684211
Crystal Palace          1.000000
Everton                 1.842105
Leicester City          1.842105
Liverpool               1.736842
Manchester City         2.473684
Manchester United       1.421053
Newcastle United        1.684211
Norwich City            1.368421
Southampton             2.052632
Stoke City              1.157895
Sunderland              1.210526
Swansea City            1.052632
Tottenham Hotspur       1.842105
Watford                 1.052632
West Bromwich Albion    1.052632
West Ham United         1.789474
Name: HomeGoals, dtype: float64

In [10]:
# Average number of goals scored by each team at home

premier_league.groupby('HomeTeam')[['HomeGoals']].mean() # double bracket returns a dataframe

Unnamed: 0_level_0,HomeGoals
HomeTeam,Unnamed: 1_level_1
Arsenal,1.631579
Aston Villa,0.736842
Bournemouth,1.210526
Chelsea,1.684211
Crystal Palace,1.0
Everton,1.842105
Leicester City,1.842105
Liverpool,1.736842
Manchester City,2.473684
Manchester United,1.421053


In [11]:
# sort the results

premier_league.groupby('HomeTeam')[['HomeGoals']].mean().sort_values('HomeGoals', ascending=False)

Unnamed: 0_level_0,HomeGoals
HomeTeam,Unnamed: 1_level_1
Manchester City,2.473684
Southampton,2.052632
Tottenham Hotspur,1.842105
Everton,1.842105
Leicester City,1.842105
West Ham United,1.789474
Liverpool,1.736842
Newcastle United,1.684211
Chelsea,1.684211
Arsenal,1.631579


In [13]:
premier_league.groupby('AwayTeam')[['AwayGoals']].sum().sort_values('AwayGoals', ascending=False)

Unnamed: 0_level_0,AwayGoals
AwayTeam,Unnamed: 1_level_1
Arsenal,34
Tottenham Hotspur,34
Leicester City,33
West Ham United,31
Liverpool,30
Chelsea,27
Sunderland,25
Everton,24
Manchester City,24
Manchester United,22


### 1.3 - Grouping By Multiple Columns

You can `group by multiple columns` by passing the list of columns into `.groupby()`
* This creates a multi-index object within an index for each column the data was grouped by
* Specify `as_index=False` to prevent the grouped columns from becoming indices

For example, this returns the sum of sales for each combination of 'family' and 'store_nbr`, but keeps a numeric index

```python
sales_sums = small_retail.groupby(['family', 'store_nbr'], as_index=False)[['sales']].sum()
```

#### Practice

In [14]:
premier_league = pd.read_excel('../retail/premier_league_games_full.xlsx')

In [15]:
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


In [17]:
premier_league.groupby('HomeTeam')[['HomeGoals']].sum()

Unnamed: 0_level_0,HomeGoals
HomeTeam,Unnamed: 1_level_1
Arsenal,306
Aston Villa,179
Birmingham City,38
Blackburn Rovers,98
Blackpool,30
Bolton Wanderers,104
Bournemouth,23
Burnley,39
Cardiff City,20
Chelsea,333


In [19]:
premier_league.groupby(['HomeTeam', 'season'])[['HomeGoals']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals
HomeTeam,season,Unnamed: 2_level_1
Arsenal,2008/2009,31
Arsenal,2009/2010,48
Arsenal,2010/2011,33
Arsenal,2011/2012,39
Arsenal,2012/2013,47
...,...,...
Wigan Athletic,2011/2012,22
Wigan Athletic,2012/2013,26
Wolverhampton Wanderers,2009/2010,13
Wolverhampton Wanderers,2010/2011,30


In [20]:
# swap our indices

premier_league.groupby(['season', 'HomeTeam'])[['HomeGoals']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals
season,HomeTeam,Unnamed: 2_level_1
2008/2009,Arsenal,31
2008/2009,Aston Villa,27
2008/2009,Blackburn Rovers,22
2008/2009,Bolton Wanderers,21
2008/2009,Chelsea,33
...,...,...
2015/2016,Swansea City,20
2015/2016,Tottenham Hotspur,35
2015/2016,Watford,20
2015/2016,West Bromwich Albion,20


In [21]:
premier_league.groupby(['season', 'HomeTeam'], as_index=False)[['HomeGoals']].sum()

Unnamed: 0,season,HomeTeam,HomeGoals
0,2008/2009,Arsenal,31
1,2008/2009,Aston Villa,27
2,2008/2009,Blackburn Rovers,22
3,2008/2009,Bolton Wanderers,21
4,2008/2009,Chelsea,33
...,...,...,...
155,2015/2016,Swansea City,20
156,2015/2016,Tottenham Hotspur,35
157,2015/2016,Watford,20
158,2015/2016,West Bromwich Albion,20


In [22]:
premier_league.groupby(['season', 'HomeTeam'], as_index=False)[['HomeGoals']].sum().query('HomeTeam == "Arsenal"')

Unnamed: 0,season,HomeTeam,HomeGoals
0,2008/2009,Arsenal,31
20,2009/2010,Arsenal,48
40,2010/2011,Arsenal,33
60,2011/2012,Arsenal,39
80,2012/2013,Arsenal,47
100,2013/2014,Arsenal,36
120,2014/2015,Arsenal,41
140,2015/2016,Arsenal,31


## 2 - Multi-index DataFrames

### 2.1 - Multi-Index DataFrames

`Multi-index DataFrames` are generally created through aggregation operations
* They are stored as a list of tuples, with an item for each layer of the index

 <img src="../images/sec04-03_multi-index_dataframe.png">

### 2.2 - Accessing Multi-Index DataFrames

The `.loc[]` accessor lets you `access multi-index DataFrames` in different ways

1. Access rows via the `outer index` only

 <img src="../images/sec04-04_access_1.png">

2. Access rows via the `outer & inner indices` 

 <img src="../images/sec04-05_access_2.png">

#### Practice

In [23]:
premier_league = pd.read_excel('../retail/premier_league_games_full.xlsx')

In [24]:
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


In [25]:
agg_prem_league = premier_league.groupby(['season', 'HomeTeam'])[['HomeGoals']].sum()

In [26]:
agg_prem_league.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals
season,HomeTeam,Unnamed: 2_level_1
2008/2009,Arsenal,31
2008/2009,Aston Villa,27
2008/2009,Blackburn Rovers,22
2008/2009,Bolton Wanderers,21
2008/2009,Chelsea,33


In [27]:
# access season 2008/2009

agg_prem_league.loc['2008/2009']

Unnamed: 0_level_0,HomeGoals
HomeTeam,Unnamed: 1_level_1
Arsenal,31
Aston Villa,27
Blackburn Rovers,22
Bolton Wanderers,21
Chelsea,33
Everton,31
Fulham,28
Hull City,18
Liverpool,41
Manchester City,40


In [28]:
# slice to grab a range of seasons (3 seasons)

agg_prem_league.loc['2008/2009':'2010/2011']

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals
season,HomeTeam,Unnamed: 2_level_1
2008/2009,Arsenal,31
2008/2009,Aston Villa,27
2008/2009,Blackburn Rovers,22
2008/2009,Bolton Wanderers,21
2008/2009,Chelsea,33
2008/2009,Everton,31
2008/2009,Fulham,28
2008/2009,Hull City,18
2008/2009,Liverpool,41
2008/2009,Manchester City,40


In [29]:
# grab individual rows
# Arsenal in 2008/2009

agg_prem_league.loc[('2008/2009', 'Arsenal')]

HomeGoals    31
Name: (2008/2009, Arsenal), dtype: int64

In [30]:
agg_prem_league.loc[('2008/2009', 'Arsenal'):('2008/2009', 'Bolton Wanderers')]

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals
season,HomeTeam,Unnamed: 2_level_1
2008/2009,Arsenal,31
2008/2009,Aston Villa,27
2008/2009,Blackburn Rovers,22
2008/2009,Bolton Wanderers,21


In [31]:
# we can always use the iloc accessor (positional indexing)
# for example, grab the second row 

agg_prem_league.iloc[1]

HomeGoals    27
Name: (2008/2009, Aston Villa), dtype: int64

In [32]:
# Use the `agg` method to apply multiple aggregations at once
# Now we have multi-level row index and multi-level column index

agg_prem_league = premier_league.groupby(['season', 'HomeTeam']).agg({'HomeGoals': ['sum', 'mean']})

agg_prem_league.head()

# agg_prem_league = premier_league.groupby(['season', 'HomeTeam'])[['HomeGoals']].agg(['sum', 'mean'])

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals,HomeGoals
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean
season,HomeTeam,Unnamed: 2_level_2,Unnamed: 3_level_2
2008/2009,Arsenal,31,1.631579
2008/2009,Aston Villa,27,1.421053
2008/2009,Blackburn Rovers,22,1.157895
2008/2009,Bolton Wanderers,21,1.105263
2008/2009,Chelsea,33,1.736842


In [33]:
# look at accessing data with these two layers of indices

agg_prem_league.iloc[0, 0]

31

In [34]:
agg_prem_league.iloc[1, 1]

1.4210526315789473

In [35]:
# slice 

agg_prem_league.iloc[1, :]

HomeGoals  sum     27.000000
           mean     1.421053
Name: (2008/2009, Aston Villa), dtype: float64

In [36]:
# loc accessor

agg_prem_league.loc['2010/2011', ('HomeGoals', )]

Unnamed: 0_level_0,sum,mean
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1
Arsenal,33,1.736842
Aston Villa,26,1.368421
Birmingham City,19,1.0
Blackburn Rovers,22,1.157895
Blackpool,30,1.578947
Bolton Wanderers,34,1.789474
Chelsea,39,2.052632
Everton,31,1.631579
Fulham,30,1.578947
Liverpool,37,1.947368


In [37]:
# loc accessor

agg_prem_league.loc['2010/2011', ('HomeGoals', 'mean')]

HomeTeam
Arsenal                    1.736842
Aston Villa                1.368421
Birmingham City            1.000000
Blackburn Rovers           1.157895
Blackpool                  1.578947
Bolton Wanderers           1.789474
Chelsea                    2.052632
Everton                    1.631579
Fulham                     1.578947
Liverpool                  1.947368
Manchester City            1.789474
Manchester United          2.578947
Newcastle United           2.157895
Stoke City                 1.631579
Sunderland                 1.315789
Tottenham Hotspur          1.578947
West Bromwich Albion       1.578947
West Ham United            1.263158
Wigan Athletic             1.157895
Wolverhampton Wanderers    1.578947
Name: (HomeGoals, mean), dtype: float64

### 2.3 - Modifying Multi-Index DataFrames

There are several ways to `modify multi-index DataFrames:`

 <img src="../images/sec04-06_modifying.png">

In [38]:
agg_prem_league = premier_league.groupby(['season', 'HomeTeam']).agg({'HomeGoals': ['sum', 'mean']})

agg_prem_league.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,HomeGoals,HomeGoals
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean
season,HomeTeam,Unnamed: 2_level_2,Unnamed: 3_level_2
2008/2009,Arsenal,31,1.631579
2008/2009,Aston Villa,27,1.421053
2008/2009,Blackburn Rovers,22,1.157895
2008/2009,Bolton Wanderers,21,1.105263
2008/2009,Chelsea,33,1.736842


In [42]:
agg_prem_league = agg_prem_league.droplevel(0, axis=1) # drop the top level of the column index (the HomeGoals title is no longer here)

In [40]:
agg_prem_league.droplevel(0, axis=1).loc[:, 'mean']

season     HomeTeam            
2008/2009  Arsenal                 1.631579
           Aston Villa             1.421053
           Blackburn Rovers        1.157895
           Bolton Wanderers        1.105263
           Chelsea                 1.736842
                                     ...   
2015/2016  Swansea City            1.052632
           Tottenham Hotspur       1.842105
           Watford                 1.052632
           West Bromwich Albion    1.052632
           West Ham United         1.789474
Name: mean, Length: 160, dtype: float64

In [43]:
agg_prem_league.swaplevel() 

Unnamed: 0_level_0,Unnamed: 1_level_0,sum,mean
HomeTeam,season,Unnamed: 2_level_1,Unnamed: 3_level_1
Arsenal,2008/2009,31,1.631579
Aston Villa,2008/2009,27,1.421053
Blackburn Rovers,2008/2009,22,1.157895
Bolton Wanderers,2008/2009,21,1.105263
Chelsea,2008/2009,33,1.736842
...,...,...,...
Swansea City,2015/2016,20,1.052632
Tottenham Hotspur,2015/2016,35,1.842105
Watford,2015/2016,20,1.052632
West Bromwich Albion,2015/2016,20,1.052632


In [44]:
agg_prem_league.swaplevel().loc['Arsenal'] # get all the rows for Arsenal

Unnamed: 0_level_0,sum,mean
season,Unnamed: 1_level_1,Unnamed: 2_level_1
2008/2009,31,1.631579
2009/2010,48,2.526316
2010/2011,33,1.736842
2011/2012,39,2.052632
2012/2013,47,2.473684
2013/2014,36,1.894737
2014/2015,41,2.157895
2015/2016,31,1.631579


In [45]:
agg_prem_league.reset_index() # end up with our integer based index

Unnamed: 0,season,HomeTeam,sum,mean
0,2008/2009,Arsenal,31,1.631579
1,2008/2009,Aston Villa,27,1.421053
2,2008/2009,Blackburn Rovers,22,1.157895
3,2008/2009,Bolton Wanderers,21,1.105263
4,2008/2009,Chelsea,33,1.736842
...,...,...,...,...
155,2015/2016,Swansea City,20,1.052632
156,2015/2016,Tottenham Hotspur,35,1.842105
157,2015/2016,Watford,20,1.052632
158,2015/2016,West Bromwich Albion,20,1.052632


## 3 - Aggregating Groups

### 3.1 - The AGG Method

The `.agg()` method lets you perform multiple aggregations on a "groupby" object 

```python
    small_retail.groupby(['store_nbr', 'family']).agg('sum')
```

### 3.2 - Multiple Aggregations

You can perform `multiple aggregations` by passing a list of aggregation functions 

```python
    small_retail.groupby(['store_nbr', 'family']).agg(['sum', 'mean'])
```

You can perform `specific aggregations by column` by passing a dictionary with column names as keys, and lists of aggregation funcions as values

```python
    (small_retail
      .groupby(['family', 'store_nbr'])
      .agg({'sales': ['sum', 'mean'],
            'onpromotion': ['min', 'max']})

    )
```

### 3.3 Named Aggregations

You can `name aggregated columns` upon creation to avoid multi-index columns

 <img src="../images/sec04-07_named_aggregations.png">

#### Practice

In [46]:
premier_league = pd.read_excel('../retail/premier_league_games_full.xlsx')

premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


In [50]:
(premier_league
 .groupby(['season', 'HomeTeam'], as_index=False)
 .agg(['sum', 'count']))

Unnamed: 0_level_0,Unnamed: 1_level_0,id,id,league_name,league_name,AwayTeam,AwayTeam,HomeGoals,HomeGoals,AwayGoals,AwayGoals
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,count,sum,count,sum,count,sum,count,sum,count
season,HomeTeam,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
2008/2009,Arsenal,36128,19,England Premier LeagueEngland Premier LeagueEn...,19,West Bromwich AlbionTottenham HotspurMancheste...,19,31,19,16,19
2008/2009,Aston Villa,36136,19,England Premier LeagueEngland Premier LeagueEn...,19,Manchester CityBlackburn RoversMiddlesbroughMa...,19,27,19,21,19
2008/2009,Blackburn Rovers,36781,19,England Premier LeagueEngland Premier LeagueEn...,19,ChelseaSunderlandLiverpoolStoke CityHull CityM...,19,22,19,23,19
2008/2009,Bolton Wanderers,36214,19,England Premier LeagueEngland Premier LeagueEn...,19,Stoke CityEvertonManchester CityLiverpoolChels...,19,21,19,21,19
2008/2009,Chelsea,36552,19,England Premier LeagueEngland Premier LeagueEn...,19,PortsmouthSunderlandNewcastle UnitedArsenalWes...,19,33,19,12,19
...,...,...,...,...,...,...,...,...,...,...,...
2015/2016,Swansea City,87241,19,England Premier LeagueEngland Premier LeagueEn...,19,ArsenalBournemouthLeicester CityWest Ham Unite...,19,20,19,20,19
2015/2016,Tottenham Hotspur,87218,19,England Premier LeagueEngland Premier LeagueEn...,19,Aston VillaWest Ham UnitedChelseaNewcastle Uni...,19,35,19,15,19
2015/2016,Watford,87142,19,England Premier LeagueEngland Premier LeagueEn...,19,West Ham UnitedManchester UnitedNorwich CityLi...,19,20,19,19,19
2015/2016,West Bromwich Albion,87064,19,England Premier LeagueEngland Premier LeagueEn...,19,Manchester CityLeicester CityArsenalTottenham ...,19,20,19,26,19


In [51]:
# better to start with the dictionary based approach 

(premier_league
 .groupby(['season', 'HomeTeam'], as_index=False)
 .agg({'HomeGoals': ['sum', 'mean'], 
       'AwayGoals': ['sum', 'mean']}))

Unnamed: 0_level_0,season,HomeTeam,HomeGoals,HomeGoals,AwayGoals,AwayGoals
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,mean,sum,mean
0,2008/2009,Arsenal,31,1.631579,16,0.842105
1,2008/2009,Aston Villa,27,1.421053,21,1.105263
2,2008/2009,Blackburn Rovers,22,1.157895,23,1.210526
3,2008/2009,Bolton Wanderers,21,1.105263,21,1.105263
4,2008/2009,Chelsea,33,1.736842,12,0.631579
...,...,...,...,...,...,...
155,2015/2016,Swansea City,20,1.052632,20,1.052632
156,2015/2016,Tottenham Hotspur,35,1.842105,15,0.789474
157,2015/2016,Watford,20,1.052632,19,1.000000
158,2015/2016,West Bromwich Albion,20,1.052632,26,1.368421


In [52]:
# named aggregations

(premier_league
 .groupby(['season', 'HomeTeam'], as_index=False)
 .agg(home_goal_sum=('HomeGoals', 'sum'),
      away_goal_sum=('AwayGoals', 'sum')))

Unnamed: 0,season,HomeTeam,home_goal_sum,away_goal_sum
0,2008/2009,Arsenal,31,16
1,2008/2009,Aston Villa,27,21
2,2008/2009,Blackburn Rovers,22,23
3,2008/2009,Bolton Wanderers,21,21
4,2008/2009,Chelsea,33,12
...,...,...,...,...
155,2015/2016,Swansea City,20,20
156,2015/2016,Tottenham Hotspur,35,15
157,2015/2016,Watford,20,19
158,2015/2016,West Bromwich Albion,20,26


### 3.4 - **PRO TIP:** TRANSFORM 

The `.transform()` method can be used to perform aggregations without reshaping
* This is useful for calculating group-level statistics to perform low-level analysis

 <img src="../images/sec04-08_transform.png">

* _Here grab the sum for each of the stores and we haven't lost any rows (keep the structure of our dataframe)_
* _This would allow us to perform analysis like what was the percentage of sales for this given row compared to the total store sales?_

#### Practice

In [53]:
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


In [55]:
# calculate the average number of goals that each of our home teams scored, but I didn't want to collapse the number of rows in my data frame

premier_league.assign(
    avg_team_goals = premier_league.groupby(['HomeTeam'])['HomeGoals'].transform('mean')
)

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1,2.223684
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0,2.013158
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1,1.210526
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1,1.466165
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2,1.177632
...,...,...,...,...,...,...,...,...
3035,4764,England Premier League,2015/2016,Southampton,Leicester City,2,2,1.763158
3036,4765,England Premier League,2015/2016,Swansea City,Stoke City,0,1,1.421053
3037,4766,England Premier League,2015/2016,Tottenham Hotspur,Liverpool,0,0,1.677632
3038,4767,England Premier League,2015/2016,Watford,Arsenal,0,3,1.052632


In [56]:
# create one more column 

premier_league.assign(
    avg_team_goals = premier_league.groupby(['HomeTeam'])['HomeGoals'].transform('mean'),
    difference = lambda x: x['HomeGoals'] - x['avg_team_goals']
)

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals,difference
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1,2.223684,-1.223684
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0,2.013158,-1.013158
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1,1.210526,-1.210526
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1,1.466165,0.533835
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2,1.177632,2.822368
...,...,...,...,...,...,...,...,...,...
3035,4764,England Premier League,2015/2016,Southampton,Leicester City,2,2,1.763158,0.236842
3036,4765,England Premier League,2015/2016,Swansea City,Stoke City,0,1,1.421053,-1.421053
3037,4766,England Premier League,2015/2016,Tottenham Hotspur,Liverpool,0,0,1.677632,-1.677632
3038,4767,England Premier League,2015/2016,Watford,Arsenal,0,3,1.052632,-1.052632


In [57]:
# Now, store this datafram in a variable

pm = premier_league.assign(
    avg_team_goals = premier_league.groupby(['HomeTeam'])['HomeGoals'].transform('mean'),
    difference = lambda x: x['HomeGoals'] - x['avg_team_goals']
    )

In [58]:
# which teams does our home team perform the best/worst against?

# On average, Chelsea performs the worst against Bourneouth
# Arsenal absolutely crushes Blackpool whenever they play 

pm.groupby(['HomeTeam', 'AwayTeam']).agg({'difference': 'mean'}).sort_values('difference')

Unnamed: 0_level_0,Unnamed: 1_level_0,difference
HomeTeam,AwayTeam,Unnamed: 2_level_1
Chelsea,Bournemouth,-2.190789
Southampton,Wigan Athletic,-1.763158
Southampton,Cardiff City,-1.763158
Leicester City,Hull City,-1.657895
Leicester City,Manchester City,-1.657895
...,...,...
Wolverhampton Wanderers,Blackpool,2.912281
Fulham,Queens Park Rangers,2.982456
Everton,Blackpool,3.302632
Leicester City,Queens Park Rangers,3.342105


In [59]:
# How often does this really happen?

# They only played each other once, so we can drop the earlier conclusion

pm.query("HomeTeam == 'Arsenal' & AwayTeam == 'Blackpool'")

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals,difference
870,2599,England Premier League,2010/2011,Arsenal,Blackpool,6,0,2.013158,3.986842


In [60]:
# Look at how Blackpool generally does when they're the AwayTeam

pm.query("AwayTeam == 'Blackpool'")

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals,difference
769,2498,England Premier League,2010/2011,Wigan Athletic,Blackpool,0,4,1.115789,-1.115789
795,2524,England Premier League,2010/2011,Aston Villa,Blackpool,3,2,1.177632,1.822368
801,2530,England Premier League,2010/2011,West Ham United,Blackpool,0,0,1.466165,-1.466165
828,2557,England Premier League,2010/2011,Bolton Wanderers,Blackpool,2,2,1.368421,0.631579
849,2578,England Premier League,2010/2011,Stoke City,Blackpool,0,1,1.342105,-1.342105
870,2599,England Premier League,2010/2011,Arsenal,Blackpool,6,0,2.013158,3.986842
881,2610,England Premier League,2010/2011,Sunderland,Blackpool,0,2,1.210526,-1.210526
894,2623,England Premier League,2010/2011,Manchester City,Blackpool,1,0,2.401316,-1.401316
916,2645,England Premier League,2010/2011,West Bromwich Albion,Blackpool,3,2,1.330827,1.669173
943,2672,England Premier League,2010/2011,Everton,Blackpool,5,3,1.697368,3.302632


## 4 - Pivot Tables

The `.pivot_table` method lets you create Excel-style `pivot tables`



### 4.1 - Pivot Table Arguments

 <img src="../images/sec04-09_pivot_table.png">



#### Practice

In [61]:
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


In [62]:
premier_league.pivot_table(index='HomeTeam', 
                           columns='season', 
                           values='HomeGoals', 
                           aggfunc='sum')

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arsenal,31.0,48.0,33.0,39.0,47.0,36.0,41.0,31.0
Aston Villa,27.0,29.0,26.0,20.0,23.0,22.0,18.0,14.0
Birmingham City,,19.0,19.0,,,,,
Blackburn Rovers,22.0,28.0,22.0,26.0,,,,
Blackpool,,,30.0,,,,,
Bolton Wanderers,21.0,26.0,34.0,23.0,,,,
Bournemouth,,,,,,,,23.0
Burnley,,25.0,,,,,14.0,
Cardiff City,,,,,,20.0,,
Chelsea,33.0,68.0,39.0,41.0,41.0,43.0,36.0,32.0


In [67]:
# filter before pivoting

(premier_league
 .query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']")
 .pivot_table(index='HomeTeam', 
              columns='season', 
              values='HomeGoals', 
              aggfunc='sum')
)

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arsenal,31,48,33,39,47,36,41,31
Chelsea,33,68,39,41,41,43,36,32
Everton,31,35,31,28,33,38,27,35


In [68]:
(premier_league
 .query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']")
 .pivot_table(index='HomeTeam', 
              columns='season', 
              values='HomeGoals', 
              aggfunc='sum',
              margins=True)
)

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,All
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Arsenal,31,48,33,39,47,36,41,31,306
Chelsea,33,68,39,41,41,43,36,32,333
Everton,31,35,31,28,33,38,27,35,258
All,95,151,103,108,121,117,104,98,897


### 4.2 - Multiple Aggregation Functions

`Multiple aggregation functions` can be passed to the `aggfunc` argument
* The new values are added as additional columns

_The functions are passed as a tuple_

```python
    smaller_retail.pivot_table(index='family',
                              columns='store_nbr',
                              values='sales',
                              aggfunc=('min', 'max'))
```

_Use a dictionary to apply specific functions to specific columns_

```python
    smaller_retail.pivot_table(
        index='family',
        columns='store_nbr',
        aggfunc=({'sales': ['sum', 'mean'], 'onpromotion': 'max'})
    )
```



### 4.3 - Pivot Tables Vs. Groupby

<img src="../images/sec04-10_pivot_vs_groupby_1.png">

<img src="../images/sec04-11_pivot_vs_groupby_2.png">

_groupby allows us to create named aggregations_

### 4.4 - **PRO TIP:** Pivot Table Heat Maps

`cmap` is short for `color map`

<img src="../images/sec04-12_heatmap.png">

## 5 - Melting DataFrames

<img src="../images/sec04-13_melt_1.png">

<img src="../images/sec04-14_melt_2.png">

<img src="../images/sec04-15_melt_3.png">

#### Practice

In [76]:
pm = (premier_league
 .query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']")
 .pivot_table(index='HomeTeam', 
              columns='season', 
              values='HomeGoals', 
              aggfunc='mean')
)

In [77]:
pm 

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arsenal,1.631579,2.526316,1.736842,2.052632,2.473684,1.894737,2.157895,1.631579
Chelsea,1.736842,3.578947,2.052632,2.157895,2.157895,2.263158,1.894737,1.684211
Everton,1.631579,1.842105,1.631579,1.473684,1.736842,2.0,1.421053,1.842105


In [78]:
# Because our column names are stored in the index, the first thing we need to do is reset the index to get them back into the columns

pm.reset_index().melt(id_vars='HomeTeam',
                      value_vars=['2008/2009', '2009/2010', '2010/2011'],
                      var_name='avg_goals')


Unnamed: 0,HomeTeam,avg_goals,value
0,Arsenal,2008/2009,1.631579
1,Chelsea,2008/2009,1.736842
2,Everton,2008/2009,1.631579
3,Arsenal,2009/2010,2.526316
4,Chelsea,2009/2010,3.578947
5,Everton,2009/2010,1.842105
6,Arsenal,2010/2011,1.736842
7,Chelsea,2010/2011,2.052632
8,Everton,2010/2011,1.631579


# Key Takeaways

<img src="../images/sec04-16_key_takeaways.png">