### **Aggregating DataFrames** </br> 

In [1]:
import pandas as pd
import numpy as np

##### **You can aggregate a DataFrames column by using aggregation methods** </br> 

In [2]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [3]:
# aggregate columns like Series using .sum() for 100 samples(100)
retail_df.loc[:, ['sales', 'onpromotion']].sample(100).sum().round(2)

sales          50821.11
onpromotion      679.00
dtype: float64

In [4]:
# aggregate columns like Series using .mean() for 100 samples(100)
retail_df.loc[:, ['sales', 'onpromotion']].sample(100).mean().round(2)

sales          449.68
onpromotion      7.81
dtype: float64

##### **Can call Aggregate functions on entire DataFrame </br> But this is not ideal** </br> Summary Statistics using `.describe()` method is better for DataFrame as a whole

In [5]:
# Create random sample of 100 for aggregation examples
sample_df = retail_df.sample(100, random_state=616)
sample_df

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.000,0
546385,2492329,2016-11-02,4,BOOKS,3.000,0
534555,2480499,2016-10-26,8,LINGERIE,7.000,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0
...,...,...,...,...,...,...
402640,2348584,2016-08-13,7,CLEANING,702.000,6
424226,2370170,2016-08-26,12,FROZEN FOODS,34.000,0
650850,2596794,2017-01-01,20,MEATS,0.000,0
384193,2330137,2016-08-03,39,CLEANING,1031.000,21


In [6]:
# Call aggregate function .sum() on entire DataFrame
sample_df.sum()
# not ideal because of columns that are objects

id                                                     247919675
date           2016-08-112016-11-212016-11-022016-10-262016-0...
store_nbr                                                   2946
family         PRODUCEHARDWAREBOOKSLINGERIEPRODUCEBOOKSBREAD/...
sales                                                  85086.114
onpromotion                                                  905
dtype: object

In [7]:
# Call aggregate function .mean() on entire DataFrame
# sample_df.mean()
##############################
# TypeError: Could not convert
##############################

In [8]:
# call sum aggregation on specific columns of DataFrame (.sum())
sample_df.loc[:, ['sales', 'onpromotion']].sum()

sales          85086.114
onpromotion      905.000
dtype: float64

In [9]:
# call Standard Deviation aggregation on specific columns of DataFrame (.std())
sample_df.loc[:, ['sales', 'onpromotion']].std()

sales          2177.512241
onpromotion      19.364330
dtype: float64

In [10]:
# .describe() is the better method to get summary statistics for an entire DataFrame
sample_df.describe().round(2)

Unnamed: 0,id,store_nbr,sales,onpromotion
count,100.0,100.0,100.0,100.0
mean,2479196.75,29.46,850.86,9.05
std,291897.71,15.95,2177.51,19.36
min,1949080.0,1.0,0.0,0.0
25%,2248124.5,16.75,3.75,0.0
50%,2480757.5,31.0,47.0,0.0
75%,2759322.5,44.0,521.81,6.5
max,2991624.0,54.0,11596.0,95.0


##### **Grouping DataFrames**</br> Grouping a Dataframe allows aggregation of data at a different level </br> - Can transform daily data into monthly </br> - Can transform transaction level data by store </br> `.groupby(column_to_groupby)[[list_of_columns_to_aggregate]].aggregation_method()` </br> This Method must specify a column to group by and then column(s) list in nested [[]] to return DataFrame and that will be aggregated. The `column_to_groupby` column becomes the index by default

In [11]:
# call groupby method on just 'family' column
retail_df.groupby('family')
# this returns a groupby object
# <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000022F9A2DAE90>

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001D87AE0F4D0>

In [12]:
# aggregate 'sales' groupby by 'family
retail_df.groupby('family')[['sales']].sum().round().sort_values('sales', ascending=False)
# returns the sum of 'sales' grouped by the 'family' values in a DataFrame

Unnamed: 0_level_0,sales
family,Unnamed: 1_level_1
GROCERY I,143227476.0
BEVERAGES,105700279.0
PRODUCE,73523507.0
CLEANING,38127743.0
DAIRY,28422893.0
BREAD/BAKERY,17092978.0
POULTRY,12375952.0
MEATS,11551426.0
PERSONAL CARE,10193693.0
DELI,9617777.0


##### **Grouping Multiple Columns in DataFrames**</br> Grouping multiple columns allows aggregation of data at a different level using a multi-index </br> - Can transform daily data into monthly </br> - Can transform transaction level data by store </br> `.groupby([columns_to_groupby])[[list_of_columns_to_aggregate]].aggregation_method()` </br> This Method must specify columns to group by and then column(s) list in nested [[]] to return multi-indexed DataFrame that was aggregated. </br> Can also be be displayed with integer index using argument `as_index=False` in groupby method</br> If sorting results aggregated DataFrame: use this syntax `.sort_values(by=[list_of_columns_to_Sort], ascending=['list_of_True_False_for_sort_order])` </br> can also chain .query() method

In [13]:
# Create aggregate 'sales' groupby by 'family' and 'store_nbr' which creates multi-index DataFrame
retail_df.groupby(['family', 'store_nbr'])[['sales']].sum().round()
# displayed as multi-index DataFrame

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
family,store_nbr,Unnamed: 2_level_1
AUTOMOTIVE,1,2524.0
AUTOMOTIVE,2,3918.0
AUTOMOTIVE,3,6790.0
AUTOMOTIVE,4,2565.0
AUTOMOTIVE,5,3667.0
...,...,...
SEAFOOD,50,12774.0
SEAFOOD,51,34251.0
SEAFOOD,52,1219.0
SEAFOOD,53,3745.0


In [14]:
# Create aggregate 'sales' groupby by 'family' and 'store_nbr' which creates multi-index DataFrame
# to display as a groupby DataFrame with integer index use as_index=False instead of multi-index DataFrame
retail_df.groupby(['family', 'store_nbr'], as_index=False)[['sales']].sum().round()
# displayed DataFrame with integer index and Aggregations

Unnamed: 0,family,store_nbr,sales
0,AUTOMOTIVE,1,2524.0
1,AUTOMOTIVE,2,3918.0
2,AUTOMOTIVE,3,6790.0
3,AUTOMOTIVE,4,2565.0
4,AUTOMOTIVE,5,3667.0
...,...,...,...
1777,SEAFOOD,50,12774.0
1778,SEAFOOD,51,34251.0
1779,SEAFOOD,52,1219.0
1780,SEAFOOD,53,3745.0


##### **Multi-Index DataFrames**</br> Created through aggregation operations </br> They are stored as a list of tuples with an item for each layer of the index

In [15]:
# Create DataFrame for aggregate 'sales' groupby by 'family' and 'store_nbr' which creates multi-index DataFrame
sales_sums = retail_df.groupby(['family', 'store_nbr'])[['sales']].sum().round()
sales_sums

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
family,store_nbr,Unnamed: 2_level_1
AUTOMOTIVE,1,2524.0
AUTOMOTIVE,2,3918.0
AUTOMOTIVE,3,6790.0
AUTOMOTIVE,4,2565.0
AUTOMOTIVE,5,3667.0
...,...,...
SEAFOOD,50,12774.0
SEAFOOD,51,34251.0
SEAFOOD,52,1219.0
SEAFOOD,53,3745.0


In [16]:
# display indices fro sales_sums
sales_sums.index
# This is the list of tuples

MultiIndex([('AUTOMOTIVE',  1),
            ('AUTOMOTIVE',  2),
            ('AUTOMOTIVE',  3),
            ('AUTOMOTIVE',  4),
            ('AUTOMOTIVE',  5),
            ('AUTOMOTIVE',  6),
            ('AUTOMOTIVE',  7),
            ('AUTOMOTIVE',  8),
            ('AUTOMOTIVE',  9),
            ('AUTOMOTIVE', 10),
            ...
            (   'SEAFOOD', 45),
            (   'SEAFOOD', 46),
            (   'SEAFOOD', 47),
            (   'SEAFOOD', 48),
            (   'SEAFOOD', 49),
            (   'SEAFOOD', 50),
            (   'SEAFOOD', 51),
            (   'SEAFOOD', 52),
            (   'SEAFOOD', 53),
            (   'SEAFOOD', 54)],
           names=['family', 'store_nbr'], length=1782)

##### **Accessing Multi-Index DataFrames**</br> use .loc[]  accessor in different ways
|Way|Description|
|---|-----------|
|&nbsp;&nbsp;1&nbsp;|Access rows via the outer index only|
|&nbsp;&nbsp;2&nbsp;|Access rows via the outer & inner  `as a tuple` [('outer_index', 'inner_index')]|

In [17]:
# Access rows via the outer index only
sales_sums.loc['AUTOMOTIVE'].head(15)
# this displays all rows for 'AUTOMOTIVE' and the 'family' column is dropped

Unnamed: 0_level_0,sales
store_nbr,Unnamed: 1_level_1
1,2524.0
2,3918.0
3,6790.0
4,2565.0
5,3667.0
6,3442.0
7,3031.0
8,3225.0
9,7695.0
10,1772.0


In [18]:
# can also slice the outer layer
sales_sums.loc['AUTOMOTIVE':'BEAUTY']
# this displays all rows for 'AUTOMOTIVE' to 'BEAUTY' and the 'family' column is not dropped

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
family,store_nbr,Unnamed: 2_level_1
AUTOMOTIVE,1,2524.0
AUTOMOTIVE,2,3918.0
AUTOMOTIVE,3,6790.0
AUTOMOTIVE,4,2565.0
AUTOMOTIVE,5,3667.0
...,...,...
BEAUTY,50,6353.0
BEAUTY,51,3566.0
BEAUTY,52,972.0
BEAUTY,53,3812.0


In [19]:
# Access rows via the outer and inner indices
# must be done as a tuple (make sure the data types match if string have in quotes, if integer has without quotes)
sales_sums.loc[('AUTOMOTIVE', 5), :]
# this displays as series and removes 'family' and 'store_nbr' of index

sales    3667.0
Name: (AUTOMOTIVE, 5), dtype: float64

In [20]:
# Access rows via the outer and inner indices
# must be done as a tuple (make sure the data types match if string have in quotes, if integer has without quotes)
# Wrapping tuple in [] will display as DataFrame 
sales_sums.loc[[('AUTOMOTIVE', 5)], :]
# this displays as DataFrame with all columns

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
family,store_nbr,Unnamed: 2_level_1
AUTOMOTIVE,5,3667.0


In [21]:
# Can Slice multi-index DataFrames but will still have multi-index, adding reset_index() will put columns labels on same level
sales_sums.loc[('AUTOMOTIVE', 5):('BEAUTY', 5),:]

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
family,store_nbr,Unnamed: 2_level_1
AUTOMOTIVE,5,3667.0
AUTOMOTIVE,6,3442.0
AUTOMOTIVE,7,3031.0
AUTOMOTIVE,8,3225.0
AUTOMOTIVE,9,7695.0
...,...,...
BEAUTY,1,1776.0
BEAUTY,2,3824.0
BEAUTY,3,8150.0
BEAUTY,4,3063.0


In [22]:
# when Slicing multi-index DataFrames use .reset_index()
# to create new idex which turns result DataFrame into integer index with all columns and aggregations
sales_sums.loc[('AUTOMOTIVE', 5):('BEAUTY', 5),:].reset_index()

Unnamed: 0,family,store_nbr,sales
0,AUTOMOTIVE,5,3667.0
1,AUTOMOTIVE,6,3442.0
2,AUTOMOTIVE,7,3031.0
3,AUTOMOTIVE,8,3225.0
4,AUTOMOTIVE,9,7695.0
...,...,...,...
104,BEAUTY,1,1776.0
105,BEAUTY,2,3824.0
106,BEAUTY,3,8150.0
107,BEAUTY,4,3063.0


##### **Accessing Multi-Index DataFrames**</br> Several ways to modify multi-index DataFrames </br> ---- Best to Reset The Index ----
|Way|Description|
|---|-----------|
|`.reset_index()`|Most common operation to return DataFrame to integer based index with aggregations|
|`.swaplevel()`|Changes the hierarchy of index levels|
|`.droplevel()`|Drops an index level from the DataFrame entirely - Will permanently lose data|

In [23]:
# reset_index() to keep aggregation and and revert back to integer index
sales_sums.reset_index()

Unnamed: 0,family,store_nbr,sales
0,AUTOMOTIVE,1,2524.0
1,AUTOMOTIVE,2,3918.0
2,AUTOMOTIVE,3,6790.0
3,AUTOMOTIVE,4,2565.0
4,AUTOMOTIVE,5,3667.0
...,...,...,...
1777,SEAFOOD,50,12774.0
1778,SEAFOOD,51,34251.0
1779,SEAFOOD,52,1219.0
1780,SEAFOOD,53,3745.0


In [24]:
# swap levels so 'family' is not the first part of multi-index -- good if second groupby is not multi-level also - great for individual grabbing individual index
sales_sums.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,sales
store_nbr,family,Unnamed: 2_level_1
1,AUTOMOTIVE,2524.0
2,AUTOMOTIVE,3918.0
3,AUTOMOTIVE,6790.0
4,AUTOMOTIVE,2565.0
5,AUTOMOTIVE,3667.0
...,...,...
50,SEAFOOD,12774.0
51,SEAFOOD,34251.0
52,SEAFOOD,1219.0
53,SEAFOOD,3745.0


In [25]:
# drop level so 'family' is permanently removed
sales_sums.droplevel('family')

Unnamed: 0_level_0,sales
store_nbr,Unnamed: 1_level_1
1,2524.0
2,3918.0
3,6790.0
4,2565.0
5,3667.0
...,...
50,12774.0
51,34251.0
52,1219.0
53,3745.0


In [26]:
# remove 'date' and 'id' from retail_df for .agg() method functions (issue with object dtypes)
small_retail = retail_df.drop(columns=['date','id'])
small_retail

Unnamed: 0,store_nbr,family,sales,onpromotion
0,1,AUTOMOTIVE,0.000,0
1,1,BABY CARE,0.000,0
2,1,BEAUTY,0.000,0
3,1,BEVERAGES,0.000,0
4,1,BOOKS,0.000,0
...,...,...,...,...
1054939,9,POULTRY,438.133,0
1054940,9,PREPARED FOODS,154.553,1
1054941,9,PRODUCE,2419.729,148
1054942,9,SCHOOL AND OFFICE SUPPLIES,121.000,8


##### **AGG Method**</br> Enables multiple aggregations on a `groupby` object</br> `.agg('operation')` method is better for performing aggregation calculations

In [27]:
# using .agg('operation) will perform the operation on all applicable columns
small_retail.groupby(['store_nbr', 'family']).agg('sum').round()
# 'sum' is applied to sales and onpromtion columns

Unnamed: 0_level_0,Unnamed: 1_level_0,sales,onpromotion
store_nbr,family,Unnamed: 2_level_1,Unnamed: 3_level_1
1,AUTOMOTIVE,2524.0,14
1,BABY CARE,0.0,0
1,BEAUTY,1776.0,190
1,BEVERAGES,1238601.0,13793
1,BOOKS,211.0,0
...,...,...,...
54,POULTRY,35537.0,909
54,PREPARED FOODS,42792.0,577
54,PRODUCE,378612.0,6734
54,SCHOOL AND OFFICE SUPPLIES,997.0,277


##### **Multiple Aggregations using .agg() method**</br> Can perform `multiple` aggregations by passing list of aggregation functions </br>`pd.groupby(['list_of_groupby_columns']).agg(['list_of_agg_functions'])` </br> Can perform `specfic` aggregations by column by passing a dictionary with column_names as keys and aggregation_functions as values</br>`pd.groupby(['list_of_groupby_columns']).agg({'column_name':'aggregation_function'])` 

In [28]:
# using .agg('operation) will perform the operation on all applicable columns
small_retail.groupby(['store_nbr', 'family']).agg(['sum','mean']).round()
# 'sum' and 'mean' applied to sales and onpromtion columns and creates multilevel column index

Unnamed: 0_level_0,Unnamed: 1_level_0,sales,sales,onpromotion,onpromotion
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
store_nbr,family,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,AUTOMOTIVE,2524.0,4.0,14,0.0
1,BABY CARE,0.0,0.0,0,0.0
1,BEAUTY,1776.0,3.0,190,0.0
1,BEVERAGES,1238601.0,2092.0,13793,23.0
1,BOOKS,211.0,0.0,0,0.0
...,...,...,...,...,...
54,POULTRY,35537.0,60.0,909,2.0
54,PREPARED FOODS,42792.0,72.0,577,1.0
54,PRODUCE,378612.0,640.0,6734,11.0
54,SCHOOL AND OFFICE SUPPLIES,997.0,2.0,277,0.0


In [29]:
# Multiple Aggregations 'sales' to have 'sum' and 'mean', 'onpromotion' to have 'min' and 'max'
small_retail.groupby(
    ['family', 'store_nbr']).agg(
        {'sales':['sum','mean'],
        'onpromotion':['min','max']}).round()


Unnamed: 0_level_0,Unnamed: 1_level_0,sales,sales,onpromotion,onpromotion
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,min,max
family,store_nbr,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AUTOMOTIVE,1,2524.0,4.0,0,1
AUTOMOTIVE,2,3918.0,7.0,0,1
AUTOMOTIVE,3,6790.0,11.0,0,1
AUTOMOTIVE,4,2565.0,4.0,0,1
AUTOMOTIVE,5,3667.0,6.0,0,2
...,...,...,...,...,...
SEAFOOD,50,12774.0,22.0,0,7
SEAFOOD,51,34251.0,58.0,0,7
SEAFOOD,52,1219.0,2.0,0,5
SEAFOOD,53,3745.0,6.0,0,5


##### **Named Aggregations using .agg() Method**</br> Can name aggregated columns on creation to avoid multi-index columns </br>`pd.groupby(column_name=('column_to_be_agg()',"agg()_function'))` </br> Multiple columns can be created by using commas after each column_name=() </br> provides easier to understand column_labels

In [30]:
# use as_index=False to remove multi-index rows, and then create columns with .agg() function to prevent multi-index columns
(small_retail.groupby(
    ['family', 'store_nbr'],as_index=False).agg(
        sales_sum = ('sales','sum'),
        sales_avg = ('sales', 'mean'),
        onpromotion_max = ('onpromotion', 'max')
    )
)

Unnamed: 0,family,store_nbr,sales_sum,sales_avg,onpromotion_max
0,AUTOMOTIVE,1,2524.000000,4.263514,1
1,AUTOMOTIVE,2,3918.000000,6.618243,1
2,AUTOMOTIVE,3,6790.000000,11.469595,1
3,AUTOMOTIVE,4,2565.000000,4.332770,1
4,AUTOMOTIVE,5,3667.000000,6.194257,2
...,...,...,...,...,...
1777,SEAFOOD,50,12773.966999,21.577647,7
1778,SEAFOOD,51,34250.948976,57.856333,7
1779,SEAFOOD,52,1219.475999,2.059926,5
1780,SEAFOOD,53,3745.180001,6.326318,5


In [31]:
# use as_index=False to remove multi-index rows, and then create columns with .agg() function to prevent multi-index columns
(sample_df.groupby(
    ['family', 'store_nbr'],as_index=False).agg(
        sales_sum = ('sales','sum'),
        sales_avg = ('sales', 'mean'),
        onpromotion_max = ('onpromotion', 'max')
    )
)

Unnamed: 0,family,store_nbr,sales_sum,sales_avg,onpromotion_max
0,AUTOMOTIVE,6,4.000,4.000,0
1,AUTOMOTIVE,13,3.000,3.000,0
2,BABY CARE,10,0.000,0.000,0
3,BABY CARE,30,0.000,0.000,0
4,BABY CARE,35,0.000,0.000,0
...,...,...,...,...,...
92,PRODUCE,54,487.239,487.239,1
93,SCHOOL AND OFFICE SUPPLIES,19,0.000,0.000,0
94,SEAFOOD,4,34.689,34.689,0
95,SEAFOOD,40,5.000,5.000,1


##### **Transform .transform() Method**</br> Can used to perform aggregations without reshaping </br> Useful for calculating group-level statistics to perform row-level analysis </br> Typically used for operations that need to return a result of the same size as the input and returns a DataFrame with the same index as the original. </br> `pd.groupby(['column(s)_to_group']),['column(s)_to_transform']).transform('function'))`.transform('function'))`

In [32]:
# assign columns and transform the values with sum operation
sample_df.assign(
    store_sales = (sample_df.groupby('store_nbr')['sales'].transform('sum'))
)

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion,store_sales
399033,2344977,2016-08-11,54,PRODUCE,487.239,1,1529.239
579626,2525570,2016-11-21,22,HARDWARE,0.000,0,42.000
546385,2492329,2016-11-02,4,BOOKS,3.000,0,310.689
534555,2480499,2016-10-26,8,LINGERIE,7.000,0,212.390
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0,5927.624
...,...,...,...,...,...,...,...
402640,2348584,2016-08-13,7,CLEANING,702.000,6,5927.624
424226,2370170,2016-08-26,12,FROZEN FOODS,34.000,0,405.772
650850,2596794,2017-01-01,20,MEATS,0.000,0,1025.000
384193,2330137,2016-08-03,39,CLEANING,1031.000,21,1049.000


In [33]:
# demo with soccer excel file
premier_league = pd.read_excel('Pandas Course Resources/retail/premier_league_games_full.xlsx')
premier_league.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


In [34]:
# calc avg goals for HomeTeam in new column, without collapsing rows grouping by HomeTeam
premier_league.assign(
    avg_team_goals = premier_league.groupby(['HomeTeam'])['HomeGoals'].transform('mean'),
    # use lambda to use previous column created
    difference = lambda x: x['HomeGoals'] - x['avg_team_goals']
)
# provides row level Dataframe

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals,difference
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1,2.223684,-1.223684
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0,2.013158,-1.013158
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1,1.210526,-1.210526
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1,1.466165,0.533835
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2,1.177632,2.822368
...,...,...,...,...,...,...,...,...,...
3035,4764,England Premier League,2015/2016,Southampton,Leicester City,2,2,1.763158,0.236842
3036,4765,England Premier League,2015/2016,Swansea City,Stoke City,0,1,1.421053,-1.421053
3037,4766,England Premier League,2015/2016,Tottenham Hotspur,Liverpool,0,0,1.677632,-1.677632
3038,4767,England Premier League,2015/2016,Watford,Arsenal,0,3,1.052632,-1.052632


In [35]:
# calc avg goals for HomeTeam in new column, without collapsing rows grouping by HomeTeam
pm = premier_league.assign(
    avg_team_goals = premier_league.groupby(['HomeTeam'])['HomeGoals'].transform('mean'),
    # use lambda to use previous column created
    difference = lambda x: x['HomeGoals'] - x['avg_team_goals']
)
# provides row level Dataframe

In [36]:
# calculate mean of difference column grouped by HomeTeam, AwayTeam
pm.groupby(['HomeTeam', 'AwayTeam']).agg({'difference':'mean'}).sort_values('difference')

Unnamed: 0_level_0,Unnamed: 1_level_0,difference
HomeTeam,AwayTeam,Unnamed: 2_level_1
Chelsea,Bournemouth,-2.190789
Southampton,Wigan Athletic,-1.763158
Southampton,Cardiff City,-1.763158
Leicester City,Hull City,-1.657895
Leicester City,Manchester City,-1.657895
...,...,...
Wolverhampton Wanderers,Blackpool,2.912281
Fulham,Queens Park Rangers,2.982456
Everton,Blackpool,3.302632
Leicester City,Queens Park Rangers,3.342105


In [37]:
# query results using .query("")with column_name in ''
pm.query("HomeTeam == 'Arsenal' and AwayTeam == 'Blackpool'")

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals,difference
870,2599,England Premier League,2010/2011,Arsenal,Blackpool,6,0,2.013158,3.986842


In [38]:
# query results using .query("")with column_name in ''
pm.query("AwayTeam == 'Blackpool'").head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals,avg_team_goals,difference
769,2498,England Premier League,2010/2011,Wigan Athletic,Blackpool,0,4,1.115789,-1.115789
795,2524,England Premier League,2010/2011,Aston Villa,Blackpool,3,2,1.177632,1.822368
801,2530,England Premier League,2010/2011,West Ham United,Blackpool,0,0,1.466165,-1.466165
828,2557,England Premier League,2010/2011,Bolton Wanderers,Blackpool,2,2,1.368421,0.631579
849,2578,England Premier League,2010/2011,Stoke City,Blackpool,0,1,1.342105,-1.342105


### **Pivot Tables** 

In [None]:
path_retail = 'Pandas Course Resources/retail/retail_2016_2017.csv'
retail_df = pd.read_csv(path_retail)

retail_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
0,1945944,2016-01-01,1,AUTOMOTIVE,0.0,0
1,1945945,2016-01-01,1,BABY CARE,0.0,0
2,1945946,2016-01-01,1,BEAUTY,0.0,0
3,1945947,2016-01-01,1,BEVERAGES,0.0,0
4,1945948,2016-01-01,1,BOOKS,0.0,0


In [None]:
# Create random sample of 100 for aggregation examples
sample_df = retail_df.sample(100, random_state=616)
sample_df.head()

Unnamed: 0,id,date,store_nbr,family,sales,onpromotion
399033,2344977,2016-08-11,54,PRODUCE,487.239,1
579626,2525570,2016-11-21,22,HARDWARE,0.0,0
546385,2492329,2016-11-02,4,BOOKS,3.0,0
534555,2480499,2016-10-26,8,LINGERIE,7.0,0
96159,2042103,2016-02-23,7,PRODUCE,5212.624,0


##### **Pivot Tables Method** </br> `.pivot_table()` method creates Excel Style Pivot Tables </br> method requires </br> `..pivot_table(index='', columns='', values='', aggfunc='')`

|Argument|Description|
|---|-----------|
|`index=''`|returns a row index with distinct values from the specified column|
|`columns=''`|returns a colun index with distinct values from the specified column|
|`values=''`|the column(s) to perform aggregation operations on|
|`aggfunc=''`|defines the aggregation operation to perform|
|`margins=`|returns row and columns totals with True (default is False)|

##### **Pivot Tables Do Not Have A Filter** <br> *Filter DataFrame before implementing `pivot_table()` Method* </br> Using .query, boolean logic conditions or splicing

In [None]:
# Create pivot_table showing top 3
# margins=False by default will not display column totals at bottom of DataFrame
retail_df.pivot_table(
    index='family',
    columns='store_nbr',
    values='sales',
    aggfunc='sum').round().head(3)

store_nbr,1,2,3,4,5,6,7,8,9,10,...,45,46,47,48,49,50,51,52,53,54
family,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AUTOMOTIVE,2524.0,3918.0,6790.0,2565.0,3667.0,3442.0,3031.0,3225.0,7695.0,1772.0,...,9809.0,8670.0,9537.0,7264.0,7477.0,6702.0,4487.0,1497.0,5811.0,4199.0
BABY CARE,0.0,84.0,672.0,24.0,215.0,12.0,48.0,142.0,228.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,198.0
BEAUTY,1776.0,3824.0,8150.0,3063.0,3604.0,4524.0,2622.0,5942.0,4462.0,933.0,...,8068.0,8901.0,9766.0,8680.0,6603.0,6353.0,3566.0,972.0,3812.0,405.0


In [None]:
# Create pivot_table showing totals at bottoms
# with margins=True the totals of the columns and the rows will be displayed at end of columns/rows of DataFrame
# Column Totals will not be displayed with .head()
retail_df.pivot_table(
    index='family', 
    columns='store_nbr', 
    values='sales', 
    aggfunc='sum',
    margins=True
).round()

store_nbr,1,2,3,4,5,6,7,8,9,10,...,46,47,48,49,50,51,52,53,54,All
family,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
AUTOMOTIVE,2524.0,3918.0,6790.0,2565.0,3667.0,3442.0,3031.0,3225.0,7695.0,1772.0,...,8670.0,9537.0,7264.0,7477.0,6702.0,4487.0,1497.0,5811.0,4199.0,226139.0
BABY CARE,0.0,84.0,672.0,24.0,215.0,12.0,48.0,142.0,228.0,179.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,198.0,7903.0
BEAUTY,1776.0,3824.0,8150.0,3063.0,3604.0,4524.0,2622.0,5942.0,4462.0,933.0,...,8901.0,9766.0,8680.0,6603.0,6353.0,3566.0,972.0,3812.0,405.0,166189.0
BEVERAGES,1238601.0,1915519.0,5280120.0,1742495.0,1110429.0,2477150.0,2612147.0,2948874.0,2218853.0,685311.0,...,3680161.0,5162704.0,3006995.0,4470550.0,2619648.0,2884739.0,537796.0,1396928.0,1186768.0,105700279.0
BOOKS,211.0,239.0,540.0,266.0,230.0,76.0,211.0,317.0,0.0,0.0,...,199.0,581.0,57.0,454.0,291.0,259.0,0.0,77.0,0.0,6438.0
BREAD/BAKERY,223998.0,410154.0,805629.0,264225.0,209913.0,365313.0,465510.0,534130.0,339050.0,63886.0,...,577221.0,734913.0,509099.0,777951.0,383754.0,534936.0,97521.0,313372.0,131379.0,17092978.0
CELEBRATION,9072.0,8342.0,29351.0,5931.0,15235.0,8932.0,11696.0,14258.0,9131.0,2373.0,...,13178.0,17113.0,12845.0,21300.0,21917.0,9150.0,1343.0,6110.0,2892.0,444901.0
CLEANING,390600.0,579198.0,1268147.0,522716.0,538626.0,760643.0,662558.0,773669.0,1078511.0,442035.0,...,1417153.0,1454345.0,1315531.0,1187930.0,992487.0,867712.0,201411.0,683924.0,688297.0,38127743.0
DAIRY,430983.0,579438.0,1427037.0,532893.0,326557.0,664470.0,876615.0,968274.0,547710.0,186123.0,...,988991.0,1429740.0,823365.0,1493050.0,714319.0,968845.0,140584.0,368223.0,151139.0,28422893.0
DELI,76538.0,147992.0,239963.0,132771.0,130029.0,210272.0,106357.0,174613.0,364163.0,155772.0,...,483142.0,393357.0,395257.0,260027.0,284165.0,142728.0,44664.0,150784.0,131721.0,9617777.0


In [None]:
# load soccer excel
premier_league = pd.read_excel('Pandas Course Resources/retail/premier_league_games_full.xlsx')

In [None]:
# pivot table that has seasons for columns, with the sum of the HomeTeam HomeGoals
premier_league.pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='sum'
).head()


season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arsenal,31.0,48.0,33.0,39.0,47.0,36.0,41.0,31.0
Aston Villa,27.0,29.0,26.0,20.0,23.0,22.0,18.0,14.0
Birmingham City,,19.0,19.0,,,,,
Blackburn Rovers,22.0,28.0,22.0,26.0,,,,
Blackpool,,,30.0,,,,,


In [None]:
# using query to filter data before pivot_table() method applied for Sum of HomeGoals
premier_league.query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']").pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='sum'
)
# results in pivot_table that was filtered first

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arsenal,31,48,33,39,47,36,41,31
Chelsea,33,68,39,41,41,43,36,32
Everton,31,35,31,28,33,38,27,35


In [None]:
# using query to filter data before pivot_table() method applied for average HomeGoals with totals from margins=True argument 
premier_league.query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']").pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='mean',
    margins=True
)
# results in pivot_table that was filtered first

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,All
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Arsenal,1.631579,2.526316,1.736842,2.052632,2.473684,1.894737,2.157895,1.631579,2.013158
Chelsea,1.736842,3.578947,2.052632,2.157895,2.157895,2.263158,1.894737,1.684211,2.190789
Everton,1.631579,1.842105,1.631579,1.473684,1.736842,2.0,1.421053,1.842105,1.697368
All,1.666667,2.649123,1.807018,1.894737,2.122807,2.052632,1.824561,1.719298,1.967105


##### **Multiple Aggregation Functions** <br> Multiple arguments can be passed as tuple to the `aggfunc=` argument</br> Additional aggregated values added as additional columns </br> This can create a wide DataFrame which may not be best for analysis

In [None]:
# create Max columns for store_nbr and Min columns for store_nbr
retail_df.pivot_table(
    index='family', 
    columns='store_nbr', 
    values='sales', 
    aggfunc=('min', 'max')
).head()

Unnamed: 0_level_0,max,max,max,max,max,max,max,max,max,max,...,min,min,min,min,min,min,min,min,min,min
store_nbr,1,2,3,4,5,6,7,8,9,10,...,45,46,47,48,49,50,51,52,53,54
family,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
AUTOMOTIVE,19.0,23.0,32.0,19.0,18.0,24.0,31.0,43.0,44.0,21.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BABY CARE,0.0,5.0,11.0,3.0,5.0,4.0,2.0,4.0,5.0,9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BEAUTY,11.0,108.0,93.0,19.0,25.0,27.0,16.0,30.0,32.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BEVERAGES,5051.0,6049.0,19154.0,6056.0,3745.0,9537.0,9009.0,13511.0,9188.0,2687.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
BOOKS,8.0,9.0,11.0,9.0,6.0,6.0,10.0,13.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# can use dictionary with aggfunction for specific aggregations on specific columns which doens't require the 'values=' argument because the values are referenced in dictionary
retail_df.pivot_table(
    index='family', 
    columns='store_nbr', 
    aggfunc=({'sales':['sum', 'mean'], 'onpromotion':'max'})
).round()

Unnamed: 0_level_0,onpromotion,onpromotion,onpromotion,onpromotion,onpromotion,onpromotion,onpromotion,onpromotion,onpromotion,onpromotion,...,sales,sales,sales,sales,sales,sales,sales,sales,sales,sales
Unnamed: 0_level_1,max,max,max,max,max,max,max,max,max,max,...,sum,sum,sum,sum,sum,sum,sum,sum,sum,sum
store_nbr,1,2,3,4,5,6,7,8,9,10,...,45,46,47,48,49,50,51,52,53,54
family,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
AUTOMOTIVE,1,1,1,1,2,1,2,1,4,1,...,9809.0,8670.0,9537.0,7264.0,7477.0,6702.0,4487.0,1497.0,5811.0,4199.0
BABY CARE,0,0,1,0,1,0,0,0,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,22.0,198.0
BEAUTY,2,2,2,2,2,2,2,2,4,4,...,8068.0,8901.0,9766.0,8680.0,6603.0,6353.0,3566.0,972.0,3812.0,405.0
BEVERAGES,84,91,97,93,93,92,89,96,101,62,...,5564655.0,3680161.0,5162704.0,3006995.0,4470550.0,2619648.0,2884739.0,537796.0,1396928.0,1186768.0
BOOKS,0,0,0,0,0,0,0,0,0,0,...,579.0,199.0,581.0,57.0,454.0,291.0,259.0,0.0,77.0,0.0
BREAD/BAKERY,24,30,29,27,21,27,29,27,82,35,...,787221.0,577221.0,734913.0,509099.0,777951.0,383754.0,534936.0,97521.0,313372.0,131379.0
CELEBRATION,7,9,10,5,10,9,9,10,6,5,...,23177.0,13178.0,17113.0,12845.0,21300.0,21917.0,9150.0,1343.0,6110.0,2892.0
CLEANING,56,61,73,59,64,68,68,66,61,44,...,1595863.0,1417153.0,1454345.0,1315531.0,1187930.0,992487.0,867712.0,201411.0,683924.0,688297.0
DAIRY,49,47,55,50,53,50,49,56,151,83,...,1477538.0,988991.0,1429740.0,823365.0,1493050.0,714319.0,968845.0,140584.0,368223.0,151139.0
DELI,60,69,71,61,66,71,60,70,71,48,...,415401.0,483142.0,393357.0,395257.0,260027.0,284165.0,142728.0,44664.0,150784.0,131721.0


##### **Pivot Table vs. Groupby** </br> if columns are not being passed into pivot table, then use groupby() method because groupby can use named aggregation to flatten column index. and pivot_tables can often create wide DataFrames


##### **Heatmaps** </br> Can quickly style a DataFrame based on its values to create a `heatmap` </br> chain the `.style.background_gradient()` to DataFrame and add `cmap` argument

In [None]:
# create heatmap with no axis in pivot table
premier_league.query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']").pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='mean',
    margins=True
).style.background_gradient(cmap='RdYlGn', axis=None)

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,All
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Arsenal,1.631579,2.526316,1.736842,2.052632,2.473684,1.894737,2.157895,1.631579,2.013158
Chelsea,1.736842,3.578947,2.052632,2.157895,2.157895,2.263158,1.894737,1.684211,2.190789
Everton,1.631579,1.842105,1.631579,1.473684,1.736842,2.0,1.421053,1.842105,1.697368
All,1.666667,2.649123,1.807018,1.894737,2.122807,2.052632,1.824561,1.719298,1.967105


In [None]:
# create heatmap with column axis (axis=1) in pivot table
premier_league.query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']").pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='mean',
    margins=True
).style.background_gradient(cmap='RdYlGn', axis=1)

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,All
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Arsenal,1.631579,2.526316,1.736842,2.052632,2.473684,1.894737,2.157895,1.631579,2.013158
Chelsea,1.736842,3.578947,2.052632,2.157895,2.157895,2.263158,1.894737,1.684211,2.190789
Everton,1.631579,1.842105,1.631579,1.473684,1.736842,2.0,1.421053,1.842105,1.697368
All,1.666667,2.649123,1.807018,1.894737,2.122807,2.052632,1.824561,1.719298,1.967105


In [None]:
# create heatmap with row axis (axis=0) in pivot table
premier_league.query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']").pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='mean',
    margins=True
).style.background_gradient(cmap='RdYlGn', axis=0)

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016,All
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Arsenal,1.631579,2.526316,1.736842,2.052632,2.473684,1.894737,2.157895,1.631579,2.013158
Chelsea,1.736842,3.578947,2.052632,2.157895,2.157895,2.263158,1.894737,1.684211,2.190789
Everton,1.631579,1.842105,1.631579,1.473684,1.736842,2.0,1.421053,1.842105,1.697368
All,1.666667,2.649123,1.807018,1.894737,2.122807,2.052632,1.824561,1.719298,1.967105


### **Melting DataFrames** </br> 

In [None]:
path_retail = 'Pandas Course Resources/retail/premier_league_games_full.xlsx'
premier = pd.read_excel(path_retail)

premier.head()

Unnamed: 0,id,league_name,season,HomeTeam,AwayTeam,HomeGoals,AwayGoals
0,1729,England Premier League,2008/2009,Manchester United,Newcastle United,1,1
1,1730,England Premier League,2008/2009,Arsenal,West Bromwich Albion,1,0
2,1731,England Premier League,2008/2009,Sunderland,Liverpool,0,1
3,1732,England Premier League,2008/2009,West Ham United,Wigan Athletic,2,1
4,1733,England Premier League,2008/2009,Aston Villa,Manchester City,4,2


##### **Melt Method** </br> This method will `unpivot` a DataFrame or `convert columns into rows` </br> Turns `'wide'` DataTable into `'Long'` Format </br> when using melt() function the values for each original column are placed on a single value column next to its corresponding column name </br> Use argument `id_vars=''` 

##### **---- Available .melt() Arguments ----**
| Argument     | Description |
|--------------|-------------|
| `id_vars=`   | Specifies the column(s) to retain as identifier variables. These columns remain unchanged in the reshaped DataFrame. |
| `value_vars=`| Identifies the column(s) that are to be melted into rows. If not specified, all columns not set as `id_vars` are melted. |
| `var_name=`  | Allows renaming the new column created from the original column names post-melting. Defaults to "variable" if unspecified. |
| `value_name=`| Specifies the name for the new column created from the values of the melted columns. Defaults to "value" if not specified. |


In [None]:
# create filtered pivot_table to be melted back to 'long' format
pm = premier.query("HomeTeam in ['Arsenal', 'Chelsea', 'Everton']").pivot_table(
    index='HomeTeam',
    columns='season',
    values='HomeGoals',
    aggfunc='mean',
)
pm

season,2008/2009,2009/2010,2010/2011,2011/2012,2012/2013,2013/2014,2014/2015,2015/2016
HomeTeam,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Arsenal,1.631579,2.526316,1.736842,2.052632,2.473684,1.894737,2.157895,1.631579
Chelsea,1.736842,3.578947,2.052632,2.157895,2.157895,2.263158,1.894737,1.684211
Everton,1.631579,1.842105,1.631579,1.473684,1.736842,2.0,1.421053,1.842105


In [None]:
# melt pivot_table pm without arguments doesn't make DataFrame very easy to use
pm.melt()

Unnamed: 0,season,value
0,2008/2009,1.631579
1,2008/2009,1.736842
2,2008/2009,1.631579
3,2009/2010,2.526316
4,2009/2010,3.578947
5,2009/2010,1.842105
6,2010/2011,1.736842
7,2010/2011,2.052632
8,2010/2011,1.631579
9,2011/2012,2.052632


In [None]:
# melt pivot_table pm with arguments to better define resulting DataFrame which requires the index reset because of column index would prevent the id_vars from being valid
pm.reset_index().melt(
    id_vars='HomeTeam',
    # the value_vars included in a list then the resulting DataFrame will be filtered or don't include to have all values
    # value_vars=['2008/2009', '2009/2010', '2010/2011'],
    value_name='avg_goals'
)

Unnamed: 0,HomeTeam,season,avg_goals
0,Arsenal,2008/2009,1.631579
1,Chelsea,2008/2009,1.736842
2,Everton,2008/2009,1.631579
3,Arsenal,2009/2010,2.526316
4,Chelsea,2009/2010,3.578947
5,Everton,2009/2010,1.842105
6,Arsenal,2010/2011,1.736842
7,Chelsea,2010/2011,2.052632
8,Everton,2010/2011,1.631579
9,Arsenal,2011/2012,2.052632
