# Reshaping, pivoting and GroupBy

*Reshaping* a data set means manipulating it into a different shape.

Reshaping offers a new view or perspective on the data.

Hadley Wickham estimates that 80% of data analysis consists of cleaning up data and contorting it into the proper shape. 

https://vita.had.co.nz/papers/tidy-data.pdf

## Wide vs. narrow (long) data

A *narrow* data set is also called a *long* or a tall data set.

These names reflect the direction in which the data set expands as we add more values to it. 

A *wide* data set increases in width; it grows out.

A narrow/long/tall data set increases in height; it grows down.

Take a peek at the following table, which measures temperatures in two cities over two days:

In [6]:
import pandas as pd 

cities_temp_wide = pd.DataFrame({"Weekday": ["Monday","Tuesday"],"Miami":[100,105],"New York":[65,70]})
cities_temp_wide

Unnamed: 0,Weekday,Miami,New York
0,Monday,100,65
1,Tuesday,105,70


This data set stores the same variable—temperature—across two columns instead of one.

We can categorize this table as being a wide data set. 

A wide data set expands horizontally.

Suppose that we introduced temperature measurements for two more cities.

The data grows wider, not taller:

In [7]:
cities_temp_wide["Chicago"] = [50, 58]
cities_temp_wide["San Francisco"] = [60, 62]
cities_temp_wide

Unnamed: 0,Weekday,Miami,New York,Chicago,San Francisco
0,Monday,100,65,50,60
1,Tuesday,105,70,58,62


A narrow data set grows vertically. A narrow format makes it easier to manipulate existing data and to add new records. Each variable is isolated to a single column.

In [10]:
cities_temp_long = pd.DataFrame([("Monday","Miami",100),
                                 ("Monday","New York",65),
                                 ("Tuesday","Miami",105),
                                 ("Tuesday","New York",70)], columns=["Weekday","City","Temperature"])
cities_temp_long

Unnamed: 0,Weekday,City,Temperature
0,Monday,Miami,100
1,Monday,New York,65
2,Tuesday,Miami,105
3,Tuesday,New York,70


To include temperatures for two more cities, we would add rows instead of columns.

The data grows taller, not wider:

In [14]:
cities_temp_long = pd.DataFrame([("Monday","Miami",100),
                                 ("Monday","New York",65),
                                 ("Monday","Chicago",50),
                                 ("Monday","San Francisco",60),
                                 ("Tuesday","Miami",105),
                                 ("Tuesday","New York",70),
                                 ("Tuesday","Chicago",58),
                                 ("Tuesday","San Francisco",68)], columns=["Weekday","City","Temperature"])
cities_temp_long
# But it is easier to calculate the average temperature because we have isolated the temperature values to a single column.

Unnamed: 0,Weekday,City,Temperature
0,Monday,Miami,100
1,Monday,New York,65
2,Monday,Chicago,50
3,Monday,San Francisco,60
4,Tuesday,Miami,105
5,Tuesday,New York,70
6,Tuesday,Chicago,58
7,Tuesday,San Francisco,68


The optimal storage format for a data set depends on the insight we’re trying to glean from it. 

Pandas offers tools to transform <code>DataFrame</code>s from narrow formats to wide formats and vice versa.

## Creating a pivot table from a DataFrame

In [20]:
sales = pd.read_csv("../data/sales_by_employee.csv", parse_dates = ["Date"])
sales.head(5)

Unnamed: 0,Date,Name,Customer,Revenue,Expenses
0,2020-01-01,Oscar,Logistics XYZ,5250,531
1,2020-01-01,Oscar,Money Corp.,4406,661
2,2020-01-02,Oscar,PaperMaven,8661,1401
3,2020-01-03,Oscar,PaperGenius,7075,906
4,2020-01-04,Oscar,Paper Pound,2524,1767


### The <code>pivot_table</code> method

A *pivot table* aggregates a column’s values and groups the results by using other columns’ values.

The word *aggregate* describes a summary computation that involves multiple values.

Example aggregations include average, sum, median, and count.

Multiple salesmen closed deals on the same date. 

In addition, the same salesmen closed multiple deals on the same date. 

What if we want to sum the revenue by date and see how much each salesman contributed to the daily totals?

We follow four steps to create a pivot table:

* Select the column(s) whose values we want to aggregate.
* Choose the aggregation operation to apply to the column(s).
* Select the column(s) whose values will group the aggregated data into categories.
* Determine whether to place the groups on the row axis, the column axis, or both axes.

First, we’ll need to invoke the <code>pivot_table</code> method on our existing <code>sales DataFrame</code> .

The method’s <code>index</code> parameter accepts the column whose values will make up the pivot table’s index labels.

In [22]:
# Date column’s values for the index labels of the pivot table
sales.pivot_table(index = "Date")

Unnamed: 0_level_0,Expenses,Revenue
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,637.5,4293.5
2020-01-02,1244.4,7303.0
2020-01-03,1313.666667,4865.833333
2020-01-04,1450.6,3948.0
2020-01-05,1196.25,4834.75


Pandas applies its default aggregation operation, an average, to all numeric columns in <code>sales</code> (Expenses and Revenue)

let’s swap the <code>aggfunc</code> parameter’s argument to "<code>sum</code>" to add the values in Expenses and Revenue:

In [24]:
sales.pivot_table(index = "Date", aggfunc = "sum")

Unnamed: 0_level_0,Expenses,Revenue
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2020-01-01,3825,25761
2020-01-02,6222,36515
2020-01-03,7882,29195
2020-01-04,7253,19740
2020-01-05,4785,19339


To aggregate only one column’s values, we can pass the parameter a string with the column name:

In [25]:
sales.pivot_table(index = "Date", values = "Revenue", aggfunc = "sum")

Unnamed: 0_level_0,Revenue
Date,Unnamed: 1_level_1
2020-01-01,25761
2020-01-02,36515
2020-01-03,29195
2020-01-04,19740
2020-01-05,19339


To aggregate values across multiple columns, we can pass <code>values</code> a list of columns.

We have a sum of revenue grouped by date.

Our final step is communicating how much each salesman contributed to the daily total.

One presentational view that seems to be optimal is placing each salesman’s name in a separate column.

Let’s add a columns parameter to the method invocation and pass it an argument of "<code>Name</code>" :

In [30]:
sales.pivot_table(
                index = "Date",
                columns = "Name",
                values = "Revenue",
                aggfunc = "sum"
)

Name,Creed,Dwight,Jim,Michael,Oscar
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-01,4430.0,2639.0,1864.0,7172.0,9656.0
2020-01-02,13214.0,,8278.0,6362.0,8661.0
2020-01-03,,11912.0,4226.0,5982.0,7075.0
2020-01-04,3144.0,,6155.0,7917.0,2524.0
2020-01-05,938.0,7771.0,,7837.0,2793.0


We can use the <code>fill_value</code> parameter to replace all pivot table <code>NaN</code> s with a fixed value. Let’s fill in the data gaps with zeroes:

In [31]:
sales.pivot_table(
                index = "Date",
                columns = "Name",
                values = "Revenue",
                aggfunc = "sum",
                fill_value = 0
)

Name,Creed,Dwight,Jim,Michael,Oscar
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-01,4430,2639,1864,7172,9656
2020-01-02,13214,0,8278,6362,8661
2020-01-03,0,11912,4226,5982,7075
2020-01-04,3144,0,6155,7917,2524
2020-01-05,938,7771,0,7837,2793


We may also want to see the revenue subtotals for each combination of date and salesman. 

We can pass an argument of <code>True</code> to the <code>margins</code> parameter to add totals for each row and column:

In [32]:
sales.pivot_table(
            index = "Date",
            columns = "Name",
            values = "Revenue",
            aggfunc = "sum",
            fill_value = 0,
            margins = True
)

Name,Creed,Dwight,Jim,Michael,Oscar,All
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-01 00:00:00,4430,2639,1864,7172,9656,25761
2020-01-02 00:00:00,13214,0,8278,6362,8661,36515
2020-01-03 00:00:00,0,11912,4226,5982,7075,29195
2020-01-04 00:00:00,3144,0,6155,7917,2524,19740
2020-01-05 00:00:00,938,7771,0,7837,2793,19339
All,21726,22322,20523,35270,30709,130550


### Additional options for pivot tables

Suppose that we’re interested in the number of business deals closed per day. 

We can pass aggfunc an <code>argument</code> of "<code>count</code>" to count the number of <code>sales</code> rows for each combination of date and employee:

In [33]:
sales.pivot_table(
            index = "Date",
            columns = "Name",
            values = "Revenue",
            aggfunc = "count",
            fill_value = 0
)

Name,Creed,Dwight,Jim,Michael,Oscar
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2020-01-01,1,1,1,1,2
2020-01-02,2,0,1,1,1
2020-01-03,0,3,1,1,1
2020-01-04,1,0,2,1,1
2020-01-05,1,1,0,1,1


Some additional options for the aggfunc parameter are listed in the following table:

![agg_function_pivot_table.png](attachment:agg_function_pivot_table.png)

We can also stack multiple groupings on a single axis by passing the <code>index</code> parameter a list of columns. 

The next example aggregates the sum of expenses by salesman and date on the row axis.

Pandas return a <code>DataFrame</code> with a two-level <code>MultiIndex</code>:

In [36]:
sales.pivot_table(
                    index = ["Name", "Date"], values = "Revenue", aggfunc = "sum"
                 ).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue
Name,Date,Unnamed: 2_level_1
Creed,2020-01-01,4430
Creed,2020-01-02,13214
Creed,2020-01-04,3144
Creed,2020-01-05,938
Dwight,2020-01-01,2639
Dwight,2020-01-03,11912
Dwight,2020-01-05,7771
Jim,2020-01-01,1864
Jim,2020-01-02,8278
Jim,2020-01-03,4226


Switch the order of strings in the <code>index</code> list to rearrange the levels in the pivot table’s <code>MultiIndex</code> . 

The next example swaps the positions of Name and Date:

In [42]:
sales.pivot_table(
                    index = ["Date", "Name"], values = "Revenue", aggfunc = "sum"
                 ).head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue
Date,Name,Unnamed: 2_level_1
2020-01-01,Creed,4430
2020-01-01,Dwight,2639
2020-01-01,Jim,1864
2020-01-01,Michael,7172
2020-01-01,Oscar,9656
2020-01-02,Creed,13214
2020-01-02,Jim,8278
2020-01-02,Michael,6362
2020-01-02,Oscar,8661
2020-01-03,Dwight,11912


## Stacking and unstacking index levels

Let’s pivot sales to organize revenue by employee name and date. 

We’ll place dates on the column axis and names on the row axis:

In [41]:
by_name_and_date = sales.pivot_table(
                    index = "Name",
                    columns = "Date",
                    values = "Revenue",
                    aggfunc = "sum"
                    )
by_name_and_date

Date,2020-01-01,2020-01-02,2020-01-03,2020-01-04,2020-01-05
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Creed,4430.0,13214.0,,3144.0,938.0
Dwight,2639.0,,11912.0,,7771.0
Jim,1864.0,8278.0,4226.0,6155.0,
Michael,7172.0,6362.0,5982.0,7917.0,7837.0
Oscar,9656.0,8661.0,7075.0,2524.0,2793.0


The <code>stack</code> method moves an index level from the column axis to the row axis.

The next example moves the Date index level from the column axis to the row axis.

Pandas creates a <code>MultiIndex</code> to store the two row levels: Name and Date.

In [43]:
by_name_and_date.stack().head(7)

Name    Date      
Creed   2020-01-01     4430.0
        2020-01-02    13214.0
        2020-01-04     3144.0
        2020-01-05      938.0
Dwight  2020-01-01     2639.0
        2020-01-03    11912.0
        2020-01-05     7771.0
dtype: float64

The complementary <code>unstack</code> method moves an index level from the row axis to the column axis. 

Consider the following pivot table, which groups revenue by customer and salesman.

In [49]:
sales_by_customer = sales.pivot_table(
                    index = ["Customer", "Name"],
                    values = "Revenue",
                    aggfunc = "sum"
                    )
sales_by_customer.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue
Customer,Name,Unnamed: 2_level_1
Average Paper Co.,Creed,13214
Average Paper Co.,Jim,2287
Best Paper Co.,Dwight,2703
Best Paper Co.,Michael,15754
Logistics XYZ,Dwight,9209


The <code>unstack</code> method moves the innermost level of the row index to the column index:

In [50]:
sales_by_customer.unstack()

Unnamed: 0_level_0,Revenue,Revenue,Revenue,Revenue,Revenue
Name,Creed,Dwight,Jim,Michael,Oscar
Customer,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Average Paper Co.,13214.0,,2287.0,,
Best Paper Co.,,2703.0,,15754.0,
Logistics XYZ,,9209.0,,7172.0,5250.0
Money Corp.,5368.0,,8278.0,,4406.0
Paper Pound,,7771.0,4226.0,,5317.0
PaperGenius,,2639.0,1864.0,12344.0,7075.0
PaperMaven,3144.0,,3868.0,,8661.0


In the new <code>DataFrame</code> , the column axis now has a two-level <code>MultiIndex</code> , and the
row axis has a regular one-level index.

## Melting a data set

A pivot table aggregates the values in a data set. 

In this section, we’ll learn how to do the opposite: break an aggregated collection of data into an unaggregated one.

The next data set, <code>video_game_sales.csv</code>, is a listing of regional sales for more than 16,000 video games. 

Each row includes the game’s name as well as the number of units sold (in millions) in the North America (NA), Europe (EU), Japan (JP), and other (Other) regions:

In [54]:
video_game_sales = pd.read_csv("../data/video_game_sales.csv")
video_game_sales.head()

Unnamed: 0,Name,NA,EU,JP,Other
0,Wii Sports,41.49,29.02,3.77,8.46
1,Super Mario Bros.,29.08,3.58,6.81,0.77
2,Mario Kart Wii,15.85,12.88,3.79,3.31
3,Wii Sports Resort,15.75,11.01,3.28,2.96
4,Pokemon Red/Pokemon Blue,11.27,8.89,10.22,1.0


Suppose that we moved the values "NA" , "EU" , "JP" , and "Other" to a new Region column. 

Compare the preceding presentation with the following one:

In [58]:
pd.DataFrame([("Wii Sports","NA",41.49), 
              ("Wii Sports","EU",29.02), 
              ("Wii Sports","JP",3.77), 
              ("Wii Sports","Other",8.46)], columns = ["Name","Region","Sales"])

Unnamed: 0,Name,Region,Sales
0,Wii Sports,,41.49
1,Wii Sports,EU,29.02
2,Wii Sports,JP,3.77
3,Wii Sports,Other,8.46


In a way, we are unpivoting the <code>DataFrame</code>.

We are converting an aggregate, summary view of the data to one in which each column stores one variable
piece of information.

Pandas melts a <code>DataFrame</code> with the <code>melt</code> method. (*Melting* is the process of converting a wide data set to a narrow one.) 

The method accepts two primary parameters:

* The <code>id_vars</code> parameter sets the identifier column, the column for which the wide data set aggregates data. Name is the identifier column in video_game_sales. The data set aggregates sales per video game.
* The <code>value_vars</code> parameter accepts the column(s) whose values pandas will melt and store in a new column.

Let’s start simple, melting only the NA column’s values.

The library stores the former column name (NA) in a new variable column:

In [59]:
video_game_sales.melt(id_vars = "Name", value_vars = "NA").head()

Unnamed: 0,Name,variable,value
0,Wii Sports,,41.49
1,Super Mario Bros.,,29.08
2,Mario Kart Wii,,15.85
3,Wii Sports Resort,,15.75
4,Pokemon Red/Pokemon Blue,,11.27


Next, let’s melt all four of the regional sales columns. 

The next code sample passes the <code>value_vars</code> parameter a list of the four regional sales columns from video_game_sales:

In [60]:
regional_sales_columns = ["NA", "EU", "JP", "Other"]

video_game_sales.melt(id_vars = "Name", value_vars = regional_sales_columns)

Unnamed: 0,Name,variable,value
0,Wii Sports,,41.49
1,Super Mario Bros.,,29.08
2,Mario Kart Wii,,15.85
3,Wii Sports Resort,,15.75
4,Pokemon Red/Pokemon Blue,,11.27
...,...,...,...
66259,Woody Woodpecker in Crazy Castle 5,Other,0.00
66260,Men in Black II: Alien Escape,Other,0.00
66261,SCORE International Baja 1000: The Official Game,Other,0.00
66262,Know How 2,Other,0.00


We can customize the melted <code>DataFrame</code>’s column names by passing arguments to the <code>var_name</code> and <code>value_name</code> parameters. 

The next example uses Region for the variable column and Sales for the value column:

In [61]:
video_game_sales_by_region = video_game_sales.melt(
                                id_vars = "Name",
                                value_vars = regional_sales_columns,
                                var_name = "Region",
                                value_name = "Sales"
                                )
video_game_sales_by_region.head()

Unnamed: 0,Name,Region,Sales
0,Wii Sports,,41.49
1,Super Mario Bros.,,29.08
2,Mario Kart Wii,,15.85
3,Wii Sports Resort,,15.75
4,Pokemon Red/Pokemon Blue,,11.27


# The GroupBy object

The pandas library’s <code>GroupBy</code> object is a storage container for grouping <code>DataFrame</code> rows into buckets. 

It provides a set of methods to aggregate and analyze each independent group in the collection. 

Let’s begin by creating a <code>DataFrame</code> that stores the prices of fruits and vegetables in a supermarket:

In [62]:
food_data = {
            "Item": ["Banana", "Cucumber", "Orange", "Tomato", "Watermelon"],
            "Type": ["Fruit", "Vegetable", "Fruit", "Vegetable", "Fruit"],
            "Price": [0.99, 1.25, 0.25, 0.33, 3.00]
            }
supermarket = pd.DataFrame(data = food_data)
supermarket

Unnamed: 0,Item,Type,Price
0,Banana,Fruit,0.99
1,Cucumber,Vegetable,1.25
2,Orange,Fruit,0.25
3,Tomato,Vegetable,0.33
4,Watermelon,Fruit,3.0


The Type column identifies the group to which an Item belongs. 

There are two groups of items in the supermarket data set: fruits and vegetables. 

We can use terms such as *groups*, *buckets*, and *clusters* interchangeably to describe the same idea. 

Multiple rows fall into the same category.

The <code>GroupBy</code> object organizes <code>DataFrame</code> rows into buckets based on shared values in a column.

Let’s begin by invoking the <code>groupby</code> method on the supermarket <code>DataFrame</code>. 

We need to pass it the column whose values pandas will use to create the groups.

In [63]:
groups = supermarket.groupby("Type")
groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87ee043c10>

The <code>get_group</code> method accepts a group name and returns a <code>DataFrame</code> with the
corresponding rows. Let’s pull out the "<code>Fruit</code>" rows:

In [64]:
groups.get_group("Fruit")

Unnamed: 0,Item,Type,Price
0,Banana,Fruit,0.99
2,Orange,Fruit,0.25
4,Watermelon,Fruit,3.0


In [65]:
groups.get_group("Vegetable")

Unnamed: 0,Item,Type,Price
1,Cucumber,Vegetable,1.25
3,Tomato,Vegetable,0.33


The <code>GroupBy</code> object excels at aggregate operations. Our original goal was to calculate the average price of the fruits and vegetables in supermarket.

In [68]:
groups.mean()

Unnamed: 0_level_0,Price
Type,Unnamed: 1_level_1
Fruit,1.413333
Vegetable,0.79


## Creating a GroupBy object from a data set

The Fortune 1000 is a listing of the 1,000 largest companies in the United States by revenue. 

The <code>fortune1000.csv</code> file is a collection of Fortune 1000 companies from 2018. 

Each row includes a company’s name, revenue, profits, employee count, sector, and industry:

In [69]:
fortune = pd.read_csv("../data/fortune1000.csv")
fortune

Unnamed: 0,Company,Revenues,Profits,Employees,Sector,Industry
0,Walmart,500343.0,9862.0,2300000,Retailing,General Merchandisers
1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining
2,Berkshire Hathaway,242137.0,44940.0,377000,Financials,Insurance: Property and Casualty (Stock)
3,Apple,229234.0,48351.0,123000,Technology,"Computers, Office Equipment"
4,UnitedHealth Group,201159.0,10558.0,260000,Health Care,Health Care: Insurance and Managed Care
...,...,...,...,...,...,...
995,SiteOne Landscape Supply,1862.0,54.6,3664,Wholesalers,Wholesalers: Diversified
996,Charles River Laboratories Intl,1858.0,123.4,11800,Health Care,Health Care: Pharmacy and Other Services
997,CoreLogic,1851.0,152.2,5900,Business Services,Financial Data Services
998,Ensign Group,1849.0,40.5,21301,Health Care,Health Care: Medical Facilities


The Sector column holds 21 unique sectors. 

Suppose that we want to find the average revenue across the companies within each sector.

Let’s invoke the <code>groupby</code> method on the fortune <code>DataFrame</code>. 

The method accepts the column whose values pandas will use to group the rows.

In [71]:
sectors = fortune.groupby("Sector")
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f87ee062490>

A <code>DataFrameGroupBy</code> object is a bundle of <code>DataFrame</code>s. 
    
Behind the scenes, pandas repeated the extraction process we used for the "Retailing" sector but for all 21 values in the Sector column.

We can count the number of groups in sectors by passing the GroupBy object into Python’s built-in <code>len</code> function:

In [72]:
len(sectors)

21

The sectors <code>GroupBy</code> object has 21 <code>DataFrame</code>s. 

The number is equal to the number of unique values in fortune ’s Sector column.

The <code>size</code> method on the <code>GroupBy</code> object returns a <code>Series</code> with an alphabetical list of the groups and their row counts.

In [77]:
sectors.size()

Sector
Aerospace & Defense               25
Apparel                           14
Business Services                 53
Chemicals                         33
Energy                           107
Engineering & Construction        27
Financials                       155
Food &  Drug Stores               12
Food, Beverages & Tobacco         37
Health Care                       71
Hotels, Restaurants & Leisure     26
Household Products                28
Industrials                       49
Materials                         45
Media                             25
Motor Vehicles & Parts            19
Retailing                         77
Technology                       103
Telecommunications                10
Transportation                    40
Wholesalers                       44
dtype: int64

## Attributes  of a <code>GroupBy</code> object

One way to visualize our <code>GroupBy</code> object is as a dictionary that maps the 21 sectors to a collection of fortune rows belonging to each one. 

The <code>groups</code> attribute stores a dictionary with these group-to-row associations; its keys are sector names, and its values are <code>Index</code> objects storing the row index positions from the fortune <code>DataFrame</code> .

In [81]:
sectors.groups['Aerospace & Defense']

Int64Index([ 26,  50,  58,  98, 117, 118, 207, 224, 275, 380, 404, 406, 414,
            540, 660, 661, 806, 829, 884, 930, 954, 955, 959, 975, 988],
           dtype='int64')

## Aggregate operations

We can invoke methods on the <code>GroupBy</code> object to apply aggregate operations to every nested group. 

In the next example, the <code>sum</code> method calculates the sum per sector for the three numeric columns (Revenues, Profits, and Employees) in the fortune <code>DataFrame</code> .

In [83]:
sectors.sum().head()

Unnamed: 0_level_0,Revenues,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,383835.0,26733.5,1010124
Apparel,101157.3,6350.7,355699
Business Services,316090.0,37179.2,1593999
Chemicals,251151.0,20475.0,474020
Energy,1543507.2,85369.6,981207


In [88]:
sectors.get_group("Aerospace & Defense")["Revenues"].sum()

383835.0

The <code>GroupBy</code> object supports many other aggregation methods. 

The next example invokes the <code>mean</code> method to calculate the average of the Revenues, Profits, and
Employees columns per sector.

In [89]:
sectors.mean().head()

Unnamed: 0_level_0,Revenues,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,15353.4,1069.34,40404.96
Apparel,7225.521429,453.621429,25407.071429
Business Services,5963.962264,701.49434,30075.45283
Chemicals,7610.636364,620.454545,14364.242424
Energy,14425.300935,805.373585,9170.158879


We can target a single fortune column by passing its name inside square brackets after
the <code>GroupBy</code> object. 

Pandas returns a new object, a <code>SeriesGroupBy</code> :

In [90]:
sectors["Revenues"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x7f87edc13650>

In [93]:
sectors["Revenues"].sum().head()

Sector
Aerospace & Defense     383835.0
Apparel                 101157.3
Business Services       316090.0
Chemicals               251151.0
Energy                 1543507.2
Name: Revenues, dtype: float64

The next example calculates the average number of employees per sector:

In [92]:
sectors["Employees"].mean().head()

Sector
Aerospace & Defense    40404.960000
Apparel                25407.071429
Business Services      30075.452830
Chemicals              14364.242424
Energy                  9170.158879
Name: Employees, dtype: float64

The <code>agg</code> method applies multiple aggregate operations to different columns and accepts a dictionary as its argument. 

In each key-value pair, the key denotes a <code>DataFrame</code> column, and the value specifies the aggregate operation to apply to the column. 

The next example extracts the lowest revenue, highest profit, and average number of employees for each sector:

In [94]:
aggregations = {
                "Revenues": "min",
                "Profits": "max",
                "Employees": "mean"
                }
sectors.agg(aggregations).head()

Unnamed: 0_level_0,Revenues,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,1877.0,8197.0,40404.96
Apparel,2350.0,4240.0,25407.071429
Business Services,1851.0,6699.0,30075.45283
Chemicals,1925.0,3000.4,14364.242424
Energy,1874.0,19710.0,9170.158879


### Applying a custom operation to all groups

The <code>GroupBy</code> object’s <code>apply</code> method expects a function as an argument. 

It invokes the function once for each group in the GroupBy object. Then it collects the return values from the function invocations and returns them in a new <code>DataFrame</code>.

Let’s define a get_largest_row function that accepts a single argument: a <code>DataFrame</code>.

The function will return the <code>DataFrame</code> row with the greatest value in the Revenues column.

In [96]:
def get_largest_row(df):
    return df.nlargest(1, "Revenues")

We can invoke the <code>apply</code> method and pass in the uninvoked <code>get_largest_row</code> function. 

Pandas invokes <code>get_largest_row</code> once for each sector and returns a <code>DataFrame</code> with the companies with the highest revenue in their sector:

In [97]:
sectors.apply(get_largest_row).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Company,Revenues,Profits,Employees,Sector,Industry
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Aerospace & Defense,26,Boeing,93392.0,8197.0,140800,Aerospace & Defense,Aerospace and Defense
Apparel,88,Nike,34350.0,4240.0,74400,Apparel,Apparel
Business Services,142,ManpowerGroup,21034.0,545.4,29000,Business Services,Temporary Help
Chemicals,46,DowDuPont,62683.0,1460.0,98000,Chemicals,Chemicals
Energy,1,Exxon Mobil,244363.0,19710.0,71200,Energy,Petroleum Refining


### Grouping by multiple columns

We can create a <code>GroupBy</code> object with values from multiple <code>DataFrame</code> columns. 

This operation is optimal when a combination of column values serves as the best identifier for a group.

The next example passes a list of two strings to the <code>groupby</code> method.

Pandas groups the rows first by the Sector column’s values and then by the Industry column’s values.

In [102]:
sector_and_industry = fortune.groupby(by = ["Sector", "Industry"])
sector_and_industry.size()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            25
Apparel              Apparel                                          14
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         2
                                                                      ..
Transportation       Trucking, Truck Leasing                          11
Wholesalers          Wholesalers: Diversified                         24
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
                     Wholesalers: Health Care                          6
Length: 82, dtype: int64

For all aggregations, pandas returns a <code>MultiIndex DataFrame</code> with the calculations.

The next example calculates the sum of the three numeric columns in <code>fortune</code> (Revenues, Profits, and Employees), grouped first by sector and then by the industries within each sector:

In [104]:
sector_and_industry.sum().head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenues,Profits,Employees
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,Aerospace and Defense,383835.0,26733.5,1010124
Apparel,Apparel,101157.3,6350.7,355699
Business Services,"Advertising, marketing",23156.0,1667.4,127500
Business Services,Diversified Outsourcing Services,74175.0,5043.7,858600
Business Services,Education,6970.0,393.5,70653


In [105]:
sector_and_industry.sum().reset_index()

Unnamed: 0,Sector,Industry,Revenues,Profits,Employees
0,Aerospace & Defense,Aerospace and Defense,383835.0,26733.5,1010124
1,Apparel,Apparel,101157.3,6350.7,355699
2,Business Services,"Advertising, marketing",23156.0,1667.4,127500
3,Business Services,Diversified Outsourcing Services,74175.0,5043.7,858600
4,Business Services,Education,6970.0,393.5,70653
...,...,...,...,...,...
77,Transportation,"Trucking, Truck Leasing",43676.0,3535.5,208312
78,Wholesalers,Wholesalers: Diversified,130984.0,5231.5,262390
79,Wholesalers,Wholesalers: Electronics and Office Equipment,122231.0,1259.4,183518
80,Wholesalers,Wholesalers: Food and Grocery,125908.0,1794.0,135767
