# The GroupBy Object

In [1]:
import pandas as pd

## The Fortune 1000 Dataset
- The **Fortune 1000** is a listing of the 1000 largest American companies as ranked by Fortune magazine.
- The **DataFrame** includes the company's name, sector, industry, and revenues, profits, and employees.

In [2]:
fortune = pd.read_csv("fortune1000.csv")
fortune.head()

Unnamed: 0,Rank,Company,Sector,Industry,Revenue,Profits,Employees
0,1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
1,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
2,3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
4,5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


## The groupby Method
- **Grouping** is a way to organize/categorize/group the data based on a column's values.
- The `groupby` method returns a **DataFrameGroupBy** object. It resembles a group/collection of **DataFrames** in a dictionary-like structure.
- The **DataFrameGroupBy** object can perform aggregate operations on *each* group within it.

In [3]:
fortune = pd.read_csv("fortune1000.csv")
fortune.head()

Unnamed: 0,Rank,Company,Sector,Industry,Revenue,Profits,Employees
0,1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
1,2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
2,3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
3,4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
4,5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [4]:
# Sums of revenues for each sector

fortune[fortune["Sector"] == "Retailing"]["Revenue"].sum()

1465076

In [5]:
# taki obiekt możemy rozumieć jako kolekcję ramek dla każdego z sektorów
# każda z ramek będzie posiadać wszystkie wiersze dla danego sektora
sectors = fortune.groupby("Sector")
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001A850E92870>

In [6]:
# ile sektorów jest w zbiorze
len(sectors)

21

In [7]:
# ile wierszy posiada każdy z sektorów
sectors.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

In [8]:
# za pomocą metody first możemy zwrócić pierwszy wiersz z każdego ramki
sectors.first()

Unnamed: 0_level_0,Rank,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,24,Boeing,Aerospace and Defense,96114,5176,161400
Apparel,91,Nike,Apparel,30601,3273,62600
Business Services,144,ManpowerGroup,Temporary Help,19330,419,27000
Chemicals,56,Dow Chemical,Chemicals,48778,7685,49495
Energy,2,Exxon Mobil,Petroleum Refining,246204,16150,75600
Engineering & Construction,155,Fluor,"Engineering, Construction",18114,413,38758
Financials,4,Berkshire Hathaway,Insurance: Property and Casualty (Stock),210821,24083,331000
Food and Drug Stores,7,CVS Health,Food and Drug Stores,153290,5237,199000
"Food, Beverages & Tobacco",41,Archer Daniels Midland,Food Production,67702,1849,32300
Health Care,5,McKesson,Wholesalers: Health Care,181241,1476,70400


In [9]:
sectors.last()

Unnamed: 0_level_0,Rank,Company,Industry,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,987,Delta Tucker Holdings,Aerospace and Defense,1923,-133,12000
Apparel,917,Guess,Apparel,2204,82,13500
Business Services,993,DeVry Education Group,Education,1910,140,11770
Chemicals,949,H.B. Fuller,Chemicals,2084,87,4425
Energy,997,Portland General Electric,Utilities: Gas and Electric,1898,172,2646
Engineering & Construction,994,MDC Holdings,Homebuilders,1909,66,1225
Financials,996,New York Community Bancorp,Commercial Banks,1902,-47,3448
Food and Drug Stores,928,Fred’s,Food and Drug Stores,2151,-7,7103
"Food, Beverages & Tobacco",954,Alliance One International,Tobacco,2066,-15,6835
Health Care,978,Providence Service,Health Care: Pharmacy and Other Services,1987,84,9072


## Retrieve a Group with the get_group Method
- The `get_group` method on the **DataFrameGroupBy** object retrieves a nested **DataFrame** belonging to a specific group/category.

In [10]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")

fortune.head(5)

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [11]:
# za pomocą metody get_group możemy zwrócić ramkę dla danego sektora
sectors.get_group("Energy")

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
14,Chevron,Energy,Petroleum Refining,131118,4587,61500
30,Phillips 66,Energy,Petroleum Refining,87169,4227,14000
32,Valero Energy,Energy,Petroleum Refining,81824,3990,10103
42,Marathon Petroleum,Energy,Petroleum Refining,64566,2852,45440
...,...,...,...,...,...,...
981,WPX Energy,Energy,"Mining, Crude-Oil Production",1958,-1727,1040
983,Adams Resources & Energy,Energy,Petroleum Refining,1944,-1,809
995,EP Energy,Energy,"Mining, Crude-Oil Production",1908,-3748,665
997,Portland General Electric,Energy,Utilities: Gas and Electric,1898,172,2646


In [12]:
sectors.get_group("Technology")

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
18,Amazon.com,Technology,Internet Services and Retailing,107006,596,230800
20,HP,Technology,"Computers, Office Equipment",103355,4554,287000
25,Microsoft,Technology,Computer Software,93580,12193,118000
31,IBM,Technology,Information Technology Services,82461,13190,411798
...,...,...,...,...,...,...
970,Rackspace Hosting,Technology,Internet Services and Retailing,2001,126,6189
971,VeriFone Systems,Technology,"Computers, Office Equipment",2001,79,5400
975,Super Micro Computer,Technology,"Computers, Office Equipment",1991,102,2285
984,Nuance Communications,Technology,Computer Software,1931,-115,13500


## Methods on the GroupBy Object
- Use square brackets on the **DataFrameGroupBy** object to "extract" a column from the original **DataFrame**.
- The resulting **SeriesGroupBy** object will have aggregation methods available on it.
- Pandas will perform the calculation on *every* group within the collection.
- For example, the `sum` method will sum together the **Revenues** for every row by group/category.

In [13]:
fortune = pd.read_csv("fortune1000.csv", index_col="Rank")
sectors = fortune.groupby("Sector")

fortune.head(5)

Unnamed: 0_level_0,Company,Sector,Industry,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Walmart,Retailing,General Merchandisers,482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment",233715,53394,110000
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),210821,24083,331000
5,McKesson,Health Care,Wholesalers: Health Care,181241,1476,70400


In [14]:
sectors["Revenue"].sum()

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

In [15]:
fortune[fortune["Sector"] == "Apparel"]["Revenue"].sum()

95968

In [16]:
sectors["Employees"].sum()

Sector
Aerospace & Defense              968057
Apparel                          346397
Business Services               1361050
Chemicals                        463651
Energy                          1188927
Engineering & Construction       406708
Financials                      3359948
Food and Drug Stores            1395398
Food, Beverages & Tobacco       1211632
Health Care                     2678289
Hotels, Resturants & Leisure    2484245
Household Products               646038
Industrials                     1545229
Materials                        638123
Media                            550314
Motor Vehicles & Parts          1082560
Retailing                       6227629
Technology                      3578949
Telecommunications               832468
Transportation                  1536793
Wholesalers                      525597
Name: Employees, dtype: int64

In [17]:
sectors["Profits"].max()

Sector
Aerospace & Defense              7608
Apparel                          3273
Business Services                6328
Chemicals                        7685
Energy                          16150
Engineering & Construction        803
Financials                      24442
Food and Drug Stores             5237
Food, Beverages & Tobacco        7351
Health Care                     18108
Hotels, Resturants & Leisure     5920
Household Products               7036
Industrials                      4833
Materials                         991
Media                            8382
Motor Vehicles & Parts           9687
Retailing                       14694
Technology                      53394
Telecommunications              17879
Transportation                   7610
Wholesalers                      1472
Name: Profits, dtype: int64

In [18]:
sectors["Profits"].min()

Sector
Aerospace & Defense              -240
Apparel                            82
Business Services               -1481
Chemicals                        -816
Energy                         -23119
Engineering & Construction       -155
Financials                      -1194
Food and Drug Stores              -62
Food, Beverages & Tobacco        -253
Health Care                      -458
Hotels, Resturants & Leisure    -1394
Household Products              -1149
Industrials                     -6126
Materials                       -1642
Media                            -881
Motor Vehicles & Parts           -889
Retailing                       -1243
Technology                      -4359
Telecommunications               -271
Transportation                   -191
Wholesalers                      -502
Name: Profits, dtype: int64

In [19]:
sectors["Employees"].mean()

Sector
Aerospace & Defense             48402.850000
Apparel                         23093.133333
Business Services               26687.254902
Chemicals                       15455.033333
Energy                           9745.303279
Engineering & Construction      15642.615385
Financials                      24172.287770
Food and Drug Stores            93026.533333
Food, Beverages & Tobacco       28177.488372
Health Care                     35710.520000
Hotels, Resturants & Leisure    99369.800000
Household Products              23072.785714
Industrials                     33591.934783
Materials                       14840.069767
Media                           22012.560000
Motor Vehicles & Parts          45106.666667
Retailing                       77845.362500
Technology                      35087.735294
Telecommunications              55497.866667
Transportation                  42688.694444
Wholesalers                     13139.925000
Name: Employees, dtype: float64

In [20]:
# możemy również wykonywać metody dla wielu kolumn
sectors[["Revenue", "Profits"]].sum()

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,357940,28742
Apparel,95968,8236
Business Services,272195,28227
Chemicals,243897,22628
Energy,1517809,-73447
Engineering & Construction,153983,5304
Financials,2217159,260209
Food and Drug Stores,483769,16759
"Food, Beverages & Tobacco",555967,51417
Health Care,1614707,106114


## Grouping by Multiple Columns
- Pass a list of columns to the **groupby** method to group by pairings of values across columns.
- Target a column to retrieve the **SeriesGroupBy** object, then perform an aggregation with a method.
- Pandas will return a **MultiIndex** **Series** where the levels will be the original groups.

## The agg Method
- The `agg` method applies different aggregation methods on different columns.
- Invoke the `agg` method directly on the **DataFrameGroupBy** object.
- Pass the method a dictionary where the keys are the columns and the values are the aggregation operations.

## Iterating through Groups 
- The **DataFrameGroupBy** object supports the `apply` method (just like a **Series** and a **DataFrame** do).
- The `apply` method invokes a function on every nested **DataFrame** in the **DataFrameGroupBy** object.
- It captures the return values of the functions and collects them in a new **DataFrame** (the return value).