# 1) Intro to Groupby Module
+ Groupby is like a container where each bundle has a common theme

In [3]:
import pandas as pd

In [8]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


## Groupby TIPS
- When we consider Groupby conditions, we want to group by using a column with smallest number of unique categories (such as Sector). There are multiple compaines that fall under each Sector. Similarly Industry is a good candidate for groupby too.
- There is no point groupby using Company in this example because every Company value is unique.

In [9]:
# check the number of unique values
fortune.nunique()

Company      996
Sector        21
Industry      73
Location     416
Revenue      945
Profits      760
Employees    755
dtype: int64

## How does it work behind the scence?
+ Pandas will look at Sector value (for example: Retailing). It then loop through the dataset and collect all the rows that fall under that specific Sector (For Example: It gonna collect Wallmart, then proceed all 1000 companines, take any rows fall under that same Sector). The same procedure will continue for each Sector.
+ Then once all collection is done, pandas will bundle them into a larger object which we can see as DataFrameGoupBy object.
+ So we can think of it like a container where each bundle has a common theme (in our case it is Common Sector)

**Groupby Object itself doesn't do anything, until we call methods upon it.**

In [12]:
sectors = fortune.groupby('Sector')
sectors

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000004925F3E1F0>

#### Take note that DataFrame and DataFrameGroupBy objects are completely different.

In [11]:
type(fortune), type(sectors)

(pandas.core.frame.DataFrame, pandas.core.groupby.generic.DataFrameGroupBy)

------

# 2) The `.groupby()` Method

In [13]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
sectors = fortune.groupby('Sector')
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


## `length of Groupby` object using `len()`, we actually get the number of Groupings
There are 21 unique sectors in our object.

In [15]:
len(sectors)

21

In [16]:
fortune['Sector'].nunique()

21

## `Size()` of groupby object gives us each grouping details.

In [17]:
sectors.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

### It is very similar to calling to `value_counts()`
+ The only difference is the sorting.

In [18]:
fortune['Sector'].value_counts()

Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Food, Beverages & Tobacco        43
Materials                        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Hotels, Resturants & Leisure     25
Media                            25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Food and Drug Stores             15
Telecommunications               15
Apparel                          15
Name: Sector, dtype: int64

## `first()` give us first row value of every grouping
we get 21 values as we have 21 grouping in our groupby object. Each value is the first row of each Sector.

In [22]:
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [19]:
sectors.first()

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,"Chicago, IL",96114,5176,161400
Apparel,Nike,Apparel,"Beaverton, OR",30601,3273,62600
Business Services,ManpowerGroup,Temporary Help,"Milwaukee, WI",19330,419,27000
Chemicals,Dow Chemical,Chemicals,"Midland, MI",48778,7685,49495
Energy,Exxon Mobil,Petroleum Refining,"Irving, TX",246204,16150,75600
Engineering & Construction,Fluor,"Engineering, Construction","Irving, TX",18114,413,38758
Financials,Berkshire Hathaway,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
Food and Drug Stores,CVS Health,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
"Food, Beverages & Tobacco",Archer Daniels Midland,Food Production,"Chicago, IL",67702,1849,32300
Health Care,McKesson,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


## `last()` give us the last row of each grouping

In [25]:
fortune.tail(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
999,Wendy’s,"Hotels, Resturants & Leisure",Food Services,"Dublin, OH",1896,161,21200
1000,Briggs & Stratton,Industrials,Industrial Machinery,"Wauwatosa, WI",1895,46,5480


In [23]:
sectors.last()

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,"McLean, VA",1923,-133,12000
Apparel,Guess,Apparel,"Los Angeles, CA",2204,82,13500
Business Services,DeVry Education Group,Education,"Downers Grove, IL",1910,140,11770
Chemicals,H.B. Fuller,Chemicals,"St. Paul, MN",2084,87,4425
Energy,Portland General Electric,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
Engineering & Construction,MDC Holdings,Homebuilders,"Denver, CO",1909,66,1225
Financials,New York Community Bancorp,Commercial Banks,"Westbury, NY",1902,-47,3448
Food and Drug Stores,Fred’s,Food and Drug Stores,"Memphis, TN",2151,-7,7103
"Food, Beverages & Tobacco",Alliance One International,Tobacco,"Morrisville, NC",2066,-15,6835
Health Care,Providence Service,Health Care: Pharmacy and Other Services,"Tucson, AZ",1987,84,9072


## `groups` attribute gives a dictionary object
+ keys with Grouping Name (in our case Sector Name)
+ values as a list of values as Index Labels (in our case Index Labels of companies belong to that each Sector)

In [32]:
fortune.head(1)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000


In [31]:
fortune.loc[24] # we can check the company belong to that Sector.

Company                     Boeing
Sector         Aerospace & Defense
Industry     Aerospace and Defense
Location               Chicago, IL
Revenue                      96114
Profits                       5176
Employees                   161400
Name: 24, dtype: object

In [27]:
sectors.groups

{'Aerospace & Defense': [24, 45, 60, 88, 118, 120, 209, 245, 282, 378, 389, 490, 560, 605, 785, 788, 836, 903, 958, 987], 'Apparel': [91, 231, 340, 354, 448, 547, 575, 597, 683, 695, 726, 794, 877, 882, 917], 'Business Services': [144, 186, 199, 204, 221, 248, 249, 294, 307, 312, 355, 392, 404, 440, 467, 468, 481, 485, 492, 503, 545, 626, 635, 652, 677, 694, 714, 729, 734, 735, 737, 744, 767, 776, 777, 783, 791, 792, 796, 801, 803, 816, 819, 820, 869, 870, 886, 939, 951, 952, 993], 'Chemicals': [56, 101, 182, 189, 206, 253, 262, 277, 288, 296, 316, 538, 549, 555, 566, 580, 613, 624, 654, 668, 717, 720, 724, 758, 761, 829, 865, 898, 934, 949], 'Energy': [2, 14, 30, 32, 42, 65, 90, 95, 98, 104, 115, 117, 121, 162, 163, 165, 166, 175, 178, 188, 190, 192, 193, 198, 214, 216, 217, 223, 225, 229, 243, 246, 247, 257, 272, 274, 279, 289, 319, 322, 324, 343, 348, 349, 350, 363, 364, 384, 387, 388, 394, 402, 403, 410, 425, 437, 438, 445, 458, 475, 483, 493, 507, 522, 541, 548, 556, 558, 569, 571

------

# 3) Retrieve A Group with the `.get_group()` Method

In [33]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
sectors = fortune.groupby('Sector')
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [39]:
sectors.get_group('Energy')
sectors.get_group('Technology')
sectors.get_group('Apparel')

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
91,Nike,Apparel,Apparel,"Beaverton, OR",30601,3273,62600
231,VF,Apparel,Apparel,"Greensboro, NC",12377,1232,64000
340,PVH,Apparel,Apparel,"New York, NY",8020,572,26200
354,Ralph Lauren,Apparel,Apparel,"New York, NY",7620,702,20000
448,Hanesbrands,Apparel,Apparel,"Winston-Salem, NC",5732,429,65300
547,Levi Strauss,Apparel,Apparel,"San Francisco, CA",4495,209,12500
575,Coach,Apparel,Apparel,"New York, NY",4192,402,12950
597,Under Armour,Apparel,Apparel,"Baltimore, MD",3963,233,9600
683,Fossil Group,Apparel,Apparel,"Richardson, TX",3229,221,15100
695,Skechers U.S.A.,Apparel,Apparel,"Manhattan Beach, CA",3159,232,6400


In [40]:
# same as calling like this
fortune[fortune['Sector'] == 'Apparel']

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
91,Nike,Apparel,Apparel,"Beaverton, OR",30601,3273,62600
231,VF,Apparel,Apparel,"Greensboro, NC",12377,1232,64000
340,PVH,Apparel,Apparel,"New York, NY",8020,572,26200
354,Ralph Lauren,Apparel,Apparel,"New York, NY",7620,702,20000
448,Hanesbrands,Apparel,Apparel,"Winston-Salem, NC",5732,429,65300
547,Levi Strauss,Apparel,Apparel,"San Francisco, CA",4495,209,12500
575,Coach,Apparel,Apparel,"New York, NY",4192,402,12950
597,Under Armour,Apparel,Apparel,"Baltimore, MD",3963,233,9600
683,Fossil Group,Apparel,Apparel,"Richardson, TX",3229,221,15100
695,Skechers U.S.A.,Apparel,Apparel,"Manhattan Beach, CA",3159,232,6400


------

# 4) Methods on the `Groupby` Object and `DataFrame` Columns

In [41]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
sectors = fortune.groupby('Sector')
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


## using `max()` method
+ Pandas will look at the **left most column** in the row and give the largest value of it.
+ Example: in our case 'Company' is the last most column in each grouping. So it will get the largest string value of it.

In [45]:
sectors.get_group('Energy').head(1)

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,Exxon Mobil,Petroleum Refining,"Irving, TX",246204,16150,75600


**We can see that 'Woodward' is the largest value (last value after descending ranking) in Aerospace & Defense Sector.**

In [43]:
sectors.max()

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Woodward,Aerospace and Defense,"Wichita, KS",96114,7608,197200
Apparel,Wolverine World Wide,Apparel,"Winston-Salem, NC",30601,3273,65300
Business Services,Western Union,Waste Management,"Troy, MI",19330,6328,216500
Chemicals,Westlake Chemical,Chemicals,"Wilmington, DE",48778,7685,52000
Energy,Xcel Energy,Utilities: Gas and Electric,"Washington, DC",246204,16150,75600
Engineering & Construction,Tutor Perini,Homebuilders,"Watsonville, CA",18114,803,92000
Financials,Zions Bancorp.,Securities,"Worcester, MA",210821,24442,331000
Food and Drug Stores,Whole Foods Market,Food and Drug Stores,"Woonsocket, RI",153290,5237,431000
"Food, Beverages & Tobacco",WhiteWave Foods,Tobacco,"Winston-Salem, NC",67702,7351,263000
Health Care,inVentiv Health,Wholesalers: Health Care,"York, PA",181241,18108,203500


## using `min()`
+ this is the opposite of max() and give the smallest value.

In [46]:
sectors.min()

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,B/E Aerospace,Aerospace and Defense,"Berwyn, PA",1923,-240,6955
Apparel,Carter’s,Apparel,"Atlanta, GA",2204,82,5978
Business Services,ABM Industries,"Advertising, marketing","Arlington, VA",1910,-1481,2400
Chemicals,A. Schulman,Chemicals,"Allentown, PA",2084,-816,1979
Energy,AES,Energy,"Akron, OH",1898,-23119,480
Engineering & Construction,AECOM,"Engineering, Construction","Atlanta, GA",1909,-155,1036
Financials,AIG,Commercial Banks,"Atlanta, GA",1902,-1194,187
Food and Drug Stores,CVS Health,Food and Drug Stores,"Austin, TX",2151,-62,1616
"Food, Beverages & Tobacco",Alliance One International,Beverages,"Arden Hills, MN",2066,-253,1857
Health Care,AbbVie,Health Care: Insurance and Managed Care,"Abbott Park, IL",1987,-458,2924


## using `sum()` and `mean()`
+ this will give **Total/Mean of every numeric columns** in each grouping
+ in our case it will Sum/Mean of Revenue, Profits, Employees

In [53]:
# we can prove this like that
sectors.get_group('Apparel')['Employees'].sum()
sectors.get_group('Apparel')['Revenue'].mean()

6397.866666666667

In [54]:
sectors.sum()
sectors.mean()

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,17897.0,1437.1,48402.85
Apparel,6397.866667,549.066667,23093.133333
Business Services,5337.156863,553.470588,26687.254902
Chemicals,8129.9,754.266667,15455.033333
Energy,12441.057377,-602.02459,9745.303279
Engineering & Construction,5922.423077,204.0,15642.615385
Financials,15950.784173,1872.007194,24172.28777
Food and Drug Stores,32251.266667,1117.266667,93026.533333
"Food, Beverages & Tobacco",12929.465116,1195.744186,28177.488372
Health Care,21529.426667,1414.853333,35710.52


## We can directly do mathmetic calculation on Grouping object too.

In [63]:
sectors['Revenue'].sum()
sectors['Profits'].mean()
sectors['Revenue'].min()
sectors['Employees'].max()

sectors[['Revenue', 'Profits']].sum()
sectors[['Employees', 'Profits']].max()

Unnamed: 0_level_0,Employees,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,197200,7608
Apparel,65300,3273
Business Services,216500,6328
Chemicals,52000,7685
Energy,75600,16150
Engineering & Construction,92000,803
Financials,331000,24442
Food and Drug Stores,431000,5237
"Food, Beverages & Tobacco",263000,7351
Health Care,203500,18108


-------

# 5) Grouping by Multiple Columns

### Let's say we want to group by `Sector` and `Industry`

In [65]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
sectors = fortune.groupby(['Sector', 'Industry']) # group by multiple columns (in our case Sector and Industry)
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [67]:
sectors.groups

{('Aerospace & Defense', 'Aerospace and Defense'): [24, 45, 60, 88, 118, 120, 209, 245, 282, 378, 389, 490, 560, 605, 785, 788, 836, 903, 958, 987], ('Apparel', 'Apparel'): [91, 231, 340, 354, 448, 547, 575, 597, 683, 695, 726, 794, 877, 882, 917], ('Business Services', 'Advertising, marketing'): [186, 355], ('Business Services', 'Diversified Outsourcing Services'): [199, 248, 485, 545, 626, 635, 714, 729, 744, 783, 803, 816, 819, 869], ('Business Services', 'Education'): [737, 820, 993], ('Business Services', 'Financial Data Services'): [204, 249, 294, 307, 392, 404, 468, 481, 492, 652, 694, 767, 776, 777, 792, 796, 801, 886, 952], ('Business Services', 'Miscellaneous'): [440, 734, 870], ('Business Services', 'Temporary Help'): [144, 467, 503, 791, 951], ('Business Services', 'Waste Management'): [221, 312, 677, 735, 939], ('Chemicals', 'Chemicals'): [56, 101, 182, 189, 206, 253, 262, 277, 288, 296, 316, 538, 549, 555, 566, 580, 613, 624, 654, 668, 717, 720, 724, 758, 761, 829, 865, 8

### This allows us to go in more details by breaking down 

In [72]:
sectors.size()
sectors.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profits,Employees
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,Aerospace and Defense,357940,28742,968057
Apparel,Apparel,95968,8236,346397
Business Services,"Advertising, marketing",22748,1549,124100
Business Services,Diversified Outsourcing Services,64829,4305,708330
Business Services,Education,7485,69,46755
...,...,...,...,...
Transportation,"Trucking, Truck Leasing",35950,1910,170456
Wholesalers,Miscellaneous,8982,17,9200
Wholesalers,Wholesalers: Diversified,176138,5193,233831
Wholesalers,Wholesalers: Electronics and Office Equipment,147906,1857,166661


In [74]:
sectors['Revenue'].sum()
sectors['Employees'].mean()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            48402.850000
Apparel              Apparel                                          23093.133333
Business Services    Advertising, marketing                           62050.000000
                     Diversified Outsourcing Services                 50595.000000
                     Education                                        15585.000000
                                                                          ...     
Transportation       Trucking, Truck Leasing                          18939.555556
Wholesalers          Miscellaneous                                     9200.000000
                     Wholesalers: Diversified                          9353.240000
                     Wholesalers: Electronics and Office Equipment    20832.625000
                     Wholesalers: Food and Grocery                    19317.500000
Name: Employees, Len

-----

# 6) The `.agg()` Method

In [75]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
sectors = fortune.groupby('Sector')
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [77]:
sectors.mean()
sectors['Profits'].sum()

Sector
Aerospace & Defense              28742
Apparel                           8236
Business Services                28227
Chemicals                        22628
Energy                          -73447
Engineering & Construction        5304
Financials                      260209
Food and Drug Stores             16759
Food, Beverages & Tobacco        51417
Health Care                     106114
Hotels, Resturants & Leisure     20697
Household Products               14428
Industrials                      20764
Materials                         4428
Media                            24347
Motor Vehicles & Parts           25898
Retailing                        47830
Technology                      180473
Telecommunications               48637
Transportation                   44169
Wholesalers                       8233
Name: Profits, dtype: int64

### `.agg()` Method gives us the best of the world by combining above usages.
We need to provie
+ Option 1) a dictionary with key, values pair of **`column_name`: `aggregation method`**
+ Option 2) **list of agg methods** which will be performed on numerical columns
+ Option 3) we do the combination of above 2 options

We can mix and match based on our requirements.

## Option 1)

In [78]:
sectors.agg({
    'Revenue': 'sum',
    'Profits': 'sum',
    'Employees': 'mean'
})

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,357940,28742,48402.85
Apparel,95968,8236,23093.133333
Business Services,272195,28227,26687.254902
Chemicals,243897,22628,15455.033333
Energy,1517809,-73447,9745.303279
Engineering & Construction,153983,5304,15642.615385
Financials,2217159,260209,24172.28777
Food and Drug Stores,483769,16759,93026.533333
"Food, Beverages & Tobacco",555967,51417,28177.488372
Health Care,1614707,106114,35710.52


## Option 2)

In [85]:
sectors.agg(['size', 'sum', 'mean', 'count', 'max', 'min', 'std']) # count returns same as size
sectors.agg(['size', 'sum', 'mean'])

Unnamed: 0_level_0,Revenue,Revenue,Revenue,Profits,Profits,Profits,Employees,Employees,Employees
Unnamed: 0_level_1,size,sum,mean,size,sum,mean,size,sum,mean
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Aerospace & Defense,20,357940,17897.0,20,28742,1437.1,20,968057,48402.85
Apparel,15,95968,6397.866667,15,8236,549.066667,15,346397,23093.133333
Business Services,51,272195,5337.156863,51,28227,553.470588,51,1361050,26687.254902
Chemicals,30,243897,8129.9,30,22628,754.266667,30,463651,15455.033333
Energy,122,1517809,12441.057377,122,-73447,-602.02459,122,1188927,9745.303279
Engineering & Construction,26,153983,5922.423077,26,5304,204.0,26,406708,15642.615385
Financials,139,2217159,15950.784173,139,260209,1872.007194,139,3359948,24172.28777
Food and Drug Stores,15,483769,32251.266667,15,16759,1117.266667,15,1395398,93026.533333
"Food, Beverages & Tobacco",43,555967,12929.465116,43,51417,1195.744186,43,1211632,28177.488372
Health Care,75,1614707,21529.426667,75,106114,1414.853333,75,2678289,35710.52


## Option 3)

In [90]:
sectors.agg({
    'Revenue': ['sum', 'mean'],
    'Profits': 'sum',
    'Employees': 'mean'
})

Unnamed: 0_level_0,Revenue,Revenue,Profits,Employees
Unnamed: 0_level_1,sum,mean,sum,mean
Sector,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Aerospace & Defense,357940,17897.0,28742,48402.85
Apparel,95968,6397.866667,8236,23093.133333
Business Services,272195,5337.156863,28227,26687.254902
Chemicals,243897,8129.9,22628,15455.033333
Energy,1517809,12441.057377,-73447,9745.303279
Engineering & Construction,153983,5922.423077,5304,15642.615385
Financials,2217159,15950.784173,260209,24172.28777
Food and Drug Stores,483769,32251.266667,16759,93026.533333
"Food, Beverages & Tobacco",555967,12929.465116,51417,28177.488372
Health Care,1614707,21529.426667,106114,35710.52


------

# 7) Iterating through Groups

In [91]:
fortune = pd.read_csv('Data/fortune1000.csv', index_col=['Rank'])
sectors = fortune.groupby('Sector')
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


## We want to extract company with the Most Revenue. What is that?

In [94]:
sectors['Profits'].max() # this doesn't give the whole row information.

Sector
Aerospace & Defense              7608
Apparel                          3273
Business Services                6328
Chemicals                        7685
Energy                          16150
Engineering & Construction        803
Financials                      24442
Food and Drug Stores             5237
Food, Beverages & Tobacco        7351
Health Care                     18108
Hotels, Resturants & Leisure     5920
Household Products               7036
Industrials                      4833
Materials                         991
Media                            8382
Motor Vehicles & Parts           9687
Retailing                       14694
Technology                      53394
Telecommunications              17879
Transportation                   7610
Wholesalers                      1472
Name: Profits, dtype: int64

### First create the empty dataframe following the same columns from original DF

In [95]:
df = pd.DataFrame(columns=fortune.columns)
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees


### Then we will iterate through the whole DF

In [97]:
for sector, data in sectors:
    highest_revenue_company_in_group = data.nlargest(1, 'Revenue') # get the top one result of highest revenue
    df = df.append(highest_revenue_company_in_group)

In [98]:
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees
24,Boeing,Aerospace & Defense,Aerospace and Defense,"Chicago, IL",96114,5176,161400
91,Nike,Apparel,Apparel,"Beaverton, OR",30601,3273,62600
144,ManpowerGroup,Business Services,Temporary Help,"Milwaukee, WI",19330,419,27000
56,Dow Chemical,Chemicals,Chemicals,"Midland, MI",48778,7685,49495
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
155,Fluor,Engineering & Construction,"Engineering, Construction","Irving, TX",18114,413,38758
4,Berkshire Hathaway,Financials,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
7,CVS Health,Food and Drug Stores,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
41,Archer Daniels Midland,"Food, Beverages & Tobacco",Food Production,"Chicago, IL",67702,1849,32300
5,McKesson,Health Care,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


## Example 2) Let's say we want the Location City based, which has the highest Revenue in each group.

In [100]:
# first create a groupby object based on cities
cities = fortune.groupby('Location')
cities

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000004927A374F0>

In [101]:
# create empty df with columns
df = pd.DataFrame(columns=fortune.columns)
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees


In [104]:
# loop through to find the largest one
for city, data in cities:
    highest_revenue_company_in_city = data.nlargest(1, 'Revenue')
    df = df.append(highest_revenue_company_in_city)

In [105]:
df

Unnamed: 0,Company,Sector,Industry,Location,Revenue,Profits,Employees
138,Abbott Laboratories,Health Care,Medical Products and Equipment,"Abbott Park, IL",20661,4423,74000
169,Goodyear Tire & Rubber,Motor Vehicles & Parts,Motor Vehicles and Parts,"Akron, OH",16443,307,66000
288,Air Products & Chemicals,Chemicals,Chemicals,"Allentown, PA",9895,1278,19550
830,Benchmark Electronics,Technology,Semiconductors and Other Electronic Components,"Angleton, TX",2541,95,10500
374,Casey’s General Stores,Retailing,Specialty Retailers: Other,"Ankeny, IA",7052,181,22408
...,...,...,...,...,...,...,...
7,CVS Health,Food and Drug Stores,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
506,Hanover Insurance Group,Financials,Insurance: Property and Casualty (Stock),"Worcester, MA",5034,332,4800
764,Penn National Gaming,"Hotels, Resturants & Leisure","Hotels, Casinos, Resorts","Wyomissing, PA",2838,1,18204
773,Bon-Ton Stores,Retailing,General Merchandisers,"York, PA",2790,-57,24100


-------

