## **Introdução ao método ```.groupby()```**

In [1]:
import pandas as pd

In [42]:
fortune = pd.read_csv("fortune1000.csv", index_col = ["Rank"])
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


A ideia do agrupamento é sempre agrupar uma coluna com "valores duplicados".

Não faz sentido agrupar dados por uma coluna que apenas possui valores únicos como a "Company" por exemplo. Nesse caso, faria mais sentido termos valores agrupados por "Sector", "Industry" ou "Location".

In [12]:
sectors = fortune.groupby(by = ["Sector"])
type(sectors)

pandas.core.groupby.generic.DataFrameGroupBy

### O que temos aqui é apenas um objeto do tipo "DataFrameGroupBy" que está armezenado na memória, e ele só será realmente útil se utilizarmos métodos com esse objete (métodos de agregação --> operações matemáticas).

## **The .groupby() Method**

In [13]:
fortune = pd.read_csv("fortune1000.csv", index_col = ["Rank"])
sectors = fortune.groupby(by = ["Sector"])
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [15]:
len(fortune) # Retorna o número de linhas

1000

In [16]:
len(sectors) # Retornar o número de agrupamentos | Nesse caso, temos 21 setores exclusivos.

21

In [17]:
# E podemos provar a afirmação acima de que temos 21 setores exclusivos no nosso objeto GroupBy com o código abaixo:

fortune["Sector"].nunique()

21

In [21]:
# Um dos primeiros métodos que podemos utilizar em um objeto DataFrameGroupBy é o método 'size'
# Nesse caso, os indices representam os agrupamentos (setores) e os números representam a quantidade de linhas que temos para cada agrupamento
# A quantidade de registros deve somar 1000 --> sectors.size().sum() = 1000 registros

sectors.size()

Sector
Aerospace & Defense              20
Apparel                          15
Business Services                51
Chemicals                        30
Energy                          122
Engineering & Construction       26
Financials                      139
Food and Drug Stores             15
Food, Beverages & Tobacco        43
Health Care                      75
Hotels, Resturants & Leisure     25
Household Products               28
Industrials                      46
Materials                        43
Media                            25
Motor Vehicles & Parts           24
Retailing                        80
Technology                      102
Telecommunications               15
Transportation                   36
Wholesalers                      40
dtype: int64

In [22]:
fortune["Sector"].value_counts() # similar ao método .size()

Financials                      139
Energy                          122
Technology                      102
Retailing                        80
Health Care                      75
Business Services                51
Industrials                      46
Materials                        43
Food, Beverages & Tobacco        43
Wholesalers                      40
Transportation                   36
Chemicals                        30
Household Products               28
Engineering & Construction       26
Media                            25
Hotels, Resturants & Leisure     25
Motor Vehicles & Parts           24
Aerospace & Defense              20
Telecommunications               15
Apparel                          15
Food and Drug Stores             15
Name: Sector, dtype: int64

In [26]:
sectors.first() # retorna o primeiro registro para cada agrupamento

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Boeing,Aerospace and Defense,"Chicago, IL",96114,5176,161400
Apparel,Nike,Apparel,"Beaverton, OR",30601,3273,62600
Business Services,ManpowerGroup,Temporary Help,"Milwaukee, WI",19330,419,27000
Chemicals,Dow Chemical,Chemicals,"Midland, MI",48778,7685,49495
Energy,Exxon Mobil,Petroleum Refining,"Irving, TX",246204,16150,75600
Engineering & Construction,Fluor,"Engineering, Construction","Irving, TX",18114,413,38758
Financials,Berkshire Hathaway,Insurance: Property and Casualty (Stock),"Omaha, NE",210821,24083,331000
Food and Drug Stores,CVS Health,Food and Drug Stores,"Woonsocket, RI",153290,5237,199000
"Food, Beverages & Tobacco",Archer Daniels Midland,Food Production,"Chicago, IL",67702,1849,32300
Health Care,McKesson,Wholesalers: Health Care,"San Francisco, CA",181241,1476,70400


In [28]:
sectors.last() # retorna o último registro para cada agrupamento

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,Delta Tucker Holdings,Aerospace and Defense,"McLean, VA",1923,-133,12000
Apparel,Guess,Apparel,"Los Angeles, CA",2204,82,13500
Business Services,DeVry Education Group,Education,"Downers Grove, IL",1910,140,11770
Chemicals,H.B. Fuller,Chemicals,"St. Paul, MN",2084,87,4425
Energy,Portland General Electric,Utilities: Gas and Electric,"Portland, OR",1898,172,2646
Engineering & Construction,MDC Holdings,Homebuilders,"Denver, CO",1909,66,1225
Financials,New York Community Bancorp,Commercial Banks,"Westbury, NY",1902,-47,3448
Food and Drug Stores,Fred’s,Food and Drug Stores,"Memphis, TN",2151,-7,7103
"Food, Beverages & Tobacco",Alliance One International,Tobacco,"Morrisville, NC",2066,-15,6835
Health Care,Providence Service,Health Care: Pharmacy and Other Services,"Tucson, AZ",1987,84,9072


In [32]:
sectors.groups # retorna um dicionário Python que contém como 'keys' o agrupamento e como valores uma lista de indices que se enquadram no agrupamento

{'Aerospace & Defense': [24, 45, 60, 88, 118, 120, 209, 245, 282, 378, 389, 490, 560, 605, 785, 788, 836, 903, 958, 987], 'Apparel': [91, 231, 340, 354, 448, 547, 575, 597, 683, 695, 726, 794, 877, 882, 917], 'Business Services': [144, 186, 199, 204, 221, 248, 249, 294, 307, 312, 355, 392, 404, 440, 467, 468, 481, 485, 492, 503, 545, 626, 635, 652, 677, 694, 714, 729, 734, 735, 737, 744, 767, 776, 777, 783, 791, 792, 796, 801, 803, 816, 819, 820, 869, 870, 886, 939, 951, 952, 993], 'Chemicals': [56, 101, 182, 189, 206, 253, 262, 277, 288, 296, 316, 538, 549, 555, 566, 580, 613, 624, 654, 668, 717, 720, 724, 758, 761, 829, 865, 898, 934, 949], 'Energy': [2, 14, 30, 32, 42, 65, 90, 95, 98, 104, 115, 117, 121, 162, 163, 165, 166, 175, 178, 188, 190, 192, 193, 198, 214, 216, 217, 223, 225, 229, 243, 246, 247, 257, 272, 274, 279, 289, 319, 322, 324, 343, 348, 349, 350, 363, 364, 384, 387, 388, 394, 402, 403, 410, 425, 437, 438, 445, 458, 475, 483, 493, 507, 522, 541, 548, 556, 558, 569, 571

In [40]:
# provando o que foi dito acima:

fortune.loc[24]

Company                     Boeing
Sector         Aerospace & Defense
Industry     Aerospace and Defense
Location               Chicago, IL
Revenue                      96114
Profits                       5176
Employees                   161400
Name: 24, dtype: object

## **Retrieve A Group with the .get_group() Method**

In [43]:
fortune = pd.read_csv("fortune1000.csv", index_col = ["Rank"])
sectors = fortune.groupby(by = ["Sector"])
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [45]:
sectors.get_group(name = "Energy") # retorna registros onde o "Sector" é igual ao argumento fornecido para o parâmetro "Name"

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
14,Chevron,Energy,Petroleum Refining,"San Ramon, CA",131118,4587,61500
30,Phillips 66,Energy,Petroleum Refining,"Houston, TX",87169,4227,14000
32,Valero Energy,Energy,Petroleum Refining,"San Antonio, TX",81824,3990,10103
42,Marathon Petroleum,Energy,Petroleum Refining,"Findlay, OH",64566,2852,45440
...,...,...,...,...,...,...,...
981,WPX Energy,Energy,"Mining, Crude-Oil Production","Tulsa, OK",1958,-1727,1040
983,Adams Resources & Energy,Energy,Petroleum Refining,"Houston, TX",1944,-1,809
995,EP Energy,Energy,"Mining, Crude-Oil Production","Houston, TX",1908,-3748,665
997,Portland General Electric,Energy,Utilities: Gas and Electric,"Portland, OR",1898,172,2646


## **Methods on the Groupby Object and Dataframe Columns**

In [49]:
fortune = pd.read_csv("fortune1000.csv", index_col = ["Rank"])
sectors = fortune.groupby(by = ["Sector"])
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [53]:
# Os dois métodos abaixo irão e aplicar independente de qual for o tipo de dados que a coluna mais a esquerda estiver armazenando

sectors.max() # O método irá olhar para a coluna mais à esquerda, que nesse caso é a 'Company' e vai obter a string que tem a 'maior letra do alfabeto' para o agrupamento.
sectors.min()

Unnamed: 0_level_0,Company,Industry,Location,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Aerospace & Defense,B/E Aerospace,Aerospace and Defense,"Berwyn, PA",1923,-240,6955
Apparel,Carter’s,Apparel,"Atlanta, GA",2204,82,5978
Business Services,ABM Industries,"Advertising, marketing","Arlington, VA",1910,-1481,2400
Chemicals,A. Schulman,Chemicals,"Allentown, PA",2084,-816,1979
Energy,AES,Energy,"Akron, OH",1898,-23119,480
Engineering & Construction,AECOM,"Engineering, Construction","Atlanta, GA",1909,-155,1036
Financials,AIG,Commercial Banks,"Atlanta, GA",1902,-1194,187
Food and Drug Stores,CVS Health,Food and Drug Stores,"Austin, TX",2151,-62,1616
"Food, Beverages & Tobacco",Alliance One International,Beverages,"Arden Hills, MN",2066,-253,1857
Health Care,AbbVie,Health Care: Insurance and Managed Care,"Abbott Park, IL",1987,-458,2924


In [70]:
# Métodos que são aplicados somente a colunas numéricas

sectors.sum()
sectors.median()
sectors.mean()
sectors.std()

Unnamed: 0_level_0,Revenue,Profits,Employees
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Aerospace & Defense,24468.237017,2026.281788,55389.486777
Apparel,7254.770616,810.479441,21833.228179
Business Services,4102.377498,1058.517026,37717.982486
Chemicals,9575.323185,1481.756552,15644.31444
Energy,27362.358215,3872.84622,12949.476412
Engineering & Construction,4579.682742,229.344283,20625.225235
Financials,27105.928018,4108.831303,49070.50437
Food and Drug Stores,48595.512604,1677.128944,129049.965843
"Food, Beverages & Tobacco",15732.585231,2032.684843,46494.442039
Health Care,35680.266614,3025.285364,40300.162385


#### O ideal é que para utilizarmos métodos matemáticos em nosso objeto grouby, é que especifiquemos as colunas exatas em que queremos aplicar esses métodos.

In [65]:
sectors["Revenue"].sum()

Sector
Aerospace & Defense              357940
Apparel                           95968
Business Services                272195
Chemicals                        243897
Energy                          1517809
Engineering & Construction       153983
Financials                      2217159
Food and Drug Stores             483769
Food, Beverages & Tobacco        555967
Health Care                     1614707
Hotels, Resturants & Leisure     169546
Household Products               234737
Industrials                      497581
Materials                        259145
Media                            220764
Motor Vehicles & Parts           482540
Retailing                       1465076
Technology                      1377600
Telecommunications               461834
Transportation                   408508
Wholesalers                      444800
Name: Revenue, dtype: int64

In [68]:
sectors[["Revenue", "Profits"]].sum()

Unnamed: 0_level_0,Revenue,Profits
Sector,Unnamed: 1_level_1,Unnamed: 2_level_1
Aerospace & Defense,357940,28742
Apparel,95968,8236
Business Services,272195,28227
Chemicals,243897,22628
Energy,1517809,-73447
Engineering & Construction,153983,5304
Financials,2217159,260209
Food and Drug Stores,483769,16759
"Food, Beverages & Tobacco",555967,51417
Health Care,1614707,106114


## **Grouping by Multiple Columns**

In [75]:
fortune = pd.read_csv("fortune1000.csv", index_col = ["Rank"])
sectors = fortune.groupby(by = ["Sector", "Industry"]) # a ordem das colunas importam como na geração de indices múltiplos
fortune.head(3)

Unnamed: 0_level_0,Company,Sector,Industry,Location,Revenue,Profits,Employees
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Walmart,Retailing,General Merchandisers,"Bentonville, AR",482130,14694,2300000
2,Exxon Mobil,Energy,Petroleum Refining,"Irving, TX",246204,16150,75600
3,Apple,Technology,"Computers, Office Equipment","Cupertino, CA",233715,53394,110000


In [73]:
sectors.groups

{('Aerospace & Defense', 'Aerospace and Defense'): [24, 45, 60, 88, 118, 120, 209, 245, 282, 378, 389, 490, 560, 605, 785, 788, 836, 903, 958, 987], ('Apparel', 'Apparel'): [91, 231, 340, 354, 448, 547, 575, 597, 683, 695, 726, 794, 877, 882, 917], ('Business Services', 'Advertising, marketing'): [186, 355], ('Business Services', 'Diversified Outsourcing Services'): [199, 248, 485, 545, 626, 635, 714, 729, 744, 783, 803, 816, 819, 869], ('Business Services', 'Education'): [737, 820, 993], ('Business Services', 'Financial Data Services'): [204, 249, 294, 307, 392, 404, 468, 481, 492, 652, 694, 767, 776, 777, 792, 796, 801, 886, 952], ('Business Services', 'Miscellaneous'): [440, 734, 870], ('Business Services', 'Temporary Help'): [144, 467, 503, 791, 951], ('Business Services', 'Waste Management'): [221, 312, 677, 735, 939], ('Chemicals', 'Chemicals'): [56, 101, 182, 189, 206, 253, 262, 277, 288, 296, 316, 538, 549, 555, 566, 580, 613, 624, 654, 668, 717, 720, 724, 758, 761, 829, 865, 8

In [74]:
sectors.size() # agora temos "agrupamentos Multi-indices"

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            20
Apparel              Apparel                                          15
Business Services    Advertising, marketing                            2
                     Diversified Outsourcing Services                 14
                     Education                                         3
                                                                      ..
Transportation       Trucking, Truck Leasing                           9
Wholesalers          Miscellaneous                                     1
                     Wholesalers: Diversified                         25
                     Wholesalers: Electronics and Office Equipment     8
                     Wholesalers: Food and Grocery                     6
Length: 79, dtype: int64

In [76]:
sectors.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Revenue,Profits,Employees
Sector,Industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Aerospace & Defense,Aerospace and Defense,357940,28742,968057
Apparel,Apparel,95968,8236,346397
Business Services,"Advertising, marketing",22748,1549,124100
Business Services,Diversified Outsourcing Services,64829,4305,708330
Business Services,Education,7485,69,46755
...,...,...,...,...
Transportation,"Trucking, Truck Leasing",35950,1910,170456
Wholesalers,Miscellaneous,8982,17,9200
Wholesalers,Wholesalers: Diversified,176138,5193,233831
Wholesalers,Wholesalers: Electronics and Office Equipment,147906,1857,166661


In [78]:
sectors["Revenue"].sum()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            357940
Apparel              Apparel                                           95968
Business Services    Advertising, marketing                            22748
                     Diversified Outsourcing Services                  64829
                     Education                                          7485
                                                                       ...  
Transportation       Trucking, Truck Leasing                           35950
Wholesalers          Miscellaneous                                      8982
                     Wholesalers: Diversified                         176138
                     Wholesalers: Electronics and Office Equipment    147906
                     Wholesalers: Food and Grocery                    111774
Name: Revenue, Length: 79, dtype: int64

In [79]:
sectors["Employees"].mean()

Sector               Industry                                     
Aerospace & Defense  Aerospace and Defense                            48402.850000
Apparel              Apparel                                          23093.133333
Business Services    Advertising, marketing                           62050.000000
                     Diversified Outsourcing Services                 50595.000000
                     Education                                        15585.000000
                                                                          ...     
Transportation       Trucking, Truck Leasing                          18939.555556
Wholesalers          Miscellaneous                                     9200.000000
                     Wholesalers: Diversified                          9353.240000
                     Wholesalers: Electronics and Office Equipment    20832.625000
                     Wholesalers: Food and Grocery                    19317.500000
Name: Employees, Len