# Grouping data in pandas

You can group and aggregate data in pandas in ways that will be familiar if you've ever done a pivot table in Excel or a GROUP BY statement in SQL.

In this notebook, we'll use the eel import data that lives at `../data/eels.csv`.

- [value_counts()](#value_counts())
- [groupby()](#groupby())
- [Grouping by multiple columns](#Grouping-by-multiple-columns)
- [pivot_table()](#pivot_table())

In [3]:
# import pandas
import pandas as pd

In [4]:
# read the CSV into a data frame
df = pd.read_csv('../data/eels.csv')

In [5]:
# check the output with `head()`
df.head()

Unnamed: 0,year,month,country,product,kilos,dollars
0,2010,1,CHINA,EELS FROZEN,49087,393583
1,2010,1,JAPAN,EELS FRESH,263,7651
2,2010,1,TAIWAN,EELS FROZEN,9979,116359
3,2010,1,VIETNAM,EELS FRESH,1938,10851
4,2010,1,VIETNAM,EELS FROZEN,21851,69955


### `value_counts()`

If all you need to do is count the occurrences of a value in a column, you can use the [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html) method.

In our eel data, every row is one month's of shipments of a particular eel product from one country. Let's count up how many months each country is represented in the data.

In [8]:
# get value counts of country column
df.country.value_counts()

CHINA                187
JAPAN                145
CANADA               101
VIETNAM               84
TAIWAN                77
PORTUGAL              76
SOUTH KOREA           65
THAILAND              25
SPAIN                 14
NEW ZEALAND            5
NORWAY                 4
BANGLADESH             3
PANAMA                 3
MEXICO                 3
POLAND                 2
UKRAINE                2
CHILE                  2
BURMA                  1
INDIA                  1
SENEGAL                1
PAKISTAN               1
PHILIPPINES            1
COSTA RICA             1
CHINA - HONG KONG      1
Name: country, dtype: int64

### `groupby()`

Let's group the data by country and sum the kilos for each country.

If this were a pivot table in Excel, we'd drag the `country` column into Rows and the `kilos` column into Values, then summarize by Sum.

If this were SQL, we might write something like:

```sql
SELECT country, SUM(kilos)
FROM eels
GROUP BY country
ORDER BY 2 desc
```

Let's do the same thing in pandas using [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html):

- Select a list with our two columns of interest (`country` and `kilos`)
- Call the `groupby()` method on the grouping column (`country`)
- Call the `sum()` method
- Sort by kilos descending

In [4]:
df[['country', 'kilos']].groupby('country').sum().sort_values('kilos', ascending=False)

Unnamed: 0_level_0,kilos
country,Unnamed: 1_level_1
CHINA,15965996
VIETNAM,637737
TAIWAN,442740
JAPAN,361364
CANADA,346075
SOUTH KOREA,243540
THAILAND,137556
PORTUGAL,41453
PAKISTAN,22453
MEXICO,20860


You can use other aggregations, too -- let's do the `median()`.

In [13]:
df[['country', 'kilos']].groupby('country').median().sort_values('kilos', ascending=False)

Unnamed: 0_level_0,kilos
country,Unnamed: 1_level_1
CHINA,40000.0
PAKISTAN,22453.0
UKRAINE,5707.0
VIETNAM,5326.5
MEXICO,5307.0
TAIWAN,4195.0
CHILE,3092.5
NORWAY,3062.0
INDIA,2200.0
CANADA,2063.0


... and you can do _multiple_ aggregations, too, if that's useful. Just use the `agg()` function and pass it a list of functions that you'd like to compute on numeric columns:

In [14]:
df[['country', 'kilos']].groupby('country').agg(['sum', 'median', 'mean'])

Unnamed: 0_level_0,kilos,kilos,kilos
Unnamed: 0_level_1,sum,median,mean
country,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
BANGLADESH,613,300.0,204.333333
BURMA,699,699.0,699.0
CANADA,346075,2063.0,3426.485149
CHILE,6185,3092.5,3092.5
CHINA,15965996,40000.0,85379.657754
CHINA - HONG KONG,735,735.0,735.0
COSTA RICA,563,563.0,563.0
INDIA,2200,2200.0,2200.0
JAPAN,361364,223.0,2492.165517
MEXICO,20860,5307.0,6953.333333


### Grouping by multiple columns

You can group by multiple columns! Just pass a _list_ of columns to the `groupby()` method instead of a column name. If, for example, we want to get the total kilos by country by year, we could select our three columns of data to pass to `groupby()` and call the `sum()` function. Like this:

In [8]:
df[['country', 'year', 'kilos']].groupby(['country', 'year']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,kilos
country,year,Unnamed: 2_level_1
BANGLADESH,2012,13
BANGLADESH,2015,600
BURMA,2016,699
CANADA,2010,13552
CANADA,2011,24968
CANADA,2012,110796
CANADA,2013,44455
CANADA,2014,31546
CANADA,2015,28619
CANADA,2016,68568


### `pivot_table()`

... which is fine and all, but there's a more intuitive way to look at this data, I think: using the [`pivot_table()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html) method.

If we were making this pivot table in Excel, we would drag `country` to Rows, `kilos` to Values and `year` to Columns. But we're gonna do it in pandas. We're gonna hand the `pivot_table()` method four things:
- The data frame you're pivoting (`df`)
- The `index` column -- what to group your data by (`index='country'`)
- The `columns` column -- the second grouping factor (`columns='year'`)
- The `values` column -- what column are we doing math on? (`values='kilos'`)
- The `aggfunc` -- what function to use to aggregate the data; the default is to use an average, but we'll use Python's built-in `sum` function

Then we'll sort the results by the latest year of data -- 2017 -- and fill null values with zeroes.

In [10]:
pd.pivot_table(df,
               index='country',
               columns='year',
               values='kilos',
               aggfunc=sum).sort_values(2017, ascending=False) \
                           .fillna(0)

year,2010,2011,2012,2013,2014,2015,2016,2017
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CHINA,372397.0,249232.0,1437392.0,1090135.0,1753140.0,4713882.0,4578546.0,1771272.0
TAIWAN,73842.0,0.0,53774.0,39752.0,83478.0,48272.0,99535.0,44087.0
SOUTH KOREA,42929.0,41385.0,28146.0,27353.0,37708.0,8386.0,14729.0,42904.0
JAPAN,1326.0,2509.0,32255.0,105758.0,40177.0,69699.0,71748.0,37892.0
THAILAND,2866.0,5018.0,9488.0,4488.0,15110.0,41771.0,26931.0,31884.0
VIETNAM,63718.0,155488.0,118063.0,100828.0,38112.0,36859.0,96179.0,28490.0
CANADA,13552.0,24968.0,110796.0,44455.0,31546.0,28619.0,68568.0,23571.0
PORTUGAL,2081.0,3672.0,2579.0,2041.0,7215.0,8013.0,9105.0,6747.0
PANAMA,0.0,0.0,0.0,11849.0,0.0,0.0,0.0,974.0
BANGLADESH,0.0,0.0,13.0,0.0,0.0,600.0,0.0,0.0
