---
title: Grouping and Sorting
tags: [jupyter]
keywords: pandas
summary: "Scaling up the level of insight on the data with more complex data using grouping and sorting functions."
mlType: dataFrame
infoType: pandas
sidebar: pandas_sidebar
permalink: __AutoGenThis__
notebookfilename:  __AutoGenThis__
---

In [1]:
import sys

sys.path.append("../")

In [2]:
import pandas as pd
from pprint import pprint

# Padas Options

In [3]:
pd.set_option('max_rows', 8)

# I/O

In [4]:
reviews = pd.read_csv("../data/winemag-data-130k-v2.csv", index_col=0)

In [5]:
reviews.head()

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
0,Italy,"Aromas include tropical fruit, broom, brimston...",Vulkà Bianco,87,,Sicily & Sardinia,Etna,,Kerin O’Keefe,@kerinokeefe,Nicosia 2013 Vulkà Bianco (Etna),White Blend,Nicosia
1,Portugal,"This is ripe and fruity, a wine that is smooth...",Avidagos,87,15.0,Douro,,,Roger Voss,@vossroger,Quinta dos Avidagos 2011 Avidagos Red (Douro),Portuguese Red,Quinta dos Avidagos
2,US,"Tart and snappy, the flavors of lime flesh and...",,87,14.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Rainstorm 2013 Pinot Gris (Willamette Valley),Pinot Gris,Rainstorm
3,US,"Pineapple rind, lemon pith and orange blossom ...",Reserve Late Harvest,87,13.0,Michigan,Lake Michigan Shore,,Alexander Peartree,,St. Julian 2013 Reserve Late Harvest Riesling ...,Riesling,St. Julian
4,US,"Much like the regular bottling from 2012, this...",Vintner's Reserve Wild Child Block,87,65.0,Oregon,Willamette Valley,Willamette Valley,Paul Gregutt,@paulgwine,Sweet Cheeks 2012 Vintner's Reserve Wild Child...,Pinot Noir,Sweet Cheeks


# Groupwise Analysis

**groupby()** groups the data based on some condition where you can apply the same methods as as before with mean(), count(), etc...

For the purpose of this demonstration let use try an replicate the value_count() method.  First we group the points and for each group point we count the number of individuals in that group.

In [9]:
reviews.groupby('points').points.count()

points
80      397
81      692
82     1836
83     3025
       ... 
97      229
98       77
99       33
100      19
Name: points, Length: 21, dtype: int64

You can think of each group we generate as being a slice of our DF containing only data with values that match.  This DF is accessible to us directly using the **apply()** method and we can then manipulate the data in any way.  

For instance selecting the name of the first wine reviewed from each winery in the dataset.

In [18]:
reviews.groupby('winery').apply(lambda df:df.title.iloc[0])

winery
1+1=3                                     1+1=3 NV Rosé Sparkling (Cava)
10 Knots                            10 Knots 2010 Viognier (Paso Robles)
100 Percent Wine              100 Percent Wine 2015 Moscato (California)
1000 Stories           1000 Stories 2013 Bourbon Barrel Aged Zinfande...
                                             ...                        
Öko                    Öko 2013 Made With Organically Grown Grapes Ma...
Ökonomierat Rebholz    Ökonomierat Rebholz 2007 Von Rotliegenden Spät...
àMaurice               àMaurice 2013 Fred Estate Syrah (Walla Walla V...
Štoka                                    Štoka 2009 Izbrani Teran (Kras)
Length: 16757, dtype: object

You can also group by more than one coloumn.

For instance this is how you select the best wine for each province and country.

In [19]:
colListToGroup = ['country','province']
reviews.groupby(colListToGroup).apply(lambda df: df.loc[df.points.idxmax()])

Unnamed: 0_level_0,Unnamed: 1_level_0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Argentina,Mendoza Province,Argentina,"If the color doesn't tell the full story, the ...",Nicasia Vineyard,97,120.0,Mendoza Province,Mendoza,,Michael Schachner,@wineschach,Bodega Catena Zapata 2006 Nicasia Vineyard Mal...,Malbec,Bodega Catena Zapata
Argentina,Other,Argentina,"Take note, this could be the best wine Colomé ...",Reserva,95,90.0,Other,Salta,,Michael Schachner,@wineschach,Colomé 2010 Reserva Malbec (Salta),Malbec,Colomé
Armenia,Armenia,Armenia,"Deep salmon in color, this wine offers a bouqu...",Estate Bottled,88,15.0,Armenia,,,Mike DeSimone,@worldwineguys,Van Ardi 2015 Estate Bottled Rosé (Armenia),Rosé,Van Ardi
Australia,Australia Other,Australia,Writes the book on how to make a wine filled w...,Sarah's Blend,93,15.0,Australia Other,South Eastern Australia,,,,Marquis Philips 2000 Sarah's Blend Red (South ...,Red Blend,Marquis Philips
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uruguay,Montevideo,Uruguay,"A rich, heady bouquet offers aromas of blackbe...",Monte Vide Eu Tannat-Merlot-Tempranillo,91,60.0,Montevideo,,,Michael Schachner,@wineschach,Bouza 2015 Monte Vide Eu Tannat-Merlot-Tempran...,Red Blend,Bouza
Uruguay,Progreso,Uruguay,"Rusty in color but deep and complex in nature,...",Etxe Oneko Fortified Sweet Red,90,46.0,Progreso,,,Michael Schachner,@wineschach,Pisano 2007 Etxe Oneko Fortified Sweet Red Tan...,Tannat,Pisano
Uruguay,San Jose,Uruguay,"Baked, sweet, heavy aromas turn earthy with ti...",El Preciado Gran Reserva,87,50.0,San Jose,,,Michael Schachner,@wineschach,Castillo Viejo 2005 El Preciado Gran Reserva R...,Red Blend,Castillo Viejo
Uruguay,Uruguay,Uruguay,"Cherry and berry aromas are ripe, healthy and ...",Blend 002 Limited Edition,91,22.0,Uruguay,,,Michael Schachner,@wineschach,Narbona NV Blend 002 Limited Edition Tannat-Ca...,Tannat-Cabernet Franc,Narbona


Note that the new indecies for the new DF are the prov and country that we group them by.  After than we apply a lambda function to obtain the max points in the DF.

## agg()

This method allows us to run a bunch of different functions on the DF simultaneously.  This is good when we want to obtain statistical summary really quickly.

In [23]:
col = ['country','province']
functionsToRun = [len,min,max]
reviews.groupby(col).price.agg(functionsToRun)

Unnamed: 0_level_0,Unnamed: 1_level_0,len,min,max
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Argentina,Mendoza Province,3264.0,4.0,230.0
Argentina,Other,536.0,7.0,150.0
Armenia,Armenia,2.0,14.0,15.0
Australia,Australia Other,245.0,5.0,130.0
...,...,...,...,...
Uruguay,Montevideo,11.0,17.0,60.0
Uruguay,Progreso,11.0,12.0,46.0
Uruguay,San Jose,3.0,20.0,50.0
Uruguay,Uruguay,24.0,12.0,50.0


Notice that the list of methods do not have **()* because you can think of these as pointers to the function that will be later executed and not executed when the code reaches this point. 

# Multi-Indexes

So far all examples we've been working with DF or Series with single-label index.  **groupby()** is slightly differen in that depening on the operation we run it will sometimes result in what is called as **multi-index** like the above example.

Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices.  They are also required two levels of labels to retrieve a value.  

To revert back to regular indexing you can use the method **reset_index()**.

In [26]:
col = ['country','province']
functionsToRun = [len,min,max]
multIndex = reviews.groupby(col).price.agg(functionsToRun) 

In [28]:
type(multIndex.index)

pandas.core.indexes.multi.MultiIndex

In [29]:
multIndex.reset_index()

Unnamed: 0,country,province,len,min,max
0,Argentina,Mendoza Province,3264.0,4.0,230.0
1,Argentina,Other,536.0,7.0,150.0
2,Armenia,Armenia,2.0,14.0,15.0
3,Australia,Australia Other,245.0,5.0,130.0
...,...,...,...,...,...
421,Uruguay,Montevideo,11.0,17.0,60.0
422,Uruguay,Progreso,11.0,12.0,46.0
423,Uruguay,San Jose,3.0,20.0,50.0
424,Uruguay,Uruguay,24.0,12.0,50.0


Notice that country and provinces have now split into two coloumns

# Sorting

Looking again at **countries_reviewed** we can see that grouping returns data in index order, not in value order.  That is to say when outputting the resutls of a **groupby()** the order of the rows is dependent on the values in the index, not in the data.

To get data in the order we want it in we can sort it ourselves using the **sort_values()** method.

In [31]:
multIndex.sort_values(by='len',
                     ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,len,min,max
country,province,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
US,California,36247.0,4.0,2013.0
US,Washington,8639.0,6.0,240.0
France,Bordeaux,5941.0,6.0,3300.0
Italy,Tuscany,5897.0,6.0,900.0
...,...,...,...,...
New Zealand,Gladstone,1.0,16.0,16.0
South Africa,Piekenierskloof,1.0,,
Chile,Coelemu,1.0,25.0,25.0
Greece,Beotia,1.0,27.0,27.0


To reset the index we can sort by index as well

In [32]:
multIndex = multIndex.reset_index()

In [33]:
multIndex.sort_index()

Unnamed: 0,country,province,len,min,max
0,Argentina,Mendoza Province,3264.0,4.0,230.0
1,Argentina,Other,536.0,7.0,150.0
2,Armenia,Armenia,2.0,14.0,15.0
3,Australia,Australia Other,245.0,5.0,130.0
...,...,...,...,...,...
421,Uruguay,Montevideo,11.0,17.0,60.0
422,Uruguay,Progreso,11.0,12.0,46.0
423,Uruguay,San Jose,3.0,20.0,50.0
424,Uruguay,Uruguay,24.0,12.0,50.0


You can also sort by more than one coloumn at a time.

In [35]:
multIndex.sort_values(by=['country', 'len'])

Unnamed: 0,country,province,len,min,max
1,Argentina,Other,536.0,7.0,150.0
0,Argentina,Mendoza Province,3264.0,4.0,230.0
2,Armenia,Armenia,2.0,14.0,15.0
6,Australia,Tasmania,42.0,16.0,130.0
...,...,...,...,...,...
422,Uruguay,Progreso,11.0,12.0,46.0
420,Uruguay,Juanico,12.0,10.0,130.0
424,Uruguay,Uruguay,24.0,12.0,50.0
419,Uruguay,Canelones,43.0,12.0,65.0


# idxMax()

returns the index with the maximum value.