# Pandas
- Solve short hands-on challenges to perfect your data manipulation skills.
- https://www.kaggle.com/learn/pandas

## 4.- Grouping and Sorting
- Scale up your level of insight. The more complex the dataset, the more this matters  

In [1]:
import numpy as np
import pandas as pd

print('np.__version__:', np.__version__)
print('pd.__version__:', pd.__version__)

#pd.set_option('display.max_rows', 5)

np.__version__: 1.23.5
pd.__version__: 1.5.3


In [2]:
reviews = pd.read_csv('Red.csv')
reviews.head(2)

Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
0,Pomerol 2011,France,Pomerol,Château La Providence,4.2,100,95.0,2011
1,Lirac 2017,France,Lirac,Château Mont-Redon,4.3,100,15.5,2017


### Groupwise analysis

In [3]:
# .value_counts (sting column - pd object)
print(reviews.Country.dtype)
reviews.Country.value_counts()

object


Italy            2650
France           2256
Spain            1142
South Africa      500
United States     374
Chile             326
Germany           248
Australia         246
Argentina         246
Portugal          230
Austria           220
New Zealand        63
Brazil             40
Romania            23
Lebanon            15
Israel             13
Greece             13
Switzerland        12
Hungary             9
Moldova             8
Slovenia            8
Turkey              6
Georgia             5
Uruguay             4
Croatia             2
Bulgaria            2
Canada              2
Mexico              1
China               1
Slovakia            1
Name: Country, dtype: int64

In [4]:
# .value_counts (numerical column)
print(reviews.Rating.dtype)
reviews.Rating.value_counts()

float64


3.8    1171
3.9    1044
3.7     994
4.0     905
4.1     887
3.6     754
4.2     743
4.3     483
3.5     477
4.4     301
3.4     275
3.3     184
4.5     149
4.6     104
3.2      89
3.1      30
4.7      28
3.0      27
2.9       6
4.8       6
2.8       5
2.5       2
2.6       1
2.7       1
Name: Rating, dtype: int64

In [5]:
## .value_counts is a groupwise shortcut:
reviews.groupby('Rating').Rating.count().sort_values(ascending=False)

Rating
3.8    1171
3.9    1044
3.7     994
4.0     905
4.1     887
3.6     754
4.2     743
4.3     483
3.5     477
4.4     301
3.4     275
3.3     184
4.5     149
4.6     104
3.2      89
3.1      30
4.7      28
3.0      27
4.8       6
2.9       6
2.8       5
2.5       2
2.6       1
2.7       1
Name: Rating, dtype: int64

In [6]:
# + to get the cheapest wine in each point value category:
reviews.groupby('Rating').Price.min()

Rating
2.5      7.50
2.6     11.50
2.7      9.12
2.8      7.32
2.9      5.95
3.0      4.69
3.1      3.70
3.2      3.55
3.3      4.30
3.4      3.95
3.5      4.28
3.6      4.50
3.7      3.99
3.8      3.79
3.9      4.55
4.0      5.43
4.1      5.81
4.2      5.65
4.3      9.26
4.4     14.35
4.5     21.90
4.6     29.99
4.7     53.35
4.8    324.95
Name: Price, dtype: float64

In [7]:
# JM the bargain could be 3.8, 3.79 ???, lets see -- YES!!
barg_ix = (reviews.Rating / reviews.Price).idxmax()
barg_ratio = max(reviews.Rating / reviews.Price)
print(reviews.loc[reviews.Rating / reviews.Price == barg_ratio].index.item(), ' - ', barg_ix)
reviews.loc[reviews.Rating / reviews.Price == barg_ratio]



6408  -  6408


Unnamed: 0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
6408,Nero d'Avola 2018,Italy,Terre Siciliane,Monte Pietroso,3.8,56,3.79,2018


In [8]:
# back to get the cheapest wine in each Rating value category:
reviews.groupby('Rating').Price.min()

Rating
2.5      7.50
2.6     11.50
2.7      9.12
2.8      7.32
2.9      5.95
3.0      4.69
3.1      3.70
3.2      3.55
3.3      4.30
3.4      3.95
3.5      4.28
3.6      4.50
3.7      3.99
3.8      3.79
3.9      4.55
4.0      5.43
4.1      5.81
4.2      5.65
4.3      9.26
4.4     14.35
4.5     21.90
4.6     29.99
4.7     53.35
4.8    324.95
Name: Price, dtype: float64

You can think of each group we generate as being a slice of our DataFrame containing only data with values that match. This DataFrame is accessible to us directly using the apply() method, and we can then manipulate the data in any way we see fit.

In [9]:
#  here's one way of selecting the name of the first wine reviewed from each Region in the dataset:
reviews.groupby('Region').apply(lambda df: df.Name.iloc[0])


Region
Aargau                              Alpberg Pinot Noir 2017
Abruzzo                                   Appassimento 2017
Aconcagua                                        Brisa 2016
Aconcagua Costa                  Aconcagua Costa Syrah 2014
Aconcagua Valley       Don Maximiano Founder's Reserve 2015
                                       ...                 
Yecla                                Hécula Monastrell 2017
Zürich                             Compleo Cuvée Noire 2018
delle Venezie                                   Merlot 2018
Échezeaux Grand Cru                Echézeaux Grand Cru 2013
Štajerska                           Cuvée Benedict Red 2012
Length: 624, dtype: object

In [10]:
# For even more fine-grained control, you can also group by more than one column. For an example,
# here's how we would pick out the best wine by country and region:
reviews.groupby(['Country', 'Region']).apply(lambda df: df.loc[df.Rating.idxmax()])


Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Country,Region,Winery,Rating,NumberOfRatings,Price,Year
Country,Region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Argentina,Agrelo,Malbec 2017,Argentina,Agrelo,Crios,3.6,3705,12.51,2017
Argentina,Cafayate Valley,Auténtico Malbec 2018,Argentina,Cafayate Valley,Colomé,4.2,507,25.12,2018
Argentina,Calchaqui Valley,Malbec Estate 2017,Argentina,Calchaqui Valley,Colomé,4.0,1568,15.03,2017
Argentina,Gualtallary,Gran Enemigo Single Vineyard Gualtallary Caber...,Argentina,Gualtallary,El Enemigo,4.6,4180,82.86,2014
Argentina,La Consulta,Nicasia Vineyard La Consulta Malbec 2014,Argentina,La Consulta,Catena Zapata,4.3,178,101.52,2014
...,...,...,...,...,...,...,...,...,...
United States,Willamette Valley,Evenstad Reserve Pinot Noir 2014,United States,Willamette Valley,Domaine Serene,4.4,1417,89.00,2014
United States,Yamhill-Carlton District,Résonance Vineyard Pinot Noir 2015,United States,Yamhill-Carlton District,Résonance,4.0,214,66.93,2015
Uruguay,Maldonado,Reserva Tannat 2018,Uruguay,Maldonado,Bodega Garzón,4.1,3435,15.90,2018
Uruguay,Progreso,Río de Los Pájaros Reserve Tannat 2016,Uruguay,Progreso,Pisano,3.9,190,13.09,2016


Another groupby() method worth mentioning is agg(), which lets you run a bunch of different functions on your DataFrame simultaneously.

In [11]:
# We can generate a simple statistical summary of the dataset as follows:
reviews.groupby(['Country']).Price.agg([len, min, max])

Unnamed: 0_level_0,len,min,max
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,246,4.78,273.71
Australia,246,5.79,730.12
Austria,220,5.65,149.0
Brazil,40,6.77,73.06
Bulgaria,2,8.87,11.43
Canada,2,14.95,14.95
Chile,326,4.25,239.0
China,1,35.0,35.0
Croatia,2,18.9,23.9
France,2256,4.5,3410.79


In [12]:
reviews.groupby(['Country']).Price.agg(len).sort_values(ascending=False)

Country
Italy            2650
France           2256
Spain            1142
South Africa      500
United States     374
Chile             326
Germany           248
Australia         246
Argentina         246
Portugal          230
Austria           220
New Zealand        63
Brazil             40
Romania            23
Lebanon            15
Greece             13
Israel             13
Switzerland        12
Hungary             9
Slovenia            8
Moldova             8
Turkey              6
Georgia             5
Uruguay             4
Bulgaria            2
Canada              2
Croatia             2
Slovakia            1
Mexico              1
China               1
Name: Price, dtype: int64

In [13]:
reviews.Country.value_counts()

Italy            2650
France           2256
Spain            1142
South Africa      500
United States     374
Chile             326
Germany           248
Australia         246
Argentina         246
Portugal          230
Austria           220
New Zealand        63
Brazil             40
Romania            23
Lebanon            15
Israel             13
Greece             13
Switzerland        12
Hungary             9
Moldova             8
Slovenia            8
Turkey              6
Georgia             5
Uruguay             4
Croatia             2
Bulgaria            2
Canada              2
Mexico              1
China               1
Slovakia            1
Name: Country, dtype: int64

In [14]:
reviews.groupby('Country').Price.max()

Country
Argentina         273.71
Australia         730.12
Austria           149.00
Brazil             73.06
Bulgaria           11.43
Canada             14.95
Chile             239.00
China              35.00
Croatia            23.90
France           3410.79
Georgia            32.90
Germany           129.00
Greece             36.95
Hungary            24.50
Israel             34.90
Italy            1115.50
Lebanon            53.52
Mexico              8.65
Moldova            98.98
New Zealand       135.00
Portugal          136.47
Romania            29.50
Slovakia           19.90
Slovenia           19.99
South Africa      463.03
Spain             672.60
Switzerland        70.19
Turkey             35.00
United States     721.34
Uruguay            15.90
Name: Price, dtype: float64

In [15]:
reviews.groupby('Country').Price.agg([len, min, max])

Unnamed: 0_level_0,len,min,max
Country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Argentina,246,4.78,273.71
Australia,246,5.79,730.12
Austria,220,5.65,149.0
Brazil,40,6.77,73.06
Bulgaria,2,8.87,11.43
Canada,2,14.95,14.95
Chile,326,4.25,239.0
China,1,35.0,35.0
Croatia,2,18.9,23.9
France,2256,4.5,3410.79


### Multi-indexes
In all of the examples we've seen thus far we've been working with DataFrame or Series objects with a single-label index. groupby() is slightly different in the fact that, depending on the operation we run, it will sometimes result in what is called a multi-index.

In [16]:
# A multi-index differs from a regular index in that it has multiple levels. For example:
reviews.groupby(['Country', 'Region']).Winery.agg([len])

Unnamed: 0_level_0,Unnamed: 1_level_0,len
Country,Region,Unnamed: 2_level_1
Argentina,Agrelo,1
Argentina,Cafayate Valley,3
Argentina,Calchaqui Valley,3
Argentina,Gualtallary,3
Argentina,La Consulta,1
...,...,...
United States,Willamette Valley,7
United States,Yamhill-Carlton District,1
Uruguay,Maldonado,2
Uruguay,Progreso,1


In [17]:
reviews.groupby(['Country', 'Region']).NumberOfRatings.agg([len])

Unnamed: 0_level_0,Unnamed: 1_level_0,len
Country,Region,Unnamed: 2_level_1
Argentina,Agrelo,1
Argentina,Cafayate Valley,3
Argentina,Calchaqui Valley,3
Argentina,Gualtallary,3
Argentina,La Consulta,1
...,...,...
United States,Willamette Valley,7
United States,Yamhill-Carlton District,1
Uruguay,Maldonado,2
Uruguay,Progreso,1


In [18]:
reviews.groupby(['Country', 'Region']).Price.agg([len])

Unnamed: 0_level_0,Unnamed: 1_level_0,len
Country,Region,Unnamed: 2_level_1
Argentina,Agrelo,1
Argentina,Cafayate Valley,3
Argentina,Calchaqui Valley,3
Argentina,Gualtallary,3
Argentina,La Consulta,1
...,...,...
United States,Willamette Valley,7
United States,Yamhill-Carlton District,1
Uruguay,Maldonado,2
Uruguay,Progreso,1


In [19]:
print(type(reviews.groupby(['Country', 'Region']).count()))
reviews.groupby(['Country', 'Region']).count()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,Unnamed: 1_level_0,Name,Winery,Rating,NumberOfRatings,Price,Year
Country,Region,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Argentina,Agrelo,1,1,1,1,1,1
Argentina,Cafayate Valley,3,3,3,3,3,3
Argentina,Calchaqui Valley,3,3,3,3,3,3
Argentina,Gualtallary,3,3,3,3,3,3
Argentina,La Consulta,1,1,1,1,1,1
...,...,...,...,...,...,...,...
United States,Willamette Valley,7,7,7,7,7,7
United States,Yamhill-Carlton District,1,1,1,1,1,1
Uruguay,Maldonado,2,2,2,2,2,2
Uruguay,Progreso,1,1,1,1,1,1


In [20]:
reviews.groupby(['Country', 'Region']).Winery.count()

Country        Region                  
Argentina      Agrelo                      1
               Cafayate Valley             3
               Calchaqui Valley            3
               Gualtallary                 3
               La Consulta                 1
                                          ..
United States  Willamette Valley           7
               Yamhill-Carlton District    1
Uruguay        Maldonado                   2
               Progreso                    1
               San José                    1
Name: Winery, Length: 624, dtype: int64

In [21]:
countries_reviewed = reviews.groupby(['Country', 'Region']).Winery.agg([len])
#display(countries_reviewed)
print(type(countries_reviewed))
print('_________________________________________________')
print('***', type(countries_reviewed.index))
print('-------------------------------------------------')
#reviews.groupby(['Country', 'Region']).Winery.agg([len])
print(type(reviews.groupby(['Country', 'Region']).Winery.agg([len])))  # DataFrame     !!
print(type(reviews.groupby(['Country', 'Region']).Winery.agg(len)))   # Serie         !!
countries_reviewed

<class 'pandas.core.frame.DataFrame'>
_________________________________________________
*** <class 'pandas.core.indexes.multi.MultiIndex'>
-------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


Unnamed: 0_level_0,Unnamed: 1_level_0,len
Country,Region,Unnamed: 2_level_1
Argentina,Agrelo,1
Argentina,Cafayate Valley,3
Argentina,Calchaqui Valley,3
Argentina,Gualtallary,3
Argentina,La Consulta,1
...,...,...
United States,Willamette Valley,7
United States,Yamhill-Carlton District,1
Uruguay,Maldonado,2
Uruguay,Progreso,1


Multi-indices have several methods for dealing with their tiered structure which are absent for single-level indices. They also require two levels of labels to retrieve a value. Dealing with multi-index output is a common "gotcha" for users new to pandas.

The use cases for a multi-index are detailed alongside instructions on using them in the MultiIndex / Advanced Selection section of the pandas documentation.

However, in general the multi-index method you will use most often is the one for converting back to a regular index, the reset_index() method:

In [22]:
print(countries_reviewed.reset_index().index)
countries_reviewed.reset_index()

RangeIndex(start=0, stop=624, step=1)


Unnamed: 0,Country,Region,len
0,Argentina,Agrelo,1
1,Argentina,Cafayate Valley,3
2,Argentina,Calchaqui Valley,3
3,Argentina,Gualtallary,3
4,Argentina,La Consulta,1
...,...,...,...
619,United States,Willamette Valley,7
620,United States,Yamhill-Carlton District,1
621,Uruguay,Maldonado,2
622,Uruguay,Progreso,1


### Sorting

In [23]:
countries_reviewed = countries_reviewed.reset_index()
countries_reviewed.sort_values(by='len', ascending=False)

Unnamed: 0,Country,Region,len
553,Spain,Rioja,325
428,Italy,Toscana,249
509,South Africa,Stellenbosch,232
240,France,Saint-Émilion Grand Cru,207
314,Italy,Barolo,175
...,...,...,...
295,Greece,Slopes of Enos,1
294,Greece,Santorini,1
293,Greece,Rapsani,1
292,Greece,Nemea,1


In [24]:
# To sort by index values, use the companion method sort_index(). 
# This method has the same arguments and default order:

countries_reviewed.sort_index()

Unnamed: 0,Country,Region,len
0,Argentina,Agrelo,1
1,Argentina,Cafayate Valley,3
2,Argentina,Calchaqui Valley,3
3,Argentina,Gualtallary,3
4,Argentina,La Consulta,1
...,...,...,...
619,United States,Willamette Valley,7
620,United States,Yamhill-Carlton District,1
621,Uruguay,Maldonado,2
622,Uruguay,Progreso,1


##### Finally, know that you can sort by more than one column at a time:

In [27]:
countries_reviewed.sort_values(by=['Country', 'len'])

Unnamed: 0,Country,Region,len
0,Argentina,Agrelo,1
4,Argentina,La Consulta,1
11,Argentina,San Carlos,1
6,Argentina,Maipu,2
14,Argentina,Tulum Valley,2
...,...,...,...
599,United States,Napa Valley,72
582,United States,California,91
622,Uruguay,Progreso,1
623,Uruguay,San José,1


In [29]:
countries_reviewed.sort_values(by=['len', 'Country'], ascending=False)

Unnamed: 0,Country,Region,len
553,Spain,Rioja,325
428,Italy,Toscana,249
509,South Africa,Stellenbosch,232
240,France,Saint-Émilion Grand Cru,207
314,Italy,Barolo,175
...,...,...,...
32,Australia,Limestone Coast,1
36,Australia,New South Wales,1
0,Argentina,Agrelo,1
4,Argentina,La Consulta,1
