# <p style="background-color:#9d4f8c; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:10px 10px;">Pandas (4)</p>

<div class="alert alert-block alert-info alert">
    
# <span style=" color:red"> Groupby Operations and Multi-level Index

## Table of Contents
* How can we use groupby() method?
* groupby()
* cross-section: xs()
* swaplevel()
* sort_index()
* agg()

## How can we use groupby() method?

* A **groupby()** operation allows us to examine data on a **per category** basis. 
* We can use it to answer such questions: "What's the average per category" or "How many rows do we have per category?"
![image.png](attachment:3d884049-85cc-4660-90ed-ed3845030909.png)
* We need to choose a categorical column to call with **groupby**
* Categorical columns are non-continuous.
* But they can still be numerical showing categories, such as cabin class categories on a ship or different years.
* It goups the data according to categories
  
![image.png](attachment:e5164db1-c7e3-45de-9a7d-22f85eb0c8fd.png)
* If we use an aggregate function (sum, mean, count, etc.) together with groupby...

![image.png](attachment:dda35225-60e3-4edc-8a8b-0417c58b0fcd.png)
* Note that calling **groupby()** by itself creates a "lazy" groupby object waiting to be evaluated by an aggregate method call.
* There are multiple ways to split data like:

  obj.groupby("key")

  obj.groupby("key", axis=1)

  obj.groupby(["key1", "key2"])])
  

## Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("mpg.csv")
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


## groupby()

In [3]:
# We can use "describe()" to see all values in aggretaion functions
df.groupby("model_year").describe().T # It is easy to read with transpose

Unnamed: 0,model_year,70,71,72,73,74,75,76,77,78,79,80,81,82
mpg,count,29.0,28.0,28.0,40.0,27.0,30.0,34.0,28.0,36.0,29.0,29.0,29.0,31.0
mpg,mean,17.689655,21.25,18.714286,17.1,22.703704,20.266667,21.573529,23.375,24.061111,25.093103,33.696552,30.334483,31.709677
mpg,std,5.339231,6.591942,5.435529,4.700245,6.42001,4.940566,5.889297,6.675862,6.898044,6.794217,7.037983,5.591465,5.392548
mpg,min,9.0,12.0,11.0,11.0,13.0,13.0,13.0,15.0,16.2,15.5,19.1,17.6,22.0
mpg,25%,14.0,15.5,13.75,13.0,16.0,16.0,16.75,17.375,19.35,19.2,29.8,26.6,27.0
mpg,50%,16.0,19.0,18.5,16.0,24.0,19.5,21.0,21.75,20.7,23.9,32.7,31.6,32.0
mpg,75%,22.0,27.0,23.0,20.0,27.0,23.0,26.375,30.0,28.0,31.8,38.1,34.4,36.0
mpg,max,27.0,35.0,28.0,29.0,32.0,33.0,33.0,36.0,43.1,37.3,46.6,39.1,44.0
cylinders,count,29.0,28.0,28.0,40.0,27.0,30.0,34.0,28.0,36.0,29.0,29.0,29.0,31.0
cylinders,mean,6.758621,5.571429,5.821429,6.375,5.259259,5.6,5.647059,5.464286,5.361111,5.827586,4.137931,4.62069,4.193548


#### Q1: How performance of the cars has changed throughout the years?

What is the categorical column in this question?

In [4]:
# The categoical column here is model_year. Let's look at its categoories.
df["model_year"].unique()

array([70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], dtype=int64)

In [5]:
# Or we could use value_counts
df["model_year"].value_counts()

model_year
73    40
78    36
76    34
82    31
75    30
70    29
79    29
80    29
81    29
71    28
72    28
77    28
74    27
Name: count, dtype: int64

In [6]:
# Let's use group_by to find the mean of all numerical columns
df.groupby("model_year").mean("mpg")  # empty mean() gave error, that's why I wrote "mpg" inside the paranthesis

# model_year is our index now

Unnamed: 0_level_0,mpg,cylinders,displacement,weight,acceleration,origin
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,17.689655,6.758621,281.413793,3372.793103,12.948276,1.310345
71,21.25,5.571429,209.75,2995.428571,15.142857,1.428571
72,18.714286,5.821429,218.375,3237.714286,15.125,1.535714
73,17.1,6.375,256.875,3419.025,14.3125,1.375
74,22.703704,5.259259,171.740741,2877.925926,16.203704,1.666667
75,20.266667,5.6,205.533333,3176.8,16.05,1.466667
76,21.573529,5.647059,197.794118,3078.735294,15.941176,1.470588
77,23.375,5.464286,191.392857,2997.357143,15.435714,1.571429
78,24.061111,5.361111,177.805556,2861.805556,15.805556,1.611111
79,25.093103,5.827586,206.689655,3055.344828,15.813793,1.275862


In [7]:
# I want to display only mpg column together with model_year
df.groupby("model_year")["mpg"].mean()

# df.groupby("model_year").mean()["mpg"] # This code gave error

model_year
70    17.689655
71    21.250000
72    18.714286
73    17.100000
74    22.703704
75    20.266667
76    21.573529
77    23.375000
78    24.061111
79    25.093103
80    33.696552
81    30.334483
82    31.709677
Name: mpg, dtype: float64

In [8]:
# sum grouped by horsepower
df.groupby(['horsepower']).sum()

Unnamed: 0_level_0,mpg,cylinders,displacement,weight,acceleration,model_year,origin,name
horsepower,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100,333.1,97,3731.0,53456,269.3,1268,21,amc gremlinchevrolet chevelle malibuamc matado...
102,20.0,4,130.0,3150,15.7,76,2,volvo 245
103,20.3,5,131.0,2830,15.9,78,2,audi 5000
105,246.0,70,2780.0,40492,199.5,916,12,plymouth satellite customplymouth valiantplymo...
107,21.0,6,155.0,2472,14.0,73,1,mercury capri v6
...,...,...,...,...,...,...,...,...
95,309.6,70,2328.0,39258,225.0,1035,26,toyota corona mark iiplymouth dustersaab 99eto...
96,81.5,12,400.0,7667,42.9,234,7,toyota coronaplymouth arrow gstoyota celica gt
97,199.1,41,1183.0,23148,134.1,671,23,amc hornetmazda rx2 coupetoyouta corona mark i...
98,40.5,10,371.0,6470,33.5,152,3,volvo 244dlford granada


In [9]:
# min values grouped by cylinders
df.groupby(['cylinders']).min()

Unnamed: 0_level_0,mpg,displacement,horsepower,weight,acceleration,model_year,origin,name
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,18.0,70.0,100,2124,12.5,72,3,maxda rx3
4,18.0,68.0,100,1613,11.6,70,1,amc concord
5,20.3,121.0,103,2830,15.9,78,2,audi 5000
6,15.0,145.0,100,2472,11.3,70,1,amc concord
8,9.0,260.0,105,3086,8.0,70,1,amc ambassador brougham


In [10]:
# using groupby function with "sort"
 
df.groupby(['cylinders'], sort = False).sum() # No sorting in cylinders

Unnamed: 0_level_0,mpg,displacement,horsepower,weight,acceleration,model_year,origin,name
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8,1541.2,35536.0,1301651501501401982202152251901701601502252152...,423816,1334.4,7612,103,chevrolet chevelle malibubuick skylark 320plym...
4,5974.5,22398.5,958846879095113889095?728690707665696070958054...,470858,3386.7,15723,405,toyota corona mark iidatsun pl510volkswagen 11...
6,1678.8,18324.0,9597859010010510088100110100881051001008895100...,268651,1366.1,6378,100,plymouth dusteramc hornetford maverickamc grem...
3,82.2,290.0,9790110100,9594,53.0,302,12,mazda rx2 coupemaxda rx3mazda rx-4mazda rx-7 gs
5,82.1,435.0,1037767,9310,55.9,237,6,audi 5000mercedes benz 300daudi 5000s (diesel)


In [11]:
# sort= True
df.groupby(['cylinders'], sort = True).sum() # Begins from the least (from 3 to 8)

Unnamed: 0_level_0,mpg,displacement,horsepower,weight,acceleration,model_year,origin,name
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3,82.2,290.0,9790110100,9594,53.0,302,12,mazda rx2 coupemaxda rx3mazda rx-4mazda rx-7 gs
4,5974.5,22398.5,958846879095113889095?728690707665696070958054...,470858,3386.7,15723,405,toyota corona mark iidatsun pl510volkswagen 11...
5,82.1,435.0,1037767,9310,55.9,237,6,audi 5000mercedes benz 300daudi 5000s (diesel)
6,1678.8,18324.0,9597859010010510088100110100881051001008895100...,268651,1366.1,6378,100,plymouth dusteramc hornetford maverickamc grem...
8,1541.2,35536.0,1301651501501401982202152251901701601502252152...,423816,1334.4,7612,103,chevrolet chevelle malibubuick skylark 320plym...


In [12]:
# Selecting a single group (choose one category from the grouped by feature)
# In "model year", bring only "70" category (it is not string, so we do not use "")
 
grp = df.groupby('model_year')
grp.get_group(70)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino
5,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500
6,14.0,8,454.0,220,4354,9.0,70,1,chevrolet impala
7,14.0,8,440.0,215,4312,8.5,70,1,plymouth fury iii
8,14.0,8,455.0,225,4425,10.0,70,1,pontiac catalina
9,15.0,8,390.0,190,3850,8.5,70,1,amc ambassador dpl


In [13]:
# selecting object grouped on multiple columns
# name "chevrolet impala" and cylinders "6"
grp = df.groupby(['name', 'cylinders'])
grp.get_group(('chevrolet impala', 8))

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
6,14.0,8,454.0,220,4354,9.0,70,1,chevrolet impala
38,14.0,8,350.0,165,4209,12.0,71,1,chevrolet impala
62,13.0,8,350.0,165,4274,12.0,72,1,chevrolet impala
103,11.0,8,400.0,150,4997,14.0,73,1,chevrolet impala


#### Q2: Group more than one column, for example, according to model_year and the number of cylinders and find the average value.

In [14]:
# df.groupby(["model_year", "cylinders"]).mean() # This code gives error

df.groupby(["model_year", "cylinders"]).mean("mpg")

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,4,25.285714,107.0,2292.571429,16.0,2.285714
70,6,20.5,199.0,2710.5,15.5,1.0
70,8,14.111111,367.555556,3940.055556,11.194444,1.0
71,4,27.461538,101.846154,2056.384615,16.961538,1.923077
71,6,18.0,243.375,3171.875,14.75,1.0
71,8,13.428571,371.714286,4537.714286,12.214286,1.0
72,3,19.0,70.0,2330.0,13.5,3.0
72,4,23.428571,111.535714,2382.642857,17.214286,1.928571
72,8,13.615385,344.846154,4228.384615,13.0,1.0
73,3,18.0,70.0,2124.0,13.5,3.0


In [15]:
# It sees thse two columns used by groupby as index.
# We can check it using "columns"

df.groupby(["model_year", "cylinders"]).mean("mpg").columns

# They are not among columns

Index(['mpg', 'displacement', 'weight', 'acceleration', 'origin'], dtype='object')

In [16]:
# There are multiindex: "model_year" and "cylinders"
df.groupby(["model_year", "cylinders"]).mean("mpg").index

MultiIndex([(70, 4),
            (70, 6),
            (70, 8),
            (71, 4),
            (71, 6),
            (71, 8),
            (72, 3),
            (72, 4),
            (72, 8),
            (73, 3),
            (73, 4),
            (73, 6),
            (73, 8),
            (74, 4),
            (74, 6),
            (74, 8),
            (75, 4),
            (75, 6),
            (75, 8),
            (76, 4),
            (76, 6),
            (76, 8),
            (77, 3),
            (77, 4),
            (77, 6),
            (77, 8),
            (78, 4),
            (78, 5),
            (78, 6),
            (78, 8),
            (79, 4),
            (79, 5),
            (79, 6),
            (79, 8),
            (80, 3),
            (80, 4),
            (80, 5),
            (80, 6),
            (81, 4),
            (81, 6),
            (81, 8),
            (82, 4),
            (82, 6)],
           names=['model_year', 'cylinders'])

In [17]:
# df.groupby(["model_year", "cylinders"]).mean["mpg"] # This code does not work

df.groupby(["model_year", "cylinders"])["mpg"].mean()

model_year  cylinders
70          4            25.285714
            6            20.500000
            8            14.111111
71          4            27.461538
            6            18.000000
            8            13.428571
72          3            19.000000
            4            23.428571
            8            13.615385
73          3            18.000000
            4            22.727273
            6            19.000000
            8            13.200000
74          4            27.800000
            6            17.857143
            8            14.200000
75          4            25.250000
            6            17.583333
            8            15.666667
76          4            26.766667
            6            20.000000
            8            14.666667
77          3            21.500000
            4            29.107143
            6            19.500000
            8            16.000000
78          4            29.576471
            5            20.30000

In [18]:
# We can assign the multi level index created by groupby() to a variable

# year_cyl = df.groupby(["model_year","cylinders"]).mean() # this code (empty mean) gives error

year_cylinders = df.groupby(["model_year","cylinders"]).mean("mpg") 
year_cylinders

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,4,25.285714,107.0,2292.571429,16.0,2.285714
70,6,20.5,199.0,2710.5,15.5,1.0
70,8,14.111111,367.555556,3940.055556,11.194444,1.0
71,4,27.461538,101.846154,2056.384615,16.961538,1.923077
71,6,18.0,243.375,3171.875,14.75,1.0
71,8,13.428571,371.714286,4537.714286,12.214286,1.0
72,3,19.0,70.0,2330.0,13.5,3.0
72,4,23.428571,111.535714,2382.642857,17.214286,1.928571
72,8,13.615385,344.846154,4228.384615,13.0,1.0
73,3,18.0,70.0,2124.0,13.5,3.0


In [19]:
# Our indexes are "model year" and "cylimders". We can verify it... 
year_cylinders.index.names

FrozenList(['model_year', 'cylinders'])

In [20]:
# To see these groups in details...
# First group represents "model_year" and the second "cylinders"
year_cylinders.index.levels

FrozenList([[70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82], [3, 4, 5, 6, 8]])

#### Q3: Display only model_year "70"

In [21]:
# One way to find it is "loc". It takes the name of the group instead of imdec (iloc).

year_cylinders.loc[70]

# Although it shows all 70's, it does not show it as a group. 

Unnamed: 0_level_0,mpg,displacement,weight,acceleration,origin
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,25.285714,107.0,2292.571429,16.0,2.285714
6,20.5,199.0,2710.5,15.5,1.0
8,14.111111,367.555556,3940.055556,11.194444,1.0


In [22]:
# We can also show different groups using a list
year_cylinders.loc[[70, 74, 82]]

# if we want to call more than one group we need double brackets [[]]
# remember, "year_cyl" variable still reresents "model_year" and "cylinders"

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
70,4,25.285714,107.0,2292.571429,16.0,2.285714
70,6,20.5,199.0,2710.5,15.5,1.0
70,8,14.111111,367.555556,3940.055556,11.194444,1.0
74,4,27.8,96.533333,2151.466667,16.4,2.2
74,6,17.857143,230.428571,3320.0,16.857143,1.0
74,8,14.2,315.2,4438.4,14.7,1.0
82,4,32.071429,118.571429,2402.321429,16.703571,1.714286
82,6,28.333333,225.0,2931.666667,16.033333,1.0


#### Q4: Show all values according to model year "76" and cylinders "6"

In [23]:
year_cylinders.loc[(70,4)] # the values inside the paranthesis are atupel

mpg               25.285714
displacement     107.000000
weight          2292.571429
acceleration      16.000000
origin             2.285714
Name: (70, 4), dtype: float64

In [24]:
# We can see tuples in "cylinders"
# We can call one of these specific rows and see all values grouped by these two columns (model_year and cylinders)
year_cylinders.index

MultiIndex([(70, 4),
            (70, 6),
            (70, 8),
            (71, 4),
            (71, 6),
            (71, 8),
            (72, 3),
            (72, 4),
            (72, 8),
            (73, 3),
            (73, 4),
            (73, 6),
            (73, 8),
            (74, 4),
            (74, 6),
            (74, 8),
            (75, 4),
            (75, 6),
            (75, 8),
            (76, 4),
            (76, 6),
            (76, 8),
            (77, 3),
            (77, 4),
            (77, 6),
            (77, 8),
            (78, 4),
            (78, 5),
            (78, 6),
            (78, 8),
            (79, 4),
            (79, 5),
            (79, 6),
            (79, 8),
            (80, 3),
            (80, 4),
            (80, 5),
            (80, 6),
            (81, 4),
            (81, 6),
            (81, 8),
            (82, 4),
            (82, 6)],
           names=['model_year', 'cylinders'])

### cross-section: xs()
The **xs() function** is used to get cross-section from the Series/DataFrame.

This method takes a key argumen**t to select data at a particular level of a MultiInde**

**Syntax:** Series.xs(self, key, axis=0, level=None, drop_level=True)x.


In [25]:
# Let's try it on our variable year_cylinders
# model_year for "70"
year_cylinders.xs(key=70, level="model_year")

# "key" takes only one value
# for more than one key, use loc[] # For example: year_cylinders.loc[[70,76]]
# 70 is not visible in the table because it is the same for all cylinders rows. So, key is not apparent on the table.

Unnamed: 0_level_0,mpg,displacement,weight,acceleration,origin
cylinders,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4,25.285714,107.0,2292.571429,16.0,2.285714
6,20.5,199.0,2710.5,15.5,1.0
8,14.111111,367.555556,3940.055556,11.194444,1.0


#### Q5: Find cylinders "4" for each "model_year"s

In [26]:
year_cylinders.xs(key=4, level="cylinders")

# in this table cylinders "4" is not seen because it is the same throughot the rows

Unnamed: 0_level_0,mpg,displacement,weight,acceleration,origin
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
70,25.285714,107.0,2292.571429,16.0,2.285714
71,27.461538,101.846154,2056.384615,16.961538,1.923077
72,23.428571,111.535714,2382.642857,17.214286,1.928571
73,22.727273,109.272727,2338.090909,17.136364,2.0
74,27.8,96.533333,2151.466667,16.4,2.2
75,25.25,114.833333,2489.25,15.833333,2.166667
76,26.766667,106.333333,2306.6,16.866667,1.866667
77,29.107143,106.5,2205.071429,16.064286,1.857143
78,29.576471,112.117647,2296.764706,16.282353,2.117647
79,31.525,113.583333,2357.583333,15.991667,1.583333


#### Q6: Find maximum values of all columns for cylinders 6 and 8 in each model_year.

In [27]:
# Since key takes only one value,we will filter out  cylinders 6 and 8 as a data frame, then we will use groupby()

df[df["cylinders"].isin([6,8])].groupby(["model_year", "cylinders"]).max()

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,horsepower,weight,acceleration,origin,name
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
70,6,22.0,200.0,97,2833,16.0,1,plymouth duster
70,8,18.0,455.0,225,4732,18.5,1,pontiac catalina
71,6,19.0,258.0,88,3439,15.5,1,pontiac firebird
71,8,14.0,400.0,180,5140,13.5,1,pontiac safari (sw)
72,8,17.0,429.0,208,4633,16.0,1,pontiac catalina
73,6,23.0,250.0,95,3278,18.0,3,toyota mark ii
73,8,16.0,455.0,230,4997,14.5,1,pontiac grand prix
74,6,21.0,258.0,?,3781,18.0,1,plymouth satellite sebring
74,8,16.0,350.0,150,4699,16.0,1,ford gran torino (sw)
75,6,21.0,258.0,97,3907,21.0,1,plymouth valiant custom


### swaplevel()
If you want to swap the groupby features (indexes), use swaplevel().

In [28]:
# Let's try it on our year_cylinders. Normally, model_year is followed by cylinders. Chance their orders...
year_cylinders.swaplevel()

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
cylinders,model_year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,70,25.285714,107.0,2292.571429,16.0,2.285714
6,70,20.5,199.0,2710.5,15.5,1.0
8,70,14.111111,367.555556,3940.055556,11.194444,1.0
4,71,27.461538,101.846154,2056.384615,16.961538,1.923077
6,71,18.0,243.375,3171.875,14.75,1.0
8,71,13.428571,371.714286,4537.714286,12.214286,1.0
3,72,19.0,70.0,2330.0,13.5,3.0
4,72,23.428571,111.535714,2382.642857,17.214286,1.928571
8,72,13.615385,344.846154,4228.384615,13.0,1.0
3,73,18.0,70.0,2124.0,13.5,3.0


### sort_index()
Use it to sort the index (or groupby columns) in ascending or descending order.

In [29]:
year_cylinders.sort_index(level="model_year", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
82,6,28.333333,225.0,2931.666667,16.033333,1.0
82,4,32.071429,118.571429,2402.321429,16.703571,1.714286
81,8,26.6,350.0,3725.0,19.0,1.0
81,6,23.428571,184.0,3093.571429,15.442857,1.714286
81,4,32.814286,108.857143,2275.47619,16.466667,2.095238
80,6,25.9,196.5,3145.5,15.05,2.0
80,5,36.4,121.0,2950.0,19.9,2.0
80,4,34.612,111.0,2360.08,17.144,2.2
80,3,23.7,70.0,2420.0,12.5,3.0
79,8,18.63,321.4,3862.9,15.4,1.0


In [30]:
year_cylinders.sort_index(level="cylinders", ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,displacement,weight,acceleration,origin
model_year,cylinders,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
81,8,26.6,350.0,3725.0,19.0,1.0
79,8,18.63,321.4,3862.9,15.4,1.0
78,8,19.05,300.833333,3563.333333,13.266667,1.0
77,8,16.0,335.75,4177.5,13.6625,1.0
76,8,14.666667,324.0,4064.666667,13.222222,1.0
75,8,15.666667,330.5,4108.833333,13.166667,1.0
74,8,14.2,315.2,4438.4,14.7,1.0
73,8,13.2,365.25,4279.05,12.25,1.0
72,8,13.615385,344.846154,4228.384615,13.0,1.0
71,8,13.428571,371.714286,4537.714286,12.214286,1.0


### agg()

**Syntax:** DataFrameGroupBy.aggregate(func=None, *args, engine=None, engine_kwargs=None, **kwargs)

*args: 
Positional arguments to pass to func

**kwargs: 
* Iff func is None, **kwargs are used to define the output names and aggregations via Named Aggregation. See func entry
* Otherwise,se, keyword arguments to be passed into fuc..


#### Q7: Find standard deviation and mean of the all columns

In [31]:
# df.agg(["std","mean"])
# It does not work because there are some columns includes some question (see "horsepower") marks "?" in the dataframe.

In [32]:
# Let's filter ourt some columns and look at this aggregation functions
# For only "mpg"
df["mpg"].agg(["std","mean"])

# This code did not work   # df.agg(["std","mean"])["mpg"]

std      7.815984
mean    23.514573
Name: mpg, dtype: float64

In [33]:
df.groupby("model_year").agg("min")

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,origin,name
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
70,9.0,4,97.0,113,1835,8.0,1,amc ambassador dpl
71,12.0,4,71.0,100,1613,11.5,1,amc gremlin
72,11.0,3,70.0,112,2100,11.0,1,amc ambassador sst
73,11.0,3,68.0,100,1867,9.5,1,amc ambassador brougham
74,13.0,4,71.0,100,1649,13.5,1,amc hornet
75,13.0,4,90.0,100,1795,11.5,1,amc gremlin
76,13.0,4,85.0,100,1795,12.0,1,amc hornet
77,15.0,3,79.0,100,1825,11.1,1,bmw 320i
78,16.2,4,78.0,100,1800,11.2,1,amc concord
79,15.5,4,85.0,110,1915,11.3,1,amc concord dl 6


In [34]:
df.groupby(["model_year","cylinders"]).agg(['min', 'max'])

Unnamed: 0_level_0,Unnamed: 1_level_0,mpg,mpg,displacement,displacement,horsepower,horsepower,weight,weight,acceleration,acceleration,origin,origin,name,name
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,min,max,min,max,min,max,min,max,min,max,min,max
model_year,cylinders,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
70,4,24.0,27.0,97.0,121.0,113,95,1835,2672,12.5,20.5,2,3,audi 100 ls,volkswagen 1131 deluxe sedan
70,6,18.0,22.0,198.0,200.0,85,97,2587,2833,15.0,16.0,1,1,amc gremlin,plymouth duster
70,8,9.0,18.0,302.0,455.0,130,225,3086,4732,8.0,18.5,1,1,amc ambassador dpl,pontiac catalina
71,4,22.0,35.0,71.0,140.0,60,?,1613,2408,14.0,20.5,1,3,chevrolet vega (sw),volkswagen model 111
71,6,16.0,19.0,225.0,258.0,100,88,2634,3439,13.0,15.5,1,1,amc gremlin,pontiac firebird
71,8,12.0,14.0,318.0,400.0,150,180,4096,5140,11.5,13.5,1,1,chevrolet impala,pontiac safari (sw)
72,3,19.0,19.0,70.0,70.0,97,97,2330,2330,13.5,13.5,3,3,mazda rx2 coupe,mazda rx2 coupe
72,4,18.0,28.0,96.0,140.0,112,97,2100,2979,14.5,23.5,1,3,chevrolet vega,volvo 145e (sw)
72,8,11.0,17.0,302.0,429.0,130,208,3672,4633,11.0,16.0,1,1,amc ambassador sst,pontiac catalina
73,3,18.0,18.0,70.0,70.0,90,90,2124,2124,13.5,13.5,3,3,maxda rx3,maxda rx3


In [35]:
# Different aggregations per column
df.groupby("model_year").agg({'cylinders': ['min', 'max'], 'acceleration': 'sum'})

Unnamed: 0_level_0,cylinders,cylinders,acceleration
Unnamed: 0_level_1,min,max,sum
model_year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
70,4,8,375.5
71,4,8,424.0
72,3,8,423.5
73,3,8,572.5
74,4,8,437.5
75,4,8,481.5
76,4,8,542.0
77,3,8,432.2
78,4,8,569.0
79,4,8,458.6


#### Q8: Group by according to model_year and display weight's min, max, mean and std

In [36]:
# applying a function by passing a list of functions
# in addition to the usage above, agg can be used with "np.min", "np.sum", etc

grp = df.groupby('model_year')
grp['weight'].agg([np.min, np.max, np.mean, np.std])

  grp['weight'].agg([np.min, np.max, np.mean, np.std])
  grp['weight'].agg([np.min, np.max, np.mean, np.std])
  grp['weight'].agg([np.min, np.max, np.mean, np.std])
  grp['weight'].agg([np.min, np.max, np.mean, np.std])


Unnamed: 0_level_0,min,max,mean,std
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
70,1835,4732,3372.793103,852.868663
71,1613,5140,2995.428571,1061.830859
72,2100,4633,3237.714286,974.52096
73,1867,4997,3419.025,974.809133
74,1649,4699,2877.925926,949.308571
75,1795,4668,3176.8,765.179781
76,1795,4380,3078.735294,821.371481
77,1825,4335,2997.357143,912.825902
78,1800,4080,2861.805556,626.023907
79,1915,4360,3055.344828,747.881497


#### Q9: Find minimum "mpg" and maximum "weight" values according to "model_year"

In [37]:
# agg() takes also dictionary

grp = df.groupby('model_year')
grp.agg({'mpg' : 'min', 'weight' : 'max'})

Unnamed: 0_level_0,mpg,weight
model_year,Unnamed: 1_level_1,Unnamed: 2_level_1
70,9.0,4732
71,12.0,5140
72,11.0,4633
73,11.0,4997
74,13.0,4699
75,13.0,4668
76,13.0,4380
77,15.0,4335
78,16.2,4080
79,15.5,4360


In [38]:
# Another example with dictionary
df.agg({"mpg":["max", "mean"], "weight":["mean", "std"]})

# It shows NaN for the aggreation function we did not call. 
# No max value for weight and no std value for mpg columns

Unnamed: 0,mpg,weight
max,46.6,
mean,23.514573,2970.424623
std,,846.841774
