# **Guided LAB - 343.4.6 - Pandas Grouping and Aggregate Functions**

---



## **Lab Overview:**

In this lab, we will demonstrate how to group by a single column, multiple columns by using aggregations.

## ** Learning Objective**
By the end of this lab, learners will be able to:-
- Utilize the groupBy() function.
- Combine aggregate functions with groupby() function for data manipulation.

## **Dataset**
**In this lab we will student_scores.csv dataset,[ Click here to download dataset.](https://drive.google.com/file/d/1GxvbD5kV6-zzrbDS3uXUlPtm14sSZkBc/view?usp=sharing)**

## **Introduction:**

Similar to the SQL GROUP BY clause pandas DataFrame.groupby() function is used to collect identical data into groups and perform aggregate functions on the grouped data. Group by operation involves splitting the data, applying some functions, and finally aggregating the results.

In pandas, you can use groupby() with the combination of sum(), aggregate() and many more methods.

**Syntax of groupby() function**

```
pandas.groupby(by=column or index, axis=0, level=None, as_index=True, sort=True, group_keys=True,  observed=False, dropna=True)
```

- by – List of column names or index label to group by
- axis – Default to 0. It takes 0 or ‘index’, 1 or ‘columns’
- level – Used with MultiIndex.
- as_index – sql style grouped output.
- sort – Default to True. Specify whether to sort after group
- group_keys – add group keys or not
- observed – This only applies if any of the groupers are Categorical
- dropna – Default false. True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.


# **Instruction:**
In order to explain several examples of how to perform group by, first, let’s import student_score.csv file for dataset into Pandas

In [14]:
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("./Data/student_scores.csv", header=0)
df

Unnamed: 0,id,first_name,last_name,birth,gender,class,Subject,score
0,10001,John,Doe,2000-01-01,M,A,Calculus,63
1,10001,John,Doe,2000-01-01,M,A,Geometry,65
2,10001,John,Doe,2000-01-01,M,A,Statistics,63
3,10002,Jane,Smith,2000-01-02,F,B,Calculus,63
4,10002,Jane,Smith,2000-01-02,F,B,Geometry,64
5,10002,Jane,Smith,2000-01-02,F,B,Statistics,94
6,10003,Sarah,Thomas,2000-01-03,M,B,Calculus,96
7,10003,Sarah,Thomas,2000-01-03,M,B,Geometry,73
8,10003,Sarah,Thomas,2000-01-03,M,B,Statistics,61
9,10004,Frank,Brown,2000-01-04,M,A,Calculus,88


In [3]:
df.shape

(27, 8)

In [4]:
df.describe()

Unnamed: 0,id,score
count,27.0,27.0
mean,10005.0,77.703704
std,2.631174,11.591823
min,10001.0,61.0
25%,10003.0,67.0
50%,10005.0,78.0
75%,10007.0,89.0
max,10009.0,96.0


In [5]:
df.info

<bound method DataFrame.info of        id first_name last_name       birth gender class     Subject  score
0   10001       John       Doe  2000-01-01      M     A    Calculus     63
1   10001       John       Doe  2000-01-01      M     A    Geometry     65
2   10001       John       Doe  2000-01-01      M     A  Statistics     63
3   10002       Jane     Smith  2000-01-02      F     B    Calculus     63
4   10002       Jane     Smith  2000-01-02      F     B    Geometry     64
5   10002       Jane     Smith  2000-01-02      F     B  Statistics     94
6   10003      Sarah    Thomas  2000-01-03      M     B    Calculus     96
7   10003      Sarah    Thomas  2000-01-03      M     B    Geometry     73
8   10003      Sarah    Thomas  2000-01-03      M     B  Statistics     61
9   10004      Frank     Brown  2000-01-04      M     A    Calculus     88
10  10004      Frank     Brown  2000-01-04      M     A    Geometry     73
11  10004      Frank     Brown  2000-01-04      M     A  Statistics 

# **Split Data into Groups**

- The **by** parameter can accept one column or multiple columns.
- Pandas object can be split into a group in many ways. A **groups** attribute is used to list group data.




### **Example: Groupby using single column – It makes the group by using a single column.**

In [6]:
item_group = df.groupby('first_name')
item_group.groups

{'Bob': [24, 25, 26], 'Frank': [9, 10, 11], 'Fred': [21, 22, 23], 'Jane': [3, 4, 5], 'Jennifer': [15, 16, 17], 'Jessica': [18, 19, 20], 'John': [0, 1, 2], 'Mike': [12, 13, 14], 'Sarah': [6, 7, 8]}

### **Example: Groupby using multiple columns – It forms the group by using multiple columns.**

In [7]:
Groupby_MultipleColumns = df.groupby(["first_name", "last_name"])
Groupby_MultipleColumns.groups

{('Bob', 'Lopez'): [24, 25, 26], ('Frank', 'Brown'): [9, 10, 11], ('Fred', 'Clark'): [21, 22, 23], ('Jane', 'Smith'): [3, 4, 5], ('Jennifer', 'Wilson'): [15, 16, 17], ('Jessica', 'Garcia'): [18, 19, 20], ('John', 'Doe'): [0, 1, 2], ('Mike', 'Davis'): [12, 13, 14], ('Sarah', 'Thomas'): [6, 7, 8]}

### **Example: Iterating through Groups**
You can also print the group elements by iterating through groups using for loop.

In [9]:
for name, group in item_group:
    print('{}:'.format(name))
    print(group, '\n')

Bob:
       id first_name last_name       birth gender class     Subject  score
24  10009        Bob     Lopez  2000-01-09      F     B    Calculus     91
25  10009        Bob     Lopez  2000-01-09      F     B    Geometry     81
26  10009        Bob     Lopez  2000-01-09      F     B  Statistics     75 

Frank:
       id first_name last_name       birth gender class     Subject  score
9   10004      Frank     Brown  2000-01-04      M     A    Calculus     88
10  10004      Frank     Brown  2000-01-04      M     A    Geometry     73
11  10004      Frank     Brown  2000-01-04      M     A  Statistics     86 

Fred:
       id first_name last_name       birth gender class     Subject  score
21  10008       Fred     Clark  2000-01-08      F     C    Calculus     66
22  10008       Fred     Clark  2000-01-08      F     C    Geometry     71
23  10008       Fred     Clark  2000-01-08      F     C  Statistics     94 

Jane:
      id first_name last_name       birth gender class     Subject  sc

### **Example: Selecting a Group**
The **get_group()** method is used to select a particular group.

In [10]:
item_group = df.groupby('Subject')
#item_group.groups
item_group.get_group('Calculus')

Unnamed: 0,id,first_name,last_name,birth,gender,class,Subject,score
0,10001,John,Doe,2000-01-01,M,A,Calculus,63
3,10002,Jane,Smith,2000-01-02,F,B,Calculus,63
6,10003,Sarah,Thomas,2000-01-03,M,B,Calculus,96
9,10004,Frank,Brown,2000-01-04,M,A,Calculus,88
12,10005,Mike,Davis,2000-01-05,F,C,Calculus,94
15,10006,Jennifer,Wilson,2000-01-06,M,C,Calculus,90
18,10007,Jessica,Garcia,2000-01-07,F,B,Calculus,70
21,10008,Fred,Clark,2000-01-08,F,C,Calculus,66
24,10009,Bob,Lopez,2000-01-09,F,B,Calculus,91


### **Example: Groupby – Aggregations**

You can use aggregation function such as mean, sum, etc to get the aggregate value of each group. Aggregation functions are used once the group by object is created.

Let’s calculate the average score of each Subject.

In [11]:
# Directly using mean() function
agg_group_subject = df.groupby('Subject')['score'].mean()
agg_group_subject

Subject
Calculus      80.111111
Geometry      74.000000
Statistics    79.000000
Name: score, dtype: float64

## **Alternativily**: the below line will give the same output.


In [13]:
agg_group_subject = df.groupby('Subject')['score'].agg('mean')
agg_group_subject

Subject
Calculus      80.111111
Geometry      74.000000
Statistics    79.000000
Name: score, dtype: float64

Let’s calculate the average score of each Student.

In [15]:
agg_group_stu = df.groupby(["first_name", "last_name"])['score'].mean()
print(agg_group_stu)

first_name  last_name
Bob         Lopez        82.333333
Frank       Brown        82.333333
Fred        Clark        77.000000
Jane        Smith        73.666667
Jennifer    Wilson       78.666667
Jessica     Garcia       77.666667
John        Doe          63.666667
Mike        Davis        87.333333
Sarah       Thomas       76.666667
Name: score, dtype: float64


## **Alternativily: the below line will give the same output.**

In [16]:
agg_group_stu = df.groupby(["first_name", "last_name"])['score'].agg('mean')
print(agg_group_stu)

first_name  last_name
Bob         Lopez        82.333333
Frank       Brown        82.333333
Fred        Clark        77.000000
Jane        Smith        73.666667
Jennifer    Wilson       78.666667
Jessica     Garcia       77.666667
John        Doe          63.666667
Mike        Davis        87.333333
Sarah       Thomas       76.666667
Name: score, dtype: float64


### **Example: Aggregation group for Multiple columns:**
You can make groups for aggregation value by using multiple columns

Let’s calculate the average and total score of each student.

In [None]:
# In a future version of pandas, the provided callable will be used directly. To keep current behavior pass the string "mean" or "sum" instead.
agg_group = df.groupby(["first_name", "last_name"])['score'].agg([np.mean, np.sum])
print(agg_group)

                           mean  sum
first_name last_name                
Bob        Lopez      82.333333  247
Frank      Brown      82.333333  247
Fred       Clark      77.000000  231
Jane       Smith      73.666667  221
Jennifer   Wilson     78.666667  236
Jessica    Garcia     77.666667  233
John       Doe        63.666667  191
Mike       Davis      87.333333  262
Sarah      Thomas     76.666667  230


  agg_group = df.groupby(["first_name", "last_name"])['score'].agg([np.mean,np.sum])
  agg_group = df.groupby(["first_name", "last_name"])['score'].agg([np.mean,np.sum])


### **Example: Lets count the number of students**

In [19]:
agg_group_count = df.groupby(["first_name", "last_name"])["id"].count()

agg_group_count

first_name  last_name
Bob         Lopez        3
Frank       Brown        3
Fred        Clark        3
Jane        Smith        3
Jennifer    Wilson       3
Jessica     Garcia       3
John        Doe          3
Mike        Davis        3
Sarah       Thomas       3
Name: id, dtype: int64

### **Exmaple: Find the highest score of the each Student**

In [20]:
df.groupby(["first_name", "last_name"]).max()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,birth,gender,class,Subject,score
first_name,last_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bob,Lopez,10009,2000-01-09,F,B,Statistics,91
Frank,Brown,10004,2000-01-04,M,A,Statistics,88
Fred,Clark,10008,2000-01-08,F,C,Statistics,94
Jane,Smith,10002,2000-01-02,F,B,Statistics,94
Jennifer,Wilson,10006,2000-01-06,M,C,Statistics,90
Jessica,Garcia,10007,2000-01-07,F,B,Statistics,83
John,Doe,10001,2000-01-01,M,A,Statistics,65
Mike,Davis,10005,2000-01-05,F,C,Statistics,94
Sarah,Thomas,10003,2000-01-03,M,B,Statistics,96


### **Example: Find the lowest score of the each Student**

In [21]:
df.groupby(["first_name", "last_name"]).min()

Unnamed: 0_level_0,Unnamed: 1_level_0,id,birth,gender,class,Subject,score
first_name,last_name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Bob,Lopez,10009,2000-01-09,F,B,Calculus,75
Frank,Brown,10004,2000-01-04,M,A,Calculus,73
Fred,Clark,10008,2000-01-08,F,C,Calculus,66
Jane,Smith,10002,2000-01-02,F,B,Calculus,63
Jennifer,Wilson,10006,2000-01-06,M,C,Calculus,68
Jessica,Garcia,10007,2000-01-07,F,B,Calculus,70
John,Doe,10001,2000-01-01,M,A,Calculus,63
Mike,Davis,10005,2000-01-05,F,C,Calculus,78
Sarah,Thomas,10003,2000-01-03,M,B,Calculus,61
