# Groupby

If you've used SQL before, you know that **GROUP BY** is a SQL command that groups the data by a certain column, and then allows you to use aggregate functions like ```sum``` or ```count``` on this grouped data.

>Similarly, here, ```groupy``` is a method defined on the ```DataFrame``` object, which does the same thing in pandas.

Arguments: **column name**

In [1]:
import pandas as pd
import numpy as np

In [21]:
# sample data
data = {"Company":["Google", "Google", "IBM", "IBM", "Meta", "Meta"],
        "Employee":["Ram", "Shyam", "Arnav", "Charan", "Rahul", "Sarang"],
        "Profit %":[113.45, 60, 127.33, 100, 212.64, 200]}

df = pd.DataFrame(data)

In [22]:
df

Unnamed: 0,Company,Employee,Profit %
0,Google,Ram,113.45
1,Google,Shyam,60.0
2,IBM,Arnav,127.33
3,IBM,Charan,100.0
4,Meta,Rahul,212.64
5,Meta,Sarang,200.0


In [23]:
# groupby Company
df.groupby("Company")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x00000264B3D26190>

Notice that this method does not return a new ```DataFrame``` object, but rather it return a ```DataFrameGroupBy``` object, which has its own aggregate methods defined on it as we will now see.

You can save this object as a new variable:

In [24]:
grouped_df = df.groupby("Company")

In [25]:
# finding mean profit % by employees in each company
grouped_df.mean()

Unnamed: 0_level_0,Profit %
Company,Unnamed: 1_level_1
Google,86.725
IBM,113.665
Meta,206.32


In [26]:
# as usual we can call this directly off the method tree, instead of return the object
df.groupby("Company").mean()

Unnamed: 0_level_0,Profit %
Company,Unnamed: 1_level_1
Google,86.725
IBM,113.665
Meta,206.32


Some more examples of aggregate methods:

In [27]:
# standard deviation
grouped_df.std()

Unnamed: 0_level_0,Profit %
Company,Unnamed: 1_level_1
Google,37.794857
IBM,19.325228
Meta,8.93783


In [28]:
# minimum value of sales in each company by an employee
grouped_df.min()

Unnamed: 0_level_0,Employee,Profit %
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Google,Ram,60.0
IBM,Arnav,100.0
Meta,Rahul,200.0


In [29]:
grouped_df.max()

Unnamed: 0_level_0,Employee,Profit %
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Google,Shyam,113.45
IBM,Charan,127.33
Meta,Sarang,212.64


In [30]:
# counting number of employees in each company
grouped_df.count()

Unnamed: 0_level_0,Employee,Profit %
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Google,2,2
IBM,2,2
Meta,2,2


In [31]:
# getting all stats for each company
grouped_df.describe()

Unnamed: 0_level_0,Profit %,Profit %,Profit %,Profit %,Profit %,Profit %,Profit %,Profit %
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Google,2.0,86.725,37.794857,60.0,73.3625,86.725,100.0875,113.45
IBM,2.0,113.665,19.325228,100.0,106.8325,113.665,120.4975,127.33
Meta,2.0,206.32,8.93783,200.0,203.16,206.32,209.48,212.64


Let's try getting a particular statistic from this data, say for example the 75% quantile of the profit % for Meta

In [32]:
# 25% profit % of google
grouped_df.describe()["Profit %"]["75%"]["Meta"]

209.48

If we have multiple numeric coumns, these aggregate methods are applied to all of them.

In [33]:

df["Profit % Last Year"] = np.random.rand(len(df))*100

In [34]:
df

Unnamed: 0,Company,Employee,Profit %,Profit % Last Year
0,Google,Ram,113.45,83.191767
1,Google,Shyam,60.0,74.445824
2,IBM,Arnav,127.33,57.824434
3,IBM,Charan,100.0,15.098301
4,Meta,Rahul,212.64,77.785399
5,Meta,Sarang,200.0,38.414841


In [35]:
df.groupby("Company").sum()

Unnamed: 0_level_0,Profit %,Profit % Last Year
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Google,173.45,157.637592
IBM,227.33,72.922735
Meta,412.64,116.20024
