# Pandas GroupBy

In [46]:
# Imports
import pandas as pd
import numpy as np

In [47]:
# Loading the data
df = pd.read_csv("cars.csv")

In [3]:
df

Unnamed: 0,Id,Make,Model,Color,Type,Accident Rate
0,1,Ford,Focus,Black,Sedan,3
1,2,Ford,Mustang,Black,Sport,7
2,3,Ford,Fiesta,Blue,Sedan,2
3,4,Ford,Focus,Blue,Sedan,1
4,5,Ford,Mustang,Red,Sport,4
5,6,Ford,Fiesta,Blue,Sedan,8
6,7,Ford,Focus,Black,Sedan,6
7,8,Ford,Mustang,Red,Sport,1
8,9,Ford,Fiesta,White,Sedan,10
9,10,Ford,Focus,White,Sedan,8


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Id             30 non-null     int64 
 1   Make           30 non-null     object
 2   Model          30 non-null     object
 3   Color          30 non-null     object
 4   Type           30 non-null     object
 5   Accident Rate  30 non-null     int64 
dtypes: int64(2), object(4)
memory usage: 1.5+ KB


# 🐼 
### Print every group using the print function

For this exercise, you are to group the data by color and print it using the **print** function along with **apply** method. Notice that we are not performing any aggregation, we are just printing the groups so that we can easily see how the data was grouped.

This comes in handy when you want to see the data before performing any aggregation.

In [14]:
df.groupby("Color").apply(print)

    Id       Make     Model  Color   Type  Accident Rate
0    1       Ford     Focus  Black  Sedan              3
1    2       Ford   Mustang  Black  Sport              7
6    7       Ford     Focus  Black  Sedan              6
10  11       Ford   Mustang  Black  Sport              5
11  12       Ford    Fiesta  Black  Sedan             10
13  14       Ford   Mustang  Black  Sport              9
15  16  Chevrolet    Camaro  Black  Sport             10
19  20  Chevrolet  Corvette  Black  Sport              3
23  24  Chevrolet    Impala  Black  Sedan              2
24  25  Chevrolet    Camaro  Black  Sport              6
27  28  Chevrolet    Camaro  Black  Sport             10
28  29  Chevrolet  Corvette  Black  Sport              5
29  30  Chevrolet    Impala  Black  Sedan              2
    Id       Make   Model Color   Type  Accident Rate
2    3       Ford  Fiesta  Blue  Sedan              2
3    4       Ford   Focus  Blue  Sedan              1
5    6       Ford  Fiesta  Blue  Sedan  

# 🐼🐼
### Compute totals and percentages

Compute the total number of cars and the proportion grouping by color.

In [9]:
df.groupby("Color").count() # Counts number of occurrences.
df.groupby("Color").sum() # Adds up numerical values.

Unnamed: 0_level_0,Id,Make,Model,Type,Accident Rate
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Black,13,13,13,13,13
Blue,7,7,7,7,7
Red,5,5,5,5,5
White,3,3,3,3,3
Yellow,2,2,2,2,2


Unnamed: 0_level_0,Id,Accident Rate
Color,Unnamed: 1_level_1,Unnamed: 2_level_1
Black,219,78
Blue,94,28
Red,80,30
White,32,22
Yellow,40,16


Group the data by colors and compute the total. Add this value to every grouped row in a new column in a new dataframe.

In [31]:
colors = df[["Color", "Id"]].groupby("Color").count()
colors["Total"] = len(df)
colors["Pctg"] = round(100 * colors["Id"] / colors["Total"], 2)
colors

Unnamed: 0_level_0,Id,Total,Pctg
Color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Black,13,30,43.33
Blue,7,30,23.33
Red,5,30,16.67
White,3,30,10.0
Yellow,2,30,6.67


# 🐼🐼🐼
### Group by columns without using any agg function

Imagine you want to group your data in a way that the resulting data frame has the same length and is grouped by **Type**. Although you might be enticed to use the groupby method here, that's not the correct way to go. Keep in mind that groupby will always return a smaller data frame. As a rule of thumb, if your input and output have the same length (number of rows) you DON'T need to use groupby.

In [59]:
df.set_index("Type").sort_index()

Unnamed: 0_level_0,Id,Make,Model,Color,Accident Rate
Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sedan,1,Ford,Focus,Black,3
Sedan,27,Chevrolet,Impala,Blue,7
Sedan,24,Chevrolet,Impala,Black,2
Sedan,21,Chevrolet,Impala,Blue,5
Sedan,18,Chevrolet,Impala,Blue,2
Sedan,13,Ford,Focus,White,4
Sedan,12,Ford,Fiesta,Black,10
Sedan,10,Ford,Focus,White,8
Sedan,15,Ford,Fiesta,Blue,3
Sedan,30,Chevrolet,Impala,Black,2


# 🐼🐼🐼🐼
### Apply different functions given different conditions

Group by the car maker (Chevrolet and Ford) and return the sum of all values in the accident rate column when the maker is Chevrolet and the mean when the maker is Ford. By creating your own custom functions the sky is the limit. 

In [52]:
def sum_mean(x):
    if x.name.lower() == "chevrolet":
        return sum(x)
    elif x.name.lower() == "ford":
        return np.mean(x)
    
df.groupby("Make")["Accident Rate"].apply(sum_mean)

Make
Chevrolet    93.0
Ford          5.4
Name: Accident Rate, dtype: float64