## 4.8 - Grouping Data & Aggregating Variables
    - Group data by using groupby() function
    - Use aggregation functions when deriving new columns

## Import Libraries and Paths/Data

In [1]:
#Import Libraries
import pandas as pd
import numpy as np
import os

In [2]:
#Create path
path = r'/Users/puneet/Desktop/Instacart Basket Analysis 08-2025'

In [3]:
#Insert Data
ords_prods_merge = pd.read_pickle(os.path.join(path,'02 Data', 'Prepared Data', 'ords_prods_merge_new.pkl'))

In [4]:
#Take 1M row sample
df = ords_prods_merge[:1000000]

In [5]:
#Check shape
df.shape

(1000000, 18)

In [6]:
#Check first few rows
df.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_the_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid range product,Regularly busy,Regular days,Average hours
1,2539329,1,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,both,Mid range product,Regularly busy,Regular days,Average hours
2,2539329,1,1,2,8,,12427,3,0,Original Beef Jerky,23,19,4.4,both,Low range product,Regularly busy,Regular days,Average hours
3,2539329,1,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,4.7,both,Low range product,Regularly busy,Regular days,Average hours
4,2539329,1,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,both,Low range product,Regularly busy,Regular days,Average hours
5,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid range product,Regularly busy,Slowest days,Average hours
6,2398795,1,2,3,7,15.0,10258,2,0,Pistachios,117,19,3.0,both,Low range product,Regularly busy,Slowest days,Average hours
7,2398795,1,2,3,7,15.0,12427,3,1,Original Beef Jerky,23,19,4.4,both,Low range product,Regularly busy,Slowest days,Average hours
8,2398795,1,2,3,7,15.0,13176,4,0,Bag of Organic Bananas,24,4,10.3,both,Mid range product,Regularly busy,Slowest days,Average hours
9,2398795,1,2,3,7,15.0,26088,5,1,Aged White Cheddar Popcorn,23,19,4.7,both,Low range product,Regularly busy,Slowest days,Average hours


## -

## Grouping Data with pandas

In [7]:
#groupby() function
df.groupby('product_name')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f9d5160>

No actual output, only a message showing success of grouping. Groupby() should always be a part of a series of steps....
1. Split the data into groups based on some criteria.
2. Apply a function to each group separately.
3. Combine the results into a dataframe or alternative data structure or create a new column in the current dataframe.

First step complete. Next step requires aggregation..

## Aggregating Data with agg()

- Want to calculate the mean of the 'order_number' column, meaning the average number of orders by a customer

1. Split the data into groups based on “department_id.”
2. Apply the agg() function to each group to obtain the mean values for the “order_number” column.

In [8]:
df.groupby('department_id').agg({'order_number':['mean']})

Unnamed: 0_level_0,order_number
Unnamed: 0_level_1,mean
department_id,Unnamed: 1_level_2
1,14.800024
2,17.091743
3,17.913544
4,17.893092
5,15.21427
6,15.382135
7,17.694027
8,16.458105
9,15.957363
10,20.091818


Here, the groupby() function is being assigned to the df dataframe. This creates the pandas object for “department_id.” Onto this resulting object, the agg() function is applied. The agg() function will return the mean of the given column, in this case, “order_number.” 

There are some aggregations that can be conducted without use of the agg() function. For instance, the command above could be replaced with a command that uses the mean() function to achieve the same results:

In [9]:
#Alternate way to groupby()
df.groupby('department_id')['order_number'].mean()

department_id
1     14.800024
2     17.091743
3     17.913544
4     17.893092
5     15.214270
6     15.382135
7     17.694027
8     16.458105
9     15.957363
10    20.091818
11    16.482026
12    15.615061
13    16.484023
14    17.524632
15    15.691875
16    18.014071
17    16.150593
18    19.602850
19    17.631340
20    17.138607
21    21.956893
Name: order_number, dtype: float64

Just remember the key difference in syntax between the two methods: when using agg(), put the column you want to aggregate inside the parentheses of the agg() function as an argument. When using mean() (or any other standard aggregation function), simply index the column with square brackets, then follow it with the function you want to use after the dot.

## -

### Performing Multiple Aggregations

In [10]:
#Aggregate by Min, Max, and Mean
df.groupby('department_id').agg({'order_number':['mean','min','max']})

Unnamed: 0_level_0,order_number,order_number,order_number
Unnamed: 0_level_1,mean,min,max
department_id,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,14.800024,1,99
2,17.091743,1,98
3,17.913544,1,99
4,17.893092,1,99
5,15.21427,1,99
6,15.382135,1,99
7,17.694027,1,99
8,16.458105,1,91
9,15.957363,1,99
10,20.091818,1,99


## -

### Aggregating Data with transform()

- Creatinge flags with data after grouping and aggregating
- Finding "Loyalty" customers
    - First use transform() function which creates new column containing max frequency of "order_number" column
    - Then using loc() function to create second flag column to show "loyal" or not

Loyalty Flag Criteria:

- If the maximum orders the user has made is over 40, then the customer will be labeled a “Loyal customer.”
- If the maximum orders the user has made is over 10 but less than or equal to 40, then the customer will be labeled a “Regular customer.”
- If the maximum orders the user has made is less than or equal to 10, then the customer will be labeled a “New customer.”

Now, let’s map this task onto the three-step process introduced earlier:

1. Split the data into groups based on the “user_id” column.
2. Apply the transform() function on the “order_number” column to generate the maximum orders for each user.
3. Create a new column, “max_order,” into which you’ll place the results of your aggregation.

In [11]:
#Line of code for "max_order" column
ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)

  ords_prods_merge['max_order'] = ords_prods_merge.groupby(['user_id'])['order_number'].transform(np.max)


First, a new column called “max_order” is created, which will be what stores the maximum order number for each user (step 3). Then, the ords_prods_merge dataframe is grouped by the “user_id” column (step 1). And finally, the transform() function is applied on the “order_number” column with the np.max argument (step 2).

In [12]:
#See new max_order column
ords_prods_merge.head(10)

Unnamed: 0,order_id,user_id,order_number,orders_day_of_the_week,order_hour_of_day,days_since_prior_order,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,prices,_merge,price_range_loc,busiest_day,busiest_days,busiest_period_of_day,max_order
0,2539329,1,1,2,8,,196,1,0,Soda,77,7,9.0,both,Mid range product,Regularly busy,Regular days,Average hours,10
1,2539329,1,1,2,8,,14084,2,0,Organic Unsweetened Vanilla Almond Milk,91,16,12.5,both,Mid range product,Regularly busy,Regular days,Average hours,10
2,2539329,1,1,2,8,,12427,3,0,Original Beef Jerky,23,19,4.4,both,Low range product,Regularly busy,Regular days,Average hours,10
3,2539329,1,1,2,8,,26088,4,0,Aged White Cheddar Popcorn,23,19,4.7,both,Low range product,Regularly busy,Regular days,Average hours,10
4,2539329,1,1,2,8,,26405,5,0,XL Pick-A-Size Paper Towel Rolls,54,17,1.0,both,Low range product,Regularly busy,Regular days,Average hours,10
5,2398795,1,2,3,7,15.0,196,1,1,Soda,77,7,9.0,both,Mid range product,Regularly busy,Slowest days,Average hours,10
6,2398795,1,2,3,7,15.0,10258,2,0,Pistachios,117,19,3.0,both,Low range product,Regularly busy,Slowest days,Average hours,10
7,2398795,1,2,3,7,15.0,12427,3,1,Original Beef Jerky,23,19,4.4,both,Low range product,Regularly busy,Slowest days,Average hours,10
8,2398795,1,2,3,7,15.0,13176,4,0,Bag of Organic Bananas,24,4,10.3,both,Mid range product,Regularly busy,Slowest days,Average hours,10
9,2398795,1,2,3,7,15.0,26088,5,1,Aged White Cheddar Popcorn,23,19,4.7,both,Low range product,Regularly busy,Slowest days,Average hours,10


In [13]:
#Function that doesn't limit rows displayed on pandas
pd.options.display.max_rows = None

In [14]:
#Check 'max_order' values
ords_prods_merge['max_order'].value_counts()

max_order
99    1171333
8      811843
6      811396
9      810213
7      803979
5      793140
10     773124
11     769579
4      753543
12     744454
13     736359
14     733970
15     696131
3      686741
17     661981
16     655604
18     621638
19     613711
20     595323
22     593612
21     580515
23     534271
24     529905
25     521960
26     503273
27     487635
28     460949
29     458788
30     458746
31     449436
32     424493
34     418067
33     413854
36     398947
37     374638
40     367311
35     366236
39     357489
41     347926
38     347901
43     337202
44     334794
50     334358
42     331569
47     329459
46     317294
45     316758
49     300430
48     292467
51     285693
53     285458
52     285211
54     240131
56     237279
55     227599
57     200635
58     192981
60     192055
59     188261
61     174191
62     164176
64     161095
63     159384
65     154757
66     133933
67     132211
68     126181
71     122030
73     117901
74     115302
69     111

Now that we have the flag, time to use loc() function to assign column to loyal customers

## -

### Deriving Columns with loc()

In [15]:
#loc() function to create new loyalty column part 1
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

In [16]:
#loc() function to create new loyalty column part 2
ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order']> 10), 'loyalty_flag'] = 'Regular customer'

In [17]:
#loc() function to create new loyalty column part 3
ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [18]:
#Test new loyalty_flag column
ords_prods_merge['loyalty_flag'].value_counts(dropna = False)

loyalty_flag
Regular customer    15876776
Loyal customer      10284093
New customer         6243990
Name: count, dtype: int64

In [None]:
ords_prods_merge.loc[ords_prods_merge['max_order'] > 40, 'loyalty_flag'] = 'Loyal customer'

ords_prods_merge.loc[(ords_prods_merge['max_order'] <= 40) & (ords_prods_merge['max_order']> 10), 'loyalty_flag'] = 'Regular customer'

ords_prods_merge.loc[ords_prods_merge['max_order'] <= 10, 'loyalty_flag'] = 'New customer'

In [19]:
#Test 60 rows on only 3 columns, still head() function but with limited columns
ords_prods_merge[['user_id','loyalty_flag','order_number']].head(60)

Unnamed: 0,user_id,loyalty_flag,order_number
0,1,New customer,1
1,1,New customer,1
2,1,New customer,1
3,1,New customer,1
4,1,New customer,1
5,1,New customer,2
6,1,New customer,2
7,1,New customer,2
8,1,New customer,2
9,1,New customer,2


While this may seem somewhat complicated, if you look more closely, you’ll see that it’s all things you’ve dealt with before—just all at the same time. The first set of brackets is the same as the brackets used in df['column'] above, while the second set of brackets is indicating that what you’re indexing is a list of multiple columns (and lists are always included within brackets). This is why there are two sets of brackets.